All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
@ 2015-05-01  6:02 ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The series of patches intend to support PCI slot for PowerPC PowerNV platform,
which is running on top of skiboot firmware. The patchset requires corresponding
changes from skiboot firmware, which is sent to skiboot@lists.ozlabs.org
for review. The PCI slots are exposed by skiboot with device node properties,
and kernel utilizes those properties to populated PCI slots accordingly.

The original PCI infrastructure on PowerNV platform can't support hotplug 
because the PE is assigned during PHB fixup time, which is called for once
during system boot time. For this, the PCI infrastructure on PowerNV platform
has been reworked for a lot. After that, the PE and its corresponding resources
(IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating
PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64
resources, on P8 strictly speaking). Each PE will maintain a reference count,
which is (number of child PCI devices + 1). That indicates when last child PCI
device leaves the PE, the PE and its included resources will be relased and put
back into free pool again. With this design, the PE will be released when EEH PE
is released. PATCH[1 - 8] are related to this part.

>From skiboot perspective, PCI slot is providing (hot/fundamental/complete) 
resets to EEH. The kernel gets to know if skiboot supports various reset on one 
particular PCI slot through device-tree node. If it does, EEH will utilize the 
functionality provided by skiboot. Besides, the device-tree nodes have to change
in order to support PCI hotplug. For example, when one PCI adapter inserted to
one slot, its device-tree node should be added to the system dynamically. Conversely,
the device-tree node should be removed from the system when the PCI adapter is going
to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes,
they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are
doing the related work.

The last patch is the standalone PCI hotplug driver for PowerNV platform. When
removing PCI adapter from one PCI slot, which is invoked by command in userland,
the skiboot will power off the slot to save power and remove all device-tree
nodes for all PCI devices behind the slot. Conversely, the Power to the slot
is turned on, the PCI devices behind the slot is rescanned, and the device-tree
nodes for those newly detected PCI devices will be built in skiboot. For both
of cases, one message will be sent to kernel by skiboot so that the kernel
can adjust the device-tree accordingly. At the same time, the kernel also have
to deallocate or allocate PE# and its related resources (PE# and so on) for the
removed/added PCI devices.

Changelog
=========
v4:
   * Rebased to 4.1.RC1
   * Added API to unflatten FDT blob to device node sub-tree, which is attached
     the indicated parent device node. The original mechanism based on formatted
     string stream has been dropped.
   * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug")
     was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH
     Support" depends on that.
v3:
   * Rebased to 4.1.RC0
   * PowerNV PCI infrasturcture is total refactored in order to support PCI
     hotplug. The PowerNV hotplug driver is also reworked a lot because of
     the changes in skiboot in order to support PCI hotplug.

Gavin Shan (21):
  pci: Add pcibios_setup_bridge()
  powerpc/powernv: Enable M64 on P7IOC
  powerpc/powernv: M64 support improvement
  powerpc/powernv: Improve IO and M32 mapping
  powerpc/powernv: Improve DMA32 segment assignment
  powerpc/powernv: Create PEs dynamically
  powerpc/powernv: Release PEs dynamically
  powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
  powerpc/powernv: Use PCI slot reset infrastructure
  powerpc/powernv: Fundamental reset for PCI bus reset
  powerpc/pci: Don't scan empty slot
  powerpc/pci: Move pcibios_find_pci_bus() around
  powerpc/powernv: Introduce pnv_pci_poll()
  powerpc/powernv: Functions to get/reset PCI slot status
  powerpc/pci: Delay creating pci_dn
  powerpc/pci: Create eeh_dev while creating pci_dn
  powerpc/pci: Export traverse_pci_device_nodes()
  powerpc/pci: Update bridge windows on PCI plugging
  drivers/of: Support adding sub-tree
  powerpc/powernv: Select OF_DYNAMIC
  pci/hotplug: PowerPC PowerNV PCI hotplug driver

 arch/powerpc/include/asm/eeh.h                 |    7 +-
 arch/powerpc/include/asm/opal-api.h            |    7 +-
 arch/powerpc/include/asm/opal.h                |    7 +-
 arch/powerpc/include/asm/pci-bridge.h          |    7 +-
 arch/powerpc/include/asm/pnv-pci.h             |    5 +
 arch/powerpc/include/asm/ppc-pci.h             |    7 +-
 arch/powerpc/kernel/eeh_dev.c                  |   20 +-
 arch/powerpc/kernel/pci-common.c               |   18 +-
 arch/powerpc/kernel/pci-hotplug.c              |   44 +-
 arch/powerpc/kernel/pci_dn.c                   |  119 +-
 arch/powerpc/platforms/maple/pci.c             |   35 +-
 arch/powerpc/platforms/pasemi/pci.c            |    3 -
 arch/powerpc/platforms/powermac/pci.c          |   39 +-
 arch/powerpc/platforms/powernv/Kconfig         |    1 +
 arch/powerpc/platforms/powernv/eeh-powernv.c   |  245 ++--
 arch/powerpc/platforms/powernv/opal-wrappers.S |    3 +
 arch/powerpc/platforms/powernv/pci-ioda.c      | 1657 +++++++++++++++---------
 arch/powerpc/platforms/powernv/pci.c           |   64 +-
 arch/powerpc/platforms/powernv/pci.h           |   52 +-
 arch/powerpc/platforms/pseries/msi.c           |    4 +-
 arch/powerpc/platforms/pseries/pci_dlpar.c     |   32 -
 arch/powerpc/platforms/pseries/setup.c         |    9 +-
 drivers/of/dynamic.c                           |   19 +-
 drivers/of/fdt.c                               |  133 +-
 drivers/pci/hotplug/Kconfig                    |   12 +
 drivers/pci/hotplug/Makefile                   |    4 +
 drivers/pci/hotplug/powernv_php.c              |  146 +++
 drivers/pci/hotplug/powernv_php.h              |   78 ++
 drivers/pci/hotplug/powernv_php_slot.c         |  643 +++++++++
 drivers/pci/setup-bus.c                        |   12 +-
 include/linux/of.h                             |    2 +
 include/linux/of_fdt.h                         |    1 +
 include/linux/pci.h                            |    1 +
 33 files changed, 2473 insertions(+), 963 deletions(-)
 create mode 100644 drivers/pci/hotplug/powernv_php.c
 create mode 100644 drivers/pci/hotplug/powernv_php.h
 create mode 100644 drivers/pci/hotplug/powernv_php_slot.c

-- 
2.1.0


^ permalink raw reply	[flat|nested] 184+ messages in thread

* [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
@ 2015-05-01  6:02 ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The series of patches intend to support PCI slot for PowerPC PowerNV platform,
which is running on top of skiboot firmware. The patchset requires corresponding
changes from skiboot firmware, which is sent to skiboot@lists.ozlabs.org
for review. The PCI slots are exposed by skiboot with device node properties,
and kernel utilizes those properties to populated PCI slots accordingly.

The original PCI infrastructure on PowerNV platform can't support hotplug 
because the PE is assigned during PHB fixup time, which is called for once
during system boot time. For this, the PCI infrastructure on PowerNV platform
has been reworked for a lot. After that, the PE and its corresponding resources
(IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating
PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64
resources, on P8 strictly speaking). Each PE will maintain a reference count,
which is (number of child PCI devices + 1). That indicates when last child PCI
device leaves the PE, the PE and its included resources will be relased and put
back into free pool again. With this design, the PE will be released when EEH PE
is released. PATCH[1 - 8] are related to this part.

>From skiboot perspective, PCI slot is providing (hot/fundamental/complete) 
resets to EEH. The kernel gets to know if skiboot supports various reset on one 
particular PCI slot through device-tree node. If it does, EEH will utilize the 
functionality provided by skiboot. Besides, the device-tree nodes have to change
in order to support PCI hotplug. For example, when one PCI adapter inserted to
one slot, its device-tree node should be added to the system dynamically. Conversely,
the device-tree node should be removed from the system when the PCI adapter is going
to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes,
they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are
doing the related work.

The last patch is the standalone PCI hotplug driver for PowerNV platform. When
removing PCI adapter from one PCI slot, which is invoked by command in userland,
the skiboot will power off the slot to save power and remove all device-tree
nodes for all PCI devices behind the slot. Conversely, the Power to the slot
is turned on, the PCI devices behind the slot is rescanned, and the device-tree
nodes for those newly detected PCI devices will be built in skiboot. For both
of cases, one message will be sent to kernel by skiboot so that the kernel
can adjust the device-tree accordingly. At the same time, the kernel also have
to deallocate or allocate PE# and its related resources (PE# and so on) for the
removed/added PCI devices.

Changelog
=========
v4:
   * Rebased to 4.1.RC1
   * Added API to unflatten FDT blob to device node sub-tree, which is attached
     the indicated parent device node. The original mechanism based on formatted
     string stream has been dropped.
   * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug")
     was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH
     Support" depends on that.
v3:
   * Rebased to 4.1.RC0
   * PowerNV PCI infrasturcture is total refactored in order to support PCI
     hotplug. The PowerNV hotplug driver is also reworked a lot because of
     the changes in skiboot in order to support PCI hotplug.

Gavin Shan (21):
  pci: Add pcibios_setup_bridge()
  powerpc/powernv: Enable M64 on P7IOC
  powerpc/powernv: M64 support improvement
  powerpc/powernv: Improve IO and M32 mapping
  powerpc/powernv: Improve DMA32 segment assignment
  powerpc/powernv: Create PEs dynamically
  powerpc/powernv: Release PEs dynamically
  powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
  powerpc/powernv: Use PCI slot reset infrastructure
  powerpc/powernv: Fundamental reset for PCI bus reset
  powerpc/pci: Don't scan empty slot
  powerpc/pci: Move pcibios_find_pci_bus() around
  powerpc/powernv: Introduce pnv_pci_poll()
  powerpc/powernv: Functions to get/reset PCI slot status
  powerpc/pci: Delay creating pci_dn
  powerpc/pci: Create eeh_dev while creating pci_dn
  powerpc/pci: Export traverse_pci_device_nodes()
  powerpc/pci: Update bridge windows on PCI plugging
  drivers/of: Support adding sub-tree
  powerpc/powernv: Select OF_DYNAMIC
  pci/hotplug: PowerPC PowerNV PCI hotplug driver

 arch/powerpc/include/asm/eeh.h                 |    7 +-
 arch/powerpc/include/asm/opal-api.h            |    7 +-
 arch/powerpc/include/asm/opal.h                |    7 +-
 arch/powerpc/include/asm/pci-bridge.h          |    7 +-
 arch/powerpc/include/asm/pnv-pci.h             |    5 +
 arch/powerpc/include/asm/ppc-pci.h             |    7 +-
 arch/powerpc/kernel/eeh_dev.c                  |   20 +-
 arch/powerpc/kernel/pci-common.c               |   18 +-
 arch/powerpc/kernel/pci-hotplug.c              |   44 +-
 arch/powerpc/kernel/pci_dn.c                   |  119 +-
 arch/powerpc/platforms/maple/pci.c             |   35 +-
 arch/powerpc/platforms/pasemi/pci.c            |    3 -
 arch/powerpc/platforms/powermac/pci.c          |   39 +-
 arch/powerpc/platforms/powernv/Kconfig         |    1 +
 arch/powerpc/platforms/powernv/eeh-powernv.c   |  245 ++--
 arch/powerpc/platforms/powernv/opal-wrappers.S |    3 +
 arch/powerpc/platforms/powernv/pci-ioda.c      | 1657 +++++++++++++++---------
 arch/powerpc/platforms/powernv/pci.c           |   64 +-
 arch/powerpc/platforms/powernv/pci.h           |   52 +-
 arch/powerpc/platforms/pseries/msi.c           |    4 +-
 arch/powerpc/platforms/pseries/pci_dlpar.c     |   32 -
 arch/powerpc/platforms/pseries/setup.c         |    9 +-
 drivers/of/dynamic.c                           |   19 +-
 drivers/of/fdt.c                               |  133 +-
 drivers/pci/hotplug/Kconfig                    |   12 +
 drivers/pci/hotplug/Makefile                   |    4 +
 drivers/pci/hotplug/powernv_php.c              |  146 +++
 drivers/pci/hotplug/powernv_php.h              |   78 ++
 drivers/pci/hotplug/powernv_php_slot.c         |  643 +++++++++
 drivers/pci/setup-bus.c                        |   12 +-
 include/linux/of.h                             |    2 +
 include/linux/of_fdt.h                         |    1 +
 include/linux/pci.h                            |    1 +
 33 files changed, 2473 insertions(+), 963 deletions(-)
 create mode 100644 drivers/pci/hotplug/powernv_php.c
 create mode 100644 drivers/pci/hotplug/powernv_php.h
 create mode 100644 drivers/pci/hotplug/powernv_php_slot.c

-- 
2.1.0

^ permalink raw reply	[flat|nested] 184+ messages in thread

* [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
which is called for once after PCI probing and resource assignment
are completed, to allocate platform required resources for PCI devices:
PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
Obviously, it's not hotplug friendly.

The patch adds weak function pcibios_setup_bridge(), which is called
by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
to assign above platform required resources to newly added PCI devices,
in order to support PCI hotplug on PowerPC PowerNV platform.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c | 12 +++++++++---
 include/linux/pci.h     |  1 +
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 4fd0cac..a7d0c3c 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
 	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
 }
 
-static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
+
+void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
 {
 	struct pci_dev *bridge = bus->self;
 
@@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
 	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
 }
 
+void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
+{
+	pci_setup_bridge_resources(bus, type);
+}
+
 void pci_setup_bridge(struct pci_bus *bus)
 {
 	unsigned long type = IORESOURCE_IO | IORESOURCE_MEM |
 				  IORESOURCE_PREFETCH;
 
-	__pci_setup_bridge(bus, type);
+	pcibios_setup_bridge(bus, type);
 }
 
 
@@ -1467,7 +1473,7 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
 		/* avoiding touch the one without PREF */
 		if (type & IORESOURCE_PREFETCH)
 			type = IORESOURCE_PREFETCH;
-		__pci_setup_bridge(bus, type);
+		pci_setup_bridge_resources(bus, type);
 		/* for next child res under same bridge */
 		r->flags = old_flags;
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..68c5ef9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1175,6 +1175,7 @@ void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
 		  void *userdata);
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
+void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
which is called for once after PCI probing and resource assignment
are completed, to allocate platform required resources for PCI devices:
PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
Obviously, it's not hotplug friendly.

The patch adds weak function pcibios_setup_bridge(), which is called
by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
to assign above platform required resources to newly added PCI devices,
in order to support PCI hotplug on PowerPC PowerNV platform.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c | 12 +++++++++---
 include/linux/pci.h     |  1 +
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 4fd0cac..a7d0c3c 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
 	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
 }
 
-static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
+
+void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
 {
 	struct pci_dev *bridge = bus->self;
 
@@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
 	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
 }
 
+void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
+{
+	pci_setup_bridge_resources(bus, type);
+}
+
 void pci_setup_bridge(struct pci_bus *bus)
 {
 	unsigned long type = IORESOURCE_IO | IORESOURCE_MEM |
 				  IORESOURCE_PREFETCH;
 
-	__pci_setup_bridge(bus, type);
+	pcibios_setup_bridge(bus, type);
 }
 
 
@@ -1467,7 +1473,7 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
 		/* avoiding touch the one without PREF */
 		if (type & IORESOURCE_PREFETCH)
 			type = IORESOURCE_PREFETCH;
-		__pci_setup_bridge(bus, type);
+		pci_setup_bridge_resources(bus, type);
 		/* for next child res under same bridge */
 		r->flags = old_flags;
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..68c5ef9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1175,6 +1175,7 @@ void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
 		  void *userdata);
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
+void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The patch enables M64 window on P7IOC, which has been enabled on
PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them
are divided to 8 segments. So each PHB can support 128 M64 segments.
Also, P7IOC has M64DT, which helps mapping one particular M64
segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs)
segments and fixed mapping between PE# and M64 segment# in order
to keep same logic to support M64 for PHB3 and P7IOC. In turn, we
just need different phb->init_m64() hooks for P7IOC and PHB3.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++++++++++++++++++++++++++----
 1 file changed, 103 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f8bc950..646962f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
 	clear_bit(pe, phb->ioda.pe_alloc);
 }
 
+static int pnv_ioda1_init_m64(struct pnv_phb *phb)
+{
+	struct resource *r;
+	int seg;
+	s64 rc;
+
+	/* Each PHB supports 16 separate M64 BARs, each of which are
+	 * divided into 8 segments. So there are number of M64 segments
+	 * as total PE#, which is 128.
+	 */
+	for (seg = 0; seg < phb->ioda.total_pe; seg += 8) {
+		unsigned long base;
+
+		base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize;
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 seg / 8,
+						 base,
+						 0, /* unused */
+						 8 * phb->ioda.m64_segsize);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("  Failure %lld configuring M64 BAR#%d on PHB#%d\n",
+				rc, seg / 8, phb->hose->global_number);
+			goto fail;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+					      OPAL_M64_WINDOW_TYPE,
+					      seg / 8,
+					      OPAL_ENABLE_M64_SPLIT);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("  Failure %lld enabling M64 BAR#%d on PHB#%d\n",
+				rc, seg / 8, phb->hose->global_number);
+			goto fail;
+		}
+	}
+
+	/* Strip of the segment used by the reserved PE, which
+	 * is expected to be 0 or last supported PE#
+	 */
+	r = &phb->hose->mem_resources[1];
+	if (phb->ioda.reserved_pe == 0)
+		r->start += phb->ioda.m64_segsize;
+	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
+		r->end -= phb->ioda.m64_segsize;
+	else
+		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
+			phb->ioda.reserved_pe);
+
+	return 0;
+
+fail:
+	for ( ; seg >= 0; seg -= 8)
+		opal_pci_phb_mmio_enable(phb->opal_id,
+					 OPAL_M64_WINDOW_TYPE,
+					 seg / 8,
+					 OPAL_DISABLE_M64);
+
+	return -EIO;
+}
+
 /* The default M64 BAR is shared by all PEs */
 static int pnv_ioda2_init_m64(struct pnv_phb *phb)
 {
@@ -222,7 +283,7 @@ fail:
 	return -EIO;
 }
 
-static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
 {
 	resource_size_t sgsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
@@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
 	}
 }
 
-static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb,
-				 struct pci_bus *bus, int all)
+static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
+				struct pci_bus *bus, int all)
 {
 	resource_size_t segsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
@@ -346,6 +407,28 @@ done:
 			pe->master = master_pe;
 			list_add_tail(&pe->list, &master_pe->slaves);
 		}
+
+		/* P7IOC supports M64DT, which helps mapping M64 segment
+		 * to one particular PE#. Unfortunately, PHB3 has fixed
+		 * mapping between M64 segment and PE#. In order for same
+		 * logic for P7IOC and PHB3, we enforce fixed mapping
+		 * between M64 segment and PE# on P7IOC.
+		 */
+		if (phb->type == PNV_PHB_IODA1) {
+			int64_t rc;
+
+			rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+							 pe->pe_number,
+							 OPAL_M64_WINDOW_TYPE,
+							 pe->pe_number / 8,
+							 pe->pe_number % 8);
+			if (rc != OPAL_SUCCESS)
+				pr_warn("%s: Failure %lld mapping "
+					"M64 for PHB#%d-PE#%d\n",
+					__func__, rc,
+					phb->hose->global_number,
+					pe->pe_number);
+		}
 	}
 
 	kfree(pe_alloc);
@@ -360,12 +443,6 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
 	const u32 *r;
 	u64 pci_addr;
 
-	/* FIXME: Support M64 for P7IOC */
-	if (phb->type != PNV_PHB_IODA2) {
-		pr_info("  Not support M64 window\n");
-		return;
-	}
-
 	if (!firmware_has_feature(FW_FEATURE_OPALv3)) {
 		pr_info("  Firmware too old to support M64 window\n");
 		return;
@@ -394,9 +471,23 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
 
 	/* Use last M64 BAR to cover M64 window */
 	phb->ioda.m64_bar_idx = 15;
-	phb->init_m64 = pnv_ioda2_init_m64;
-	phb->reserve_m64_pe = pnv_ioda2_reserve_m64_pe;
-	phb->pick_m64_pe = pnv_ioda2_pick_m64_pe;
+	phb->reserve_m64_pe = pnv_ioda_reserve_m64_pe;
+	phb->pick_m64_pe = pnv_ioda_pick_m64_pe;
+	switch (phb->type) {
+	case PNV_PHB_IODA1:
+		phb->init_m64 = pnv_ioda1_init_m64;
+		break;
+	case PNV_PHB_IODA2:
+		phb->init_m64 = pnv_ioda2_init_m64;
+		break;
+	default:
+		phb->init_m64 = NULL;
+		phb->reserve_m64_pe = NULL;
+		phb->pick_m64_pe = NULL;
+		phb->ioda.m64_size = 0;
+		phb->ioda.m64_segsize = 0;
+		phb->ioda.m64_base = 0;
+	}
 }
 
 static void pnv_ioda_freeze_pe(struct pnv_phb *phb, int pe_no)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The patch enables M64 window on P7IOC, which has been enabled on
PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them
are divided to 8 segments. So each PHB can support 128 M64 segments.
Also, P7IOC has M64DT, which helps mapping one particular M64
segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs)
segments and fixed mapping between PE# and M64 segment# in order
to keep same logic to support M64 for PHB3 and P7IOC. In turn, we
just need different phb->init_m64() hooks for P7IOC and PHB3.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++++++++++++++++++++++++++----
 1 file changed, 103 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f8bc950..646962f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
 	clear_bit(pe, phb->ioda.pe_alloc);
 }
 
+static int pnv_ioda1_init_m64(struct pnv_phb *phb)
+{
+	struct resource *r;
+	int seg;
+	s64 rc;
+
+	/* Each PHB supports 16 separate M64 BARs, each of which are
+	 * divided into 8 segments. So there are number of M64 segments
+	 * as total PE#, which is 128.
+	 */
+	for (seg = 0; seg < phb->ioda.total_pe; seg += 8) {
+		unsigned long base;
+
+		base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize;
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 seg / 8,
+						 base,
+						 0, /* unused */
+						 8 * phb->ioda.m64_segsize);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("  Failure %lld configuring M64 BAR#%d on PHB#%d\n",
+				rc, seg / 8, phb->hose->global_number);
+			goto fail;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+					      OPAL_M64_WINDOW_TYPE,
+					      seg / 8,
+					      OPAL_ENABLE_M64_SPLIT);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("  Failure %lld enabling M64 BAR#%d on PHB#%d\n",
+				rc, seg / 8, phb->hose->global_number);
+			goto fail;
+		}
+	}
+
+	/* Strip of the segment used by the reserved PE, which
+	 * is expected to be 0 or last supported PE#
+	 */
+	r = &phb->hose->mem_resources[1];
+	if (phb->ioda.reserved_pe == 0)
+		r->start += phb->ioda.m64_segsize;
+	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
+		r->end -= phb->ioda.m64_segsize;
+	else
+		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
+			phb->ioda.reserved_pe);
+
+	return 0;
+
+fail:
+	for ( ; seg >= 0; seg -= 8)
+		opal_pci_phb_mmio_enable(phb->opal_id,
+					 OPAL_M64_WINDOW_TYPE,
+					 seg / 8,
+					 OPAL_DISABLE_M64);
+
+	return -EIO;
+}
+
 /* The default M64 BAR is shared by all PEs */
 static int pnv_ioda2_init_m64(struct pnv_phb *phb)
 {
@@ -222,7 +283,7 @@ fail:
 	return -EIO;
 }
 
-static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
 {
 	resource_size_t sgsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
@@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
 	}
 }
 
-static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb,
-				 struct pci_bus *bus, int all)
+static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
+				struct pci_bus *bus, int all)
 {
 	resource_size_t segsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
@@ -346,6 +407,28 @@ done:
 			pe->master = master_pe;
 			list_add_tail(&pe->list, &master_pe->slaves);
 		}
+
+		/* P7IOC supports M64DT, which helps mapping M64 segment
+		 * to one particular PE#. Unfortunately, PHB3 has fixed
+		 * mapping between M64 segment and PE#. In order for same
+		 * logic for P7IOC and PHB3, we enforce fixed mapping
+		 * between M64 segment and PE# on P7IOC.
+		 */
+		if (phb->type == PNV_PHB_IODA1) {
+			int64_t rc;
+
+			rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+							 pe->pe_number,
+							 OPAL_M64_WINDOW_TYPE,
+							 pe->pe_number / 8,
+							 pe->pe_number % 8);
+			if (rc != OPAL_SUCCESS)
+				pr_warn("%s: Failure %lld mapping "
+					"M64 for PHB#%d-PE#%d\n",
+					__func__, rc,
+					phb->hose->global_number,
+					pe->pe_number);
+		}
 	}
 
 	kfree(pe_alloc);
@@ -360,12 +443,6 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
 	const u32 *r;
 	u64 pci_addr;
 
-	/* FIXME: Support M64 for P7IOC */
-	if (phb->type != PNV_PHB_IODA2) {
-		pr_info("  Not support M64 window\n");
-		return;
-	}
-
 	if (!firmware_has_feature(FW_FEATURE_OPALv3)) {
 		pr_info("  Firmware too old to support M64 window\n");
 		return;
@@ -394,9 +471,23 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
 
 	/* Use last M64 BAR to cover M64 window */
 	phb->ioda.m64_bar_idx = 15;
-	phb->init_m64 = pnv_ioda2_init_m64;
-	phb->reserve_m64_pe = pnv_ioda2_reserve_m64_pe;
-	phb->pick_m64_pe = pnv_ioda2_pick_m64_pe;
+	phb->reserve_m64_pe = pnv_ioda_reserve_m64_pe;
+	phb->pick_m64_pe = pnv_ioda_pick_m64_pe;
+	switch (phb->type) {
+	case PNV_PHB_IODA1:
+		phb->init_m64 = pnv_ioda1_init_m64;
+		break;
+	case PNV_PHB_IODA2:
+		phb->init_m64 = pnv_ioda2_init_m64;
+		break;
+	default:
+		phb->init_m64 = NULL;
+		phb->reserve_m64_pe = NULL;
+		phb->pick_m64_pe = NULL;
+		phb->ioda.m64_size = 0;
+		phb->ioda.m64_segsize = 0;
+		phb->ioda.m64_base = 0;
+	}
 }
 
 static void pnv_ioda_freeze_pe(struct pnv_phb *phb, int pe_no)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 03/21] powerpc/powernv: M64 support improvement
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

We're having the hardware or enforced (on P7IOC) limitation: M64
segment#x can only be assigned to PE#x. IO and M32 segment can be
mapped to arbitrary PE# via IODT and M32DT. It means the PE number
should be x if M64 segment#x has been assigned to the PE. Also, each
PE own one M64 segment at most. Currently, we are reserving PE#
according to root port's M64 window. It won't be reliable once we
extend M64 windows of root port, or the upstream port of the PCIE
switch behind root port to PHB's M64 window, in order to support
PCI hotplug in future.

The patch reserves PE# for M64 segments according to the M64 resources
of the PCI devices (not bridges) contained in the PE. Besides, it's
always worthy to trace the M64 segments consumed by the PE, which can
be released at PCI unplugging time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++++++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |  10 +-
 2 files changed, 122 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 646962f..a994882 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -283,28 +283,78 @@ fail:
 	return -EIO;
 }
 
-static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
+/* We extend the M64 window of root port, or the upstream bridge port
+ * of the PCIE switch behind root port. So we shouldn't reserve PEs
+ * for M64 resources because there are no (normal) PCI devices consuming
+ * M64 resources on the PCI buses leading from root port, or the upstream
+ * bridge port. The function returns true if the indicated PCI bus needs
+ * reserved PEs because of M64 resources in advance. Otherwise, the
+ * function returns false.
+ */
+static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb,
+				 struct pci_bus *bus)
 {
-	resource_size_t sgsz = phb->ioda.m64_segsize;
+	/* Root bus */
+	if (!bus || pci_is_root_bus(bus))
+		return false;
+
+	/* Bus leading from root port. We need check what types of PCI
+	 * devices on the bus. If it's connecting PCI bridge, we don't
+	 * need reserve M64 PEs for it. Otherwise, we still need to do
+	 * that.
+	 */
+	if (pci_is_root_bus(bus->self->bus)) {
+		struct pci_dev *pdev;
+
+		list_for_each_entry(pdev, &bus->devices, bus_list) {
+			if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+				return true;
+		}
+
+		return false;
+	}
+
+	/* Bus leading from the upstream bridge port on top level */
+	if (pci_is_root_bus(bus->self->bus->self->bus))
+		return false;
+
+	return true;
+}
+
+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
+				    struct pci_bus *bus)
+{
+	resource_size_t segsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
 	struct resource *r;
-	int base, step, i;
+	unsigned long pe_no, limit;
+	int i;
 
-	/*
-	 * Root bus always has full M64 range and root port has
-	 * M64 range used in reality. So we're checking root port
-	 * instead of root bus.
+	if (!pnv_ioda_need_m64_pe(phb, bus))
+		return;
+
+	/* The bridge's M64 window might have been extended to the
+	 * PHB's M64 window in order to support PCI hotplug. So the
+	 * bridge's M64 window isn't reliable to be used for picking
+	 * PE# for its leading PCI bus. We have to check the M64
+	 * resources consumed by the PCI devices, which seat on the
+	 * PCI bus.
 	 */
-	list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) {
-		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
-			r = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
-			if (!r->parent ||
-			    !pnv_pci_is_mem_pref_64(r->flags))
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+#ifdef CONFIG_PCI_IOV
+			if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END)
+				continue;
+#endif
+			r = &pdev->resource[i];
+			if (!r->flags || r->start >= r->end ||
+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
 				continue;
 
-			base = (r->start - phb->ioda.m64_base) / sgsz;
-			for (step = 0; step < resource_size(r) / sgsz; step++)
-				pnv_ioda_reserve_pe(phb, base + step);
+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
+			for (; pe_no < limit; pe_no++)
+				pnv_ioda_reserve_pe(phb, pe_no);
 		}
 	}
 }
@@ -316,85 +366,64 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	struct pci_dev *pdev;
 	struct resource *r;
 	struct pnv_ioda_pe *master_pe, *pe;
-	unsigned long size, *pe_alloc;
-	bool found;
-	int start, i, j;
-
-	/* Root bus shouldn't use M64 */
-	if (pci_is_root_bus(bus))
-		return IODA_INVALID_PE;
-
-	/* We support only one M64 window on each bus */
-	found = false;
-	pci_bus_for_each_resource(bus, r, i) {
-		if (r && r->parent &&
-		    pnv_pci_is_mem_pref_64(r->flags)) {
-			found = true;
-			break;
-		}
-	}
+	unsigned long size, *pe_bitsmap;
+	unsigned long pe_no, limit;
+	int i;
 
-	/* No M64 window found ? */
-	if (!found)
+	if (!pnv_ioda_need_m64_pe(phb, bus))
 		return IODA_INVALID_PE;
 
-	/* Allocate bitmap */
+        /* Allocate bitmap */
 	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
-	pe_alloc = kzalloc(size, GFP_KERNEL);
-	if (!pe_alloc) {
-		pr_warn("%s: Out of memory !\n",
-			__func__);
+	pe_bitsmap = kzalloc(size, GFP_KERNEL);
+	if (!pe_bitsmap) {
+		pr_warn("%s: Out of memory !\n", __func__);
 		return IODA_INVALID_PE;
 	}
 
-	/*
-	 * Figure out reserved PE numbers by the PE
-	 * the its child PEs.
-	 */
-	start = (r->start - phb->ioda.m64_base) / segsz;
-	for (i = 0; i < resource_size(r) / segsz; i++)
-		set_bit(start + i, pe_alloc);
-
-	if (all)
-		goto done;
-
-	/*
-	 * If the PE doesn't cover all subordinate buses,
-	 * we need subtract from reserved PEs for children.
+	/* The bridge's M64 window might be extended to PHB's M64
+	 * window by intention to support PCI hotplug. So we have
+	 * to check the M64 resources consumed by the PCI devices
+	 * on the PCI bus.
 	 */
 	list_for_each_entry(pdev, &bus->devices, bus_list) {
-		if (!pdev->subordinate)
-			continue;
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+#ifdef CONFIG_PCI_IOV
+			if (i >= PCI_IOV_RESOURCES &&
+			    i <= PCI_IOV_RESOURCE_END)
+				continue;
+#endif
+			/* Don't scan bridge's window if the PE
+			 * doesn't contain its subordinate bus.
+			 */
+			if (!all && i >= PCI_BRIDGE_RESOURCES &&
+			    i <= PCI_BRIDGE_RESOURCE_END)
+				continue;
 
-		pci_bus_for_each_resource(pdev->subordinate, r, i) {
-			if (!r || !r->parent ||
-			    !pnv_pci_is_mem_pref_64(r->flags))
+			r = &pdev->resource[i];
+			if (!r->flags || r->start >= r->end ||
+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
 				continue;
 
-			start = (r->start - phb->ioda.m64_base) / segsz;
-			for (j = 0; j < resource_size(r) / segsz ; j++)
-				clear_bit(start + j, pe_alloc);
-                }
-        }
+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
+			for (; pe_no < limit; pe_no++)
+				set_bit(pe_no, pe_bitsmap);
+		}
+	}
 
-	/*
-	 * the current bus might not own M64 window and that's all
-	 * contributed by its child buses. For the case, we needn't
-	 * pick M64 dependent PE#.
-	 */
-	if (bitmap_empty(pe_alloc, phb->ioda.total_pe)) {
-		kfree(pe_alloc);
+	/* No M64 window found ? */
+	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
+		kfree(pe_bitsmap);
 		return IODA_INVALID_PE;
 	}
 
-	/*
-	 * Figure out the master PE and put all slave PEs to master
-	 * PE's list to form compound PE.
+	/* Figure out the master PE and put all slave PEs
+	 * to master PE's list to form compound PE.
 	 */
-done:
 	master_pe = NULL;
 	i = -1;
-	while ((i = find_next_bit(pe_alloc, phb->ioda.total_pe, i + 1)) <
+	while ((i = find_next_bit(pe_bitsmap, phb->ioda.total_pe, i + 1)) <
 		phb->ioda.total_pe) {
 		pe = &phb->ioda.pe_array[i];
 
@@ -408,6 +437,13 @@ done:
 			list_add_tail(&pe->list, &master_pe->slaves);
 		}
 
+		/* Pick the M64 segment, which should be available. Also,
+		 * those M64 segments consumed by slave PEs are contributed
+		 * to the master PE.
+		 */
+		BUG_ON(test_and_set_bit(pe->pe_number, phb->ioda.m64_segmap));
+		BUG_ON(test_and_set_bit(pe->pe_number, master_pe->m64_segmap));
+
 		/* P7IOC supports M64DT, which helps mapping M64 segment
 		 * to one particular PE#. Unfortunately, PHB3 has fixed
 		 * mapping between M64 segment and PE#. In order for same
@@ -431,7 +467,7 @@ done:
 		}
 	}
 
-	kfree(pe_alloc);
+	kfree(pe_bitsmap);
 	return master_pe->pe_number;
 }
 
@@ -1233,7 +1269,7 @@ static void pnv_pci_ioda_setup_PEs(void)
 
 		/* M64 layout might affect PE allocation */
 		if (phb->reserve_m64_pe)
-			phb->reserve_m64_pe(phb);
+			phb->reserve_m64_pe(phb, phb->hose->bus);
 
 		pnv_ioda_setup_PEs(hose->bus);
 	}
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 070ee88..19022cf 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -49,6 +49,13 @@ struct pnv_ioda_pe {
 	/* PE number */
 	unsigned int		pe_number;
 
+	/* IO/M32/M64 segments consumed by the PE. Each PE can
+	 * have one M64 segment at most, but M64 segments consumed
+	 * by slave PEs will be contributed to the master PE. One
+	 * PE can own multiple IO and M32 segments.
+	 */
+	unsigned long		m64_segmap[8];
+
 	/* "Weight" assigned to the PE for the sake of DMA resource
 	 * allocations
 	 */
@@ -114,7 +121,7 @@ struct pnv_phb {
 	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
 	void (*shutdown)(struct pnv_phb *phb);
 	int (*init_m64)(struct pnv_phb *phb);
-	void (*reserve_m64_pe)(struct pnv_phb *phb);
+	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
 	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
 	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
 	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
@@ -153,6 +160,7 @@ struct pnv_phb {
 			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
+			unsigned long		m64_segmap[8];
 			unsigned int		*m32_segmap;
 			unsigned int		*io_segmap;
 			struct pnv_ioda_pe	*pe_array;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 03/21] powerpc/powernv: M64 support improvement
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

We're having the hardware or enforced (on P7IOC) limitation: M64
segment#x can only be assigned to PE#x. IO and M32 segment can be
mapped to arbitrary PE# via IODT and M32DT. It means the PE number
should be x if M64 segment#x has been assigned to the PE. Also, each
PE own one M64 segment at most. Currently, we are reserving PE#
according to root port's M64 window. It won't be reliable once we
extend M64 windows of root port, or the upstream port of the PCIE
switch behind root port to PHB's M64 window, in order to support
PCI hotplug in future.

The patch reserves PE# for M64 segments according to the M64 resources
of the PCI devices (not bridges) contained in the PE. Besides, it's
always worthy to trace the M64 segments consumed by the PE, which can
be released at PCI unplugging time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++++++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |  10 +-
 2 files changed, 122 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 646962f..a994882 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -283,28 +283,78 @@ fail:
 	return -EIO;
 }
 
-static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
+/* We extend the M64 window of root port, or the upstream bridge port
+ * of the PCIE switch behind root port. So we shouldn't reserve PEs
+ * for M64 resources because there are no (normal) PCI devices consuming
+ * M64 resources on the PCI buses leading from root port, or the upstream
+ * bridge port. The function returns true if the indicated PCI bus needs
+ * reserved PEs because of M64 resources in advance. Otherwise, the
+ * function returns false.
+ */
+static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb,
+				 struct pci_bus *bus)
 {
-	resource_size_t sgsz = phb->ioda.m64_segsize;
+	/* Root bus */
+	if (!bus || pci_is_root_bus(bus))
+		return false;
+
+	/* Bus leading from root port. We need check what types of PCI
+	 * devices on the bus. If it's connecting PCI bridge, we don't
+	 * need reserve M64 PEs for it. Otherwise, we still need to do
+	 * that.
+	 */
+	if (pci_is_root_bus(bus->self->bus)) {
+		struct pci_dev *pdev;
+
+		list_for_each_entry(pdev, &bus->devices, bus_list) {
+			if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+				return true;
+		}
+
+		return false;
+	}
+
+	/* Bus leading from the upstream bridge port on top level */
+	if (pci_is_root_bus(bus->self->bus->self->bus))
+		return false;
+
+	return true;
+}
+
+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
+				    struct pci_bus *bus)
+{
+	resource_size_t segsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
 	struct resource *r;
-	int base, step, i;
+	unsigned long pe_no, limit;
+	int i;
 
-	/*
-	 * Root bus always has full M64 range and root port has
-	 * M64 range used in reality. So we're checking root port
-	 * instead of root bus.
+	if (!pnv_ioda_need_m64_pe(phb, bus))
+		return;
+
+	/* The bridge's M64 window might have been extended to the
+	 * PHB's M64 window in order to support PCI hotplug. So the
+	 * bridge's M64 window isn't reliable to be used for picking
+	 * PE# for its leading PCI bus. We have to check the M64
+	 * resources consumed by the PCI devices, which seat on the
+	 * PCI bus.
 	 */
-	list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) {
-		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
-			r = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
-			if (!r->parent ||
-			    !pnv_pci_is_mem_pref_64(r->flags))
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+#ifdef CONFIG_PCI_IOV
+			if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END)
+				continue;
+#endif
+			r = &pdev->resource[i];
+			if (!r->flags || r->start >= r->end ||
+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
 				continue;
 
-			base = (r->start - phb->ioda.m64_base) / sgsz;
-			for (step = 0; step < resource_size(r) / sgsz; step++)
-				pnv_ioda_reserve_pe(phb, base + step);
+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
+			for (; pe_no < limit; pe_no++)
+				pnv_ioda_reserve_pe(phb, pe_no);
 		}
 	}
 }
@@ -316,85 +366,64 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	struct pci_dev *pdev;
 	struct resource *r;
 	struct pnv_ioda_pe *master_pe, *pe;
-	unsigned long size, *pe_alloc;
-	bool found;
-	int start, i, j;
-
-	/* Root bus shouldn't use M64 */
-	if (pci_is_root_bus(bus))
-		return IODA_INVALID_PE;
-
-	/* We support only one M64 window on each bus */
-	found = false;
-	pci_bus_for_each_resource(bus, r, i) {
-		if (r && r->parent &&
-		    pnv_pci_is_mem_pref_64(r->flags)) {
-			found = true;
-			break;
-		}
-	}
+	unsigned long size, *pe_bitsmap;
+	unsigned long pe_no, limit;
+	int i;
 
-	/* No M64 window found ? */
-	if (!found)
+	if (!pnv_ioda_need_m64_pe(phb, bus))
 		return IODA_INVALID_PE;
 
-	/* Allocate bitmap */
+        /* Allocate bitmap */
 	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
-	pe_alloc = kzalloc(size, GFP_KERNEL);
-	if (!pe_alloc) {
-		pr_warn("%s: Out of memory !\n",
-			__func__);
+	pe_bitsmap = kzalloc(size, GFP_KERNEL);
+	if (!pe_bitsmap) {
+		pr_warn("%s: Out of memory !\n", __func__);
 		return IODA_INVALID_PE;
 	}
 
-	/*
-	 * Figure out reserved PE numbers by the PE
-	 * the its child PEs.
-	 */
-	start = (r->start - phb->ioda.m64_base) / segsz;
-	for (i = 0; i < resource_size(r) / segsz; i++)
-		set_bit(start + i, pe_alloc);
-
-	if (all)
-		goto done;
-
-	/*
-	 * If the PE doesn't cover all subordinate buses,
-	 * we need subtract from reserved PEs for children.
+	/* The bridge's M64 window might be extended to PHB's M64
+	 * window by intention to support PCI hotplug. So we have
+	 * to check the M64 resources consumed by the PCI devices
+	 * on the PCI bus.
 	 */
 	list_for_each_entry(pdev, &bus->devices, bus_list) {
-		if (!pdev->subordinate)
-			continue;
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+#ifdef CONFIG_PCI_IOV
+			if (i >= PCI_IOV_RESOURCES &&
+			    i <= PCI_IOV_RESOURCE_END)
+				continue;
+#endif
+			/* Don't scan bridge's window if the PE
+			 * doesn't contain its subordinate bus.
+			 */
+			if (!all && i >= PCI_BRIDGE_RESOURCES &&
+			    i <= PCI_BRIDGE_RESOURCE_END)
+				continue;
 
-		pci_bus_for_each_resource(pdev->subordinate, r, i) {
-			if (!r || !r->parent ||
-			    !pnv_pci_is_mem_pref_64(r->flags))
+			r = &pdev->resource[i];
+			if (!r->flags || r->start >= r->end ||
+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
 				continue;
 
-			start = (r->start - phb->ioda.m64_base) / segsz;
-			for (j = 0; j < resource_size(r) / segsz ; j++)
-				clear_bit(start + j, pe_alloc);
-                }
-        }
+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
+			for (; pe_no < limit; pe_no++)
+				set_bit(pe_no, pe_bitsmap);
+		}
+	}
 
-	/*
-	 * the current bus might not own M64 window and that's all
-	 * contributed by its child buses. For the case, we needn't
-	 * pick M64 dependent PE#.
-	 */
-	if (bitmap_empty(pe_alloc, phb->ioda.total_pe)) {
-		kfree(pe_alloc);
+	/* No M64 window found ? */
+	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
+		kfree(pe_bitsmap);
 		return IODA_INVALID_PE;
 	}
 
-	/*
-	 * Figure out the master PE and put all slave PEs to master
-	 * PE's list to form compound PE.
+	/* Figure out the master PE and put all slave PEs
+	 * to master PE's list to form compound PE.
 	 */
-done:
 	master_pe = NULL;
 	i = -1;
-	while ((i = find_next_bit(pe_alloc, phb->ioda.total_pe, i + 1)) <
+	while ((i = find_next_bit(pe_bitsmap, phb->ioda.total_pe, i + 1)) <
 		phb->ioda.total_pe) {
 		pe = &phb->ioda.pe_array[i];
 
@@ -408,6 +437,13 @@ done:
 			list_add_tail(&pe->list, &master_pe->slaves);
 		}
 
+		/* Pick the M64 segment, which should be available. Also,
+		 * those M64 segments consumed by slave PEs are contributed
+		 * to the master PE.
+		 */
+		BUG_ON(test_and_set_bit(pe->pe_number, phb->ioda.m64_segmap));
+		BUG_ON(test_and_set_bit(pe->pe_number, master_pe->m64_segmap));
+
 		/* P7IOC supports M64DT, which helps mapping M64 segment
 		 * to one particular PE#. Unfortunately, PHB3 has fixed
 		 * mapping between M64 segment and PE#. In order for same
@@ -431,7 +467,7 @@ done:
 		}
 	}
 
-	kfree(pe_alloc);
+	kfree(pe_bitsmap);
 	return master_pe->pe_number;
 }
 
@@ -1233,7 +1269,7 @@ static void pnv_pci_ioda_setup_PEs(void)
 
 		/* M64 layout might affect PE allocation */
 		if (phb->reserve_m64_pe)
-			phb->reserve_m64_pe(phb);
+			phb->reserve_m64_pe(phb, phb->hose->bus);
 
 		pnv_ioda_setup_PEs(hose->bus);
 	}
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 070ee88..19022cf 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -49,6 +49,13 @@ struct pnv_ioda_pe {
 	/* PE number */
 	unsigned int		pe_number;
 
+	/* IO/M32/M64 segments consumed by the PE. Each PE can
+	 * have one M64 segment at most, but M64 segments consumed
+	 * by slave PEs will be contributed to the master PE. One
+	 * PE can own multiple IO and M32 segments.
+	 */
+	unsigned long		m64_segmap[8];
+
 	/* "Weight" assigned to the PE for the sake of DMA resource
 	 * allocations
 	 */
@@ -114,7 +121,7 @@ struct pnv_phb {
 	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
 	void (*shutdown)(struct pnv_phb *phb);
 	int (*init_m64)(struct pnv_phb *phb);
-	void (*reserve_m64_pe)(struct pnv_phb *phb);
+	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
 	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
 	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
 	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
@@ -153,6 +160,7 @@ struct pnv_phb {
 			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
+			unsigned long		m64_segmap[8];
 			unsigned int		*m32_segmap;
 			unsigned int		*io_segmap;
 			struct pnv_ioda_pe	*pe_array;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The PHB's IO or M32 window is divided evenly to segments, each of
them can be mapped to arbitrary PE# by IODT or M32DT. Current code
figures out the consumed IO and M32 segments by one particular PE
from the windows of the PE's upstream bridge. It won't be reliable
once we extend M64 windows of root port, or the upstream port of
the PCIE switch behind root port to PHB's IO or M32 window, in order
to support PCI hotplug in future.

The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed
IO or M32 segments from its contained devices, no bridge involved any
more. Also, the logic to mapping IO and M32 segments are combined to
simplify the code. Besides, it's always worthy to trace the IO and M32
segments consumed by one PE, which can be released at PCI unplugging
time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 150 ++++++++++++++++--------------
 arch/powerpc/platforms/powernv/pci.h      |  13 +--
 2 files changed, 85 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a994882..7e6e266 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 }
 #endif /* CONFIG_PCI_IOV */
 
-/*
- * This function is supposed to be called on basis of PE from top
- * to bottom style. So the the I/O or MMIO segment assigned to
- * parent PE could be overrided by its child PEs if necessary.
- */
-static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
-				  struct pnv_ioda_pe *pe)
+static int pnv_ioda_map_pe_one_res(struct pci_controller *hose,
+				   struct pnv_ioda_pe *pe,
+				   struct resource *res)
 {
 	struct pnv_phb *phb = hose->private_data;
 	struct pci_bus_region region;
-	struct resource *res;
-	int i, index;
-	int rc;
+	unsigned int segsize, index;
+	unsigned long *segmap, *pe_segmap;
+	uint16_t win_type;
+	int64_t rc;
 
-	/*
-	 * NOTE: We only care PCI bus based PE for now. For PCI
-	 * device based PE, for example SRIOV sensitive VF should
-	 * be figured out later.
-	 */
-	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
+	/* Check if we need map the resource */
+	if (!res->parent || !res->flags ||
+	    res->start > res->end ||
+	    pnv_pci_is_mem_pref_64(res->flags))
+		return 0;
 
-	pci_bus_for_each_resource(pe->pbus, res, i) {
-		if (!res || !res->flags ||
-		    res->start > res->end)
-			continue;
+	if (res->flags & IORESOURCE_IO) {
+		segmap = phb->ioda.io_segmap;
+		pe_segmap = pe->io_segmap;
+		region.start = res->start - phb->ioda.io_pci_base;
+		region.end = res->end - phb->ioda.io_pci_base;
+		segsize = phb->ioda.io_segsize;
+		win_type = OPAL_IO_WINDOW_TYPE;
+	} else {
+		segmap = phb->ioda.m32_segmap;
+		pe_segmap = pe->m32_segmap;
+		region.start = res->start -
+			       hose->mem_offset[0] -
+			       phb->ioda.m32_pci_base;
+		region.end = res->end -
+			     hose->mem_offset[0] -
+			     phb->ioda.m32_pci_base;
+		segsize = phb->ioda.m32_segsize;
+		win_type = OPAL_M32_WINDOW_TYPE;
+	}
+
+	index = region.start / segsize;
+	while (index < phb->ioda.total_pe &&
+	       region.start <= region.end) {
+		rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+				pe->pe_number, win_type, 0, index);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n",
+				__func__, rc, win_type, index, pe->pe_number);
+			return -EIO;
+		}
 
-		if (res->flags & IORESOURCE_IO) {
-			region.start = res->start - phb->ioda.io_pci_base;
-			region.end   = res->end - phb->ioda.io_pci_base;
-			index = region.start / phb->ioda.io_segsize;
+		set_bit(index, segmap);
+		set_bit(index, pe_segmap);
+		region.start += segsize;
+		index++;
+	}
 
-			while (index < phb->ioda.total_pe &&
-			       region.start <= region.end) {
-				phb->ioda.io_segmap[index] = pe->pe_number;
-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-					pe->pe_number, OPAL_IO_WINDOW_TYPE, 0, index);
-				if (rc != OPAL_SUCCESS) {
-					pr_err("%s: OPAL error %d when mapping IO "
-					       "segment #%d to PE#%d\n",
-					       __func__, rc, index, pe->pe_number);
-					break;
-				}
+	return 0;
+}
 
-				region.start += phb->ioda.io_segsize;
-				index++;
-			}
-		} else if ((res->flags & IORESOURCE_MEM) &&
-			   !pnv_pci_is_mem_pref_64(res->flags)) {
-			region.start = res->start -
-				       hose->mem_offset[0] -
-				       phb->ioda.m32_pci_base;
-			region.end   = res->end -
-				       hose->mem_offset[0] -
-				       phb->ioda.m32_pci_base;
-			index = region.start / phb->ioda.m32_segsize;
-
-			while (index < phb->ioda.total_pe &&
-			       region.start <= region.end) {
-				phb->ioda.m32_segmap[index] = pe->pe_number;
-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-					pe->pe_number, OPAL_M32_WINDOW_TYPE, 0, index);
-				if (rc != OPAL_SUCCESS) {
-					pr_err("%s: OPAL error %d when mapping M32 "
-					       "segment#%d to PE#%d",
-					       __func__, rc, index, pe->pe_number);
-					break;
-				}
+static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
+				  struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *pdev;
+	struct resource *res;
+	int i;
 
-				region.start += phb->ioda.m32_segsize;
-				index++;
-			}
+	/* This function only works for bus dependent PE */
+	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
+
+	list_for_each_entry(pdev, &pe->pbus->devices, bus_list) {
+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+			res = &pdev->resource[i];
+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
+				return;
+		}
+
+		/* If the PE contains all subordinate PCI buses, the
+		 * resources of the child bridges should be mapped
+		 * to the PE as well.
+		 */
+		if (!(pe->flags & PNV_IODA_PE_BUS_ALL) ||
+		    (pdev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
+			continue;
+
+		for (i = 0; i <= PCI_BRIDGE_RESOURCE_NUM; i++) {
+			res = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
+				return;
 		}
 	}
 }
@@ -2780,7 +2795,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 {
 	struct pci_controller *hose;
 	struct pnv_phb *phb;
-	unsigned long size, m32map_off, pemap_off, iomap_off = 0;
+	unsigned long size, pemap_off;
 	const __be64 *prop64;
 	const __be32 *prop32;
 	int len;
@@ -2865,19 +2880,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	/* Allocate aux data & arrays. We don't have IO ports on PHB3 */
 	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
-	m32map_off = size;
-	size += phb->ioda.total_pe * sizeof(phb->ioda.m32_segmap[0]);
-	if (phb->type == PNV_PHB_IODA1) {
-		iomap_off = size;
-		size += phb->ioda.total_pe * sizeof(phb->ioda.io_segmap[0]);
-	}
 	pemap_off = size;
 	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
 	aux = memblock_virt_alloc(size, 0);
 	phb->ioda.pe_alloc = aux;
-	phb->ioda.m32_segmap = aux + m32map_off;
-	if (phb->type == PNV_PHB_IODA1)
-		phb->ioda.io_segmap = aux + iomap_off;
 	phb->ioda.pe_array = aux + pemap_off;
 	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 19022cf..f604bb7 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -54,6 +54,8 @@ struct pnv_ioda_pe {
 	 * by slave PEs will be contributed to the master PE. One
 	 * PE can own multiple IO and M32 segments.
 	 */
+	unsigned long		io_segmap[8];
+	unsigned long		m32_segmap[8];
 	unsigned long		m64_segmap[8];
 
 	/* "Weight" assigned to the PE for the sake of DMA resource
@@ -154,16 +156,15 @@ struct pnv_phb {
 			unsigned int		io_segsize;
 			unsigned int		io_pci_base;
 
-			/* PE allocation bitmap */
+			/* PE allocation */
+			struct pnv_ioda_pe	*pe_array;
 			unsigned long		*pe_alloc;
-			/* PE allocation mutex */
 			struct mutex		pe_alloc_mutex;
 
-			/* M32 & IO segment maps */
+			/* IO/M32/M64 segment bitmaps */
+			unsigned long		io_segmap[8];
+			unsigned long		m32_segmap[8];
 			unsigned long		m64_segmap[8];
-			unsigned int		*m32_segmap;
-			unsigned int		*io_segmap;
-			struct pnv_ioda_pe	*pe_array;
 
 			/* IRQ chip */
 			int			irq_chip_init;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The PHB's IO or M32 window is divided evenly to segments, each of
them can be mapped to arbitrary PE# by IODT or M32DT. Current code
figures out the consumed IO and M32 segments by one particular PE
from the windows of the PE's upstream bridge. It won't be reliable
once we extend M64 windows of root port, or the upstream port of
the PCIE switch behind root port to PHB's IO or M32 window, in order
to support PCI hotplug in future.

The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed
IO or M32 segments from its contained devices, no bridge involved any
more. Also, the logic to mapping IO and M32 segments are combined to
simplify the code. Besides, it's always worthy to trace the IO and M32
segments consumed by one PE, which can be released at PCI unplugging
time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 150 ++++++++++++++++--------------
 arch/powerpc/platforms/powernv/pci.h      |  13 +--
 2 files changed, 85 insertions(+), 78 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a994882..7e6e266 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 }
 #endif /* CONFIG_PCI_IOV */
 
-/*
- * This function is supposed to be called on basis of PE from top
- * to bottom style. So the the I/O or MMIO segment assigned to
- * parent PE could be overrided by its child PEs if necessary.
- */
-static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
-				  struct pnv_ioda_pe *pe)
+static int pnv_ioda_map_pe_one_res(struct pci_controller *hose,
+				   struct pnv_ioda_pe *pe,
+				   struct resource *res)
 {
 	struct pnv_phb *phb = hose->private_data;
 	struct pci_bus_region region;
-	struct resource *res;
-	int i, index;
-	int rc;
+	unsigned int segsize, index;
+	unsigned long *segmap, *pe_segmap;
+	uint16_t win_type;
+	int64_t rc;
 
-	/*
-	 * NOTE: We only care PCI bus based PE for now. For PCI
-	 * device based PE, for example SRIOV sensitive VF should
-	 * be figured out later.
-	 */
-	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
+	/* Check if we need map the resource */
+	if (!res->parent || !res->flags ||
+	    res->start > res->end ||
+	    pnv_pci_is_mem_pref_64(res->flags))
+		return 0;
 
-	pci_bus_for_each_resource(pe->pbus, res, i) {
-		if (!res || !res->flags ||
-		    res->start > res->end)
-			continue;
+	if (res->flags & IORESOURCE_IO) {
+		segmap = phb->ioda.io_segmap;
+		pe_segmap = pe->io_segmap;
+		region.start = res->start - phb->ioda.io_pci_base;
+		region.end = res->end - phb->ioda.io_pci_base;
+		segsize = phb->ioda.io_segsize;
+		win_type = OPAL_IO_WINDOW_TYPE;
+	} else {
+		segmap = phb->ioda.m32_segmap;
+		pe_segmap = pe->m32_segmap;
+		region.start = res->start -
+			       hose->mem_offset[0] -
+			       phb->ioda.m32_pci_base;
+		region.end = res->end -
+			     hose->mem_offset[0] -
+			     phb->ioda.m32_pci_base;
+		segsize = phb->ioda.m32_segsize;
+		win_type = OPAL_M32_WINDOW_TYPE;
+	}
+
+	index = region.start / segsize;
+	while (index < phb->ioda.total_pe &&
+	       region.start <= region.end) {
+		rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+				pe->pe_number, win_type, 0, index);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n",
+				__func__, rc, win_type, index, pe->pe_number);
+			return -EIO;
+		}
 
-		if (res->flags & IORESOURCE_IO) {
-			region.start = res->start - phb->ioda.io_pci_base;
-			region.end   = res->end - phb->ioda.io_pci_base;
-			index = region.start / phb->ioda.io_segsize;
+		set_bit(index, segmap);
+		set_bit(index, pe_segmap);
+		region.start += segsize;
+		index++;
+	}
 
-			while (index < phb->ioda.total_pe &&
-			       region.start <= region.end) {
-				phb->ioda.io_segmap[index] = pe->pe_number;
-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-					pe->pe_number, OPAL_IO_WINDOW_TYPE, 0, index);
-				if (rc != OPAL_SUCCESS) {
-					pr_err("%s: OPAL error %d when mapping IO "
-					       "segment #%d to PE#%d\n",
-					       __func__, rc, index, pe->pe_number);
-					break;
-				}
+	return 0;
+}
 
-				region.start += phb->ioda.io_segsize;
-				index++;
-			}
-		} else if ((res->flags & IORESOURCE_MEM) &&
-			   !pnv_pci_is_mem_pref_64(res->flags)) {
-			region.start = res->start -
-				       hose->mem_offset[0] -
-				       phb->ioda.m32_pci_base;
-			region.end   = res->end -
-				       hose->mem_offset[0] -
-				       phb->ioda.m32_pci_base;
-			index = region.start / phb->ioda.m32_segsize;
-
-			while (index < phb->ioda.total_pe &&
-			       region.start <= region.end) {
-				phb->ioda.m32_segmap[index] = pe->pe_number;
-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-					pe->pe_number, OPAL_M32_WINDOW_TYPE, 0, index);
-				if (rc != OPAL_SUCCESS) {
-					pr_err("%s: OPAL error %d when mapping M32 "
-					       "segment#%d to PE#%d",
-					       __func__, rc, index, pe->pe_number);
-					break;
-				}
+static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
+				  struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *pdev;
+	struct resource *res;
+	int i;
 
-				region.start += phb->ioda.m32_segsize;
-				index++;
-			}
+	/* This function only works for bus dependent PE */
+	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
+
+	list_for_each_entry(pdev, &pe->pbus->devices, bus_list) {
+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+			res = &pdev->resource[i];
+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
+				return;
+		}
+
+		/* If the PE contains all subordinate PCI buses, the
+		 * resources of the child bridges should be mapped
+		 * to the PE as well.
+		 */
+		if (!(pe->flags & PNV_IODA_PE_BUS_ALL) ||
+		    (pdev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
+			continue;
+
+		for (i = 0; i <= PCI_BRIDGE_RESOURCE_NUM; i++) {
+			res = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
+				return;
 		}
 	}
 }
@@ -2780,7 +2795,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 {
 	struct pci_controller *hose;
 	struct pnv_phb *phb;
-	unsigned long size, m32map_off, pemap_off, iomap_off = 0;
+	unsigned long size, pemap_off;
 	const __be64 *prop64;
 	const __be32 *prop32;
 	int len;
@@ -2865,19 +2880,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	/* Allocate aux data & arrays. We don't have IO ports on PHB3 */
 	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
-	m32map_off = size;
-	size += phb->ioda.total_pe * sizeof(phb->ioda.m32_segmap[0]);
-	if (phb->type == PNV_PHB_IODA1) {
-		iomap_off = size;
-		size += phb->ioda.total_pe * sizeof(phb->ioda.io_segmap[0]);
-	}
 	pemap_off = size;
 	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
 	aux = memblock_virt_alloc(size, 0);
 	phb->ioda.pe_alloc = aux;
-	phb->ioda.m32_segmap = aux + m32map_off;
-	if (phb->type == PNV_PHB_IODA1)
-		phb->ioda.io_segmap = aux + iomap_off;
 	phb->ioda.pe_array = aux + pemap_off;
 	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 19022cf..f604bb7 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -54,6 +54,8 @@ struct pnv_ioda_pe {
 	 * by slave PEs will be contributed to the master PE. One
 	 * PE can own multiple IO and M32 segments.
 	 */
+	unsigned long		io_segmap[8];
+	unsigned long		m32_segmap[8];
 	unsigned long		m64_segmap[8];
 
 	/* "Weight" assigned to the PE for the sake of DMA resource
@@ -154,16 +156,15 @@ struct pnv_phb {
 			unsigned int		io_segsize;
 			unsigned int		io_pci_base;
 
-			/* PE allocation bitmap */
+			/* PE allocation */
+			struct pnv_ioda_pe	*pe_array;
 			unsigned long		*pe_alloc;
-			/* PE allocation mutex */
 			struct mutex		pe_alloc_mutex;
 
-			/* M32 & IO segment maps */
+			/* IO/M32/M64 segment bitmaps */
+			unsigned long		io_segmap[8];
+			unsigned long		m32_segmap[8];
 			unsigned long		m64_segmap[8];
-			unsigned int		*m32_segmap;
-			unsigned int		*io_segmap;
-			struct pnv_ioda_pe	*pe_array;
 
 			/* IRQ chip */
 			int			irq_chip_init;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 05/21] powerpc/powernv: Improve DMA32 segment assignment
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

For P7IOC, the whole available DMA32 space, which is below the
MEM32 space, is evenly divided into 256MB segments. How many
continuous segments assigned to one particular PE depends on
the PE's DMA weight that is figured out from the type of each
PCI devices contained in the PE, and PHB's DMA weight which is
accumulative DMA weight of PEs contained in the PHB. It means
that the PHB's DMA weight calculation depends on existing PEs,
which works perfectly now, but not hotplug friendly. As the
whole available DMA32 space can be assigned to one PE on PHB3,
so we don't have the issue on PHB3.

The patch improves DMA32 segment assignment by removing the
dependency of existing PEs to make the piece of logic friendly
to PCI hotplug. Besides, it's always worthy to trace the DMA32
segments consumed by one PE, which can be released at PCI
unplugging time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 204 ++++++++++++++++--------------
 arch/powerpc/platforms/powernv/pci.h      |  24 +---
 2 files changed, 116 insertions(+), 112 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7e6e266..9ef745e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -976,8 +976,11 @@ static void pnv_ioda_link_pe_by_weight(struct pnv_phb *phb,
 	list_add_tail(&pe->dma_link, &phb->ioda.pe_dma_list);
 }
 
-static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
+static unsigned int pnv_ioda_dev_dma_weight(struct pci_dev *dev)
 {
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+
 	/* This is quite simplistic. The "base" weight of a device
 	 * is 10. 0 means no DMA is to be accounted for it.
 	 */
@@ -990,14 +993,33 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	if (dev->class == PCI_CLASS_SERIAL_USB_UHCI ||
 	    dev->class == PCI_CLASS_SERIAL_USB_OHCI ||
 	    dev->class == PCI_CLASS_SERIAL_USB_EHCI)
-		return 3;
+		return 3 * phb->ioda.tce32_count;
 
 	/* Increase the weight of RAID (includes Obsidian) */
 	if ((dev->class >> 8) == PCI_CLASS_STORAGE_RAID)
-		return 15;
+		return 15 * phb->ioda.tce32_count;
 
 	/* Default */
-	return 10;
+	return 10 * phb->ioda.tce32_count;
+}
+
+static int __pnv_ioda_phb_dma_weight(struct pci_dev *pdev, void *data)
+{
+	unsigned int *dma_weight = data;
+
+	*dma_weight += pnv_ioda_dev_dma_weight(pdev);
+	return 0;
+}
+
+static void pnv_ioda_phb_dma_weight(struct pnv_phb *phb)
+{
+	phb->ioda.dma_weight = 0;
+	if (!phb->hose->bus)
+		return;
+
+	pci_walk_bus(phb->hose->bus,
+		     __pnv_ioda_phb_dma_weight,
+		     &phb->ioda.dma_weight);
 }
 
 #ifdef CONFIG_PCI_IOV
@@ -1156,7 +1178,7 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 			continue;
 		}
 		pdn->pe_number = pe->pe_number;
-		pe->dma_weight += pnv_ioda_dma_weight(dev);
+		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_same_PE(dev->subordinate, pe);
 	}
@@ -1193,7 +1215,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
 	pe->pbus = bus;
 	pe->pdev = NULL;
-	pe->tce32_seg = -1;
 	pe->mve_number = -1;
 	pe->rid = bus->busn_res.start << 8;
 	pe->dma_weight = 0;
@@ -1223,14 +1244,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	/* Put PE to the list */
 	list_add_tail(&pe->list, &phb->ioda.pe_list);
 
-	/* Account for one DMA PE if at least one DMA capable device exist
-	 * below the bridge
-	 */
-	if (pe->dma_weight != 0) {
-		phb->ioda.dma_weight += pe->dma_weight;
-		phb->ioda.dma_pe_count++;
-	}
-
 	/* Link the PE */
 	pnv_ioda_link_pe_by_weight(phb, pe);
 }
@@ -1569,7 +1582,6 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		pe->flags = PNV_IODA_PE_VF;
 		pe->pbus = NULL;
 		pe->parent_dev = pdev;
-		pe->tce32_seg = -1;
 		pe->mve_number = -1;
 		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
 			   pci_iov_virtfn_devfn(pdev, vf_index);
@@ -1890,28 +1902,70 @@ void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
 }
 
-static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
-				      struct pnv_ioda_pe *pe, unsigned int base,
-				      unsigned int segs)
+static int pnv_pci_ioda1_dma_segment_alloc(struct pnv_phb *phb,
+					   struct pnv_ioda_pe *pe)
+{
+	unsigned int weight, base, segs;
+
+	/* We shouldn't already have 32-bits DMA associated */
+	if (WARN_ON(pe->tce32_seg_start || pe->tce32_seg_end))
+		return -EEXIST;
+
+	/* Needn't setup TCE table for non-DMA capable PE */
+	weight = pe->dma_weight;
+	if (!weight)
+		return -ENODEV;
+
+	/* Calculate the DMA segments that PE needs. It's guaranteed
+	 * that the PE will have one segment at least.
+	 */
+	if (weight < phb->ioda.dma_weight / phb->ioda.tce32_count)
+		weight = phb->ioda.dma_weight / phb->ioda.tce32_count;
+	segs = (weight * phb->ioda.tce32_count) / phb->ioda.dma_weight;
+
+	/* Reserve the DMA segments with back-off way, which should
+	 * give us one segment at least.
+	 */
+	do {
+		base = bitmap_find_next_zero_area(phb->ioda.tce32_segmap,
+						  phb->ioda.tce32_count,
+						  0, segs, 0);
+		if (base < phb->ioda.tce32_count)
+			bitmap_set(phb->ioda.tce32_segmap, base, segs);
+	} while ((base > phb->ioda.tce32_count) && (--segs));
+
+	/* There are possibly no DMA32 segments */
+	if (!segs)
+		return -ENODEV;
+
+	pe->tce32_seg_start = base;
+	pe->tce32_seg_end = base + segs;
+	return 0;
+}
+
+static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe)
 {
 
 	struct page *tce_mem = NULL;
 	const __be64 *swinvp;
 	struct iommu_table *tbl;
-	unsigned int i;
-	int64_t rc;
 	void *addr;
+	unsigned int base, segs, i;
+	int64_t rc;
 
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
-	/* XXX FIXME: Allocate multi-level tables on PHB3 */
 
-	/* We shouldn't already have a 32-bit DMA associated */
-	if (WARN_ON(pe->tce32_seg >= 0))
+	/* Allocate TCE32 segments */
+	if (pnv_pci_ioda1_dma_segment_alloc(phb, pe)) {
+		pe_err(pe, " Cannot setting up 32-bits TCE table\n");
 		return;
+	}
 
-	/* Grab a 32-bit TCE table */
-	pe->tce32_seg = base;
+	/* Build a 32-bits TCE table */
+	base = pe->tce32_seg_start;
+	segs = pe->tce32_seg_end - pe->tce32_seg_start;
 	pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
 		(base << 28), ((base + segs) << 28) - 1);
 
@@ -1943,6 +1997,10 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		}
 	}
 
+	/* Print some info */
+	pe_info(pe, "DMA weight %d, assigned (%d %d) DMA32 segments\n",
+		pe->dma_weight, base, segs);
+
 	/* Setup linux iommu table */
 	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
@@ -1981,8 +2039,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	return;
  fail:
 	/* XXX Failure: Try to fallback to 64-bit only ? */
-	if (pe->tce32_seg >= 0)
-		pe->tce32_seg = -1;
+	bitmap_clear(phb->ioda.tce32_segmap, base, segs);
+	pe->tce32_seg_start = pe->tce32_seg_end = 0;
 	if (tce_mem)
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
@@ -2051,11 +2109,10 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 
 	/* We shouldn't already have a 32-bit DMA associated */
-	if (WARN_ON(pe->tce32_seg >= 0))
+	if (WARN_ON(pe->tce32_seg_end > pe->tce32_seg_start))
 		return;
 
 	/* The PE will reserve all possible 32-bits space */
-	pe->tce32_seg = 0;
 	end = (1 << ilog2(phb->ioda.m32_pci_base));
 	tce_table_size = (end / 0x1000) * 8;
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
@@ -2066,7 +2123,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				   get_order(tce_table_size));
 	if (!tce_mem) {
 		pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
-		goto fail;
+		return;
 	}
 	addr = page_address(tce_mem);
 	memset(addr, 0, tce_table_size);
@@ -2079,11 +2136,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 					pe->pe_number << 1, 1, __pa(addr),
 					tce_table_size, 0x1000);
 	if (rc) {
+		__free_pages(tce_mem, get_order(tce_table_size));
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-		goto fail;
+		return;
 	}
 
+	/* Print some info */
+	pe->tce32_seg_end = pe->tce32_seg_start + 1;
+	pe_info(pe, "Assigned DMA32 space\n");
+
 	/* Setup linux iommu table */
 	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
@@ -2120,76 +2182,30 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* Also create a bypass window */
 	if (!pnv_iommu_bypass_disabled)
 		pnv_pci_ioda2_setup_bypass_pe(phb, pe);
-
-	return;
-fail:
-	if (pe->tce32_seg >= 0)
-		pe->tce32_seg = -1;
-	if (tce_mem)
-		__free_pages(tce_mem, get_order(tce_table_size));
 }
 
-static void pnv_ioda_setup_dma(struct pnv_phb *phb)
+void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
+			       struct pnv_ioda_pe *pe)
 {
-	struct pci_controller *hose = phb->hose;
-	unsigned int residual, remaining, segs, tw, base;
-	struct pnv_ioda_pe *pe;
-
-	/* If we have more PE# than segments available, hand out one
-	 * per PE until we run out and let the rest fail. If not,
-	 * then we assign at least one segment per PE, plus more based
-	 * on the amount of devices under that PE
+	/* Recalculate the PHB's total DMA weight, which depends on
+	 * PCI devices. That means the PCI devices beneath the PHB
+	 * should have been probed successfully. Otherwise, the
+	 * calculated PHB's DMA weight won't be accurate.
 	 */
-	if (phb->ioda.dma_pe_count > phb->ioda.tce32_count)
-		residual = 0;
-	else
-		residual = phb->ioda.tce32_count -
-			phb->ioda.dma_pe_count;
+	pnv_ioda_phb_dma_weight(phb);
 
-	pr_info("PCI: Domain %04x has %ld available 32-bit DMA segments\n",
-		hose->global_number, phb->ioda.tce32_count);
-	pr_info("PCI: %d PE# for a total weight of %d\n",
-		phb->ioda.dma_pe_count, phb->ioda.dma_weight);
-
-	/* Walk our PE list and configure their DMA segments, hand them
-	 * out one base segment plus any residual segments based on
-	 * weight
-	 */
-	remaining = phb->ioda.tce32_count;
-	tw = phb->ioda.dma_weight;
-	base = 0;
-	list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link) {
-		if (!pe->dma_weight)
-			continue;
-		if (!remaining) {
-			pe_warn(pe, "No DMA32 resources available\n");
-			continue;
-		}
-		segs = 1;
-		if (residual) {
-			segs += ((pe->dma_weight * residual)  + (tw / 2)) / tw;
-			if (segs > remaining)
-				segs = remaining;
-		}
+	if (phb->type == PNV_PHB_IODA1)
+		pnv_pci_ioda1_setup_dma_pe(phb, pe);
+	else if (phb->type == PNV_PHB_IODA2)
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+}
 
-		/*
-		 * For IODA2 compliant PHB3, we needn't care about the weight.
-		 * The all available 32-bits DMA space will be assigned to
-		 * the specific PE.
-		 */
-		if (phb->type == PNV_PHB_IODA1) {
-			pe_info(pe, "DMA weight %d, assigned %d DMA32 segments\n",
-				pe->dma_weight, segs);
-			pnv_pci_ioda_setup_dma_pe(phb, pe, base, segs);
-		} else {
-			pe_info(pe, "Assign DMA32 space\n");
-			segs = 0;
-			pnv_pci_ioda2_setup_dma_pe(phb, pe);
-		}
+static void pnv_ioda_setup_dma(struct pnv_phb *phb)
+{
+	struct pnv_ioda_pe *pe;
 
-		remaining -= segs;
-		base += segs;
-	}
+	list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
+		pnv_pci_ioda_setup_dma_pe(phb, pe);
 }
 
 #ifdef CONFIG_PCI_MSI
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f604bb7..2784951 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -58,15 +58,11 @@ struct pnv_ioda_pe {
 	unsigned long		m32_segmap[8];
 	unsigned long		m64_segmap[8];
 
-	/* "Weight" assigned to the PE for the sake of DMA resource
-	 * allocations
-	 */
-	unsigned int		dma_weight;
-
-	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
-	int			tce32_seg;
-	int			tce32_segcount;
+	/* 32-bits DMA */
 	struct iommu_table	*tce32_table;
+	unsigned int		dma_weight;
+	unsigned int		tce32_seg_start;
+	unsigned int		tce32_seg_end;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
@@ -183,17 +179,9 @@ struct pnv_phb {
 			unsigned char		pe_rmap[0x10000];
 
 			/* 32-bit TCE tables allocation */
-			unsigned long		tce32_count;
-
-			/* Total "weight" for the sake of DMA resources
-			 * allocation
-			 */
 			unsigned int		dma_weight;
-			unsigned int		dma_pe_count;
-
-			/* Sorted list of used PE's, sorted at
-			 * boot for resource allocation purposes
-			 */
+			unsigned long		tce32_count;
+			unsigned long		tce32_segmap[8];
 			struct list_head	pe_dma_list;
 		} ioda;
 	};
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 05/21] powerpc/powernv: Improve DMA32 segment assignment
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

For P7IOC, the whole available DMA32 space, which is below the
MEM32 space, is evenly divided into 256MB segments. How many
continuous segments assigned to one particular PE depends on
the PE's DMA weight that is figured out from the type of each
PCI devices contained in the PE, and PHB's DMA weight which is
accumulative DMA weight of PEs contained in the PHB. It means
that the PHB's DMA weight calculation depends on existing PEs,
which works perfectly now, but not hotplug friendly. As the
whole available DMA32 space can be assigned to one PE on PHB3,
so we don't have the issue on PHB3.

The patch improves DMA32 segment assignment by removing the
dependency of existing PEs to make the piece of logic friendly
to PCI hotplug. Besides, it's always worthy to trace the DMA32
segments consumed by one PE, which can be released at PCI
unplugging time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 204 ++++++++++++++++--------------
 arch/powerpc/platforms/powernv/pci.h      |  24 +---
 2 files changed, 116 insertions(+), 112 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7e6e266..9ef745e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -976,8 +976,11 @@ static void pnv_ioda_link_pe_by_weight(struct pnv_phb *phb,
 	list_add_tail(&pe->dma_link, &phb->ioda.pe_dma_list);
 }
 
-static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
+static unsigned int pnv_ioda_dev_dma_weight(struct pci_dev *dev)
 {
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+	struct pnv_phb *phb = hose->private_data;
+
 	/* This is quite simplistic. The "base" weight of a device
 	 * is 10. 0 means no DMA is to be accounted for it.
 	 */
@@ -990,14 +993,33 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	if (dev->class == PCI_CLASS_SERIAL_USB_UHCI ||
 	    dev->class == PCI_CLASS_SERIAL_USB_OHCI ||
 	    dev->class == PCI_CLASS_SERIAL_USB_EHCI)
-		return 3;
+		return 3 * phb->ioda.tce32_count;
 
 	/* Increase the weight of RAID (includes Obsidian) */
 	if ((dev->class >> 8) == PCI_CLASS_STORAGE_RAID)
-		return 15;
+		return 15 * phb->ioda.tce32_count;
 
 	/* Default */
-	return 10;
+	return 10 * phb->ioda.tce32_count;
+}
+
+static int __pnv_ioda_phb_dma_weight(struct pci_dev *pdev, void *data)
+{
+	unsigned int *dma_weight = data;
+
+	*dma_weight += pnv_ioda_dev_dma_weight(pdev);
+	return 0;
+}
+
+static void pnv_ioda_phb_dma_weight(struct pnv_phb *phb)
+{
+	phb->ioda.dma_weight = 0;
+	if (!phb->hose->bus)
+		return;
+
+	pci_walk_bus(phb->hose->bus,
+		     __pnv_ioda_phb_dma_weight,
+		     &phb->ioda.dma_weight);
 }
 
 #ifdef CONFIG_PCI_IOV
@@ -1156,7 +1178,7 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 			continue;
 		}
 		pdn->pe_number = pe->pe_number;
-		pe->dma_weight += pnv_ioda_dma_weight(dev);
+		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
 			pnv_ioda_setup_same_PE(dev->subordinate, pe);
 	}
@@ -1193,7 +1215,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
 	pe->pbus = bus;
 	pe->pdev = NULL;
-	pe->tce32_seg = -1;
 	pe->mve_number = -1;
 	pe->rid = bus->busn_res.start << 8;
 	pe->dma_weight = 0;
@@ -1223,14 +1244,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	/* Put PE to the list */
 	list_add_tail(&pe->list, &phb->ioda.pe_list);
 
-	/* Account for one DMA PE if at least one DMA capable device exist
-	 * below the bridge
-	 */
-	if (pe->dma_weight != 0) {
-		phb->ioda.dma_weight += pe->dma_weight;
-		phb->ioda.dma_pe_count++;
-	}
-
 	/* Link the PE */
 	pnv_ioda_link_pe_by_weight(phb, pe);
 }
@@ -1569,7 +1582,6 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		pe->flags = PNV_IODA_PE_VF;
 		pe->pbus = NULL;
 		pe->parent_dev = pdev;
-		pe->tce32_seg = -1;
 		pe->mve_number = -1;
 		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
 			   pci_iov_virtfn_devfn(pdev, vf_index);
@@ -1890,28 +1902,70 @@ void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 		pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
 }
 
-static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
-				      struct pnv_ioda_pe *pe, unsigned int base,
-				      unsigned int segs)
+static int pnv_pci_ioda1_dma_segment_alloc(struct pnv_phb *phb,
+					   struct pnv_ioda_pe *pe)
+{
+	unsigned int weight, base, segs;
+
+	/* We shouldn't already have 32-bits DMA associated */
+	if (WARN_ON(pe->tce32_seg_start || pe->tce32_seg_end))
+		return -EEXIST;
+
+	/* Needn't setup TCE table for non-DMA capable PE */
+	weight = pe->dma_weight;
+	if (!weight)
+		return -ENODEV;
+
+	/* Calculate the DMA segments that PE needs. It's guaranteed
+	 * that the PE will have one segment at least.
+	 */
+	if (weight < phb->ioda.dma_weight / phb->ioda.tce32_count)
+		weight = phb->ioda.dma_weight / phb->ioda.tce32_count;
+	segs = (weight * phb->ioda.tce32_count) / phb->ioda.dma_weight;
+
+	/* Reserve the DMA segments with back-off way, which should
+	 * give us one segment at least.
+	 */
+	do {
+		base = bitmap_find_next_zero_area(phb->ioda.tce32_segmap,
+						  phb->ioda.tce32_count,
+						  0, segs, 0);
+		if (base < phb->ioda.tce32_count)
+			bitmap_set(phb->ioda.tce32_segmap, base, segs);
+	} while ((base > phb->ioda.tce32_count) && (--segs));
+
+	/* There are possibly no DMA32 segments */
+	if (!segs)
+		return -ENODEV;
+
+	pe->tce32_seg_start = base;
+	pe->tce32_seg_end = base + segs;
+	return 0;
+}
+
+static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe)
 {
 
 	struct page *tce_mem = NULL;
 	const __be64 *swinvp;
 	struct iommu_table *tbl;
-	unsigned int i;
-	int64_t rc;
 	void *addr;
+	unsigned int base, segs, i;
+	int64_t rc;
 
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
-	/* XXX FIXME: Allocate multi-level tables on PHB3 */
 
-	/* We shouldn't already have a 32-bit DMA associated */
-	if (WARN_ON(pe->tce32_seg >= 0))
+	/* Allocate TCE32 segments */
+	if (pnv_pci_ioda1_dma_segment_alloc(phb, pe)) {
+		pe_err(pe, " Cannot setting up 32-bits TCE table\n");
 		return;
+	}
 
-	/* Grab a 32-bit TCE table */
-	pe->tce32_seg = base;
+	/* Build a 32-bits TCE table */
+	base = pe->tce32_seg_start;
+	segs = pe->tce32_seg_end - pe->tce32_seg_start;
 	pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
 		(base << 28), ((base + segs) << 28) - 1);
 
@@ -1943,6 +1997,10 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		}
 	}
 
+	/* Print some info */
+	pe_info(pe, "DMA weight %d, assigned (%d %d) DMA32 segments\n",
+		pe->dma_weight, base, segs);
+
 	/* Setup linux iommu table */
 	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
@@ -1981,8 +2039,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	return;
  fail:
 	/* XXX Failure: Try to fallback to 64-bit only ? */
-	if (pe->tce32_seg >= 0)
-		pe->tce32_seg = -1;
+	bitmap_clear(phb->ioda.tce32_segmap, base, segs);
+	pe->tce32_seg_start = pe->tce32_seg_end = 0;
 	if (tce_mem)
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
@@ -2051,11 +2109,10 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 
 	/* We shouldn't already have a 32-bit DMA associated */
-	if (WARN_ON(pe->tce32_seg >= 0))
+	if (WARN_ON(pe->tce32_seg_end > pe->tce32_seg_start))
 		return;
 
 	/* The PE will reserve all possible 32-bits space */
-	pe->tce32_seg = 0;
 	end = (1 << ilog2(phb->ioda.m32_pci_base));
 	tce_table_size = (end / 0x1000) * 8;
 	pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
@@ -2066,7 +2123,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				   get_order(tce_table_size));
 	if (!tce_mem) {
 		pe_err(pe, "Failed to allocate a 32-bit TCE memory\n");
-		goto fail;
+		return;
 	}
 	addr = page_address(tce_mem);
 	memset(addr, 0, tce_table_size);
@@ -2079,11 +2136,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 					pe->pe_number << 1, 1, __pa(addr),
 					tce_table_size, 0x1000);
 	if (rc) {
+		__free_pages(tce_mem, get_order(tce_table_size));
 		pe_err(pe, "Failed to configure 32-bit TCE table,"
 		       " err %ld\n", rc);
-		goto fail;
+		return;
 	}
 
+	/* Print some info */
+	pe->tce32_seg_end = pe->tce32_seg_start + 1;
+	pe_info(pe, "Assigned DMA32 space\n");
+
 	/* Setup linux iommu table */
 	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
@@ -2120,76 +2182,30 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	/* Also create a bypass window */
 	if (!pnv_iommu_bypass_disabled)
 		pnv_pci_ioda2_setup_bypass_pe(phb, pe);
-
-	return;
-fail:
-	if (pe->tce32_seg >= 0)
-		pe->tce32_seg = -1;
-	if (tce_mem)
-		__free_pages(tce_mem, get_order(tce_table_size));
 }
 
-static void pnv_ioda_setup_dma(struct pnv_phb *phb)
+void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
+			       struct pnv_ioda_pe *pe)
 {
-	struct pci_controller *hose = phb->hose;
-	unsigned int residual, remaining, segs, tw, base;
-	struct pnv_ioda_pe *pe;
-
-	/* If we have more PE# than segments available, hand out one
-	 * per PE until we run out and let the rest fail. If not,
-	 * then we assign at least one segment per PE, plus more based
-	 * on the amount of devices under that PE
+	/* Recalculate the PHB's total DMA weight, which depends on
+	 * PCI devices. That means the PCI devices beneath the PHB
+	 * should have been probed successfully. Otherwise, the
+	 * calculated PHB's DMA weight won't be accurate.
 	 */
-	if (phb->ioda.dma_pe_count > phb->ioda.tce32_count)
-		residual = 0;
-	else
-		residual = phb->ioda.tce32_count -
-			phb->ioda.dma_pe_count;
+	pnv_ioda_phb_dma_weight(phb);
 
-	pr_info("PCI: Domain %04x has %ld available 32-bit DMA segments\n",
-		hose->global_number, phb->ioda.tce32_count);
-	pr_info("PCI: %d PE# for a total weight of %d\n",
-		phb->ioda.dma_pe_count, phb->ioda.dma_weight);
-
-	/* Walk our PE list and configure their DMA segments, hand them
-	 * out one base segment plus any residual segments based on
-	 * weight
-	 */
-	remaining = phb->ioda.tce32_count;
-	tw = phb->ioda.dma_weight;
-	base = 0;
-	list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link) {
-		if (!pe->dma_weight)
-			continue;
-		if (!remaining) {
-			pe_warn(pe, "No DMA32 resources available\n");
-			continue;
-		}
-		segs = 1;
-		if (residual) {
-			segs += ((pe->dma_weight * residual)  + (tw / 2)) / tw;
-			if (segs > remaining)
-				segs = remaining;
-		}
+	if (phb->type == PNV_PHB_IODA1)
+		pnv_pci_ioda1_setup_dma_pe(phb, pe);
+	else if (phb->type == PNV_PHB_IODA2)
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+}
 
-		/*
-		 * For IODA2 compliant PHB3, we needn't care about the weight.
-		 * The all available 32-bits DMA space will be assigned to
-		 * the specific PE.
-		 */
-		if (phb->type == PNV_PHB_IODA1) {
-			pe_info(pe, "DMA weight %d, assigned %d DMA32 segments\n",
-				pe->dma_weight, segs);
-			pnv_pci_ioda_setup_dma_pe(phb, pe, base, segs);
-		} else {
-			pe_info(pe, "Assign DMA32 space\n");
-			segs = 0;
-			pnv_pci_ioda2_setup_dma_pe(phb, pe);
-		}
+static void pnv_ioda_setup_dma(struct pnv_phb *phb)
+{
+	struct pnv_ioda_pe *pe;
 
-		remaining -= segs;
-		base += segs;
-	}
+	list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
+		pnv_pci_ioda_setup_dma_pe(phb, pe);
 }
 
 #ifdef CONFIG_PCI_MSI
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f604bb7..2784951 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -58,15 +58,11 @@ struct pnv_ioda_pe {
 	unsigned long		m32_segmap[8];
 	unsigned long		m64_segmap[8];
 
-	/* "Weight" assigned to the PE for the sake of DMA resource
-	 * allocations
-	 */
-	unsigned int		dma_weight;
-
-	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
-	int			tce32_seg;
-	int			tce32_segcount;
+	/* 32-bits DMA */
 	struct iommu_table	*tce32_table;
+	unsigned int		dma_weight;
+	unsigned int		tce32_seg_start;
+	unsigned int		tce32_seg_end;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
@@ -183,17 +179,9 @@ struct pnv_phb {
 			unsigned char		pe_rmap[0x10000];
 
 			/* 32-bit TCE tables allocation */
-			unsigned long		tce32_count;
-
-			/* Total "weight" for the sake of DMA resources
-			 * allocation
-			 */
 			unsigned int		dma_weight;
-			unsigned int		dma_pe_count;
-
-			/* Sorted list of used PE's, sorted at
-			 * boot for resource allocation purposes
-			 */
+			unsigned long		tce32_count;
+			unsigned long		tce32_segmap[8];
 			struct list_head	pe_dma_list;
 		} ioda;
 	};
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

Currently, the PEs and their associated resources are assigned
in ppc_md.pcibios_fixup(). The function is called for once after
PCI probing and resources assignment are finished. Obviously, it's
not hotplug friendly. The patch creates PEs dynamically by
ppc_md.pcibios_setup_bridge(), which is called on the event during
system bootup and PCI hotplug: updating PCI bridge's windows after
resource assignment/reassignment are finished. For partial hotplug
case, where not all PCI devices belonging to the PE are unplugged
and plugged again, we just need unbinding/binding the affected
PCI devices with the corresponding PE without creating new one.

Besides, it might require addtional resources (e.g. M32) to the
windows of the PCI bridge when unplugging current adapter, and
insert a different adapter if there is one PCI slot, which is
assumed behind root port, or the downstream bridge of the PCIE
switch behind root port. The parent bridge of the newly plugged
adapter would reject the request to add more resources, leading
to hotplug failure. For the issue, the patch extends the windows
of root port, or the upstream port of the PCIe switch behind root
port to PHB's windows when ppc_md.pcibios_setup_bridge() is called.

There is no upstream bridge for root bus, so we have to reserve
PE#, which is next to the reserved PE# in advance and fixing the
PE for root bus in ppc_md.pcibios_setup_bridge().

The patch also changes the rule assigning PE#: PE# reserved for
prefetchable 64-bits memory resource and SRIOV VFs starts from
zero while PE# for dynamic allocations starts from ioda.total_pe
reversely. It's because PE# for prefetchable 64-bits memory resource,
which is ually allocated begining with the PHB's aperatus and PE#
and the resource have fixed mapping. The PE# for dynamic allocation
is quite flexible and has no limitation.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |   1 +
 arch/powerpc/kernel/pci-common.c          |  10 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 307 ++++++++++++++++++++----------
 arch/powerpc/platforms/powernv/pci.h      |   4 +-
 4 files changed, 220 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 1811c44..5367eb3 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -29,6 +29,7 @@ struct pci_controller_ops {
 
 	/* Called during PCI resource reassignment */
 	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
+	void		(*setup_bridge)(struct pci_bus *, unsigned long);
 	void		(*reset_secondary_bus)(struct pci_dev *dev);
 };
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 0d05406..01d2a84 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -134,6 +134,16 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+void pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+
+	if (hose->controller_ops.setup_bridge)
+		hose->controller_ops.setup_bridge(bus, type);
+	else
+		pci_setup_bridge_resources(bus, type);
+}
+
 #ifdef CONFIG_PCI_IOV
 resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev, int resno)
 {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9ef745e..910fb67 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -143,18 +143,23 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
 
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
-	unsigned long pe;
+	unsigned long pe_no;
+	unsigned long limit = phb->ioda.total_pe - 1;
 
 	do {
-		pe = find_next_zero_bit(phb->ioda.pe_alloc,
-					phb->ioda.total_pe, 0);
-		if (pe >= phb->ioda.total_pe)
+		pe_no = find_next_zero_bit(phb->ioda.pe_alloc,
+					   phb->ioda.total_pe, limit);
+		if (pe_no < phb->ioda.total_pe &&
+		    !test_and_set_bit(pe_no, phb->ioda.pe_alloc))
+			break;
+
+		if (--limit >= phb->ioda.total_pe)
 			return IODA_INVALID_PE;
-	} while(test_and_set_bit(pe, phb->ioda.pe_alloc));
+	} while(1);
 
-	phb->ioda.pe_array[pe].phb = phb;
-	phb->ioda.pe_array[pe].pe_number = pe;
-	return pe;
+	phb->ioda.pe_array[pe_no].phb = phb;
+	phb->ioda.pe_array[pe_no].pe_number = pe_no;
+	return pe_no;
 }
 
 static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
@@ -214,6 +219,13 @@ static int pnv_ioda1_init_m64(struct pnv_phb *phb)
 		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
 			phb->ioda.reserved_pe);
 
+	/* Strip of the segment used by PE for PCI root bus,
+	 * which is last supported PE#, or one next to the
+	 * reserved PE#
+	 */
+	if (phb->ioda.root_pe_no != IODA_INVALID_PE)
+		r->end -= phb->ioda.m64_segsize;
+
 	return 0;
 
 fail:
@@ -264,13 +276,24 @@ static int pnv_ioda2_init_m64(struct pnv_phb *phb)
 	 */
 	r = &phb->hose->mem_resources[1];
 	if (phb->ioda.reserved_pe == 0)
-		r->start += phb->ioda.m64_segsize;
+		r->start += (phb->ioda.root_pe_no != IODA_INVALID_PE ?
+			     phb->ioda.m64_segsize * 2 :
+			     phb->ioda.m64_segsize);
 	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
-		r->end -= phb->ioda.m64_segsize;
+		r->end -= (phb->ioda.root_pe_no != IODA_INVALID_PE ?
+			   phb->ioda.m64_segsize * 2 :
+			   phb->ioda.m64_segsize);
 	else
 		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
 			phb->ioda.reserved_pe);
 
+	/* Strip of the segment used by PE for PCI root bus,
+	 * which is last supported PE#, or one next to the
+	 * reserved PE#
+	 */
+	if (phb->ioda.root_pe_no != IODA_INVALID_PE)
+		r->end -= phb->ioda.m64_segsize;
+
 	return 0;
 
 fail:
@@ -837,7 +860,7 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 
 	/* Clear the reverse map */
 	for (rid = pe->rid; rid < rid_end; rid++)
-		phb->ioda.pe_rmap[rid] = 0;
+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
 
 	/* Release from all parents PELT-V */
 	while (parent) {
@@ -1172,11 +1195,18 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		struct pci_dn *pdn = pci_get_pdn(dev);
 
-		if (pdn == NULL) {
-			pr_warn("%s: No device node associated with device !\n",
-				pci_name(dev));
+		if (!pdn) {
+			dev_warn(&dev->dev, "%s: No associated PCI data\n",
+				 __func__);
 			continue;
 		}
+
+		/* The PCI device might have been associated with the PE in
+		 * case of partial hotplug.
+		 */
+		if (pdn->pe_number != IODA_INVALID_PE)
+			continue;
+
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
@@ -1190,15 +1220,31 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
  * subordinate PCI devices and buses. The second type of PE is normally
  * orgiriated by PCIe-to-PCI bridge or PLX switch downstream ports.
  */
-static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
+static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
 	struct pnv_phb *phb = hose->private_data;
 	struct pnv_ioda_pe *pe;
 	int pe_num = IODA_INVALID_PE;
 
+	/* For partial hotplug case, the PE instance hasn't been destroyed
+	 * yet. We shouldn't allocated a new one and assign resources to
+	 * it. The existing PE instance should be reused, but we should
+	 * associate the devices to the PE.
+	 */
+	pe_num = phb->ioda.pe_rmap[bus->number << 8];
+	if (pe_num != IODA_INVALID_PE) {
+		pe = &phb->ioda.pe_array[pe_num];
+		pnv_ioda_setup_same_PE(bus, pe);
+		return NULL;
+	}
+
+	/* PE number for root bus should have been reserved */
+	if (pci_is_root_bus(bus))
+		pe_num = phb->ioda.root_pe_no;
+
 	/* Check if PE is determined by M64 */
-	if (phb->pick_m64_pe)
+	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
 		pe_num = phb->pick_m64_pe(phb, bus, all);
 
 	/* The PE number isn't pinned by M64 */
@@ -1208,7 +1254,7 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	if (pe_num == IODA_INVALID_PE) {
 		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
 			__func__, pci_domain_nr(bus), bus->number);
-		return;
+		return NULL;
 	}
 
 	pe = &phb->ioda.pe_array[pe_num];
@@ -1220,18 +1266,18 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	pe->dma_weight = 0;
 
 	if (all)
-		pe_info(pe, "Secondary bus %d..%d associated with PE#%d\n",
-			bus->busn_res.start, bus->busn_res.end, pe_num);
+		pe_info(pe, "Secondary bus %d..%d associated\n",
+			bus->busn_res.start, bus->busn_res.end);
 	else
-		pe_info(pe, "Secondary bus %d associated with PE#%d\n",
-			bus->busn_res.start, pe_num);
+		pe_info(pe, "Secondary bus %d associated\n",
+			bus->busn_res.start);
 
 	if (pnv_ioda_configure_pe(phb, pe)) {
 		/* XXX What do we do here ? */
 		if (pe_num)
 			pnv_ioda_free_pe(phb, pe_num);
 		pe->pbus = NULL;
-		return;
+		return NULL;
 	}
 
 	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
@@ -1246,46 +1292,8 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 
 	/* Link the PE */
 	pnv_ioda_link_pe_by_weight(phb, pe);
-}
-
-static void pnv_ioda_setup_PEs(struct pci_bus *bus)
-{
-	struct pci_dev *dev;
-
-	pnv_ioda_setup_bus_PE(bus, 0);
 
-	list_for_each_entry(dev, &bus->devices, bus_list) {
-		if (dev->subordinate) {
-			if (pci_pcie_type(dev) == PCI_EXP_TYPE_PCI_BRIDGE)
-				pnv_ioda_setup_bus_PE(dev->subordinate, 1);
-			else
-				pnv_ioda_setup_PEs(dev->subordinate);
-		}
-	}
-}
-
-/*
- * Configure PEs so that the downstream PCI buses and devices
- * could have their associated PE#. Unfortunately, we didn't
- * figure out the way to identify the PLX bridge yet. So we
- * simply put the PCI bus and the subordinate behind the root
- * port to PE# here. The game rule here is expected to be changed
- * as soon as we can detected PLX bridge correctly.
- */
-static void pnv_pci_ioda_setup_PEs(void)
-{
-	struct pci_controller *hose, *tmp;
-	struct pnv_phb *phb;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		phb = hose->private_data;
-
-		/* M64 layout might affect PE allocation */
-		if (phb->reserve_m64_pe)
-			phb->reserve_m64_pe(phb, phb->hose->bus);
-
-		pnv_ioda_setup_PEs(hose->bus);
-	}
+	return pe;
 }
 
 #ifdef CONFIG_PCI_IOV
@@ -2200,14 +2208,6 @@ void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
 }
 
-static void pnv_ioda_setup_dma(struct pnv_phb *phb)
-{
-	struct pnv_ioda_pe *pe;
-
-	list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
-		pnv_pci_ioda_setup_dma_pe(phb, pe);
-}
-
 #ifdef CONFIG_PCI_MSI
 static void pnv_ioda2_msi_eoi(struct irq_data *d)
 {
@@ -2649,34 +2649,6 @@ static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
 	}
 }
 
-static void pnv_pci_ioda_setup_seg(void)
-{
-	struct pci_controller *tmp, *hose;
-	struct pnv_phb *phb;
-	struct pnv_ioda_pe *pe;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		phb = hose->private_data;
-		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
-			pnv_ioda_setup_pe_seg(hose, pe);
-		}
-	}
-}
-
-static void pnv_pci_ioda_setup_DMA(void)
-{
-	struct pci_controller *hose, *tmp;
-	struct pnv_phb *phb;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		pnv_ioda_setup_dma(hose->private_data);
-
-		/* Mark the PHB initialization done */
-		phb = hose->private_data;
-		phb->initialized = 1;
-	}
-}
-
 static void pnv_pci_ioda_create_dbgfs(void)
 {
 #ifdef CONFIG_DEBUG_FS
@@ -2698,9 +2670,14 @@ static void pnv_pci_ioda_create_dbgfs(void)
 
 static void pnv_pci_ioda_fixup(void)
 {
-	pnv_pci_ioda_setup_PEs();
-	pnv_pci_ioda_setup_seg();
-	pnv_pci_ioda_setup_DMA();
+	struct pci_controller *tmp, *hose;
+	struct pnv_phb *phb;
+
+	/* Notify initialization of PHB done */
+	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+		phb = hose->private_data;
+		phb->initialized = 1;
+	}
 
 	pnv_pci_ioda_create_dbgfs();
 
@@ -2751,6 +2728,115 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+/*
+ * We are updating root port or the upstream bridge behind the root
+ * port with PHB's various windows in order to accomodate the changes
+ * on required resources during PCI (slot) hotplug, which is connected
+ * to either root port, or the downstream ports of PCIe switch behind
+ * the root port.
+ */
+static void pnv_pci_fixup_bridge_resources(struct pci_bus *bus,
+					   unsigned long type)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dev *bridge = bus->self;
+	struct resource *r, *w;
+	int i;
+
+	/* Check if we need apply fixup to the bridge's resources */
+	if (!pci_is_root_bus(bridge->bus) &&
+	    !pci_is_root_bus(bridge->bus->self->bus)) {
+		pci_setup_bridge_resources(bus, type);
+		return;
+	}
+
+	/* Fixup the resoureces */
+	for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
+		r = &bridge->resource[PCI_BRIDGE_RESOURCES + i];
+		if (!r->flags || !r->parent)
+			continue;
+
+		w = NULL;
+		if (r->flags & type & IORESOURCE_IO)
+			w = &hose->io_resource;
+		else if (pnv_pci_is_mem_pref_64(r->flags) &&
+			 (type & IORESOURCE_PREFETCH) &&
+			 phb->ioda.m64_segsize)
+			w = &hose->mem_resources[1];
+		else if (r->flags & type & IORESOURCE_MEM)
+			w = &hose->mem_resources[0];
+
+		r->start = w->start;
+		r->end = w->end;
+	}
+
+	/* Update the resources */
+	pci_setup_bridge_resources(bus, type);
+}
+
+static void pnv_pci_setup_bridge(struct pci_bus *bus,
+				 unsigned long type)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dev *bridge = bus->self;
+	struct pci_dev *parent;
+	struct pnv_ioda_pe *pe;
+
+	/* The PCI bus might be behind a PCIE-to-PCI bridge. For that
+	 * case, the PCI bus should have been included to one PE. So
+	 * we needn't assign PE for it again.
+	 */
+	parent = bridge->bus ? bridge->bus->self : NULL;
+	while (parent) {
+		if (pci_pcie_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE)
+			return;
+
+		parent = parent->bus ? parent->bus->self : NULL;
+	}
+
+	/* Assign PE to root bus, which would be the parent PE and
+	 * should be populated prior to any other PEs.
+	 */
+	if (!phb->ioda.root_pe_populated) {
+		pe = pnv_ioda_setup_bus_PE(phb->hose->bus, 0);
+		if (pe && phb->ioda.root_pe_no == IODA_INVALID_PE)
+			phb->ioda.root_pe_no = pe->pe_number;
+		phb->ioda.root_pe_populated = 1;
+	}
+
+	/* Extend bridge's windows if necessary */
+	pnv_pci_fixup_bridge_resources(bus, type);
+
+	/* Don't assign PE to bus, which doesn't have any subordinate
+	 * PCI devices on it.
+	 */
+	if (list_empty(&bus->devices))
+		return;
+
+	/* Reserve PEs for M64 resource */
+	if (phb->reserve_m64_pe)
+		phb->reserve_m64_pe(phb, bus);
+
+	/* Assign PE. We might run here because of partial hotplug.
+	 * For the case, we just pick up the existing PE and should
+	 * not allocate resources again.
+	 */
+	if (pci_pcie_type(bridge) == PCI_EXP_TYPE_PCI_BRIDGE)
+		pe = pnv_ioda_setup_bus_PE(bus, 1);
+	else
+		pe = pnv_ioda_setup_bus_PE(bus, 0);
+	if (!pe)
+		return;
+
+	/* Setup MMIO mapping */
+	pnv_ioda_setup_pe_seg(hose, pe);
+
+	/* Setup DMA */
+	pnv_pci_ioda_setup_dma_pe(phb, pe);
+}
+
 #ifdef CONFIG_PCI_IOV
 static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
 						      int resno)
@@ -2901,7 +2987,22 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	aux = memblock_virt_alloc(size, 0);
 	phb->ioda.pe_alloc = aux;
 	phb->ioda.pe_array = aux + pemap_off;
-	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
+
+	/* Choose number of PE for root bus, which shouldn't consume
+	 * any M64 resource. So we avoid picking low-end PE#, which
+	 * is usually binding with 64-bits prefetchable memory resources
+	 * closely.
+	 */
+	pnv_ioda_reserve_pe(phb, phb->ioda.reserved_pe);
+	if (phb->ioda.reserved_pe == 0) {
+		phb->ioda.root_pe_no = phb->ioda.total_pe - 1;
+		pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_no);
+	} else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1)) {
+		phb->ioda.root_pe_no = phb->ioda.reserved_pe - 1;
+		pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_no);
+	} else {
+		phb->ioda.root_pe_no = IODA_INVALID_PE;
+	}
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
@@ -2910,6 +3011,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
 
+	/* Invalidate RID to PE# mapping */
+	memset(phb->ioda.pe_rmap, 0xff, sizeof(phb->ioda.pe_rmap));
+
 #if 0 /* We should really do that ... */
 	rc = opal_pci_set_phb_mem_window(opal->phb_id,
 					 window_type,
@@ -2958,6 +3062,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	 */
 	ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;
 	pnv_pci_controller_ops.enable_device_hook = pnv_pci_enable_device_hook;
+	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
 	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
 	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
 	hose->controller_ops = pnv_pci_controller_ops;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 2784951..1bea3a8 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -134,6 +134,8 @@ struct pnv_phb {
 			/* Global bridge info */
 			unsigned int		total_pe;
 			unsigned int		reserved_pe;
+			unsigned int		root_pe_no;
+			unsigned int		root_pe_populated;
 
 			/* 32-bit MMIO window */
 			unsigned int		m32_size;
@@ -176,7 +178,7 @@ struct pnv_phb {
 			 * we are to support more than 256 PEs, indexed
 			 * bus { bus, devfn }
 			 */
-			unsigned char		pe_rmap[0x10000];
+			unsigned int		pe_rmap[0x10000];
 
 			/* 32-bit TCE tables allocation */
 			unsigned int		dma_weight;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

Currently, the PEs and their associated resources are assigned
in ppc_md.pcibios_fixup(). The function is called for once after
PCI probing and resources assignment are finished. Obviously, it's
not hotplug friendly. The patch creates PEs dynamically by
ppc_md.pcibios_setup_bridge(), which is called on the event during
system bootup and PCI hotplug: updating PCI bridge's windows after
resource assignment/reassignment are finished. For partial hotplug
case, where not all PCI devices belonging to the PE are unplugged
and plugged again, we just need unbinding/binding the affected
PCI devices with the corresponding PE without creating new one.

Besides, it might require addtional resources (e.g. M32) to the
windows of the PCI bridge when unplugging current adapter, and
insert a different adapter if there is one PCI slot, which is
assumed behind root port, or the downstream bridge of the PCIE
switch behind root port. The parent bridge of the newly plugged
adapter would reject the request to add more resources, leading
to hotplug failure. For the issue, the patch extends the windows
of root port, or the upstream port of the PCIe switch behind root
port to PHB's windows when ppc_md.pcibios_setup_bridge() is called.

There is no upstream bridge for root bus, so we have to reserve
PE#, which is next to the reserved PE# in advance and fixing the
PE for root bus in ppc_md.pcibios_setup_bridge().

The patch also changes the rule assigning PE#: PE# reserved for
prefetchable 64-bits memory resource and SRIOV VFs starts from
zero while PE# for dynamic allocations starts from ioda.total_pe
reversely. It's because PE# for prefetchable 64-bits memory resource,
which is ually allocated begining with the PHB's aperatus and PE#
and the resource have fixed mapping. The PE# for dynamic allocation
is quite flexible and has no limitation.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |   1 +
 arch/powerpc/kernel/pci-common.c          |  10 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 307 ++++++++++++++++++++----------
 arch/powerpc/platforms/powernv/pci.h      |   4 +-
 4 files changed, 220 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 1811c44..5367eb3 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -29,6 +29,7 @@ struct pci_controller_ops {
 
 	/* Called during PCI resource reassignment */
 	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
+	void		(*setup_bridge)(struct pci_bus *, unsigned long);
 	void		(*reset_secondary_bus)(struct pci_dev *dev);
 };
 
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 0d05406..01d2a84 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -134,6 +134,16 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+void pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+
+	if (hose->controller_ops.setup_bridge)
+		hose->controller_ops.setup_bridge(bus, type);
+	else
+		pci_setup_bridge_resources(bus, type);
+}
+
 #ifdef CONFIG_PCI_IOV
 resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev, int resno)
 {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9ef745e..910fb67 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -143,18 +143,23 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
 
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
-	unsigned long pe;
+	unsigned long pe_no;
+	unsigned long limit = phb->ioda.total_pe - 1;
 
 	do {
-		pe = find_next_zero_bit(phb->ioda.pe_alloc,
-					phb->ioda.total_pe, 0);
-		if (pe >= phb->ioda.total_pe)
+		pe_no = find_next_zero_bit(phb->ioda.pe_alloc,
+					   phb->ioda.total_pe, limit);
+		if (pe_no < phb->ioda.total_pe &&
+		    !test_and_set_bit(pe_no, phb->ioda.pe_alloc))
+			break;
+
+		if (--limit >= phb->ioda.total_pe)
 			return IODA_INVALID_PE;
-	} while(test_and_set_bit(pe, phb->ioda.pe_alloc));
+	} while(1);
 
-	phb->ioda.pe_array[pe].phb = phb;
-	phb->ioda.pe_array[pe].pe_number = pe;
-	return pe;
+	phb->ioda.pe_array[pe_no].phb = phb;
+	phb->ioda.pe_array[pe_no].pe_number = pe_no;
+	return pe_no;
 }
 
 static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
@@ -214,6 +219,13 @@ static int pnv_ioda1_init_m64(struct pnv_phb *phb)
 		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
 			phb->ioda.reserved_pe);
 
+	/* Strip of the segment used by PE for PCI root bus,
+	 * which is last supported PE#, or one next to the
+	 * reserved PE#
+	 */
+	if (phb->ioda.root_pe_no != IODA_INVALID_PE)
+		r->end -= phb->ioda.m64_segsize;
+
 	return 0;
 
 fail:
@@ -264,13 +276,24 @@ static int pnv_ioda2_init_m64(struct pnv_phb *phb)
 	 */
 	r = &phb->hose->mem_resources[1];
 	if (phb->ioda.reserved_pe == 0)
-		r->start += phb->ioda.m64_segsize;
+		r->start += (phb->ioda.root_pe_no != IODA_INVALID_PE ?
+			     phb->ioda.m64_segsize * 2 :
+			     phb->ioda.m64_segsize);
 	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
-		r->end -= phb->ioda.m64_segsize;
+		r->end -= (phb->ioda.root_pe_no != IODA_INVALID_PE ?
+			   phb->ioda.m64_segsize * 2 :
+			   phb->ioda.m64_segsize);
 	else
 		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
 			phb->ioda.reserved_pe);
 
+	/* Strip of the segment used by PE for PCI root bus,
+	 * which is last supported PE#, or one next to the
+	 * reserved PE#
+	 */
+	if (phb->ioda.root_pe_no != IODA_INVALID_PE)
+		r->end -= phb->ioda.m64_segsize;
+
 	return 0;
 
 fail:
@@ -837,7 +860,7 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 
 	/* Clear the reverse map */
 	for (rid = pe->rid; rid < rid_end; rid++)
-		phb->ioda.pe_rmap[rid] = 0;
+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
 
 	/* Release from all parents PELT-V */
 	while (parent) {
@@ -1172,11 +1195,18 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		struct pci_dn *pdn = pci_get_pdn(dev);
 
-		if (pdn == NULL) {
-			pr_warn("%s: No device node associated with device !\n",
-				pci_name(dev));
+		if (!pdn) {
+			dev_warn(&dev->dev, "%s: No associated PCI data\n",
+				 __func__);
 			continue;
 		}
+
+		/* The PCI device might have been associated with the PE in
+		 * case of partial hotplug.
+		 */
+		if (pdn->pe_number != IODA_INVALID_PE)
+			continue;
+
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
@@ -1190,15 +1220,31 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
  * subordinate PCI devices and buses. The second type of PE is normally
  * orgiriated by PCIe-to-PCI bridge or PLX switch downstream ports.
  */
-static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
+static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
 	struct pnv_phb *phb = hose->private_data;
 	struct pnv_ioda_pe *pe;
 	int pe_num = IODA_INVALID_PE;
 
+	/* For partial hotplug case, the PE instance hasn't been destroyed
+	 * yet. We shouldn't allocated a new one and assign resources to
+	 * it. The existing PE instance should be reused, but we should
+	 * associate the devices to the PE.
+	 */
+	pe_num = phb->ioda.pe_rmap[bus->number << 8];
+	if (pe_num != IODA_INVALID_PE) {
+		pe = &phb->ioda.pe_array[pe_num];
+		pnv_ioda_setup_same_PE(bus, pe);
+		return NULL;
+	}
+
+	/* PE number for root bus should have been reserved */
+	if (pci_is_root_bus(bus))
+		pe_num = phb->ioda.root_pe_no;
+
 	/* Check if PE is determined by M64 */
-	if (phb->pick_m64_pe)
+	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
 		pe_num = phb->pick_m64_pe(phb, bus, all);
 
 	/* The PE number isn't pinned by M64 */
@@ -1208,7 +1254,7 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	if (pe_num == IODA_INVALID_PE) {
 		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
 			__func__, pci_domain_nr(bus), bus->number);
-		return;
+		return NULL;
 	}
 
 	pe = &phb->ioda.pe_array[pe_num];
@@ -1220,18 +1266,18 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	pe->dma_weight = 0;
 
 	if (all)
-		pe_info(pe, "Secondary bus %d..%d associated with PE#%d\n",
-			bus->busn_res.start, bus->busn_res.end, pe_num);
+		pe_info(pe, "Secondary bus %d..%d associated\n",
+			bus->busn_res.start, bus->busn_res.end);
 	else
-		pe_info(pe, "Secondary bus %d associated with PE#%d\n",
-			bus->busn_res.start, pe_num);
+		pe_info(pe, "Secondary bus %d associated\n",
+			bus->busn_res.start);
 
 	if (pnv_ioda_configure_pe(phb, pe)) {
 		/* XXX What do we do here ? */
 		if (pe_num)
 			pnv_ioda_free_pe(phb, pe_num);
 		pe->pbus = NULL;
-		return;
+		return NULL;
 	}
 
 	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
@@ -1246,46 +1292,8 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 
 	/* Link the PE */
 	pnv_ioda_link_pe_by_weight(phb, pe);
-}
-
-static void pnv_ioda_setup_PEs(struct pci_bus *bus)
-{
-	struct pci_dev *dev;
-
-	pnv_ioda_setup_bus_PE(bus, 0);
 
-	list_for_each_entry(dev, &bus->devices, bus_list) {
-		if (dev->subordinate) {
-			if (pci_pcie_type(dev) == PCI_EXP_TYPE_PCI_BRIDGE)
-				pnv_ioda_setup_bus_PE(dev->subordinate, 1);
-			else
-				pnv_ioda_setup_PEs(dev->subordinate);
-		}
-	}
-}
-
-/*
- * Configure PEs so that the downstream PCI buses and devices
- * could have their associated PE#. Unfortunately, we didn't
- * figure out the way to identify the PLX bridge yet. So we
- * simply put the PCI bus and the subordinate behind the root
- * port to PE# here. The game rule here is expected to be changed
- * as soon as we can detected PLX bridge correctly.
- */
-static void pnv_pci_ioda_setup_PEs(void)
-{
-	struct pci_controller *hose, *tmp;
-	struct pnv_phb *phb;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		phb = hose->private_data;
-
-		/* M64 layout might affect PE allocation */
-		if (phb->reserve_m64_pe)
-			phb->reserve_m64_pe(phb, phb->hose->bus);
-
-		pnv_ioda_setup_PEs(hose->bus);
-	}
+	return pe;
 }
 
 #ifdef CONFIG_PCI_IOV
@@ -2200,14 +2208,6 @@ void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
 }
 
-static void pnv_ioda_setup_dma(struct pnv_phb *phb)
-{
-	struct pnv_ioda_pe *pe;
-
-	list_for_each_entry(pe, &phb->ioda.pe_dma_list, dma_link)
-		pnv_pci_ioda_setup_dma_pe(phb, pe);
-}
-
 #ifdef CONFIG_PCI_MSI
 static void pnv_ioda2_msi_eoi(struct irq_data *d)
 {
@@ -2649,34 +2649,6 @@ static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
 	}
 }
 
-static void pnv_pci_ioda_setup_seg(void)
-{
-	struct pci_controller *tmp, *hose;
-	struct pnv_phb *phb;
-	struct pnv_ioda_pe *pe;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		phb = hose->private_data;
-		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
-			pnv_ioda_setup_pe_seg(hose, pe);
-		}
-	}
-}
-
-static void pnv_pci_ioda_setup_DMA(void)
-{
-	struct pci_controller *hose, *tmp;
-	struct pnv_phb *phb;
-
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
-		pnv_ioda_setup_dma(hose->private_data);
-
-		/* Mark the PHB initialization done */
-		phb = hose->private_data;
-		phb->initialized = 1;
-	}
-}
-
 static void pnv_pci_ioda_create_dbgfs(void)
 {
 #ifdef CONFIG_DEBUG_FS
@@ -2698,9 +2670,14 @@ static void pnv_pci_ioda_create_dbgfs(void)
 
 static void pnv_pci_ioda_fixup(void)
 {
-	pnv_pci_ioda_setup_PEs();
-	pnv_pci_ioda_setup_seg();
-	pnv_pci_ioda_setup_DMA();
+	struct pci_controller *tmp, *hose;
+	struct pnv_phb *phb;
+
+	/* Notify initialization of PHB done */
+	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+		phb = hose->private_data;
+		phb->initialized = 1;
+	}
 
 	pnv_pci_ioda_create_dbgfs();
 
@@ -2751,6 +2728,115 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+/*
+ * We are updating root port or the upstream bridge behind the root
+ * port with PHB's various windows in order to accomodate the changes
+ * on required resources during PCI (slot) hotplug, which is connected
+ * to either root port, or the downstream ports of PCIe switch behind
+ * the root port.
+ */
+static void pnv_pci_fixup_bridge_resources(struct pci_bus *bus,
+					   unsigned long type)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dev *bridge = bus->self;
+	struct resource *r, *w;
+	int i;
+
+	/* Check if we need apply fixup to the bridge's resources */
+	if (!pci_is_root_bus(bridge->bus) &&
+	    !pci_is_root_bus(bridge->bus->self->bus)) {
+		pci_setup_bridge_resources(bus, type);
+		return;
+	}
+
+	/* Fixup the resoureces */
+	for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
+		r = &bridge->resource[PCI_BRIDGE_RESOURCES + i];
+		if (!r->flags || !r->parent)
+			continue;
+
+		w = NULL;
+		if (r->flags & type & IORESOURCE_IO)
+			w = &hose->io_resource;
+		else if (pnv_pci_is_mem_pref_64(r->flags) &&
+			 (type & IORESOURCE_PREFETCH) &&
+			 phb->ioda.m64_segsize)
+			w = &hose->mem_resources[1];
+		else if (r->flags & type & IORESOURCE_MEM)
+			w = &hose->mem_resources[0];
+
+		r->start = w->start;
+		r->end = w->end;
+	}
+
+	/* Update the resources */
+	pci_setup_bridge_resources(bus, type);
+}
+
+static void pnv_pci_setup_bridge(struct pci_bus *bus,
+				 unsigned long type)
+{
+	struct pci_controller *hose = pci_bus_to_host(bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dev *bridge = bus->self;
+	struct pci_dev *parent;
+	struct pnv_ioda_pe *pe;
+
+	/* The PCI bus might be behind a PCIE-to-PCI bridge. For that
+	 * case, the PCI bus should have been included to one PE. So
+	 * we needn't assign PE for it again.
+	 */
+	parent = bridge->bus ? bridge->bus->self : NULL;
+	while (parent) {
+		if (pci_pcie_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE)
+			return;
+
+		parent = parent->bus ? parent->bus->self : NULL;
+	}
+
+	/* Assign PE to root bus, which would be the parent PE and
+	 * should be populated prior to any other PEs.
+	 */
+	if (!phb->ioda.root_pe_populated) {
+		pe = pnv_ioda_setup_bus_PE(phb->hose->bus, 0);
+		if (pe && phb->ioda.root_pe_no == IODA_INVALID_PE)
+			phb->ioda.root_pe_no = pe->pe_number;
+		phb->ioda.root_pe_populated = 1;
+	}
+
+	/* Extend bridge's windows if necessary */
+	pnv_pci_fixup_bridge_resources(bus, type);
+
+	/* Don't assign PE to bus, which doesn't have any subordinate
+	 * PCI devices on it.
+	 */
+	if (list_empty(&bus->devices))
+		return;
+
+	/* Reserve PEs for M64 resource */
+	if (phb->reserve_m64_pe)
+		phb->reserve_m64_pe(phb, bus);
+
+	/* Assign PE. We might run here because of partial hotplug.
+	 * For the case, we just pick up the existing PE and should
+	 * not allocate resources again.
+	 */
+	if (pci_pcie_type(bridge) == PCI_EXP_TYPE_PCI_BRIDGE)
+		pe = pnv_ioda_setup_bus_PE(bus, 1);
+	else
+		pe = pnv_ioda_setup_bus_PE(bus, 0);
+	if (!pe)
+		return;
+
+	/* Setup MMIO mapping */
+	pnv_ioda_setup_pe_seg(hose, pe);
+
+	/* Setup DMA */
+	pnv_pci_ioda_setup_dma_pe(phb, pe);
+}
+
 #ifdef CONFIG_PCI_IOV
 static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
 						      int resno)
@@ -2901,7 +2987,22 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	aux = memblock_virt_alloc(size, 0);
 	phb->ioda.pe_alloc = aux;
 	phb->ioda.pe_array = aux + pemap_off;
-	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
+
+	/* Choose number of PE for root bus, which shouldn't consume
+	 * any M64 resource. So we avoid picking low-end PE#, which
+	 * is usually binding with 64-bits prefetchable memory resources
+	 * closely.
+	 */
+	pnv_ioda_reserve_pe(phb, phb->ioda.reserved_pe);
+	if (phb->ioda.reserved_pe == 0) {
+		phb->ioda.root_pe_no = phb->ioda.total_pe - 1;
+		pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_no);
+	} else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1)) {
+		phb->ioda.root_pe_no = phb->ioda.reserved_pe - 1;
+		pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_no);
+	} else {
+		phb->ioda.root_pe_no = IODA_INVALID_PE;
+	}
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
@@ -2910,6 +3011,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
 
+	/* Invalidate RID to PE# mapping */
+	memset(phb->ioda.pe_rmap, 0xff, sizeof(phb->ioda.pe_rmap));
+
 #if 0 /* We should really do that ... */
 	rc = opal_pci_set_phb_mem_window(opal->phb_id,
 					 window_type,
@@ -2958,6 +3062,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	 */
 	ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;
 	pnv_pci_controller_ops.enable_device_hook = pnv_pci_enable_device_hook;
+	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
 	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
 	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
 	hose->controller_ops = pnv_pci_controller_ops;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 2784951..1bea3a8 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -134,6 +134,8 @@ struct pnv_phb {
 			/* Global bridge info */
 			unsigned int		total_pe;
 			unsigned int		reserved_pe;
+			unsigned int		root_pe_no;
+			unsigned int		root_pe_populated;
 
 			/* 32-bit MMIO window */
 			unsigned int		m32_size;
@@ -176,7 +178,7 @@ struct pnv_phb {
 			 * we are to support more than 256 PEs, indexed
 			 * bus { bus, devfn }
 			 */
-			unsigned char		pe_rmap[0x10000];
+			unsigned int		pe_rmap[0x10000];
 
 			/* 32-bit TCE tables allocation */
 			unsigned int		dma_weight;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The original code doesn't support releasing PEs dynamically, meaning
that PE and the associated resources (IO, M32, M64 and DMA) can't
be released when unplugging a PCI adapter from one hotpluggable slot.

The patch takes object oriented methodology, introducs reference
count to PE, which is initialized to 1 and increased with 1 when a
new PCI device joins the PE. Once the last PCI device leaves the
PE, the PE is going to be release together with its associated
(IO, M32, M64, DMA) resources.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |   3 +
 arch/powerpc/kernel/pci-hotplug.c         |   5 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
 arch/powerpc/platforms/powernv/pci.h      |   4 +-
 4 files changed, 432 insertions(+), 238 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 5367eb3..a6ad4b1 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -31,6 +31,9 @@ struct pci_controller_ops {
 	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
 	void		(*setup_bridge)(struct pci_bus *, unsigned long);
 	void		(*reset_secondary_bus)(struct pci_dev *dev);
+
+	/* Called when PCI device is released */
+	void		(*release_device)(struct pci_dev *);
 };
 
 /*
diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 7ed85a6..0040343 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -29,6 +29,11 @@
  */
 void pcibios_release_device(struct pci_dev *dev)
 {
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+
+	if (hose->controller_ops.release_device)
+		hose->controller_ops.release_device(dev);
+
 	eeh_remove_device(dev);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 910fb67..ef8c216 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -12,6 +12,8 @@
 #undef DEBUG
 
 #include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/kref.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
 #include <linux/debugfs.h>
@@ -47,6 +49,8 @@
 /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
 #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
 
+static void pnv_ioda_release_pe(struct kref *kref);
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
 		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
 }
 
-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
 {
-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
-			__func__, pe_no, phb->hose->global_number);
+	if (!pe)
+		return;
+
+	kref_get(&pe->kref);
+}
+
+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
+{
+	unsigned int count;
+
+	if (!pe)
 		return;
+
+	/*
+	 * The count is initialized to 1 and increased with 1 when
+	 * a new PCI device is bound with the PE. Once the last PCI
+	 * device is leaving from the PE, the PE is going to be
+	 * released.
+	 */
+	count = atomic_read(&pe->kref.refcount);
+	if (count == 2)
+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
+	else
+		kref_put(&pe->kref, pnv_ioda_release_pe);
+}
+
+static void pnv_pci_release_device(struct pci_dev *pdev)
+{
+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	struct pnv_ioda_pe *pe;
+
+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+		pe = &phb->ioda.pe_array[pdn->pe_number];
+		pnv_ioda_pe_put(pe);
+		pdn->pe_number = IODA_INVALID_PE;
 	}
+}
 
-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
-			__func__, pe_no, phb->hose->global_number);
+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb;
+	int index, count;
+	unsigned long tbl_addr, tbl_size;
+
+	/* No DMA capability for slave PEs */
+	if (pe->flags & PNV_IODA_PE_SLAVE)
+		return;
+
+	/* Bypass DMA window */
+	if (phb->type == PNV_PHB_IODA2 &&
+	    pe->tce_bypass_enabled &&
+	    pe->tce32_table &&
+	    pe->tce32_table->set_bypass)
+		pe->tce32_table->set_bypass(pe->tce32_table, false);
+
+	/* 32-bits DMA window */
+	count = pe->tce32_seg_end - pe->tce32_seg_start;
+	tbl_addr = pe->tce32_table->it_base;
+	if (!count)
 		return;
+
+	/* Free IOMMU table */
+	iommu_free_table(pe->tce32_table,
+			 of_node_full_name(phb->hose->dn));
+
+	/* Deconfigure TCE table */
+	switch (phb->type) {
+	case PNV_PHB_IODA1:
+		for (index = 0; index < count; index++)
+			opal_pci_map_pe_dma_window(phb->opal_id,
+						   pe->pe_number,
+						   pe->tce32_seg_start + index,
+						   1,
+						   __pa(tbl_addr) +
+						   index * TCE32_TABLE_SIZE,
+						   0,
+						   0x1000);
+		bitmap_clear(phb->ioda.tce32_segmap,
+			     pe->tce32_seg_start,
+			     count);
+		tbl_size = TCE32_TABLE_SIZE * count;
+		break;
+	case PNV_PHB_IODA2:
+		opal_pci_map_pe_dma_window(phb->opal_id,
+					   pe->pe_number,
+					   pe->pe_number << 1,
+					   1,
+					   __pa(tbl_addr),
+					   0,
+					   0x1000);
+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
+		break;
+	default:
+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
+		return;
+	}
+
+	/* Free memory of IOMMU table */
+	free_pages(tbl_addr, get_order(tbl_size));
+	pe->tce32_table = NULL;
+	pe->tce32_seg_start = 0;
+	pe->tce32_seg_end = 0;
+}
+
+static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb;
+	unsigned long *segmap = NULL, *pe_segmap = NULL;
+	int i;
+	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
+				     OPAL_M32_WINDOW_TYPE,
+				     OPAL_M64_WINDOW_TYPE };
+
+	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
+		switch (win_type[win]) {
+		case OPAL_IO_WINDOW_TYPE:
+			segmap = phb->ioda.io_segmap;
+			pe_segmap = pe->io_segmap;
+			break;
+		case OPAL_M32_WINDOW_TYPE:
+			segmap = phb->ioda.m32_segmap;
+			pe_segmap = pe->m32_segmap;
+			break;
+		case OPAL_M64_WINDOW_TYPE:
+			segmap = phb->ioda.m64_segmap;
+			pe_segmap = pe->m64_segmap;
+			break;
+		}
+		i = -1;
+		while ((i = find_next_bit(pe_segmap,
+			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
+			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
+			    win_type[win] == OPAL_M32_WINDOW_TYPE)
+				opal_pci_map_pe_mmio_window(phb->opal_id,
+						phb->ioda.reserved_pe,
+						win_type[win], 0, i);
+			else if (phb->type == PNV_PHB_IODA1)
+				opal_pci_map_pe_mmio_window(phb->opal_id,
+						phb->ioda.reserved_pe,
+						win_type[win],
+						i / 8, i % 8);
+
+			clear_bit(i, pe_segmap);
+			clear_bit(i, segmap);
+		}
+	}
+}
+
+static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
+				  struct pnv_ioda_pe *parent,
+				  struct pnv_ioda_pe *child,
+				  bool is_add)
+{
+	const char *desc = is_add ? "adding" : "removing";
+	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
+			      OPAL_REMOVE_PE_FROM_DOMAIN;
+	struct pnv_ioda_pe *slave;
+	long rc;
+
+	/* Parent PE affects child PE */
+	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
+				child->pe_number, op);
+	if (rc != OPAL_SUCCESS) {
+		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
+			rc, desc);
+		return -ENXIO;
+	}
+
+	if (!(child->flags & PNV_IODA_PE_MASTER))
+		return 0;
+
+	/* Compound case: parent PE affects slave PEs */
+	list_for_each_entry(slave, &child->slaves, list) {
+		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
+					slave->pe_number, op);
+		if (rc != OPAL_SUCCESS) {
+			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
+				rc, desc);
+			return -ENXIO;
+		}
+	}
+
+	return 0;
+}
+
+static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
+{
+	struct pnv_phb *phb = pe->phb;
+	struct pnv_ioda_pe *slave;
+	struct pci_dev *pdev = NULL;
+	int ret;
+
+	/*
+	 * Clear PE frozen state. If it's master PE, we need
+	 * clear slave PE frozen state as well.
+	 */
+	opal_pci_eeh_freeze_clear(phb->opal_id,
+				  pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+	if (pe->flags & PNV_IODA_PE_MASTER) {
+		list_for_each_entry(slave, &pe->slaves, list) {
+			opal_pci_eeh_freeze_clear(phb->opal_id,
+						  slave->pe_number,
+						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+		}
+	}
+
+	/*
+	 * Associate PE in PELT. We need add the PE into the
+	 * corresponding PELT-V as well. Otherwise, the error
+	 * originated from the PE might contribute to other
+	 * PEs.
+	 */
+	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
+	if (ret)
+		return ret;
+
+	/* For compound PEs, any one affects all of them */
+	if (pe->flags & PNV_IODA_PE_MASTER) {
+		list_for_each_entry(slave, &pe->slaves, list) {
+			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
+			if (ret)
+				return ret;
+		}
+	}
+
+	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
+		pdev = pe->pbus->self;
+	else if (pe->flags & PNV_IODA_PE_DEV)
+		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
+
+	while (pdev) {
+		struct pci_dn *pdn = pci_get_pdn(pdev);
+		struct pnv_ioda_pe *parent;
+
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			parent = &phb->ioda.pe_array[pdn->pe_number];
+			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
+			if (ret)
+				return ret;
+		}
+
+		pdev = pdev->bus->self;
+	}
+
+	return 0;
+}
+
+static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb;
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	long rid_end, rid;
+	int64_t rc;
+
+	/* Tear down MVE */
+	if (phb->type == PNV_PHB_IODA1 &&
+	    pe->mve_number != -1) {
+		rc = opal_pci_set_mve(phb->opal_id,
+				      pe->mve_number,
+				      phb->ioda.reserved_pe);
+		if (rc != OPAL_SUCCESS)
+			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
+				rc, pe->mve_number);
+		rc = opal_pci_set_mve_enable(phb->opal_id,
+					     pe->mve_number,
+					     OPAL_DISABLE_MVE);
+		if (rc != OPAL_SUCCESS)
+			pe_warn(pe, "Error %lld disabling MVE#%d\n",
+				rc, pe->mve_number);
+		pe->mve_number = -1;
+	}
+
+	/* Unmapping PELTV */
+	pnv_ioda_set_peltv(pe, false);
+
+	/* To unmap PELTM */
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end -
+				pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;   break;
+		case  2: bcomp = OpalPciBus7Bits; break;
+		case  4: bcomp = OpalPciBus6Bits; break;
+		case  8: bcomp = OpalPciBus5Bits; break;
+		case 16: bcomp = OpalPciBus4Bits; break;
+		case 32: bcomp = OpalPciBus3Bits; break;
+		default:
+			/* Fail back to case of one bus */
+			pe_warn(pe, "Cannot support %d buses\n", count);
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear RID mapping */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
+
+	/* Unmapping PELTM */
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
+}
+
+static void pnv_ioda_release_pe(struct kref *kref)
+{
+	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
+	struct pnv_ioda_pe *tmp, *slave;
+	struct pnv_phb *phb = pe->phb;
+
+	pnv_ioda_release_pe_dma(pe);
+	pnv_ioda_release_pe_seg(pe);
+	pnv_ioda_deconfigure_pe(pe);
+
+	/* Release slave PEs for compound PE */
+	if (pe->flags & PNV_IODA_PE_MASTER) {
+		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
+			pnv_ioda_pe_put(slave);
+	}
+
+	/* Remove the PE from various list. We need remove slave
+	 * PE from master's list.
+	 */
+	list_del(&pe->dma_link);
+	list_del(&pe->list);
+
+	/* Free PE number */
+	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
+}
+
+static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
+					    int pe_no)
+{
+	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
+
+	kref_init(&pe->kref);
+	pe->phb = phb;
+	pe->pe_number = pe_no;
+	INIT_LIST_HEAD(&pe->dma_link);
+	INIT_LIST_HEAD(&pe->list);
+
+	return pe;
+}
+
+static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
+					       int pe_no)
+{
+	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
+		pr_warn("%s: Invalid PE %d on PHB#%x\n",
+			__func__, pe_no, phb->hose->global_number);
+		return NULL;
 	}
 
-	phb->ioda.pe_array[pe_no].phb = phb;
-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
+	/*
+	 * Same PE might be reserved for multiple times, which
+	 * is out of problem actually.
+	 */
+	set_bit(pe_no, phb->ioda.pe_alloc);
+	return pnv_ioda_init_pe(phb, pe_no);
 }
 
-static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
+static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
 	unsigned long pe_no;
 	unsigned long limit = phb->ioda.total_pe - 1;
@@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 			break;
 
 		if (--limit >= phb->ioda.total_pe)
-			return IODA_INVALID_PE;
+			return NULL;
 	} while(1);
 
-	phb->ioda.pe_array[pe_no].phb = phb;
-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
-	return pe_no;
-}
-
-static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
-{
-	WARN_ON(phb->ioda.pe_array[pe].pdev);
-
-	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
-	clear_bit(pe, phb->ioda.pe_alloc);
+	return pnv_ioda_init_pe(phb, pe_no);
 }
 
 static int pnv_ioda1_init_m64(struct pnv_phb *phb)
@@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
 	}
 }
 
-static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
-				struct pci_bus *bus, int all)
+static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
+						struct pci_bus *bus,
+						int all)
 {
 	resource_size_t segsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
@@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	int i;
 
 	if (!pnv_ioda_need_m64_pe(phb, bus))
-		return IODA_INVALID_PE;
+		return NULL;
 
         /* Allocate bitmap */
 	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
 	pe_bitsmap = kzalloc(size, GFP_KERNEL);
 	if (!pe_bitsmap) {
 		pr_warn("%s: Out of memory !\n", __func__);
-		return IODA_INVALID_PE;
+		return NULL;
 	}
 
 	/* The bridge's M64 window might be extended to PHB's M64
@@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	/* No M64 window found ? */
 	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
 		kfree(pe_bitsmap);
-		return IODA_INVALID_PE;
+		return NULL;
 	}
 
 	/* Figure out the master PE and put all slave PEs
@@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	}
 
 	kfree(pe_bitsmap);
-	return master_pe->pe_number;
+	return master_pe;
 }
 
 static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
@@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
  * but in the meantime, we need to protect them to avoid warnings
  */
 #ifdef CONFIG_PCI_MSI
-static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
+static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
 {
 	struct pci_controller *hose = pci_bus_to_host(dev->bus);
 	struct pnv_phb *phb = hose->private_data;
@@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
 }
 #endif /* CONFIG_PCI_MSI */
 
-static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
-				  struct pnv_ioda_pe *parent,
-				  struct pnv_ioda_pe *child,
-				  bool is_add)
-{
-	const char *desc = is_add ? "adding" : "removing";
-	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
-			      OPAL_REMOVE_PE_FROM_DOMAIN;
-	struct pnv_ioda_pe *slave;
-	long rc;
-
-	/* Parent PE affects child PE */
-	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
-				child->pe_number, op);
-	if (rc != OPAL_SUCCESS) {
-		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
-			rc, desc);
-		return -ENXIO;
-	}
-
-	if (!(child->flags & PNV_IODA_PE_MASTER))
-		return 0;
-
-	/* Compound case: parent PE affects slave PEs */
-	list_for_each_entry(slave, &child->slaves, list) {
-		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
-					slave->pe_number, op);
-		if (rc != OPAL_SUCCESS) {
-			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
-				rc, desc);
-			return -ENXIO;
-		}
-	}
-
-	return 0;
-}
-
-static int pnv_ioda_set_peltv(struct pnv_phb *phb,
-			      struct pnv_ioda_pe *pe,
-			      bool is_add)
-{
-	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev = NULL;
-	int ret;
-
-	/*
-	 * Clear PE frozen state. If it's master PE, we need
-	 * clear slave PE frozen state as well.
-	 */
-	if (is_add) {
-		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
-					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
-		if (pe->flags & PNV_IODA_PE_MASTER) {
-			list_for_each_entry(slave, &pe->slaves, list)
-				opal_pci_eeh_freeze_clear(phb->opal_id,
-							  slave->pe_number,
-							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
-		}
-	}
-
-	/*
-	 * Associate PE in PELT. We need add the PE into the
-	 * corresponding PELT-V as well. Otherwise, the error
-	 * originated from the PE might contribute to other
-	 * PEs.
-	 */
-	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
-	if (ret)
-		return ret;
-
-	/* For compound PEs, any one affects all of them */
-	if (pe->flags & PNV_IODA_PE_MASTER) {
-		list_for_each_entry(slave, &pe->slaves, list) {
-			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
-			if (ret)
-				return ret;
-		}
-	}
-
-	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
-		pdev = pe->pbus->self;
-	else if (pe->flags & PNV_IODA_PE_DEV)
-		pdev = pe->pdev->bus->self;
-#ifdef CONFIG_PCI_IOV
-	else if (pe->flags & PNV_IODA_PE_VF)
-		pdev = pe->parent_dev->bus->self;
-#endif /* CONFIG_PCI_IOV */
-	while (pdev) {
-		struct pci_dn *pdn = pci_get_pdn(pdev);
-		struct pnv_ioda_pe *parent;
-
-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
-			parent = &phb->ioda.pe_array[pdn->pe_number];
-			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
-			if (ret)
-				return ret;
-		}
-
-		pdev = pdev->bus->self;
-	}
-
-	return 0;
-}
-
-#ifdef CONFIG_PCI_IOV
-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
-{
-	struct pci_dev *parent;
-	uint8_t bcomp, dcomp, fcomp;
-	int64_t rc;
-	long rid_end, rid;
-
-	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
-	if (pe->pbus) {
-		int count;
-
-		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
-		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
-		parent = pe->pbus->self;
-		if (pe->flags & PNV_IODA_PE_BUS_ALL)
-			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
-		else
-			count = 1;
-
-		switch(count) {
-		case  1: bcomp = OpalPciBusAll;         break;
-		case  2: bcomp = OpalPciBus7Bits;       break;
-		case  4: bcomp = OpalPciBus6Bits;       break;
-		case  8: bcomp = OpalPciBus5Bits;       break;
-		case 16: bcomp = OpalPciBus4Bits;       break;
-		case 32: bcomp = OpalPciBus3Bits;       break;
-		default:
-			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
-			        count);
-			/* Do an exact match only */
-			bcomp = OpalPciBusAll;
-		}
-		rid_end = pe->rid + (count << 8);
-	} else {
-		if (pe->flags & PNV_IODA_PE_VF)
-			parent = pe->parent_dev;
-		else
-			parent = pe->pdev->bus->self;
-		bcomp = OpalPciBusAll;
-		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
-		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
-		rid_end = pe->rid + 1;
-	}
-
-	/* Clear the reverse map */
-	for (rid = pe->rid; rid < rid_end; rid++)
-		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
-
-	/* Release from all parents PELT-V */
-	while (parent) {
-		struct pci_dn *pdn = pci_get_pdn(parent);
-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
-			/* XXX What to do in case of error ? */
-		}
-		parent = parent->bus->self;
-	}
-
-	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
-
-	/* Disassociate PE in PELT */
-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
-	if (rc)
-		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
-	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
-			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
-	if (rc)
-		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
-
-	pe->pbus = NULL;
-	pe->pdev = NULL;
-	pe->parent_dev = NULL;
-
-	return 0;
-}
-#endif /* CONFIG_PCI_IOV */
-
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 	}
 
 	/* Configure PELTV */
-	pnv_ioda_set_peltv(phb, pe, true);
+	pnv_ioda_set_peltv(pe, true);
 
 	/* Setup reverse map */
 	for (rid = pe->rid; rid < rid_end; rid++)
@@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 		if (pdn->pe_number != IODA_INVALID_PE)
 			continue;
 
+		/* Increase reference count of the parent PE */
+		pnv_ioda_pe_get(pe);
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
@@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
 	struct pnv_phb *phb = hose->private_data;
-	struct pnv_ioda_pe *pe;
+	struct pnv_ioda_pe *pe = NULL;
 	int pe_num = IODA_INVALID_PE;
 
 	/* For partial hotplug case, the PE instance hasn't been destroyed
@@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	}
 
 	/* PE number for root bus should have been reserved */
-	if (pci_is_root_bus(bus))
-		pe_num = phb->ioda.root_pe_no;
+	if (pci_is_root_bus(bus) &&
+	    phb->ioda.root_pe_no != IODA_INVALID_PE)
+		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
 
 	/* Check if PE is determined by M64 */
-	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
-		pe_num = phb->pick_m64_pe(phb, bus, all);
+	if (!pe && phb->pick_m64_pe)
+		pe = phb->pick_m64_pe(phb, bus, all);
 
 	/* The PE number isn't pinned by M64 */
-	if (pe_num == IODA_INVALID_PE)
-		pe_num = pnv_ioda_alloc_pe(phb);
+	if (!pe)
+		pe = pnv_ioda_alloc_pe(phb);
 
-	if (pe_num == IODA_INVALID_PE) {
-		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
+	if (!pe) {
+		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
 			__func__, pci_domain_nr(bus), bus->number);
 		return NULL;
 	}
 
-	pe = &phb->ioda.pe_array[pe_num];
 	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
 	pe->pbus = bus;
 	pe->pdev = NULL;
@@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 
 	if (pnv_ioda_configure_pe(phb, pe)) {
 		/* XXX What do we do here ? */
-		if (pe_num)
-			pnv_ioda_free_pe(phb, pe_num);
-		pe->pbus = NULL;
+		pnv_ioda_pe_put(pe);
 		return NULL;
 	}
 
 	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
-			GFP_KERNEL, hose->node);
+				       GFP_KERNEL, hose->node);
 	pe->tce32_table->data = pe;
 
 	/* Associate it with all child devices */
@@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		list_del(&pe->list);
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
-		pnv_ioda_deconfigure_pe(phb, pe);
+		pnv_ioda_deconfigure_pe(pe);
 
-		pnv_ioda_free_pe(phb, pe->pe_number);
+		pnv_ioda_pe_put(pe);
 	}
 }
 
@@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 
 		if (pnv_ioda_configure_pe(phb, pe)) {
 			/* XXX What do we do here ? */
-			if (pe_num)
-				pnv_ioda_free_pe(phb, pe_num);
-			pe->pdev = NULL;
+			pnv_ioda_pe_put(pe);
 			continue;
 		}
 
@@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
 	struct pnv_ioda_pe *pe;
 	int rc;
 
-	pe = pnv_ioda_get_pe(dev);
+	pe = pnv_ioda_pci_dev_to_pe(dev);
 	if (!pe)
 		return -ENODEV;
 
@@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
 	struct pnv_ioda_pe *pe;
 	int rc;
 
-	if (!(pe = pnv_ioda_get_pe(dev)))
+	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
 		return -ENODEV;
 
 	/* Assign XIVE to PE */
@@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
 				  unsigned int hwirq, unsigned int virq,
 				  unsigned int is_64, struct msi_msg *msg)
 {
-	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
+	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
 	unsigned int xive_num = hwirq - phb->msi_base;
 	__be32 data;
 	int rc;
@@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
 	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
 	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
+	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
 	hose->controller_ops = pnv_pci_controller_ops;
 
 #ifdef CONFIG_PCI_IOV
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 1bea3a8..8b10f01 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -28,6 +28,7 @@ enum pnv_phb_model {
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
 struct pnv_ioda_pe {
+	struct kref		kref;
 	unsigned long		flags;
 	struct pnv_phb		*phb;
 
@@ -120,7 +121,8 @@ struct pnv_phb {
 	void (*shutdown)(struct pnv_phb *phb);
 	int (*init_m64)(struct pnv_phb *phb);
 	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
-	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
+	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
+					   struct pci_bus *bus, int all);
 	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
 	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
 	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The original code doesn't support releasing PEs dynamically, meaning
that PE and the associated resources (IO, M32, M64 and DMA) can't
be released when unplugging a PCI adapter from one hotpluggable slot.

The patch takes object oriented methodology, introducs reference
count to PE, which is initialized to 1 and increased with 1 when a
new PCI device joins the PE. Once the last PCI device leaves the
PE, the PE is going to be release together with its associated
(IO, M32, M64, DMA) resources.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |   3 +
 arch/powerpc/kernel/pci-hotplug.c         |   5 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
 arch/powerpc/platforms/powernv/pci.h      |   4 +-
 4 files changed, 432 insertions(+), 238 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 5367eb3..a6ad4b1 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -31,6 +31,9 @@ struct pci_controller_ops {
 	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
 	void		(*setup_bridge)(struct pci_bus *, unsigned long);
 	void		(*reset_secondary_bus)(struct pci_dev *dev);
+
+	/* Called when PCI device is released */
+	void		(*release_device)(struct pci_dev *);
 };
 
 /*
diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 7ed85a6..0040343 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -29,6 +29,11 @@
  */
 void pcibios_release_device(struct pci_dev *dev)
 {
+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
+
+	if (hose->controller_ops.release_device)
+		hose->controller_ops.release_device(dev);
+
 	eeh_remove_device(dev);
 }
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 910fb67..ef8c216 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -12,6 +12,8 @@
 #undef DEBUG
 
 #include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/kref.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
 #include <linux/debugfs.h>
@@ -47,6 +49,8 @@
 /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
 #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
 
+static void pnv_ioda_release_pe(struct kref *kref);
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
 		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
 }
 
-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
 {
-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
-			__func__, pe_no, phb->hose->global_number);
+	if (!pe)
+		return;
+
+	kref_get(&pe->kref);
+}
+
+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
+{
+	unsigned int count;
+
+	if (!pe)
 		return;
+
+	/*
+	 * The count is initialized to 1 and increased with 1 when
+	 * a new PCI device is bound with the PE. Once the last PCI
+	 * device is leaving from the PE, the PE is going to be
+	 * released.
+	 */
+	count = atomic_read(&pe->kref.refcount);
+	if (count == 2)
+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
+	else
+		kref_put(&pe->kref, pnv_ioda_release_pe);
+}
+
+static void pnv_pci_release_device(struct pci_dev *pdev)
+{
+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+	struct pnv_phb *phb = hose->private_data;
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	struct pnv_ioda_pe *pe;
+
+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+		pe = &phb->ioda.pe_array[pdn->pe_number];
+		pnv_ioda_pe_put(pe);
+		pdn->pe_number = IODA_INVALID_PE;
 	}
+}
 
-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
-			__func__, pe_no, phb->hose->global_number);
+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb;
+	int index, count;
+	unsigned long tbl_addr, tbl_size;
+
+	/* No DMA capability for slave PEs */
+	if (pe->flags & PNV_IODA_PE_SLAVE)
+		return;
+
+	/* Bypass DMA window */
+	if (phb->type == PNV_PHB_IODA2 &&
+	    pe->tce_bypass_enabled &&
+	    pe->tce32_table &&
+	    pe->tce32_table->set_bypass)
+		pe->tce32_table->set_bypass(pe->tce32_table, false);
+
+	/* 32-bits DMA window */
+	count = pe->tce32_seg_end - pe->tce32_seg_start;
+	tbl_addr = pe->tce32_table->it_base;
+	if (!count)
 		return;
+
+	/* Free IOMMU table */
+	iommu_free_table(pe->tce32_table,
+			 of_node_full_name(phb->hose->dn));
+
+	/* Deconfigure TCE table */
+	switch (phb->type) {
+	case PNV_PHB_IODA1:
+		for (index = 0; index < count; index++)
+			opal_pci_map_pe_dma_window(phb->opal_id,
+						   pe->pe_number,
+						   pe->tce32_seg_start + index,
+						   1,
+						   __pa(tbl_addr) +
+						   index * TCE32_TABLE_SIZE,
+						   0,
+						   0x1000);
+		bitmap_clear(phb->ioda.tce32_segmap,
+			     pe->tce32_seg_start,
+			     count);
+		tbl_size = TCE32_TABLE_SIZE * count;
+		break;
+	case PNV_PHB_IODA2:
+		opal_pci_map_pe_dma_window(phb->opal_id,
+					   pe->pe_number,
+					   pe->pe_number << 1,
+					   1,
+					   __pa(tbl_addr),
+					   0,
+					   0x1000);
+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
+		break;
+	default:
+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
+		return;
+	}
+
+	/* Free memory of IOMMU table */
+	free_pages(tbl_addr, get_order(tbl_size));
+	pe->tce32_table = NULL;
+	pe->tce32_seg_start = 0;
+	pe->tce32_seg_end = 0;
+}
+
+static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb;
+	unsigned long *segmap = NULL, *pe_segmap = NULL;
+	int i;
+	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
+				     OPAL_M32_WINDOW_TYPE,
+				     OPAL_M64_WINDOW_TYPE };
+
+	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
+		switch (win_type[win]) {
+		case OPAL_IO_WINDOW_TYPE:
+			segmap = phb->ioda.io_segmap;
+			pe_segmap = pe->io_segmap;
+			break;
+		case OPAL_M32_WINDOW_TYPE:
+			segmap = phb->ioda.m32_segmap;
+			pe_segmap = pe->m32_segmap;
+			break;
+		case OPAL_M64_WINDOW_TYPE:
+			segmap = phb->ioda.m64_segmap;
+			pe_segmap = pe->m64_segmap;
+			break;
+		}
+		i = -1;
+		while ((i = find_next_bit(pe_segmap,
+			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
+			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
+			    win_type[win] == OPAL_M32_WINDOW_TYPE)
+				opal_pci_map_pe_mmio_window(phb->opal_id,
+						phb->ioda.reserved_pe,
+						win_type[win], 0, i);
+			else if (phb->type == PNV_PHB_IODA1)
+				opal_pci_map_pe_mmio_window(phb->opal_id,
+						phb->ioda.reserved_pe,
+						win_type[win],
+						i / 8, i % 8);
+
+			clear_bit(i, pe_segmap);
+			clear_bit(i, segmap);
+		}
+	}
+}
+
+static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
+				  struct pnv_ioda_pe *parent,
+				  struct pnv_ioda_pe *child,
+				  bool is_add)
+{
+	const char *desc = is_add ? "adding" : "removing";
+	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
+			      OPAL_REMOVE_PE_FROM_DOMAIN;
+	struct pnv_ioda_pe *slave;
+	long rc;
+
+	/* Parent PE affects child PE */
+	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
+				child->pe_number, op);
+	if (rc != OPAL_SUCCESS) {
+		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
+			rc, desc);
+		return -ENXIO;
+	}
+
+	if (!(child->flags & PNV_IODA_PE_MASTER))
+		return 0;
+
+	/* Compound case: parent PE affects slave PEs */
+	list_for_each_entry(slave, &child->slaves, list) {
+		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
+					slave->pe_number, op);
+		if (rc != OPAL_SUCCESS) {
+			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
+				rc, desc);
+			return -ENXIO;
+		}
+	}
+
+	return 0;
+}
+
+static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
+{
+	struct pnv_phb *phb = pe->phb;
+	struct pnv_ioda_pe *slave;
+	struct pci_dev *pdev = NULL;
+	int ret;
+
+	/*
+	 * Clear PE frozen state. If it's master PE, we need
+	 * clear slave PE frozen state as well.
+	 */
+	opal_pci_eeh_freeze_clear(phb->opal_id,
+				  pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+	if (pe->flags & PNV_IODA_PE_MASTER) {
+		list_for_each_entry(slave, &pe->slaves, list) {
+			opal_pci_eeh_freeze_clear(phb->opal_id,
+						  slave->pe_number,
+						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+		}
+	}
+
+	/*
+	 * Associate PE in PELT. We need add the PE into the
+	 * corresponding PELT-V as well. Otherwise, the error
+	 * originated from the PE might contribute to other
+	 * PEs.
+	 */
+	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
+	if (ret)
+		return ret;
+
+	/* For compound PEs, any one affects all of them */
+	if (pe->flags & PNV_IODA_PE_MASTER) {
+		list_for_each_entry(slave, &pe->slaves, list) {
+			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
+			if (ret)
+				return ret;
+		}
+	}
+
+	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
+		pdev = pe->pbus->self;
+	else if (pe->flags & PNV_IODA_PE_DEV)
+		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
+
+	while (pdev) {
+		struct pci_dn *pdn = pci_get_pdn(pdev);
+		struct pnv_ioda_pe *parent;
+
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			parent = &phb->ioda.pe_array[pdn->pe_number];
+			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
+			if (ret)
+				return ret;
+		}
+
+		pdev = pdev->bus->self;
+	}
+
+	return 0;
+}
+
+static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
+{
+	struct pnv_phb *phb = pe->phb;
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	long rid_end, rid;
+	int64_t rc;
+
+	/* Tear down MVE */
+	if (phb->type == PNV_PHB_IODA1 &&
+	    pe->mve_number != -1) {
+		rc = opal_pci_set_mve(phb->opal_id,
+				      pe->mve_number,
+				      phb->ioda.reserved_pe);
+		if (rc != OPAL_SUCCESS)
+			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
+				rc, pe->mve_number);
+		rc = opal_pci_set_mve_enable(phb->opal_id,
+					     pe->mve_number,
+					     OPAL_DISABLE_MVE);
+		if (rc != OPAL_SUCCESS)
+			pe_warn(pe, "Error %lld disabling MVE#%d\n",
+				rc, pe->mve_number);
+		pe->mve_number = -1;
+	}
+
+	/* Unmapping PELTV */
+	pnv_ioda_set_peltv(pe, false);
+
+	/* To unmap PELTM */
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end -
+				pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;   break;
+		case  2: bcomp = OpalPciBus7Bits; break;
+		case  4: bcomp = OpalPciBus6Bits; break;
+		case  8: bcomp = OpalPciBus5Bits; break;
+		case 16: bcomp = OpalPciBus4Bits; break;
+		case 32: bcomp = OpalPciBus3Bits; break;
+		default:
+			/* Fail back to case of one bus */
+			pe_warn(pe, "Cannot support %d buses\n", count);
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear RID mapping */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
+
+	/* Unmapping PELTM */
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
+}
+
+static void pnv_ioda_release_pe(struct kref *kref)
+{
+	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
+	struct pnv_ioda_pe *tmp, *slave;
+	struct pnv_phb *phb = pe->phb;
+
+	pnv_ioda_release_pe_dma(pe);
+	pnv_ioda_release_pe_seg(pe);
+	pnv_ioda_deconfigure_pe(pe);
+
+	/* Release slave PEs for compound PE */
+	if (pe->flags & PNV_IODA_PE_MASTER) {
+		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
+			pnv_ioda_pe_put(slave);
+	}
+
+	/* Remove the PE from various list. We need remove slave
+	 * PE from master's list.
+	 */
+	list_del(&pe->dma_link);
+	list_del(&pe->list);
+
+	/* Free PE number */
+	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
+}
+
+static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
+					    int pe_no)
+{
+	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
+
+	kref_init(&pe->kref);
+	pe->phb = phb;
+	pe->pe_number = pe_no;
+	INIT_LIST_HEAD(&pe->dma_link);
+	INIT_LIST_HEAD(&pe->list);
+
+	return pe;
+}
+
+static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
+					       int pe_no)
+{
+	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
+		pr_warn("%s: Invalid PE %d on PHB#%x\n",
+			__func__, pe_no, phb->hose->global_number);
+		return NULL;
 	}
 
-	phb->ioda.pe_array[pe_no].phb = phb;
-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
+	/*
+	 * Same PE might be reserved for multiple times, which
+	 * is out of problem actually.
+	 */
+	set_bit(pe_no, phb->ioda.pe_alloc);
+	return pnv_ioda_init_pe(phb, pe_no);
 }
 
-static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
+static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
 	unsigned long pe_no;
 	unsigned long limit = phb->ioda.total_pe - 1;
@@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 			break;
 
 		if (--limit >= phb->ioda.total_pe)
-			return IODA_INVALID_PE;
+			return NULL;
 	} while(1);
 
-	phb->ioda.pe_array[pe_no].phb = phb;
-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
-	return pe_no;
-}
-
-static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
-{
-	WARN_ON(phb->ioda.pe_array[pe].pdev);
-
-	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
-	clear_bit(pe, phb->ioda.pe_alloc);
+	return pnv_ioda_init_pe(phb, pe_no);
 }
 
 static int pnv_ioda1_init_m64(struct pnv_phb *phb)
@@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
 	}
 }
 
-static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
-				struct pci_bus *bus, int all)
+static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
+						struct pci_bus *bus,
+						int all)
 {
 	resource_size_t segsz = phb->ioda.m64_segsize;
 	struct pci_dev *pdev;
@@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	int i;
 
 	if (!pnv_ioda_need_m64_pe(phb, bus))
-		return IODA_INVALID_PE;
+		return NULL;
 
         /* Allocate bitmap */
 	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
 	pe_bitsmap = kzalloc(size, GFP_KERNEL);
 	if (!pe_bitsmap) {
 		pr_warn("%s: Out of memory !\n", __func__);
-		return IODA_INVALID_PE;
+		return NULL;
 	}
 
 	/* The bridge's M64 window might be extended to PHB's M64
@@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	/* No M64 window found ? */
 	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
 		kfree(pe_bitsmap);
-		return IODA_INVALID_PE;
+		return NULL;
 	}
 
 	/* Figure out the master PE and put all slave PEs
@@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
 	}
 
 	kfree(pe_bitsmap);
-	return master_pe->pe_number;
+	return master_pe;
 }
 
 static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
@@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
  * but in the meantime, we need to protect them to avoid warnings
  */
 #ifdef CONFIG_PCI_MSI
-static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
+static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
 {
 	struct pci_controller *hose = pci_bus_to_host(dev->bus);
 	struct pnv_phb *phb = hose->private_data;
@@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
 }
 #endif /* CONFIG_PCI_MSI */
 
-static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
-				  struct pnv_ioda_pe *parent,
-				  struct pnv_ioda_pe *child,
-				  bool is_add)
-{
-	const char *desc = is_add ? "adding" : "removing";
-	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
-			      OPAL_REMOVE_PE_FROM_DOMAIN;
-	struct pnv_ioda_pe *slave;
-	long rc;
-
-	/* Parent PE affects child PE */
-	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
-				child->pe_number, op);
-	if (rc != OPAL_SUCCESS) {
-		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
-			rc, desc);
-		return -ENXIO;
-	}
-
-	if (!(child->flags & PNV_IODA_PE_MASTER))
-		return 0;
-
-	/* Compound case: parent PE affects slave PEs */
-	list_for_each_entry(slave, &child->slaves, list) {
-		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
-					slave->pe_number, op);
-		if (rc != OPAL_SUCCESS) {
-			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
-				rc, desc);
-			return -ENXIO;
-		}
-	}
-
-	return 0;
-}
-
-static int pnv_ioda_set_peltv(struct pnv_phb *phb,
-			      struct pnv_ioda_pe *pe,
-			      bool is_add)
-{
-	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev = NULL;
-	int ret;
-
-	/*
-	 * Clear PE frozen state. If it's master PE, we need
-	 * clear slave PE frozen state as well.
-	 */
-	if (is_add) {
-		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
-					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
-		if (pe->flags & PNV_IODA_PE_MASTER) {
-			list_for_each_entry(slave, &pe->slaves, list)
-				opal_pci_eeh_freeze_clear(phb->opal_id,
-							  slave->pe_number,
-							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
-		}
-	}
-
-	/*
-	 * Associate PE in PELT. We need add the PE into the
-	 * corresponding PELT-V as well. Otherwise, the error
-	 * originated from the PE might contribute to other
-	 * PEs.
-	 */
-	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
-	if (ret)
-		return ret;
-
-	/* For compound PEs, any one affects all of them */
-	if (pe->flags & PNV_IODA_PE_MASTER) {
-		list_for_each_entry(slave, &pe->slaves, list) {
-			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
-			if (ret)
-				return ret;
-		}
-	}
-
-	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
-		pdev = pe->pbus->self;
-	else if (pe->flags & PNV_IODA_PE_DEV)
-		pdev = pe->pdev->bus->self;
-#ifdef CONFIG_PCI_IOV
-	else if (pe->flags & PNV_IODA_PE_VF)
-		pdev = pe->parent_dev->bus->self;
-#endif /* CONFIG_PCI_IOV */
-	while (pdev) {
-		struct pci_dn *pdn = pci_get_pdn(pdev);
-		struct pnv_ioda_pe *parent;
-
-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
-			parent = &phb->ioda.pe_array[pdn->pe_number];
-			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
-			if (ret)
-				return ret;
-		}
-
-		pdev = pdev->bus->self;
-	}
-
-	return 0;
-}
-
-#ifdef CONFIG_PCI_IOV
-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
-{
-	struct pci_dev *parent;
-	uint8_t bcomp, dcomp, fcomp;
-	int64_t rc;
-	long rid_end, rid;
-
-	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
-	if (pe->pbus) {
-		int count;
-
-		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
-		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
-		parent = pe->pbus->self;
-		if (pe->flags & PNV_IODA_PE_BUS_ALL)
-			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
-		else
-			count = 1;
-
-		switch(count) {
-		case  1: bcomp = OpalPciBusAll;         break;
-		case  2: bcomp = OpalPciBus7Bits;       break;
-		case  4: bcomp = OpalPciBus6Bits;       break;
-		case  8: bcomp = OpalPciBus5Bits;       break;
-		case 16: bcomp = OpalPciBus4Bits;       break;
-		case 32: bcomp = OpalPciBus3Bits;       break;
-		default:
-			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
-			        count);
-			/* Do an exact match only */
-			bcomp = OpalPciBusAll;
-		}
-		rid_end = pe->rid + (count << 8);
-	} else {
-		if (pe->flags & PNV_IODA_PE_VF)
-			parent = pe->parent_dev;
-		else
-			parent = pe->pdev->bus->self;
-		bcomp = OpalPciBusAll;
-		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
-		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
-		rid_end = pe->rid + 1;
-	}
-
-	/* Clear the reverse map */
-	for (rid = pe->rid; rid < rid_end; rid++)
-		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
-
-	/* Release from all parents PELT-V */
-	while (parent) {
-		struct pci_dn *pdn = pci_get_pdn(parent);
-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
-			/* XXX What to do in case of error ? */
-		}
-		parent = parent->bus->self;
-	}
-
-	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
-
-	/* Disassociate PE in PELT */
-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
-	if (rc)
-		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
-	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
-			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
-	if (rc)
-		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
-
-	pe->pbus = NULL;
-	pe->pdev = NULL;
-	pe->parent_dev = NULL;
-
-	return 0;
-}
-#endif /* CONFIG_PCI_IOV */
-
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 	}
 
 	/* Configure PELTV */
-	pnv_ioda_set_peltv(phb, pe, true);
+	pnv_ioda_set_peltv(pe, true);
 
 	/* Setup reverse map */
 	for (rid = pe->rid; rid < rid_end; rid++)
@@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 		if (pdn->pe_number != IODA_INVALID_PE)
 			continue;
 
+		/* Increase reference count of the parent PE */
+		pnv_ioda_pe_get(pe);
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
@@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 {
 	struct pci_controller *hose = pci_bus_to_host(bus);
 	struct pnv_phb *phb = hose->private_data;
-	struct pnv_ioda_pe *pe;
+	struct pnv_ioda_pe *pe = NULL;
 	int pe_num = IODA_INVALID_PE;
 
 	/* For partial hotplug case, the PE instance hasn't been destroyed
@@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 	}
 
 	/* PE number for root bus should have been reserved */
-	if (pci_is_root_bus(bus))
-		pe_num = phb->ioda.root_pe_no;
+	if (pci_is_root_bus(bus) &&
+	    phb->ioda.root_pe_no != IODA_INVALID_PE)
+		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
 
 	/* Check if PE is determined by M64 */
-	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
-		pe_num = phb->pick_m64_pe(phb, bus, all);
+	if (!pe && phb->pick_m64_pe)
+		pe = phb->pick_m64_pe(phb, bus, all);
 
 	/* The PE number isn't pinned by M64 */
-	if (pe_num == IODA_INVALID_PE)
-		pe_num = pnv_ioda_alloc_pe(phb);
+	if (!pe)
+		pe = pnv_ioda_alloc_pe(phb);
 
-	if (pe_num == IODA_INVALID_PE) {
-		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
+	if (!pe) {
+		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
 			__func__, pci_domain_nr(bus), bus->number);
 		return NULL;
 	}
 
-	pe = &phb->ioda.pe_array[pe_num];
 	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
 	pe->pbus = bus;
 	pe->pdev = NULL;
@@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 
 	if (pnv_ioda_configure_pe(phb, pe)) {
 		/* XXX What do we do here ? */
-		if (pe_num)
-			pnv_ioda_free_pe(phb, pe_num);
-		pe->pbus = NULL;
+		pnv_ioda_pe_put(pe);
 		return NULL;
 	}
 
 	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
-			GFP_KERNEL, hose->node);
+				       GFP_KERNEL, hose->node);
 	pe->tce32_table->data = pe;
 
 	/* Associate it with all child devices */
@@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 		list_del(&pe->list);
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
-		pnv_ioda_deconfigure_pe(phb, pe);
+		pnv_ioda_deconfigure_pe(pe);
 
-		pnv_ioda_free_pe(phb, pe->pe_number);
+		pnv_ioda_pe_put(pe);
 	}
 }
 
@@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
 
 		if (pnv_ioda_configure_pe(phb, pe)) {
 			/* XXX What do we do here ? */
-			if (pe_num)
-				pnv_ioda_free_pe(phb, pe_num);
-			pe->pdev = NULL;
+			pnv_ioda_pe_put(pe);
 			continue;
 		}
 
@@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
 	struct pnv_ioda_pe *pe;
 	int rc;
 
-	pe = pnv_ioda_get_pe(dev);
+	pe = pnv_ioda_pci_dev_to_pe(dev);
 	if (!pe)
 		return -ENODEV;
 
@@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
 	struct pnv_ioda_pe *pe;
 	int rc;
 
-	if (!(pe = pnv_ioda_get_pe(dev)))
+	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
 		return -ENODEV;
 
 	/* Assign XIVE to PE */
@@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
 				  unsigned int hwirq, unsigned int virq,
 				  unsigned int is_64, struct msi_msg *msg)
 {
-	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
+	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
 	unsigned int xive_num = hwirq - phb->msi_base;
 	__be32 data;
 	int rc;
@@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
 	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
 	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
+	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
 	hose->controller_ops = pnv_pci_controller_ops;
 
 #ifdef CONFIG_PCI_IOV
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 1bea3a8..8b10f01 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -28,6 +28,7 @@ enum pnv_phb_model {
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
 struct pnv_ioda_pe {
+	struct kref		kref;
 	unsigned long		flags;
 	struct pnv_phb		*phb;
 
@@ -120,7 +121,8 @@ struct pnv_phb {
 	void (*shutdown)(struct pnv_phb *phb);
 	int (*init_m64)(struct pnv_phb *phb);
 	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
-	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
+	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
+					   struct pci_bus *bus, int all);
 	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
 	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
 	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 08/21] powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

Nobody is using the this function. The patch drops it.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 71 -------------------------------
 1 file changed, 71 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ef8c216..5cd8298 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1302,77 +1302,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 }
 #endif /* CONFIG_PCI_IOV */
 
-#if 0
-static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
-{
-	struct pci_controller *hose = pci_bus_to_host(dev->bus);
-	struct pnv_phb *phb = hose->private_data;
-	struct pci_dn *pdn = pci_get_pdn(dev);
-	struct pnv_ioda_pe *pe;
-	int pe_num;
-
-	if (!pdn) {
-		pr_err("%s: Device tree node not associated properly\n",
-			   pci_name(dev));
-		return NULL;
-	}
-	if (pdn->pe_number != IODA_INVALID_PE)
-		return NULL;
-
-	/* PE#0 has been pre-set */
-	if (dev->bus->number == 0)
-		pe_num = 0;
-	else
-		pe_num = pnv_ioda_alloc_pe(phb);
-	if (pe_num == IODA_INVALID_PE) {
-		pr_warning("%s: Not enough PE# available, disabling device\n",
-			   pci_name(dev));
-		return NULL;
-	}
-
-	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
-	 * pointer in the PE data structure, both should be destroyed at the
-	 * same time. However, this needs to be looked at more closely again
-	 * once we actually start removing things (Hotplug, SR-IOV, ...)
-	 *
-	 * At some point we want to remove the PDN completely anyways
-	 */
-	pe = &phb->ioda.pe_array[pe_num];
-	pci_dev_get(dev);
-	pdn->pcidev = dev;
-	pdn->pe_number = pe_num;
-	pe->pdev = dev;
-	pe->pbus = NULL;
-	pe->tce32_seg = -1;
-	pe->mve_number = -1;
-	pe->rid = dev->bus->number << 8 | pdn->devfn;
-
-	pe_info(pe, "Associated device to PE\n");
-
-	if (pnv_ioda_configure_pe(phb, pe)) {
-		/* XXX What do we do here ? */
-		if (pe_num)
-			pnv_ioda_free_pe(phb, pe_num);
-		pdn->pe_number = IODA_INVALID_PE;
-		pe->pdev = NULL;
-		pci_dev_put(dev);
-		return NULL;
-	}
-
-	/* Assign a DMA weight to the device */
-	pe->dma_weight = pnv_ioda_dma_weight(dev);
-	if (pe->dma_weight != 0) {
-		phb->ioda.dma_weight += pe->dma_weight;
-		phb->ioda.dma_pe_count++;
-	}
-
-	/* Link the PE */
-	pnv_ioda_link_pe_by_weight(phb, pe);
-
-	return pe;
-}
-#endif /* Useful for SRIOV case */
-
 static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *dev;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 08/21] powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

Nobody is using the this function. The patch drops it.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 71 -------------------------------
 1 file changed, 71 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ef8c216..5cd8298 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1302,77 +1302,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 }
 #endif /* CONFIG_PCI_IOV */
 
-#if 0
-static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
-{
-	struct pci_controller *hose = pci_bus_to_host(dev->bus);
-	struct pnv_phb *phb = hose->private_data;
-	struct pci_dn *pdn = pci_get_pdn(dev);
-	struct pnv_ioda_pe *pe;
-	int pe_num;
-
-	if (!pdn) {
-		pr_err("%s: Device tree node not associated properly\n",
-			   pci_name(dev));
-		return NULL;
-	}
-	if (pdn->pe_number != IODA_INVALID_PE)
-		return NULL;
-
-	/* PE#0 has been pre-set */
-	if (dev->bus->number == 0)
-		pe_num = 0;
-	else
-		pe_num = pnv_ioda_alloc_pe(phb);
-	if (pe_num == IODA_INVALID_PE) {
-		pr_warning("%s: Not enough PE# available, disabling device\n",
-			   pci_name(dev));
-		return NULL;
-	}
-
-	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
-	 * pointer in the PE data structure, both should be destroyed at the
-	 * same time. However, this needs to be looked at more closely again
-	 * once we actually start removing things (Hotplug, SR-IOV, ...)
-	 *
-	 * At some point we want to remove the PDN completely anyways
-	 */
-	pe = &phb->ioda.pe_array[pe_num];
-	pci_dev_get(dev);
-	pdn->pcidev = dev;
-	pdn->pe_number = pe_num;
-	pe->pdev = dev;
-	pe->pbus = NULL;
-	pe->tce32_seg = -1;
-	pe->mve_number = -1;
-	pe->rid = dev->bus->number << 8 | pdn->devfn;
-
-	pe_info(pe, "Associated device to PE\n");
-
-	if (pnv_ioda_configure_pe(phb, pe)) {
-		/* XXX What do we do here ? */
-		if (pe_num)
-			pnv_ioda_free_pe(phb, pe_num);
-		pdn->pe_number = IODA_INVALID_PE;
-		pe->pdev = NULL;
-		pci_dev_put(dev);
-		return NULL;
-	}
-
-	/* Assign a DMA weight to the device */
-	pe->dma_weight = pnv_ioda_dma_weight(dev);
-	if (pe->dma_weight != 0) {
-		phb->ioda.dma_weight += pe->dma_weight;
-		phb->ioda.dma_pe_count++;
-	}
-
-	/* Link the PE */
-	pnv_ioda_link_pe_by_weight(phb, pe);
-
-	return pe;
-}
-#endif /* Useful for SRIOV case */
-
 static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *dev;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

For PowerNV platform, running on top of skiboot, all PE level reset
should be routed to firmware if the bridge of the PE primary bus has
device-node property "ibm,reset-by-firmware". Otherwise, the kernel
has to issue hot reset on PE's primary bus despite the requested reset
types, which is the behaviour before the firmware supports PCI slot
reset. So the changes don't depend on the PCI slot reset capability
exposed from the firmware.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h               |   1 +
 arch/powerpc/include/asm/opal.h              |   4 +-
 arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
 3 files changed, 102 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index c5eb86f..2793d24 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -190,6 +190,7 @@ enum {
 #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
 #define EEH_RESET_HOT		1	/* Hot reset			*/
 #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
+#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
 #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
 #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 042af1a..6d467df 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
 int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
 					uint16_t dma_window_number, uint64_t pci_start_addr,
 					uint64_t pci_mem_size);
-int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
+int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
 
 int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
 				   uint64_t diag_buffer_len);
@@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
 int64_t opal_set_system_attention_led(uint8_t led_action);
 int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
 			    __be16 *pci_error_type, __be16 *severity);
-int64_t opal_pci_poll(uint64_t phb_id);
+int64_t opal_pci_poll(uint64_t id, uint8_t *val);
 int64_t opal_return_cpu(void);
 int64_t opal_check_token(uint64_t token);
 int64_t opal_reinit_cpus(uint64_t flags);
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index ce738ab..3c01095 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
 	return ret;
 }
 
-static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
+static s64 pnv_eeh_poll(uint64_t id)
 {
 	s64 rc = OPAL_HARDWARE;
 
 	while (1) {
-		rc = opal_pci_poll(phb->opal_id);
+		rc = opal_pci_poll(id, NULL);
 		if (rc <= 0)
 			break;
 
@@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
 int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
 {
 	struct pnv_phb *phb = hose->private_data;
+	uint8_t scope;
 	s64 rc = OPAL_HARDWARE;
 
 	pr_debug("%s: Reset PHB#%x, option=%d\n",
 		 __func__, hose->global_number, option);
-
-	/* Issue PHB complete reset request */
-	if (option == EEH_RESET_FUNDAMENTAL ||
-	    option == EEH_RESET_HOT)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PHB_COMPLETE,
-				    OPAL_ASSERT_RESET);
-	else if (option == EEH_RESET_DEACTIVATE)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PHB_COMPLETE,
-				    OPAL_DEASSERT_RESET);
-	if (rc < 0)
-		goto out;
-
-	/*
-	 * Poll state of the PHB until the request is done
-	 * successfully. The PHB reset is usually PHB complete
-	 * reset followed by hot reset on root bus. So we also
-	 * need the PCI bus settlement delay.
-	 */
-	rc = pnv_eeh_phb_poll(phb);
-	if (option == EEH_RESET_DEACTIVATE) {
-		if (system_state < SYSTEM_RUNNING)
-			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
-		else
-			msleep(EEH_PE_RST_SETTLE_TIME);
+	switch (option) {
+	case EEH_RESET_HOT:
+		scope = OPAL_RESET_PCI_HOT;
+		break;
+	case EEH_RESET_FUNDAMENTAL:
+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
+		break;
+	case EEH_RESET_COMPLETE:
+		scope = OPAL_RESET_PHB_COMPLETE;
+		break;
+	case EEH_RESET_DEACTIVATE:
+		return 0;
+	default:
+		pr_warn("%s: Unsupported option %d\n",
+			__func__, option);
+		return -EINVAL;
 	}
-out:
-	if (rc != OPAL_SUCCESS)
-		return -EIO;
 
-	return 0;
-}
-
-static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
-{
-	struct pnv_phb *phb = hose->private_data;
-	s64 rc = OPAL_HARDWARE;
+	/* Issue reset and poll until it's completed */
+	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
+	if (rc > 0)
+		rc = pnv_eeh_poll(phb->opal_id);
 
-	pr_debug("%s: Reset PHB#%x, option=%d\n",
-		 __func__, hose->global_number, option);
-
-	/*
-	 * During the reset deassert time, we needn't care
-	 * the reset scope because the firmware does nothing
-	 * for fundamental or hot reset during deassert phase.
-	 */
-	if (option == EEH_RESET_FUNDAMENTAL)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PCI_FUNDAMENTAL,
-				    OPAL_ASSERT_RESET);
-	else if (option == EEH_RESET_HOT)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PCI_HOT,
-				    OPAL_ASSERT_RESET);
-	else if (option == EEH_RESET_DEACTIVATE)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PCI_HOT,
-				    OPAL_DEASSERT_RESET);
-	if (rc < 0)
-		goto out;
-
-	/* Poll state of the PHB until the request is done */
-	rc = pnv_eeh_phb_poll(phb);
-	if (option == EEH_RESET_DEACTIVATE)
-		msleep(EEH_PE_RST_SETTLE_TIME);
-out:
-	if (rc != OPAL_SUCCESS)
-		return -EIO;
-
-	return 0;
+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
 }
 
-static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
+static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 {
 	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
 	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
@@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	return 0;
 }
 
+static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
+	uint64_t id = (0x1ul << 60);
+	uint8_t scope;
+	s64 rc;
+
+	/*
+	 * If the firmware can't handle it, we will issue hot reset
+	 * on the secondary bus despite the requested reset type
+	 */
+	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
+		return __pnv_eeh_bridge_reset(dev, option);
+
+	/* The firmware can handle the request */
+	switch (option) {
+	case EEH_RESET_HOT:
+		scope = OPAL_RESET_PCI_HOT;
+		break;
+	case EEH_RESET_FUNDAMENTAL:
+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
+		break;
+	case EEH_RESET_DEACTIVATE:
+		return 0;
+	case EEH_RESET_COMPLETE:
+	default:
+		pr_warn("%s: Unsupported option %d on device %s\n",
+			__func__, option, pci_name(dev));
+		return -EINVAL;
+	}
+
+	hose = pci_bus_to_host(dev->bus);
+	phb = hose->private_data;
+	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
+	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
+	if (rc > 0)
+		rc = pnv_eeh_poll(id);
+
+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
+}
+
 void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	struct pci_controller *hose;
 
 	if (pci_is_root_bus(dev->bus)) {
 		hose = pci_bus_to_host(dev->bus);
-		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
-		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
+		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
+		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
 	} else {
 		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
 		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
@@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
 static int pnv_eeh_reset(struct eeh_pe *pe, int option)
 {
 	struct pci_controller *hose = pe->phb;
+	struct pnv_phb *phb;
 	struct pci_bus *bus;
-	int ret;
+	s64 rc;
 
 	/*
 	 * For PHB reset, we always have complete reset. For those PEs whose
@@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
 	 * reset. The side effect is that EEH core has to clear the frozen
 	 * state explicitly after BAR restore.
 	 */
-	if (pe->type & EEH_PE_PHB) {
-		ret = pnv_eeh_phb_reset(hose, option);
-	} else {
-		struct pnv_phb *phb;
-		s64 rc;
+	if (pe->type & EEH_PE_PHB)
+		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);
 
-		/*
-		 * The frozen PE might be caused by PAPR error injection
-		 * registers, which are expected to be cleared after hitting
-		 * frozen PE as stated in the hardware spec. Unfortunately,
-		 * that's not true on P7IOC. So we have to clear it manually
-		 * to avoid recursive EEH errors during recovery.
-		 */
-		phb = hose->private_data;
-		if (phb->model == PNV_PHB_MODEL_P7IOC &&
-		    (option == EEH_RESET_HOT ||
-		    option == EEH_RESET_FUNDAMENTAL)) {
-			rc = opal_pci_reset(phb->opal_id,
-					    OPAL_RESET_PHB_ERROR,
-					    OPAL_ASSERT_RESET);
-			if (rc != OPAL_SUCCESS) {
-				pr_warn("%s: Failure %lld clearing "
-					"error injection registers\n",
-					__func__, rc);
-				return -EIO;
-			}
+	/*
+	 * The frozen PE might be caused by PAPR error injection
+	 * registers, which are expected to be cleared after hitting
+	 * frozen PE as stated in the hardware spec. Unfortunately,
+	 * that's not true on P7IOC. So we have to clear it manually
+	 * to avoid recursive EEH errors during recovery.
+	 */
+	phb = hose->private_data;
+	if (phb->model == PNV_PHB_MODEL_P7IOC &&
+	    (option == EEH_RESET_HOT ||
+	    option == EEH_RESET_FUNDAMENTAL)) {
+		rc = opal_pci_reset(phb->opal_id,
+				    OPAL_RESET_PHB_ERROR,
+				    OPAL_ASSERT_RESET);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("%s: Failure %lld clearing error "
+				"injection registers on PHB#%d\n",
+				__func__, rc, hose->global_number);
+			return -EIO;
 		}
-
-		bus = eeh_pe_bus_get(pe);
-		if (pci_is_root_bus(bus) ||
-			pci_is_root_bus(bus->parent))
-			ret = pnv_eeh_root_reset(hose, option);
-		else
-			ret = pnv_eeh_bridge_reset(bus->self, option);
 	}
 
-	return ret;
+	/* Route the reset request to PHB or upstream bridge */
+	bus = eeh_pe_bus_get(pe);
+	if (pci_is_root_bus(bus))
+		return pnv_eeh_phb_reset(hose, option);
+
+	return pnv_eeh_bridge_reset(bus->self, option);
 }
 
 /**
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

For PowerNV platform, running on top of skiboot, all PE level reset
should be routed to firmware if the bridge of the PE primary bus has
device-node property "ibm,reset-by-firmware". Otherwise, the kernel
has to issue hot reset on PE's primary bus despite the requested reset
types, which is the behaviour before the firmware supports PCI slot
reset. So the changes don't depend on the PCI slot reset capability
exposed from the firmware.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h               |   1 +
 arch/powerpc/include/asm/opal.h              |   4 +-
 arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
 3 files changed, 102 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index c5eb86f..2793d24 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -190,6 +190,7 @@ enum {
 #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
 #define EEH_RESET_HOT		1	/* Hot reset			*/
 #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
+#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
 #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
 #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 042af1a..6d467df 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
 int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
 					uint16_t dma_window_number, uint64_t pci_start_addr,
 					uint64_t pci_mem_size);
-int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
+int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
 
 int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
 				   uint64_t diag_buffer_len);
@@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
 int64_t opal_set_system_attention_led(uint8_t led_action);
 int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
 			    __be16 *pci_error_type, __be16 *severity);
-int64_t opal_pci_poll(uint64_t phb_id);
+int64_t opal_pci_poll(uint64_t id, uint8_t *val);
 int64_t opal_return_cpu(void);
 int64_t opal_check_token(uint64_t token);
 int64_t opal_reinit_cpus(uint64_t flags);
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index ce738ab..3c01095 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
 	return ret;
 }
 
-static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
+static s64 pnv_eeh_poll(uint64_t id)
 {
 	s64 rc = OPAL_HARDWARE;
 
 	while (1) {
-		rc = opal_pci_poll(phb->opal_id);
+		rc = opal_pci_poll(id, NULL);
 		if (rc <= 0)
 			break;
 
@@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
 int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
 {
 	struct pnv_phb *phb = hose->private_data;
+	uint8_t scope;
 	s64 rc = OPAL_HARDWARE;
 
 	pr_debug("%s: Reset PHB#%x, option=%d\n",
 		 __func__, hose->global_number, option);
-
-	/* Issue PHB complete reset request */
-	if (option == EEH_RESET_FUNDAMENTAL ||
-	    option == EEH_RESET_HOT)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PHB_COMPLETE,
-				    OPAL_ASSERT_RESET);
-	else if (option == EEH_RESET_DEACTIVATE)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PHB_COMPLETE,
-				    OPAL_DEASSERT_RESET);
-	if (rc < 0)
-		goto out;
-
-	/*
-	 * Poll state of the PHB until the request is done
-	 * successfully. The PHB reset is usually PHB complete
-	 * reset followed by hot reset on root bus. So we also
-	 * need the PCI bus settlement delay.
-	 */
-	rc = pnv_eeh_phb_poll(phb);
-	if (option == EEH_RESET_DEACTIVATE) {
-		if (system_state < SYSTEM_RUNNING)
-			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
-		else
-			msleep(EEH_PE_RST_SETTLE_TIME);
+	switch (option) {
+	case EEH_RESET_HOT:
+		scope = OPAL_RESET_PCI_HOT;
+		break;
+	case EEH_RESET_FUNDAMENTAL:
+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
+		break;
+	case EEH_RESET_COMPLETE:
+		scope = OPAL_RESET_PHB_COMPLETE;
+		break;
+	case EEH_RESET_DEACTIVATE:
+		return 0;
+	default:
+		pr_warn("%s: Unsupported option %d\n",
+			__func__, option);
+		return -EINVAL;
 	}
-out:
-	if (rc != OPAL_SUCCESS)
-		return -EIO;
 
-	return 0;
-}
-
-static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
-{
-	struct pnv_phb *phb = hose->private_data;
-	s64 rc = OPAL_HARDWARE;
+	/* Issue reset and poll until it's completed */
+	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
+	if (rc > 0)
+		rc = pnv_eeh_poll(phb->opal_id);
 
-	pr_debug("%s: Reset PHB#%x, option=%d\n",
-		 __func__, hose->global_number, option);
-
-	/*
-	 * During the reset deassert time, we needn't care
-	 * the reset scope because the firmware does nothing
-	 * for fundamental or hot reset during deassert phase.
-	 */
-	if (option == EEH_RESET_FUNDAMENTAL)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PCI_FUNDAMENTAL,
-				    OPAL_ASSERT_RESET);
-	else if (option == EEH_RESET_HOT)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PCI_HOT,
-				    OPAL_ASSERT_RESET);
-	else if (option == EEH_RESET_DEACTIVATE)
-		rc = opal_pci_reset(phb->opal_id,
-				    OPAL_RESET_PCI_HOT,
-				    OPAL_DEASSERT_RESET);
-	if (rc < 0)
-		goto out;
-
-	/* Poll state of the PHB until the request is done */
-	rc = pnv_eeh_phb_poll(phb);
-	if (option == EEH_RESET_DEACTIVATE)
-		msleep(EEH_PE_RST_SETTLE_TIME);
-out:
-	if (rc != OPAL_SUCCESS)
-		return -EIO;
-
-	return 0;
+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
 }
 
-static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
+static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 {
 	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
 	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
@@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	return 0;
 }
 
+static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
+	uint64_t id = (0x1ul << 60);
+	uint8_t scope;
+	s64 rc;
+
+	/*
+	 * If the firmware can't handle it, we will issue hot reset
+	 * on the secondary bus despite the requested reset type
+	 */
+	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
+		return __pnv_eeh_bridge_reset(dev, option);
+
+	/* The firmware can handle the request */
+	switch (option) {
+	case EEH_RESET_HOT:
+		scope = OPAL_RESET_PCI_HOT;
+		break;
+	case EEH_RESET_FUNDAMENTAL:
+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
+		break;
+	case EEH_RESET_DEACTIVATE:
+		return 0;
+	case EEH_RESET_COMPLETE:
+	default:
+		pr_warn("%s: Unsupported option %d on device %s\n",
+			__func__, option, pci_name(dev));
+		return -EINVAL;
+	}
+
+	hose = pci_bus_to_host(dev->bus);
+	phb = hose->private_data;
+	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
+	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
+	if (rc > 0)
+		rc = pnv_eeh_poll(id);
+
+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
+}
+
 void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	struct pci_controller *hose;
 
 	if (pci_is_root_bus(dev->bus)) {
 		hose = pci_bus_to_host(dev->bus);
-		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
-		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
+		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
+		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
 	} else {
 		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
 		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
@@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
 static int pnv_eeh_reset(struct eeh_pe *pe, int option)
 {
 	struct pci_controller *hose = pe->phb;
+	struct pnv_phb *phb;
 	struct pci_bus *bus;
-	int ret;
+	s64 rc;
 
 	/*
 	 * For PHB reset, we always have complete reset. For those PEs whose
@@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
 	 * reset. The side effect is that EEH core has to clear the frozen
 	 * state explicitly after BAR restore.
 	 */
-	if (pe->type & EEH_PE_PHB) {
-		ret = pnv_eeh_phb_reset(hose, option);
-	} else {
-		struct pnv_phb *phb;
-		s64 rc;
+	if (pe->type & EEH_PE_PHB)
+		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);
 
-		/*
-		 * The frozen PE might be caused by PAPR error injection
-		 * registers, which are expected to be cleared after hitting
-		 * frozen PE as stated in the hardware spec. Unfortunately,
-		 * that's not true on P7IOC. So we have to clear it manually
-		 * to avoid recursive EEH errors during recovery.
-		 */
-		phb = hose->private_data;
-		if (phb->model == PNV_PHB_MODEL_P7IOC &&
-		    (option == EEH_RESET_HOT ||
-		    option == EEH_RESET_FUNDAMENTAL)) {
-			rc = opal_pci_reset(phb->opal_id,
-					    OPAL_RESET_PHB_ERROR,
-					    OPAL_ASSERT_RESET);
-			if (rc != OPAL_SUCCESS) {
-				pr_warn("%s: Failure %lld clearing "
-					"error injection registers\n",
-					__func__, rc);
-				return -EIO;
-			}
+	/*
+	 * The frozen PE might be caused by PAPR error injection
+	 * registers, which are expected to be cleared after hitting
+	 * frozen PE as stated in the hardware spec. Unfortunately,
+	 * that's not true on P7IOC. So we have to clear it manually
+	 * to avoid recursive EEH errors during recovery.
+	 */
+	phb = hose->private_data;
+	if (phb->model == PNV_PHB_MODEL_P7IOC &&
+	    (option == EEH_RESET_HOT ||
+	    option == EEH_RESET_FUNDAMENTAL)) {
+		rc = opal_pci_reset(phb->opal_id,
+				    OPAL_RESET_PHB_ERROR,
+				    OPAL_ASSERT_RESET);
+		if (rc != OPAL_SUCCESS) {
+			pr_warn("%s: Failure %lld clearing error "
+				"injection registers on PHB#%d\n",
+				__func__, rc, hose->global_number);
+			return -EIO;
 		}
-
-		bus = eeh_pe_bus_get(pe);
-		if (pci_is_root_bus(bus) ||
-			pci_is_root_bus(bus->parent))
-			ret = pnv_eeh_root_reset(hose, option);
-		else
-			ret = pnv_eeh_bridge_reset(bus->self, option);
 	}
 
-	return ret;
+	/* Route the reset request to PHB or upstream bridge */
+	bus = eeh_pe_bus_get(pe);
+	if (pci_is_root_bus(bus))
+		return pnv_eeh_phb_reset(hose, option);
+
+	return pnv_eeh_bridge_reset(bus->self, option);
 }
 
 /**
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

Function pnv_pci_reset_secondary_bus() is used to reset specified
PCI bus, which is leaded by root complex or PCI bridge. That means
the function shouldn't be called on PCI root bus and the patch
removes the logic for that case.

Also, some adapters beneath the indicated PCI bus may require
fundamental reset in order to successfully reload their firmwares
after the reset. The patch translates hot reset to fundamental reset
for that case.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 3c01095..58e4dcf 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
 }
 
-void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
+static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
 {
-	struct pci_controller *hose;
+	int *freset = data;
 
-	if (pci_is_root_bus(dev->bus)) {
-		hose = pci_bus_to_host(dev->bus);
-		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
-		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
-	} else {
-		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
-		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
+	/*
+	 * Stop the iteration immediately if there is any
+	 * one PCI device requesting fundamental reset
+	 */
+	*freset |= pdev->needs_freset;
+	return *freset;
+}
+
+void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
+{
+	int option = EEH_RESET_HOT;
+	int freset = 0;
+
+	/* Check if there're any PCI devices asking for fundamental reset */
+	if (pdev->subordinate) {
+		pci_walk_bus(pdev->subordinate,
+			     pnv_pci_dev_reset_type,
+			     &freset);
+		if (freset)
+			option = EEH_RESET_FUNDAMENTAL;
 	}
+
+	/* Issue the requested type of reset */
+	pnv_eeh_bridge_reset(pdev, option);
+	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
 }
 
 /**
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

Function pnv_pci_reset_secondary_bus() is used to reset specified
PCI bus, which is leaded by root complex or PCI bridge. That means
the function shouldn't be called on PCI root bus and the patch
removes the logic for that case.

Also, some adapters beneath the indicated PCI bus may require
fundamental reset in order to successfully reload their firmwares
after the reset. The patch translates hot reset to fundamental reset
for that case.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 3c01095..58e4dcf 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
 }
 
-void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
+static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
 {
-	struct pci_controller *hose;
+	int *freset = data;
 
-	if (pci_is_root_bus(dev->bus)) {
-		hose = pci_bus_to_host(dev->bus);
-		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
-		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
-	} else {
-		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
-		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
+	/*
+	 * Stop the iteration immediately if there is any
+	 * one PCI device requesting fundamental reset
+	 */
+	*freset |= pdev->needs_freset;
+	return *freset;
+}
+
+void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
+{
+	int option = EEH_RESET_HOT;
+	int freset = 0;
+
+	/* Check if there're any PCI devices asking for fundamental reset */
+	if (pdev->subordinate) {
+		pci_walk_bus(pdev->subordinate,
+			     pnv_pci_dev_reset_type,
+			     &freset);
+		if (freset)
+			option = EEH_RESET_FUNDAMENTAL;
 	}
+
+	/* Issue the requested type of reset */
+	pnv_eeh_bridge_reset(pdev, option);
+	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
 }
 
 /**
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 11/21] powerpc/pci: Don't scan empty slot
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

In hotplug case, function pcibios_add_pci_devices() is called to
rescan the specified PCI bus, which might not have any child devices.
Access to the PCI bus's child device node will cause kernel crash
without exception. The patch adds condition of skipping scanning
PCI bus without child devices, in order to avoid kernel crash.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-hotplug.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 0040343..651a866a 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -92,7 +92,8 @@ void pcibios_add_pci_devices(struct pci_bus * bus)
 	if (mode == PCI_PROBE_DEVTREE) {
 		/* use ofdt-based probe */
 		of_rescan_bus(dn, bus);
-	} else if (mode == PCI_PROBE_NORMAL) {
+	} else if (mode == PCI_PROBE_NORMAL &&
+		   dn->child && PCI_DN(dn->child)) {
 		/*
 		 * Use legacy probe. In the partial hotplug case, we
 		 * probably have grandchildren devices unplugged. So
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 11/21] powerpc/pci: Don't scan empty slot
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

In hotplug case, function pcibios_add_pci_devices() is called to
rescan the specified PCI bus, which might not have any child devices.
Access to the PCI bus's child device node will cause kernel crash
without exception. The patch adds condition of skipping scanning
PCI bus without child devices, in order to avoid kernel crash.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-hotplug.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 0040343..651a866a 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -92,7 +92,8 @@ void pcibios_add_pci_devices(struct pci_bus * bus)
 	if (mode == PCI_PROBE_DEVTREE) {
 		/* use ofdt-based probe */
 		of_rescan_bus(dn, bus);
-	} else if (mode == PCI_PROBE_NORMAL) {
+	} else if (mode == PCI_PROBE_NORMAL &&
+		   dn->child && PCI_DN(dn->child)) {
 		/*
 		 * Use legacy probe. In the partial hotplug case, we
 		 * probably have grandchildren devices unplugged. So
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 12/21] powerpc/pci: Move pcibios_find_pci_bus() around
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:02   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The patch moves pcibios_find_pci_bus() to PPC kerenl directory so
that it can be reused by hotplug code for pSeries and PowerNV
platform at the same time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/pci-hotplug.c          | 36 ++++++++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/pci_dlpar.c | 32 --------------------------
 2 files changed, 36 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 651a866a..67094fd 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -21,6 +21,42 @@
 #include <asm/firmware.h>
 #include <asm/eeh.h>
 
+static struct pci_bus *find_pci_bus(struct pci_bus *bus,
+				    struct device_node *dn)
+{
+	struct pci_bus *tmp, *child = NULL;
+	struct device_node *busdn;
+
+	busdn = pci_bus_to_OF_node(bus);
+	if (busdn == dn)
+		return bus;
+
+	list_for_each_entry(tmp, &bus->children, node) {
+		child = find_pci_bus(tmp, dn);
+		if (child)
+			break;
+	}
+
+	return child;
+}
+
+/**
+ * pcibios_find_pci_bus - find PCI bus according to the given device node
+ * @dn: Device node
+ *
+ * Find the corresponding PCI bus according to the given device node.
+ */
+struct pci_bus *pcibios_find_pci_bus(struct device_node *dn)
+{
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn  || !pdn->phb || !pdn->phb->bus)
+		return NULL;
+
+	return find_pci_bus(pdn->phb->bus, dn);
+}
+EXPORT_SYMBOL_GPL(pcibios_find_pci_bus);
+
 /**
  * pcibios_release_device - release PCI device
  * @dev: PCI device
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 5d4a3df..906dbaa 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -34,38 +34,6 @@
 
 #include "pseries.h"
 
-static struct pci_bus *
-find_bus_among_children(struct pci_bus *bus,
-                        struct device_node *dn)
-{
-	struct pci_bus *child = NULL;
-	struct pci_bus *tmp;
-	struct device_node *busdn;
-
-	busdn = pci_bus_to_OF_node(bus);
-	if (busdn == dn)
-		return bus;
-
-	list_for_each_entry(tmp, &bus->children, node) {
-		child = find_bus_among_children(tmp, dn);
-		if (child)
-			break;
-	};
-	return child;
-}
-
-struct pci_bus *
-pcibios_find_pci_bus(struct device_node *dn)
-{
-	struct pci_dn *pdn = dn->data;
-
-	if (!pdn  || !pdn->phb || !pdn->phb->bus)
-		return NULL;
-
-	return find_bus_among_children(pdn->phb->bus, dn);
-}
-EXPORT_SYMBOL_GPL(pcibios_find_pci_bus);
-
 struct pci_controller *init_phb_dynamic(struct device_node *dn)
 {
 	struct pci_controller *phb;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 12/21] powerpc/pci: Move pcibios_find_pci_bus() around
@ 2015-05-01  6:02   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The patch moves pcibios_find_pci_bus() to PPC kerenl directory so
that it can be reused by hotplug code for pSeries and PowerNV
platform at the same time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 arch/powerpc/kernel/pci-hotplug.c          | 36 ++++++++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/pci_dlpar.c | 32 --------------------------
 2 files changed, 36 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 651a866a..67094fd 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -21,6 +21,42 @@
 #include <asm/firmware.h>
 #include <asm/eeh.h>
 
+static struct pci_bus *find_pci_bus(struct pci_bus *bus,
+				    struct device_node *dn)
+{
+	struct pci_bus *tmp, *child = NULL;
+	struct device_node *busdn;
+
+	busdn = pci_bus_to_OF_node(bus);
+	if (busdn == dn)
+		return bus;
+
+	list_for_each_entry(tmp, &bus->children, node) {
+		child = find_pci_bus(tmp, dn);
+		if (child)
+			break;
+	}
+
+	return child;
+}
+
+/**
+ * pcibios_find_pci_bus - find PCI bus according to the given device node
+ * @dn: Device node
+ *
+ * Find the corresponding PCI bus according to the given device node.
+ */
+struct pci_bus *pcibios_find_pci_bus(struct device_node *dn)
+{
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn  || !pdn->phb || !pdn->phb->bus)
+		return NULL;
+
+	return find_pci_bus(pdn->phb->bus, dn);
+}
+EXPORT_SYMBOL_GPL(pcibios_find_pci_bus);
+
 /**
  * pcibios_release_device - release PCI device
  * @dev: PCI device
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 5d4a3df..906dbaa 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -34,38 +34,6 @@
 
 #include "pseries.h"
 
-static struct pci_bus *
-find_bus_among_children(struct pci_bus *bus,
-                        struct device_node *dn)
-{
-	struct pci_bus *child = NULL;
-	struct pci_bus *tmp;
-	struct device_node *busdn;
-
-	busdn = pci_bus_to_OF_node(bus);
-	if (busdn == dn)
-		return bus;
-
-	list_for_each_entry(tmp, &bus->children, node) {
-		child = find_bus_among_children(tmp, dn);
-		if (child)
-			break;
-	};
-	return child;
-}
-
-struct pci_bus *
-pcibios_find_pci_bus(struct device_node *dn)
-{
-	struct pci_dn *pdn = dn->data;
-
-	if (!pdn  || !pdn->phb || !pdn->phb->bus)
-		return NULL;
-
-	return find_bus_among_children(pdn->phb->bus, dn);
-}
-EXPORT_SYMBOL_GPL(pcibios_find_pci_bus);
-
 struct pci_controller *init_phb_dynamic(struct device_node *dn)
 {
 	struct pci_controller *phb;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

We might not get some PCI slot information (e.g. power status)
immediately by OPAL API. Instead, opal_pci_poll() need to be called
for the required information.

The patch introduces pnv_pci_poll(), which bases on original
pnv_eeh_poll(), to cover the above case

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++--------------------------
 arch/powerpc/platforms/powernv/pci.c         | 16 ++++++++++++++++
 arch/powerpc/platforms/powernv/pci.h         |  1 +
 3 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 58e4dcf..9253b9e 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
 	return ret;
 }
 
-static s64 pnv_eeh_poll(uint64_t id)
-{
-	s64 rc = OPAL_HARDWARE;
-
-	while (1) {
-		rc = opal_pci_poll(id, NULL);
-		if (rc <= 0)
-			break;
-
-		if (system_state < SYSTEM_RUNNING)
-			udelay(1000 * rc);
-		else
-			msleep(rc);
-	}
-
-	return rc;
-}
-
 int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
 {
 	struct pnv_phb *phb = hose->private_data;
@@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
 
 	/* Issue reset and poll until it's completed */
 	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
-	if (rc > 0)
-		rc = pnv_eeh_poll(phb->opal_id);
-
-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
+	return pnv_pci_poll(phb->opal_id, rc, NULL);
 }
 
 static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
@@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	phb = hose->private_data;
 	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
 	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
-	if (rc > 0)
-		rc = pnv_eeh_poll(id);
-
-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
+	return pnv_pci_poll(id, rc, NULL);
 }
 
 static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index bca2aeb..a2da9a3 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -44,6 +44,22 @@
 #define cfg_dbg(fmt...)	do { } while(0)
 //#define cfg_dbg(fmt...)	printk(fmt)
 
+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
+{
+	while (rval > 0) {
+		if (system_state < SYSTEM_RUNNING)
+			udelay(1000 * rval);
+		else
+			msleep(rval);
+
+		rval = opal_pci_poll(id, pval);
+		if (rval == OPAL_SUCCESS && pval)
+			rval = opal_pci_poll(id, pval);
+	}
+
+	return rval ? -EIO : 0;
+}
+
 #ifdef CONFIG_PCI_MSI
 static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
 {
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 8b10f01..82c5539 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -202,6 +202,7 @@ struct pnv_phb {
 
 extern struct pci_ops pnv_pci_ops;
 
+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval);
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
 int pnv_pci_cfg_read(struct pci_dn *pdn,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

We might not get some PCI slot information (e.g. power status)
immediately by OPAL API. Instead, opal_pci_poll() need to be called
for the required information.

The patch introduces pnv_pci_poll(), which bases on original
pnv_eeh_poll(), to cover the above case

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++--------------------------
 arch/powerpc/platforms/powernv/pci.c         | 16 ++++++++++++++++
 arch/powerpc/platforms/powernv/pci.h         |  1 +
 3 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 58e4dcf..9253b9e 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
 	return ret;
 }
 
-static s64 pnv_eeh_poll(uint64_t id)
-{
-	s64 rc = OPAL_HARDWARE;
-
-	while (1) {
-		rc = opal_pci_poll(id, NULL);
-		if (rc <= 0)
-			break;
-
-		if (system_state < SYSTEM_RUNNING)
-			udelay(1000 * rc);
-		else
-			msleep(rc);
-	}
-
-	return rc;
-}
-
 int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
 {
 	struct pnv_phb *phb = hose->private_data;
@@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
 
 	/* Issue reset and poll until it's completed */
 	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
-	if (rc > 0)
-		rc = pnv_eeh_poll(phb->opal_id);
-
-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
+	return pnv_pci_poll(phb->opal_id, rc, NULL);
 }
 
 static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
@@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
 	phb = hose->private_data;
 	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
 	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
-	if (rc > 0)
-		rc = pnv_eeh_poll(id);
-
-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
+	return pnv_pci_poll(id, rc, NULL);
 }
 
 static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index bca2aeb..a2da9a3 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -44,6 +44,22 @@
 #define cfg_dbg(fmt...)	do { } while(0)
 //#define cfg_dbg(fmt...)	printk(fmt)
 
+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
+{
+	while (rval > 0) {
+		if (system_state < SYSTEM_RUNNING)
+			udelay(1000 * rval);
+		else
+			msleep(rval);
+
+		rval = opal_pci_poll(id, pval);
+		if (rval == OPAL_SUCCESS && pval)
+			rval = opal_pci_poll(id, pval);
+	}
+
+	return rval ? -EIO : 0;
+}
+
 #ifdef CONFIG_PCI_MSI
 static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
 {
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 8b10f01..82c5539 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -202,6 +202,7 @@ struct pnv_phb {
 
 extern struct pci_ops pnv_pci_ops;
 
+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval);
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
 int pnv_pci_cfg_read(struct pci_dn *pdn,
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 14/21] powerpc/powernv: Functions to get/reset PCI slot status
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The patch exports 3 functions, which base on corresponding OPAL
APIs to get or set PCI slot status. Those functions are going to
be used by PCI hotplug module in subsequent patches:

   pnv_pci_get_presence_status()  opal_pci_get_presence_status()
   pnv_pci_get_power_status()     opal_pci_get_power_status()
   pnv_pci_set_power_status()     opal_pci_set_power_status()

Besides, the patch also exports pnv_pci_hotplug_notifier() to allow
registering PCI hotplug notifier, which will be used to receive PCI
hotplug message from skiboot firmware.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal-api.h            |  7 +++-
 arch/powerpc/include/asm/opal.h                |  3 ++
 arch/powerpc/include/asm/pnv-pci.h             |  5 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  3 ++
 arch/powerpc/platforms/powernv/pci.c           | 45 ++++++++++++++++++++++++++
 5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 0321a90..29b407d 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -153,7 +153,10 @@
 #define OPAL_FLASH_READ				110
 #define OPAL_FLASH_WRITE			111
 #define OPAL_FLASH_ERASE			112
-#define OPAL_LAST				112
+#define OPAL_PCI_GET_PRESENCE_STATUS		116
+#define OPAL_PCI_GET_POWER_STATUS		117
+#define OPAL_PCI_SET_POWER_STATUS		118
+#define OPAL_LAST				118
 
 /* Device tree flags */
 
@@ -352,6 +355,8 @@ enum opal_msg_type {
 	OPAL_MSG_SHUTDOWN,		/* params[0] = 1 reboot, 0 shutdown */
 	OPAL_MSG_HMI_EVT,
 	OPAL_MSG_DPO,
+	OPAL_MSG_PRD,
+	OPAL_MSG_PCI_HOTPLUG,
 	OPAL_MSG_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 6d467df..a0eb206 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -200,6 +200,9 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf,
 		uint64_t size, uint64_t token);
 int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size,
 		uint64_t token);
+int64_t opal_pci_get_presence_status(uint64_t id, uint8_t *status);
+int64_t opal_pci_get_power_status(uint64_t id, uint8_t *status);
+int64_t opal_pci_set_power_status(uint64_t id, uint8_t status);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
index f9b4982..50d92a4 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -13,6 +13,11 @@
 #include <linux/pci.h>
 #include <misc/cxl.h>
 
+extern int pnv_pci_get_presence_status(uint64_t id, uint8_t *status);
+extern int pnv_pci_get_power_status(uint64_t id, uint8_t *status);
+extern int pnv_pci_set_power_status(uint64_t id, uint8_t status);
+extern int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable);
+
 int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
 int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
 			   unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a7ade94..aa95dcb 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -295,3 +295,6 @@ OPAL_CALL(opal_i2c_request,			OPAL_I2C_REQUEST);
 OPAL_CALL(opal_flash_read,			OPAL_FLASH_READ);
 OPAL_CALL(opal_flash_write,			OPAL_FLASH_WRITE);
 OPAL_CALL(opal_flash_erase,			OPAL_FLASH_ERASE);
+OPAL_CALL(opal_pci_get_presence_status,		OPAL_PCI_GET_PRESENCE_STATUS);
+OPAL_CALL(opal_pci_get_power_status,		OPAL_PCI_GET_POWER_STATUS);
+OPAL_CALL(opal_pci_set_power_status,		OPAL_PCI_SET_POWER_STATUS);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a2da9a3..60e6d65 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -60,6 +60,51 @@ int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
 	return rval ? -EIO : 0;
 }
 
+int pnv_pci_get_presence_status(uint64_t id, uint8_t *status)
+{
+	long rc;
+
+	if (!opal_check_token(OPAL_PCI_GET_PRESENCE_STATUS))
+		return -ENXIO;
+
+	rc = opal_pci_get_presence_status(id, status);
+	return pnv_pci_poll(id, rc, status);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_get_presence_status);
+
+int pnv_pci_get_power_status(uint64_t id, uint8_t *status)
+{
+	long rc;
+
+	if (!opal_check_token(OPAL_PCI_GET_POWER_STATUS))
+		return -ENXIO;
+
+	rc = opal_pci_get_power_status(id, status);
+	return pnv_pci_poll(id, rc, status);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_get_power_status);
+
+int pnv_pci_set_power_status(uint64_t id, uint8_t status)
+{
+	long rc;
+
+	if (!opal_check_token(OPAL_PCI_SET_POWER_STATUS))
+		return -ENXIO;
+
+	rc = opal_pci_set_power_status(id, status);
+	return pnv_pci_poll(id, rc, NULL);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_set_power_status);
+
+int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable)
+{
+	if (enable)
+		return opal_message_notifier_register(OPAL_MSG_PCI_HOTPLUG, nb);
+
+	return opal_message_notifier_unregister(OPAL_MSG_PCI_HOTPLUG, nb);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_hotplug_notifier);
+
 #ifdef CONFIG_PCI_MSI
 static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
 {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 14/21] powerpc/powernv: Functions to get/reset PCI slot status
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The patch exports 3 functions, which base on corresponding OPAL
APIs to get or set PCI slot status. Those functions are going to
be used by PCI hotplug module in subsequent patches:

   pnv_pci_get_presence_status()  opal_pci_get_presence_status()
   pnv_pci_get_power_status()     opal_pci_get_power_status()
   pnv_pci_set_power_status()     opal_pci_set_power_status()

Besides, the patch also exports pnv_pci_hotplug_notifier() to allow
registering PCI hotplug notifier, which will be used to receive PCI
hotplug message from skiboot firmware.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal-api.h            |  7 +++-
 arch/powerpc/include/asm/opal.h                |  3 ++
 arch/powerpc/include/asm/pnv-pci.h             |  5 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  3 ++
 arch/powerpc/platforms/powernv/pci.c           | 45 ++++++++++++++++++++++++++
 5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 0321a90..29b407d 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -153,7 +153,10 @@
 #define OPAL_FLASH_READ				110
 #define OPAL_FLASH_WRITE			111
 #define OPAL_FLASH_ERASE			112
-#define OPAL_LAST				112
+#define OPAL_PCI_GET_PRESENCE_STATUS		116
+#define OPAL_PCI_GET_POWER_STATUS		117
+#define OPAL_PCI_SET_POWER_STATUS		118
+#define OPAL_LAST				118
 
 /* Device tree flags */
 
@@ -352,6 +355,8 @@ enum opal_msg_type {
 	OPAL_MSG_SHUTDOWN,		/* params[0] = 1 reboot, 0 shutdown */
 	OPAL_MSG_HMI_EVT,
 	OPAL_MSG_DPO,
+	OPAL_MSG_PRD,
+	OPAL_MSG_PCI_HOTPLUG,
 	OPAL_MSG_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 6d467df..a0eb206 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -200,6 +200,9 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf,
 		uint64_t size, uint64_t token);
 int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size,
 		uint64_t token);
+int64_t opal_pci_get_presence_status(uint64_t id, uint8_t *status);
+int64_t opal_pci_get_power_status(uint64_t id, uint8_t *status);
+int64_t opal_pci_set_power_status(uint64_t id, uint8_t status);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
index f9b4982..50d92a4 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -13,6 +13,11 @@
 #include <linux/pci.h>
 #include <misc/cxl.h>
 
+extern int pnv_pci_get_presence_status(uint64_t id, uint8_t *status);
+extern int pnv_pci_get_power_status(uint64_t id, uint8_t *status);
+extern int pnv_pci_set_power_status(uint64_t id, uint8_t status);
+extern int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable);
+
 int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
 int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
 			   unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a7ade94..aa95dcb 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -295,3 +295,6 @@ OPAL_CALL(opal_i2c_request,			OPAL_I2C_REQUEST);
 OPAL_CALL(opal_flash_read,			OPAL_FLASH_READ);
 OPAL_CALL(opal_flash_write,			OPAL_FLASH_WRITE);
 OPAL_CALL(opal_flash_erase,			OPAL_FLASH_ERASE);
+OPAL_CALL(opal_pci_get_presence_status,		OPAL_PCI_GET_PRESENCE_STATUS);
+OPAL_CALL(opal_pci_get_power_status,		OPAL_PCI_GET_POWER_STATUS);
+OPAL_CALL(opal_pci_set_power_status,		OPAL_PCI_SET_POWER_STATUS);
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index a2da9a3..60e6d65 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -60,6 +60,51 @@ int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
 	return rval ? -EIO : 0;
 }
 
+int pnv_pci_get_presence_status(uint64_t id, uint8_t *status)
+{
+	long rc;
+
+	if (!opal_check_token(OPAL_PCI_GET_PRESENCE_STATUS))
+		return -ENXIO;
+
+	rc = opal_pci_get_presence_status(id, status);
+	return pnv_pci_poll(id, rc, status);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_get_presence_status);
+
+int pnv_pci_get_power_status(uint64_t id, uint8_t *status)
+{
+	long rc;
+
+	if (!opal_check_token(OPAL_PCI_GET_POWER_STATUS))
+		return -ENXIO;
+
+	rc = opal_pci_get_power_status(id, status);
+	return pnv_pci_poll(id, rc, status);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_get_power_status);
+
+int pnv_pci_set_power_status(uint64_t id, uint8_t status)
+{
+	long rc;
+
+	if (!opal_check_token(OPAL_PCI_SET_POWER_STATUS))
+		return -ENXIO;
+
+	rc = opal_pci_set_power_status(id, status);
+	return pnv_pci_poll(id, rc, NULL);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_set_power_status);
+
+int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable)
+{
+	if (enable)
+		return opal_message_notifier_register(OPAL_MSG_PCI_HOTPLUG, nb);
+
+	return opal_message_notifier_unregister(OPAL_MSG_PCI_HOTPLUG, nb);
+}
+EXPORT_SYMBOL_GPL(pnv_pci_hotplug_notifier);
+
 #ifdef CONFIG_PCI_MSI
 static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
 {
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The pci_dn instances are allocated from memblock or bootmem when
creating PCI controller (hoses) in setup_arch(). The PCI hotplug,
which will be supported by proceeding patches, will release PCI
device nodes and their corresponding pci_dn on unplugging event.
The pci_dn instance memory chunks alloed from memblock or bootmem
are hard to reused after being released.

The patch delay creating pci_dn so that they can be allocated from
slab. In turn, the memory chunks for them can be reused after being
released without problem. The creation of eeh_dev instances, which
depends on pci_dn, is delayed a bit as well.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/ppc-pci.h     |  1 -
 arch/powerpc/kernel/eeh_dev.c          |  2 +-
 arch/powerpc/kernel/pci_dn.c           | 40 +++++++++++++++++++---------------
 arch/powerpc/platforms/maple/pci.c     | 35 +++++++++++++++++------------
 arch/powerpc/platforms/pasemi/pci.c    |  3 ---
 arch/powerpc/platforms/powermac/pci.c  | 39 ++++++++++++++++++++-------------
 arch/powerpc/platforms/powernv/pci.c   |  3 ---
 arch/powerpc/platforms/pseries/setup.c |  1 -
 8 files changed, 68 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 4122a86..7388316 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root,
 		      void *(*fn)(struct pci_dn *, void *),
 		      void *data);
 
-extern void pci_devs_phb_init(void);
 extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
 
 /* From rtas_pci.h */
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index aabba94..f33ce5b 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void)
 	return 0;
 }
 
-core_initcall(eeh_dev_phb_init);
+core_initcall_sync(eeh_dev_phb_init);
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index b3b4df9..d3833af 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	struct device_node *parent;
 	struct pci_dn *pdn;
 
-	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
 	if (pdn == NULL)
 		return NULL;
 	dn->data = pdn;
@@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	traverse_pci_devices(dn, update_dn_pci_info, phb);
 }
 
-/** 
+static void pci_dev_pdn_setup(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	if (pdev->dev.archdata.pci_data)
+		return;
+
+	/* Setup the fast path */
+	pdn = pci_get_pdn(pdev);
+	pdev->dev.archdata.pci_data = pdn;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
+
+/*
  * pci_devs_phb_init - Initialize phbs and pci devs under them.
- * 
- * This routine walks over all phb's (pci-host bridges) on the
- * system, and sets up assorted pci-related structures 
+ *
+ * This routine walks over all phb's (pci-host bridges) on
+ * the system, and sets up assorted pci-related structures
  * (including pci info in the device node structs) for each
  * pci device found underneath.  This routine runs once,
  * early in the boot sequence.
  */
-void __init pci_devs_phb_init(void)
+static int __init pci_devs_phb_init(void)
 {
 	struct pci_controller *phb, *tmp;
 
 	/* This must be done first so the device nodes have valid pci info! */
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		pci_devs_phb_init_dynamic(phb);
-}
-
-static void pci_dev_pdn_setup(struct pci_dev *pdev)
-{
-	struct pci_dn *pdn;
 
-	if (pdev->dev.archdata.pci_data)
-		return;
-
-	/* Setup the fast path */
-	pdn = pci_get_pdn(pdev);
-	pdev->dev.archdata.pci_data = pdn;
+	return 0;
 }
-DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
+
+core_initcall(pci_devs_phb_init);
diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c
index a923230..04a69a8 100644
--- a/arch/powerpc/platforms/maple/pci.c
+++ b/arch/powerpc/platforms/maple/pci.c
@@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev)
 	DBG(" <- maple_pci_irq_fixup\n");
 }
 
+static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge)
+{
+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
+	struct device_node *np, *child;
+
+	if (hose != u3_agp)
+		return 0;
+
+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
+	 * assume there is no P2P bridge on the AGP bus, which should be a
+	 * safe assumptions hopefully.
+	 */
+	np = hose->dn;
+	PCI_DN(np)->busno = 0xf0;
+	for_each_child_of_node(np, child)
+		PCI_DN(child)->busno = 0xf0;
+
+	return 0;
+}
+
 void __init maple_pci_init(void)
 {
 	struct device_node *np, *root;
@@ -605,20 +625,7 @@ void __init maple_pci_init(void)
 	if (ht && maple_add_bridge(ht) != 0)
 		of_node_put(ht);
 
-	/* Setup the linkage between OF nodes and PHBs */ 
-	pci_devs_phb_init();
-
-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
-	 * assume there is no P2P bridge on the AGP bus, which should be a
-	 * safe assumptions hopefully.
-	 */
-	if (u3_agp) {
-		struct device_node *np = u3_agp->dn;
-		PCI_DN(np)->busno = 0xf0;
-		for (np = np->child; np; np = np->sibling)
-			PCI_DN(np)->busno = 0xf0;
-	}
-
+	ppc_md.pcibios_root_bridge_prepare = maple_pci_root_bridge_prepare;
 	/* Tell pci.c to not change any resource allocations.  */
 	pci_add_flags(PCI_PROBE_ONLY);
 }
diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
index f3a68a0..10c4e8f 100644
--- a/arch/powerpc/platforms/pasemi/pci.c
+++ b/arch/powerpc/platforms/pasemi/pci.c
@@ -229,9 +229,6 @@ void __init pas_pci_init(void)
 			of_node_get(np);
 
 	of_node_put(root);
-
-	/* Setup the linkage between OF nodes and PHBs */
-	pci_devs_phb_init();
 }
 
 void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)
diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
index 59ab16f..368716f 100644
--- a/arch/powerpc/platforms/powermac/pci.c
+++ b/arch/powerpc/platforms/powermac/pci.c
@@ -878,6 +878,29 @@ void pmac_pci_irq_fixup(struct pci_dev *dev)
 #endif /* CONFIG_PPC32 */
 }
 
+#ifdef CONFIG_PPC64
+static int pmac_pci_root_bridge_prepare(struct pci_hot_bridge *bridge)
+{
+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
+	struct device_node *np, *child;
+
+	if (hose != u3_agp)
+		return 0;
+
+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
+	 * assume there is no P2P bridge on the AGP bus, which should be a
+	 * safe assumptions for now. We should do something better in the
+	 * future though
+	 */
+	np = hose->dn;
+	PCI_DN(np)->busno = 0xf0;
+	for_each_child_of_node(np, child)
+		PCI_DN(child)->busno = 0xf0;
+
+	return 0;
+}
+#endif /* CONFIG_PPC64 */
+
 void __init pmac_pci_init(void)
 {
 	struct device_node *np, *root;
@@ -914,22 +937,8 @@ void __init pmac_pci_init(void)
 	if (ht && pmac_add_bridge(ht) != 0)
 		of_node_put(ht);
 
-	/* Setup the linkage between OF nodes and PHBs */
-	pci_devs_phb_init();
-
-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
-	 * assume there is no P2P bridge on the AGP bus, which should be a
-	 * safe assumptions for now. We should do something better in the
-	 * future though
-	 */
-	if (u3_agp) {
-		struct device_node *np = u3_agp->dn;
-		PCI_DN(np)->busno = 0xf0;
-		for (np = np->child; np; np = np->sibling)
-			PCI_DN(np)->busno = 0xf0;
-	}
 	/* pmac_check_ht_link(); */
-
+	ppc_md.pcibios_root_bridge_prepare = pmac_pci_root_bridge_prepare;
 #else /* CONFIG_PPC64 */
 	init_p2pbridge();
 	init_second_ohare();
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 60e6d65..21a4eb3 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -819,9 +819,6 @@ void __init pnv_pci_init(void)
 	for_each_compatible_node(np, NULL, "ibm,ioda2-phb")
 		pnv_pci_init_ioda2_phb(np);
 
-	/* Setup the linkage between OF nodes and PHBs */
-	pci_devs_phb_init();
-
 	/* Configure IOMMU DMA hooks */
 	ppc_md.tce_build = pnv_tce_build_vm;
 	ppc_md.tce_free = pnv_tce_free_vm;
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index df6a704..5f80758 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -482,7 +482,6 @@ static void __init find_and_init_phbs(void)
 	}
 
 	of_node_put(root);
-	pci_devs_phb_init();
 
 	/*
 	 * PCI_PROBE_ONLY and PCI_REASSIGN_ALL_BUS can be set via properties
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The pci_dn instances are allocated from memblock or bootmem when
creating PCI controller (hoses) in setup_arch(). The PCI hotplug,
which will be supported by proceeding patches, will release PCI
device nodes and their corresponding pci_dn on unplugging event.
The pci_dn instance memory chunks alloed from memblock or bootmem
are hard to reused after being released.

The patch delay creating pci_dn so that they can be allocated from
slab. In turn, the memory chunks for them can be reused after being
released without problem. The creation of eeh_dev instances, which
depends on pci_dn, is delayed a bit as well.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/ppc-pci.h     |  1 -
 arch/powerpc/kernel/eeh_dev.c          |  2 +-
 arch/powerpc/kernel/pci_dn.c           | 40 +++++++++++++++++++---------------
 arch/powerpc/platforms/maple/pci.c     | 35 +++++++++++++++++------------
 arch/powerpc/platforms/pasemi/pci.c    |  3 ---
 arch/powerpc/platforms/powermac/pci.c  | 39 ++++++++++++++++++++-------------
 arch/powerpc/platforms/powernv/pci.c   |  3 ---
 arch/powerpc/platforms/pseries/setup.c |  1 -
 8 files changed, 68 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 4122a86..7388316 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root,
 		      void *(*fn)(struct pci_dn *, void *),
 		      void *data);
 
-extern void pci_devs_phb_init(void);
 extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
 
 /* From rtas_pci.h */
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index aabba94..f33ce5b 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void)
 	return 0;
 }
 
-core_initcall(eeh_dev_phb_init);
+core_initcall_sync(eeh_dev_phb_init);
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index b3b4df9..d3833af 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	struct device_node *parent;
 	struct pci_dn *pdn;
 
-	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
 	if (pdn == NULL)
 		return NULL;
 	dn->data = pdn;
@@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	traverse_pci_devices(dn, update_dn_pci_info, phb);
 }
 
-/** 
+static void pci_dev_pdn_setup(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	if (pdev->dev.archdata.pci_data)
+		return;
+
+	/* Setup the fast path */
+	pdn = pci_get_pdn(pdev);
+	pdev->dev.archdata.pci_data = pdn;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
+
+/*
  * pci_devs_phb_init - Initialize phbs and pci devs under them.
- * 
- * This routine walks over all phb's (pci-host bridges) on the
- * system, and sets up assorted pci-related structures 
+ *
+ * This routine walks over all phb's (pci-host bridges) on
+ * the system, and sets up assorted pci-related structures
  * (including pci info in the device node structs) for each
  * pci device found underneath.  This routine runs once,
  * early in the boot sequence.
  */
-void __init pci_devs_phb_init(void)
+static int __init pci_devs_phb_init(void)
 {
 	struct pci_controller *phb, *tmp;
 
 	/* This must be done first so the device nodes have valid pci info! */
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		pci_devs_phb_init_dynamic(phb);
-}
-
-static void pci_dev_pdn_setup(struct pci_dev *pdev)
-{
-	struct pci_dn *pdn;
 
-	if (pdev->dev.archdata.pci_data)
-		return;
-
-	/* Setup the fast path */
-	pdn = pci_get_pdn(pdev);
-	pdev->dev.archdata.pci_data = pdn;
+	return 0;
 }
-DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
+
+core_initcall(pci_devs_phb_init);
diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c
index a923230..04a69a8 100644
--- a/arch/powerpc/platforms/maple/pci.c
+++ b/arch/powerpc/platforms/maple/pci.c
@@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev)
 	DBG(" <- maple_pci_irq_fixup\n");
 }
 
+static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge)
+{
+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
+	struct device_node *np, *child;
+
+	if (hose != u3_agp)
+		return 0;
+
+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
+	 * assume there is no P2P bridge on the AGP bus, which should be a
+	 * safe assumptions hopefully.
+	 */
+	np = hose->dn;
+	PCI_DN(np)->busno = 0xf0;
+	for_each_child_of_node(np, child)
+		PCI_DN(child)->busno = 0xf0;
+
+	return 0;
+}
+
 void __init maple_pci_init(void)
 {
 	struct device_node *np, *root;
@@ -605,20 +625,7 @@ void __init maple_pci_init(void)
 	if (ht && maple_add_bridge(ht) != 0)
 		of_node_put(ht);
 
-	/* Setup the linkage between OF nodes and PHBs */ 
-	pci_devs_phb_init();
-
-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
-	 * assume there is no P2P bridge on the AGP bus, which should be a
-	 * safe assumptions hopefully.
-	 */
-	if (u3_agp) {
-		struct device_node *np = u3_agp->dn;
-		PCI_DN(np)->busno = 0xf0;
-		for (np = np->child; np; np = np->sibling)
-			PCI_DN(np)->busno = 0xf0;
-	}
-
+	ppc_md.pcibios_root_bridge_prepare = maple_pci_root_bridge_prepare;
 	/* Tell pci.c to not change any resource allocations.  */
 	pci_add_flags(PCI_PROBE_ONLY);
 }
diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
index f3a68a0..10c4e8f 100644
--- a/arch/powerpc/platforms/pasemi/pci.c
+++ b/arch/powerpc/platforms/pasemi/pci.c
@@ -229,9 +229,6 @@ void __init pas_pci_init(void)
 			of_node_get(np);
 
 	of_node_put(root);
-
-	/* Setup the linkage between OF nodes and PHBs */
-	pci_devs_phb_init();
 }
 
 void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)
diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
index 59ab16f..368716f 100644
--- a/arch/powerpc/platforms/powermac/pci.c
+++ b/arch/powerpc/platforms/powermac/pci.c
@@ -878,6 +878,29 @@ void pmac_pci_irq_fixup(struct pci_dev *dev)
 #endif /* CONFIG_PPC32 */
 }
 
+#ifdef CONFIG_PPC64
+static int pmac_pci_root_bridge_prepare(struct pci_hot_bridge *bridge)
+{
+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
+	struct device_node *np, *child;
+
+	if (hose != u3_agp)
+		return 0;
+
+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
+	 * assume there is no P2P bridge on the AGP bus, which should be a
+	 * safe assumptions for now. We should do something better in the
+	 * future though
+	 */
+	np = hose->dn;
+	PCI_DN(np)->busno = 0xf0;
+	for_each_child_of_node(np, child)
+		PCI_DN(child)->busno = 0xf0;
+
+	return 0;
+}
+#endif /* CONFIG_PPC64 */
+
 void __init pmac_pci_init(void)
 {
 	struct device_node *np, *root;
@@ -914,22 +937,8 @@ void __init pmac_pci_init(void)
 	if (ht && pmac_add_bridge(ht) != 0)
 		of_node_put(ht);
 
-	/* Setup the linkage between OF nodes and PHBs */
-	pci_devs_phb_init();
-
-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
-	 * assume there is no P2P bridge on the AGP bus, which should be a
-	 * safe assumptions for now. We should do something better in the
-	 * future though
-	 */
-	if (u3_agp) {
-		struct device_node *np = u3_agp->dn;
-		PCI_DN(np)->busno = 0xf0;
-		for (np = np->child; np; np = np->sibling)
-			PCI_DN(np)->busno = 0xf0;
-	}
 	/* pmac_check_ht_link(); */
-
+	ppc_md.pcibios_root_bridge_prepare = pmac_pci_root_bridge_prepare;
 #else /* CONFIG_PPC64 */
 	init_p2pbridge();
 	init_second_ohare();
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 60e6d65..21a4eb3 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -819,9 +819,6 @@ void __init pnv_pci_init(void)
 	for_each_compatible_node(np, NULL, "ibm,ioda2-phb")
 		pnv_pci_init_ioda2_phb(np);
 
-	/* Setup the linkage between OF nodes and PHBs */
-	pci_devs_phb_init();
-
 	/* Configure IOMMU DMA hooks */
 	ppc_md.tce_build = pnv_tce_build_vm;
 	ppc_md.tce_free = pnv_tce_free_vm;
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index df6a704..5f80758 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -482,7 +482,6 @@ static void __init find_and_init_phbs(void)
 	}
 
 	of_node_put(root);
-	pci_devs_phb_init();
 
 	/*
 	 * PCI_PROBE_ONLY and PCI_REASSIGN_ALL_BUS can be set via properties
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The eeh_dev is always created based on pci_dn, but with initcall
supported by core_initcall_sync(). The patch creates eeh_dev
when pci_dn is created, indicating they have same life cycle.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h         |  6 ++++--
 arch/powerpc/kernel/eeh_dev.c          | 18 ++++--------------
 arch/powerpc/kernel/pci_dn.c           | 12 ++++++++++++
 arch/powerpc/platforms/pseries/setup.c |  6 +-----
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 2793d24..4ed88f6 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
 const char *eeh_pe_loc_get(struct eeh_pe *pe);
 struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
 
-void *eeh_dev_init(struct pci_dn *pdn, void *data);
+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
+			     struct pci_controller *phb);
 void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
 int eeh_init(void);
 int __init eeh_ops_register(struct eeh_ops *ops);
@@ -322,7 +323,8 @@ static inline int eeh_init(void)
 	return 0;
 }
 
-static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
+static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
+					   struct pci_controller *phb)
 {
 	return NULL;
 }
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index f33ce5b..7486932 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -44,14 +44,14 @@
 /**
  * eeh_dev_init - Create EEH device according to OF node
  * @pdn: PCI device node
- * @data: PHB
+ * @phb: PCI controller
  *
  * It will create EEH device according to the given OF node. The function
  * might be called by PCI emunation, DR, PHB hotplug.
  */
-void *eeh_dev_init(struct pci_dn *pdn, void *data)
+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
+			     struct pci_controller *phb)
 {
-	struct pci_controller *phb = data;
 	struct eeh_dev *edev;
 
 	/* Allocate EEH device */
@@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
 	edev->phb = phb;
 	INIT_LIST_HEAD(&edev->list);
 
-	return NULL;
+	return edev;
 }
 
 /**
@@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
  */
 void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
 {
-	struct pci_dn *root = phb->pci_data;
-
 	/* EEH PE for PHB */
 	eeh_phb_pe_create(phb);
-
-	/* EEH device for PHB */
-	eeh_dev_init(root, phb);
-
-	/* EEH devices for children OF nodes */
-	traverse_pci_dn(root, eeh_dev_init, phb);
 }
 
 /**
@@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void)
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		eeh_dev_phb_init_dynamic(phb);
 
-	pr_info("EEH: devices created\n");
-
 	return 0;
 }
 
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index d3833af..abc81fa 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	const __be32 *regs;
 	struct device_node *parent;
 	struct pci_dn *pdn;
+#ifdef CONFIG_EEH
+	struct eeh_dev *edev;
+#endif
 
 	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
 	if (pdn == NULL)
@@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	/* Extended config space */
 	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
 
+	/* Initialize EEH device */
+#ifdef CONFIG_EEH
+	edev = eeh_dev_init(pdn, phb);
+	if (!edev) {
+		kfree(pdn);
+		return NULL;
+	}
+#endif
+
 	/* Attach to parent node */
 	INIT_LIST_HEAD(&pdn->child_list);
 	INIT_LIST_HEAD(&pdn->list);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 5f80758..92974aa 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
 	switch (action) {
 	case OF_RECONFIG_ATTACH_NODE:
 		pci = np->parent->data;
-		if (pci) {
+		if (pci)
 			update_dn_pci_info(np, pci->phb);
-
-			/* Create EEH device for the OF node */
-			eeh_dev_init(PCI_DN(np), pci->phb);
-		}
 		break;
 	default:
 		err = NOTIFY_DONE;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The eeh_dev is always created based on pci_dn, but with initcall
supported by core_initcall_sync(). The patch creates eeh_dev
when pci_dn is created, indicating they have same life cycle.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h         |  6 ++++--
 arch/powerpc/kernel/eeh_dev.c          | 18 ++++--------------
 arch/powerpc/kernel/pci_dn.c           | 12 ++++++++++++
 arch/powerpc/platforms/pseries/setup.c |  6 +-----
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 2793d24..4ed88f6 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
 const char *eeh_pe_loc_get(struct eeh_pe *pe);
 struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
 
-void *eeh_dev_init(struct pci_dn *pdn, void *data);
+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
+			     struct pci_controller *phb);
 void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
 int eeh_init(void);
 int __init eeh_ops_register(struct eeh_ops *ops);
@@ -322,7 +323,8 @@ static inline int eeh_init(void)
 	return 0;
 }
 
-static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
+static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
+					   struct pci_controller *phb)
 {
 	return NULL;
 }
diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
index f33ce5b..7486932 100644
--- a/arch/powerpc/kernel/eeh_dev.c
+++ b/arch/powerpc/kernel/eeh_dev.c
@@ -44,14 +44,14 @@
 /**
  * eeh_dev_init - Create EEH device according to OF node
  * @pdn: PCI device node
- * @data: PHB
+ * @phb: PCI controller
  *
  * It will create EEH device according to the given OF node. The function
  * might be called by PCI emunation, DR, PHB hotplug.
  */
-void *eeh_dev_init(struct pci_dn *pdn, void *data)
+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
+			     struct pci_controller *phb)
 {
-	struct pci_controller *phb = data;
 	struct eeh_dev *edev;
 
 	/* Allocate EEH device */
@@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
 	edev->phb = phb;
 	INIT_LIST_HEAD(&edev->list);
 
-	return NULL;
+	return edev;
 }
 
 /**
@@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
  */
 void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
 {
-	struct pci_dn *root = phb->pci_data;
-
 	/* EEH PE for PHB */
 	eeh_phb_pe_create(phb);
-
-	/* EEH device for PHB */
-	eeh_dev_init(root, phb);
-
-	/* EEH devices for children OF nodes */
-	traverse_pci_dn(root, eeh_dev_init, phb);
 }
 
 /**
@@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void)
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		eeh_dev_phb_init_dynamic(phb);
 
-	pr_info("EEH: devices created\n");
-
 	return 0;
 }
 
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index d3833af..abc81fa 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	const __be32 *regs;
 	struct device_node *parent;
 	struct pci_dn *pdn;
+#ifdef CONFIG_EEH
+	struct eeh_dev *edev;
+#endif
 
 	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
 	if (pdn == NULL)
@@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	/* Extended config space */
 	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
 
+	/* Initialize EEH device */
+#ifdef CONFIG_EEH
+	edev = eeh_dev_init(pdn, phb);
+	if (!edev) {
+		kfree(pdn);
+		return NULL;
+	}
+#endif
+
 	/* Attach to parent node */
 	INIT_LIST_HEAD(&pdn->child_list);
 	INIT_LIST_HEAD(&pdn->list);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 5f80758..92974aa 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
 	switch (action) {
 	case OF_RECONFIG_ATTACH_NODE:
 		pci = np->parent->data;
-		if (pci) {
+		if (pci)
 			update_dn_pci_info(np, pci->phb);
-
-			/* Create EEH device for the OF node */
-			eeh_dev_init(PCI_DN(np), pci->phb);
-		}
 		break;
 	default:
 		err = NOTIFY_DONE;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 17/21] powerpc/pci: Export traverse_pci_device_nodes()
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The patch exports following functions, which are derived from their
original implementation, so that the PCI hotplug logic can reuse
the functions to add or remove pci_dn for all device nodes under
specified PCI slot.

   traverse_pci_device_nodes()     traverse_pci_devices()
   add_pci_device_node_info()      update_dn_pci_info()
   remove_pci_device_node_info()   newly added

The patch also releases eeh_dev when its corresponding pci_dn
is released, indicating they have same life cycle.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h  |  3 +-
 arch/powerpc/include/asm/ppc-pci.h     |  6 +--
 arch/powerpc/kernel/pci_dn.c           | 67 +++++++++++++++++++++++++++++-----
 arch/powerpc/platforms/pseries/msi.c   |  4 +-
 arch/powerpc/platforms/pseries/setup.c |  2 +-
 5 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index a6ad4b1..e0cb114 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -222,7 +222,8 @@ extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
 extern struct pci_dn *add_dev_pci_data(struct pci_dev *pdev);
 extern void remove_dev_pci_data(struct pci_dev *pdev);
-extern void *update_dn_pci_info(struct device_node *dn, void *data);
+extern void *add_pci_device_node_info(struct device_node *dn, void *data);
+extern void remove_pci_device_node_info(struct device_node *dn);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
 					  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 7388316..3f0874e 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -33,9 +33,9 @@ extern struct pci_dev *isa_bridge_pcidev;	/* may be NULL if no ISA bus */
 struct device_node;
 struct pci_dn;
 
-typedef void *(*traverse_func)(struct device_node *me, void *data);
-void *traverse_pci_devices(struct device_node *start, traverse_func pre,
-		void *data);
+void *traverse_pci_device_nodes(struct device_node *start,
+				void *(*fn)(struct device_node *, void *),
+				void *data);
 void *traverse_pci_dn(struct pci_dn *root,
 		      void *(*fn)(struct pci_dn *, void *),
 		      void *data);
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index abc81fa..6bd9d8c 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -265,11 +265,15 @@ void remove_dev_pci_data(struct pci_dev *pdev)
 #endif /* CONFIG_PCI_IOV */
 }
 
-/*
- * Traverse_func that inits the PCI fields of the device node.
- * NOTE: this *must* be done before read/write config to the device.
+/**
+ * add_pci_device_node_info - Add pci_dn for PCI device node
+ * @dn: PCI device node
+ * @data: additonal argument
+ *
+ * Add pci_dn for the indicated PCI device node. The newly created
+ * pci_dn will be put into that one of the parent device node.
  */
-void *update_dn_pci_info(struct device_node *dn, void *data)
+void *add_pci_device_node_info(struct device_node *dn, void *data)
 {
 	struct pci_controller *phb = data;
 	const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
@@ -328,8 +332,48 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 
 	return NULL;
 }
+EXPORT_SYMBOL(add_pci_device_node_info);
 
-/*
+/**
+ * remove_pci_device_node_info - Remove pci_dn from PCI device node
+ * @dn: PCI device node
+ *
+ * Remove pci_dn from PCI device node. The pci_dn is also removed
+ * from the child list of the parent pci_dn.
+ */
+void remove_pci_device_node_info(struct device_node *np)
+{
+	struct pci_dn *pdn = np ? PCI_DN(np) : NULL;
+#ifdef CONFIG_EEH
+	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+#endif
+
+	if (!pdn)
+		return;
+
+#ifdef CONFIG_EEH
+	if (edev) {
+		pdn->edev = NULL;
+		kfree(edev);
+	}
+#endif
+
+	BUG_ON(!list_empty(&pdn->child_list));
+	list_del(&pdn->list);
+	if (pdn->parent)
+		of_node_put(pdn->parent->node);
+
+	np->data = NULL;
+	kfree(pdn);
+}
+EXPORT_SYMBOL(remove_pci_device_node_info);
+
+/**
+ * traverse_pci_device_nodes - Traverse children of indicated device node
+ * @start: indicated device node
+ * @pre: callback
+ * @data: additional parameter to the callback
+ *
  * Traverse a device tree stopping each PCI device in the tree.
  * This is done depth first.  As each node is processed, a "pre"
  * function is called and the children are processed recursively.
@@ -347,8 +391,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
  * one of these nodes we also assume its siblings are non-pci for
  * performance.
  */
-void *traverse_pci_devices(struct device_node *start, traverse_func pre,
-		void *data)
+void *traverse_pci_device_nodes(struct device_node *start,
+				void *(*fn)(struct device_node *, void *data),
+				void *data)
 {
 	struct device_node *dn, *nextdn;
 	void *ret;
@@ -363,7 +408,7 @@ void *traverse_pci_devices(struct device_node *start, traverse_func pre,
 		if (classp)
 			class = of_read_number(classp, 1);
 
-		if (pre && ((ret = pre(dn, data)) != NULL))
+		if (fn && ((ret = fn(dn, data)) != NULL))
 			return ret;
 
 		/* If we are a PCI bridge, go down */
@@ -384,8 +429,10 @@ void *traverse_pci_devices(struct device_node *start, traverse_func pre,
 			nextdn = dn->sibling;
 		}
 	}
+
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(traverse_pci_device_nodes);
 
 static struct pci_dn *pci_dn_next_one(struct pci_dn *root,
 				      struct pci_dn *pdn)
@@ -441,7 +488,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	struct pci_dn *pdn;
 
 	/* PHB nodes themselves must not match */
-	update_dn_pci_info(dn, phb);
+	add_pci_device_node_info(dn, phb);
 	pdn = dn->data;
 	if (pdn) {
 		pdn->devfn = pdn->busno = -1;
@@ -451,7 +498,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	}
 
 	/* Update dn->phb ptrs for new phb and children devices */
-	traverse_pci_devices(dn, update_dn_pci_info, phb);
+	traverse_pci_device_nodes(dn, add_pci_device_node_info, phb);
 }
 
 static void pci_dev_pdn_setup(struct pci_dev *pdev)
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
index c8d24f9..9ebbd19 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -303,7 +303,7 @@ static int msi_quota_for_device(struct pci_dev *dev, int request)
 	memset(&counts, 0, sizeof(struct msi_counts));
 
 	/* Work out how many devices we have below this PE */
-	traverse_pci_devices(pe_dn, count_non_bridge_devices, &counts);
+	traverse_pci_device_nodes(pe_dn, count_non_bridge_devices, &counts);
 
 	if (counts.num_devices == 0) {
 		pr_err("rtas_msi: found 0 devices under PE for %s\n",
@@ -318,7 +318,7 @@ static int msi_quota_for_device(struct pci_dev *dev, int request)
 	/* else, we have some more calculating to do */
 	counts.requestor = pci_device_to_OF_node(dev);
 	counts.request = request;
-	traverse_pci_devices(pe_dn, count_spare_msis, &counts);
+	traverse_pci_device_nodes(pe_dn, count_spare_msis, &counts);
 
 	/* If the quota isn't an integer multiple of the total, we can
 	 * use the remainder as spare MSIs for anyone that wants them. */
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 92974aa..ed8c894 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -262,7 +262,7 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
 	case OF_RECONFIG_ATTACH_NODE:
 		pci = np->parent->data;
 		if (pci)
-			update_dn_pci_info(np, pci->phb);
+			add_pci_device_node_info(np, pci->phb);
 		break;
 	default:
 		err = NOTIFY_DONE;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 17/21] powerpc/pci: Export traverse_pci_device_nodes()
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The patch exports following functions, which are derived from their
original implementation, so that the PCI hotplug logic can reuse
the functions to add or remove pci_dn for all device nodes under
specified PCI slot.

   traverse_pci_device_nodes()     traverse_pci_devices()
   add_pci_device_node_info()      update_dn_pci_info()
   remove_pci_device_node_info()   newly added

The patch also releases eeh_dev when its corresponding pci_dn
is released, indicating they have same life cycle.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h  |  3 +-
 arch/powerpc/include/asm/ppc-pci.h     |  6 +--
 arch/powerpc/kernel/pci_dn.c           | 67 +++++++++++++++++++++++++++++-----
 arch/powerpc/platforms/pseries/msi.c   |  4 +-
 arch/powerpc/platforms/pseries/setup.c |  2 +-
 5 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index a6ad4b1..e0cb114 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -222,7 +222,8 @@ extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
 extern struct pci_dn *add_dev_pci_data(struct pci_dev *pdev);
 extern void remove_dev_pci_data(struct pci_dev *pdev);
-extern void *update_dn_pci_info(struct device_node *dn, void *data);
+extern void *add_pci_device_node_info(struct device_node *dn, void *data);
+extern void remove_pci_device_node_info(struct device_node *dn);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
 					  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 7388316..3f0874e 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -33,9 +33,9 @@ extern struct pci_dev *isa_bridge_pcidev;	/* may be NULL if no ISA bus */
 struct device_node;
 struct pci_dn;
 
-typedef void *(*traverse_func)(struct device_node *me, void *data);
-void *traverse_pci_devices(struct device_node *start, traverse_func pre,
-		void *data);
+void *traverse_pci_device_nodes(struct device_node *start,
+				void *(*fn)(struct device_node *, void *),
+				void *data);
 void *traverse_pci_dn(struct pci_dn *root,
 		      void *(*fn)(struct pci_dn *, void *),
 		      void *data);
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index abc81fa..6bd9d8c 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -265,11 +265,15 @@ void remove_dev_pci_data(struct pci_dev *pdev)
 #endif /* CONFIG_PCI_IOV */
 }
 
-/*
- * Traverse_func that inits the PCI fields of the device node.
- * NOTE: this *must* be done before read/write config to the device.
+/**
+ * add_pci_device_node_info - Add pci_dn for PCI device node
+ * @dn: PCI device node
+ * @data: additonal argument
+ *
+ * Add pci_dn for the indicated PCI device node. The newly created
+ * pci_dn will be put into that one of the parent device node.
  */
-void *update_dn_pci_info(struct device_node *dn, void *data)
+void *add_pci_device_node_info(struct device_node *dn, void *data)
 {
 	struct pci_controller *phb = data;
 	const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
@@ -328,8 +332,48 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 
 	return NULL;
 }
+EXPORT_SYMBOL(add_pci_device_node_info);
 
-/*
+/**
+ * remove_pci_device_node_info - Remove pci_dn from PCI device node
+ * @dn: PCI device node
+ *
+ * Remove pci_dn from PCI device node. The pci_dn is also removed
+ * from the child list of the parent pci_dn.
+ */
+void remove_pci_device_node_info(struct device_node *np)
+{
+	struct pci_dn *pdn = np ? PCI_DN(np) : NULL;
+#ifdef CONFIG_EEH
+	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
+#endif
+
+	if (!pdn)
+		return;
+
+#ifdef CONFIG_EEH
+	if (edev) {
+		pdn->edev = NULL;
+		kfree(edev);
+	}
+#endif
+
+	BUG_ON(!list_empty(&pdn->child_list));
+	list_del(&pdn->list);
+	if (pdn->parent)
+		of_node_put(pdn->parent->node);
+
+	np->data = NULL;
+	kfree(pdn);
+}
+EXPORT_SYMBOL(remove_pci_device_node_info);
+
+/**
+ * traverse_pci_device_nodes - Traverse children of indicated device node
+ * @start: indicated device node
+ * @pre: callback
+ * @data: additional parameter to the callback
+ *
  * Traverse a device tree stopping each PCI device in the tree.
  * This is done depth first.  As each node is processed, a "pre"
  * function is called and the children are processed recursively.
@@ -347,8 +391,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
  * one of these nodes we also assume its siblings are non-pci for
  * performance.
  */
-void *traverse_pci_devices(struct device_node *start, traverse_func pre,
-		void *data)
+void *traverse_pci_device_nodes(struct device_node *start,
+				void *(*fn)(struct device_node *, void *data),
+				void *data)
 {
 	struct device_node *dn, *nextdn;
 	void *ret;
@@ -363,7 +408,7 @@ void *traverse_pci_devices(struct device_node *start, traverse_func pre,
 		if (classp)
 			class = of_read_number(classp, 1);
 
-		if (pre && ((ret = pre(dn, data)) != NULL))
+		if (fn && ((ret = fn(dn, data)) != NULL))
 			return ret;
 
 		/* If we are a PCI bridge, go down */
@@ -384,8 +429,10 @@ void *traverse_pci_devices(struct device_node *start, traverse_func pre,
 			nextdn = dn->sibling;
 		}
 	}
+
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(traverse_pci_device_nodes);
 
 static struct pci_dn *pci_dn_next_one(struct pci_dn *root,
 				      struct pci_dn *pdn)
@@ -441,7 +488,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	struct pci_dn *pdn;
 
 	/* PHB nodes themselves must not match */
-	update_dn_pci_info(dn, phb);
+	add_pci_device_node_info(dn, phb);
 	pdn = dn->data;
 	if (pdn) {
 		pdn->devfn = pdn->busno = -1;
@@ -451,7 +498,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	}
 
 	/* Update dn->phb ptrs for new phb and children devices */
-	traverse_pci_devices(dn, update_dn_pci_info, phb);
+	traverse_pci_device_nodes(dn, add_pci_device_node_info, phb);
 }
 
 static void pci_dev_pdn_setup(struct pci_dev *pdev)
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
index c8d24f9..9ebbd19 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -303,7 +303,7 @@ static int msi_quota_for_device(struct pci_dev *dev, int request)
 	memset(&counts, 0, sizeof(struct msi_counts));
 
 	/* Work out how many devices we have below this PE */
-	traverse_pci_devices(pe_dn, count_non_bridge_devices, &counts);
+	traverse_pci_device_nodes(pe_dn, count_non_bridge_devices, &counts);
 
 	if (counts.num_devices == 0) {
 		pr_err("rtas_msi: found 0 devices under PE for %s\n",
@@ -318,7 +318,7 @@ static int msi_quota_for_device(struct pci_dev *dev, int request)
 	/* else, we have some more calculating to do */
 	counts.requestor = pci_device_to_OF_node(dev);
 	counts.request = request;
-	traverse_pci_devices(pe_dn, count_spare_msis, &counts);
+	traverse_pci_device_nodes(pe_dn, count_spare_msis, &counts);
 
 	/* If the quota isn't an integer multiple of the total, we can
 	 * use the remainder as spare MSIs for anyone that wants them. */
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 92974aa..ed8c894 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -262,7 +262,7 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
 	case OF_RECONFIG_ATTACH_NODE:
 		pci = np->parent->data;
 		if (pci)
-			update_dn_pci_info(np, pci->phb);
+			add_pci_device_node_info(np, pci->phb);
 		break;
 	default:
 		err = NOTIFY_DONE;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 18/21] powerpc/pci: Update bridge windows on PCI plugging
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

During the PCI plugging event, the PCI devices are rescanned and
their IO and MMIO resources are reassigned. However, the PowerNV
platform will assign PE# based on that, which depends on updating
to window of bridge of the PE's primary bus.

The patch updates the windows of bridge of PE's primary bus if
we have valid bridge. Otherwise, we assume it's root bus or SRIOV
virtual bus and PE won't be assigned during PCI plugging time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 01d2a84..cb0bb3f 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1473,8 +1473,12 @@ void pcibios_finish_adding_to_bus(struct pci_bus *bus)
 	/* Allocate bus and devices resources */
 	pcibios_allocate_bus_resources(bus);
 	pcibios_claim_one_bus(bus);
-	if (!pci_has_flag(PCI_PROBE_ONLY))
-		pci_assign_unassigned_bus_resources(bus);
+	if (!pci_has_flag(PCI_PROBE_ONLY)) {
+		if (bus->self)
+			pci_assign_unassigned_bridge_resources(bus->self);
+		else
+			pci_assign_unassigned_bus_resources(bus);
+	}
 
 	/* Fixup EEH */
 	eeh_add_device_tree_late(bus);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 18/21] powerpc/pci: Update bridge windows on PCI plugging
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

During the PCI plugging event, the PCI devices are rescanned and
their IO and MMIO resources are reassigned. However, the PowerNV
platform will assign PE# based on that, which depends on updating
to window of bridge of the PE's primary bus.

The patch updates the windows of bridge of PE's primary bus if
we have valid bridge. Otherwise, we assume it's root bus or SRIOV
virtual bus and PE won't be assigned during PCI plugging time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 01d2a84..cb0bb3f 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1473,8 +1473,12 @@ void pcibios_finish_adding_to_bus(struct pci_bus *bus)
 	/* Allocate bus and devices resources */
 	pcibios_allocate_bus_resources(bus);
 	pcibios_claim_one_bus(bus);
-	if (!pci_has_flag(PCI_PROBE_ONLY))
-		pci_assign_unassigned_bus_resources(bus);
+	if (!pci_has_flag(PCI_PROBE_ONLY)) {
+		if (bus->self)
+			pci_assign_unassigned_bridge_resources(bus->self);
+		else
+			pci_assign_unassigned_bus_resources(bus);
+	}
 
 	/* Fixup EEH */
 	eeh_add_device_tree_late(bus);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: linux-pci, benh, bhelgaas, Gavin Shan, Grant Likely, Rob Herring

The requirement is raised when developing the PCI hotplug feature
for PowerPC PowerNV platform, which runs on top of skiboot firmware.
When plugging PCI adapter to one PCI slot, the firmware rescans the
slot and build FDT (Flat Device Tree) blob, which is sent to the
PowerNV PCI hotplug driver for processing. The new constructed device
nodes from the FDT blob are expected to be attached to the device
node of the PCI slot. Unfortunately, it seems we don't have a API
to support the scenario. The patch intends to support it by newly
introduced function of_fdt_add_subtree(), the design behind it is
shown as below:

   * When the sub-tree FDT blob, which is owned by firmware, is
     received by kernel. It's copied over to the blob, which is
     dynamically allocated. Since then, the FDT blob owned by
     firmware isn't touched.
   * Rework unflatten_dt_node() so that the device nodes in current
     and deeper depth have been constructed from the FDT blob. All
     device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is
     similar to OF_DYNAMIC. However, device node with the flag set
     can be free'd, but in the way other than that for OF_DYNAMIC
     device nodes.
   * of_fdt_add_subtree() is the introduced API to do the work.

Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/of/dynamic.c   |  19 +++++--
 drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
 include/linux/of.h     |   2 +
 include/linux/of_fdt.h |   1 +
 4 files changed, 127 insertions(+), 28 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 3351ef4..f562080 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
 		return;
 	}
 
-	if (!of_node_check_flag(node, OF_DYNAMIC))
+	/* Release the subtree */
+	if (node->subtree) {
+		kfree(node->subtree);
+		node->subtree = NULL;
+	}
+
+	if (!of_node_check_flag(node, OF_DYNAMIC) &&
+	    !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
 		return;
 
 	while (prop) {
 		struct property *next = prop->next;
-		kfree(prop->name);
-		kfree(prop->value);
+		if (of_node_check_flag(node, OF_DYNAMIC)) {
+			kfree(prop->name);
+			kfree(prop->value);
+		}
 		kfree(prop);
 		prop = next;
 
@@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
 			node->deadprops = NULL;
 		}
 	}
-	kfree(node->full_name);
+
+	if (of_node_check_flag(node, OF_DYNAMIC))
+		kfree(node->full_name);
 	kfree(node->data);
 	kfree(node);
 }
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index cde35c5d01..7659560 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -28,6 +28,10 @@
 #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
 #include <asm/page.h>
 
+#include "of_private.h"
+
+static int cur_node_depth;
+
 /*
  * of_fdt_limit_memory - limit the number of regions in the /memory node
  * @limit: maximum entries
@@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
  * @dad: Parent struct device_node
  * @fpsize: Size of the node path up at the current depth.
  */
-static void * unflatten_dt_node(void *blob,
-				void *mem,
-				int *poffset,
-				struct device_node *dad,
-				struct device_node **nodepp,
-				unsigned long fpsize,
-				bool dryrun)
+static void *unflatten_dt_node(void *blob,
+			       void *mem,
+			       int *poffset,
+			       struct device_node *dad,
+			       struct device_node **nodepp,
+			       unsigned long fpsize,
+			       bool dryrun,
+			       bool dynamic)
 {
 	const __be32 *p;
 	struct device_node *np;
 	struct property *pp, **prev_pp = NULL;
 	const char *pathp;
 	unsigned int l, allocl;
-	static int depth = 0;
 	int old_depth;
 	int offset;
 	int has_name = 0;
@@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
 		}
 	}
 
-	np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
+	if (dynamic)
+		np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
+	else
+		np = unflatten_dt_alloc(&mem,
+				sizeof(struct device_node) + allocl,
 				__alignof__(struct device_node));
 	if (!dryrun) {
 		char *fn;
 		of_node_init(np);
 		np->full_name = fn = ((char *)np) + sizeof(*np);
+		if (dynamic)
+			of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
 		if (new_format) {
 			/* rebuild full path for new format */
 			if (dad && dad->parent) {
@@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
 		}
 		if (strcmp(pname, "name") == 0)
 			has_name = 1;
-		pp = unflatten_dt_alloc(&mem, sizeof(struct property),
-					__alignof__(struct property));
+
+		if (dynamic)
+			pp = kzalloc(sizeof(struct property), GFP_KERNEL);
+		else
+			pp = unflatten_dt_alloc(&mem, sizeof(struct property),
+						__alignof__(struct property));
 		if (!dryrun) {
 			/* We accept flattened tree phandles either in
 			 * ePAPR-style "phandle" properties, or the
@@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
 		if (pa < ps)
 			pa = p1;
 		sz = (pa - ps) + 1;
-		pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
-					__alignof__(struct property));
+
+		if (dynamic)
+			pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
+		else
+			pp = unflatten_dt_alloc(&mem,
+						sizeof(struct property) + sz,
+						__alignof__(struct property));
 		if (!dryrun) {
 			pp->name = "name";
 			pp->length = sz;
@@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
 			np->type = "<NULL>";
 	}
 
-	old_depth = depth;
-	*poffset = fdt_next_node(blob, *poffset, &depth);
-	if (depth < 0)
-		depth = 0;
-	while (*poffset > 0 && depth > old_depth)
-		mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
-					fpsize, dryrun);
+	old_depth = cur_node_depth;
+	*poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
+	while (*poffset > 0) {
+		if (cur_node_depth < old_depth)
+			break;
+
+		if (cur_node_depth == old_depth)
+			mem = unflatten_dt_node(blob, mem, poffset,
+						dad, NULL, fpsize,
+						dryrun, dynamic);
+		else if (cur_node_depth > old_depth)
+			mem = unflatten_dt_node(blob, mem, poffset,
+						np, NULL, fpsize,
+						dryrun, dynamic);
+	}
 
 	if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
 		pr_err("unflatten: error %d processing FDT\n", *poffset);
@@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
  * for the resulting tree
  */
 static void __unflatten_device_tree(void *blob,
-			     struct device_node **mynodes,
-			     void * (*dt_alloc)(u64 size, u64 align))
+				struct device_node **mynodes,
+				void * (*dt_alloc)(u64 size, u64 align))
 {
 	unsigned long size;
 	int start;
@@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
 
 	/* First pass, scan for size */
 	start = 0;
-	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
+	cur_node_depth = 1;
+	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
+						NULL, 0, true, false);
 	size = ALIGN(size, 4);
 
 	pr_debug("  size is %lx, allocating...\n", size);
@@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
 
 	/* Second pass, do actual unflattening */
 	start = 0;
-	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
+	cur_node_depth = 1;
+	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
 	if (be32_to_cpup(mem + size) != 0xdeadbeef)
 		pr_warning("End of tree marker overwritten: %08x\n",
 			   be32_to_cpup(mem + size));
@@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
 }
 EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
 
+static void populate_sysfs_for_child_nodes(struct device_node *parent)
+{
+	struct device_node *child;
+
+	for_each_child_of_node(parent, child) {
+		__of_attach_node_sysfs(child);
+		populate_sysfs_for_child_nodes(child);
+	}
+}
+
+/**
+ * of_fdt_add_substree - Create sub-tree of device nodes
+ * @parent: parent device node to which the sub-tree will attach
+ * @blob: flat device tree blob representing the sub-tree
+ *
+ * Copy over the FDT blob, which passed from firmware, and then
+ * unflatten the sub-tree.
+ */
+void of_fdt_add_subtree(struct device_node *parent, void *blob)
+{
+	int start = 0;
+
+	/* Validate the header */
+	if (!blob || fdt_check_header(blob)) {
+		pr_err("%s: Invalid device-tree blob header at 0x%p\n",
+		       __func__, blob);
+		return;
+	}
+
+	/* Free the flat blob for last time lazily */
+	if (parent->subtree) {
+		kfree(parent->subtree);
+		parent->subtree = NULL;
+	}
+
+	/* Copy over the flat blob */
+	parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
+	if (!parent->subtree) {
+		pr_err("%s: Cannot copy over device-tree blob\n",
+		       __func__);
+		return;
+	}
+
+	memcpy(parent->subtree, blob, fdt_totalsize(blob));
+
+	/* Unflatten it */
+	mutex_lock(&of_mutex);
+	cur_node_depth = 1;
+	unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
+			  strlen(parent->full_name), false, true);
+	populate_sysfs_for_child_nodes(parent);
+	mutex_unlock(&of_mutex);
+}
+EXPORT_SYMBOL(of_fdt_add_subtree);
+
 /* Everything below here references initial_boot_params directly. */
 int __initdata dt_root_addr_cells;
 int __initdata dt_root_size_cells;
diff --git a/include/linux/of.h b/include/linux/of.h
index ddeaae6..ac50b02 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -60,6 +60,7 @@ struct device_node {
 	struct	device_node *sibling;
 	struct	kobject kobj;
 	unsigned long _flags;
+	void	*subtree;
 	void	*data;
 #if defined(CONFIG_SPARC)
 	const char *path_component_name;
@@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
 #define OF_DETACHED	2 /* node has been detached from the device tree */
 #define OF_POPULATED	3 /* device already created for the node */
 #define OF_POPULATED_BUS	4 /* of_platform_populate recursed to children of this node */
+#define OF_DYNAMIC_HYBIRD	5 /* similar to OF_DYNAMIC, but partially */
 
 #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
 #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index 587ee50..1fb47d7 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
 			const char *const *compat);
 extern void of_fdt_unflatten_tree(unsigned long *blob,
 			       struct device_node **mynodes);
+extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
 
 /* TBD: Temporary export of fdt globals - remove when code fully merged */
 extern int __initdata dt_root_addr_cells;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, Gavin Shan, Grant Likely, Rob Herring, bhelgaas

The requirement is raised when developing the PCI hotplug feature
for PowerPC PowerNV platform, which runs on top of skiboot firmware.
When plugging PCI adapter to one PCI slot, the firmware rescans the
slot and build FDT (Flat Device Tree) blob, which is sent to the
PowerNV PCI hotplug driver for processing. The new constructed device
nodes from the FDT blob are expected to be attached to the device
node of the PCI slot. Unfortunately, it seems we don't have a API
to support the scenario. The patch intends to support it by newly
introduced function of_fdt_add_subtree(), the design behind it is
shown as below:

   * When the sub-tree FDT blob, which is owned by firmware, is
     received by kernel. It's copied over to the blob, which is
     dynamically allocated. Since then, the FDT blob owned by
     firmware isn't touched.
   * Rework unflatten_dt_node() so that the device nodes in current
     and deeper depth have been constructed from the FDT blob. All
     device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is
     similar to OF_DYNAMIC. However, device node with the flag set
     can be free'd, but in the way other than that for OF_DYNAMIC
     device nodes.
   * of_fdt_add_subtree() is the introduced API to do the work.

Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/of/dynamic.c   |  19 +++++--
 drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
 include/linux/of.h     |   2 +
 include/linux/of_fdt.h |   1 +
 4 files changed, 127 insertions(+), 28 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 3351ef4..f562080 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
 		return;
 	}
 
-	if (!of_node_check_flag(node, OF_DYNAMIC))
+	/* Release the subtree */
+	if (node->subtree) {
+		kfree(node->subtree);
+		node->subtree = NULL;
+	}
+
+	if (!of_node_check_flag(node, OF_DYNAMIC) &&
+	    !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
 		return;
 
 	while (prop) {
 		struct property *next = prop->next;
-		kfree(prop->name);
-		kfree(prop->value);
+		if (of_node_check_flag(node, OF_DYNAMIC)) {
+			kfree(prop->name);
+			kfree(prop->value);
+		}
 		kfree(prop);
 		prop = next;
 
@@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
 			node->deadprops = NULL;
 		}
 	}
-	kfree(node->full_name);
+
+	if (of_node_check_flag(node, OF_DYNAMIC))
+		kfree(node->full_name);
 	kfree(node->data);
 	kfree(node);
 }
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index cde35c5d01..7659560 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -28,6 +28,10 @@
 #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
 #include <asm/page.h>
 
+#include "of_private.h"
+
+static int cur_node_depth;
+
 /*
  * of_fdt_limit_memory - limit the number of regions in the /memory node
  * @limit: maximum entries
@@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
  * @dad: Parent struct device_node
  * @fpsize: Size of the node path up at the current depth.
  */
-static void * unflatten_dt_node(void *blob,
-				void *mem,
-				int *poffset,
-				struct device_node *dad,
-				struct device_node **nodepp,
-				unsigned long fpsize,
-				bool dryrun)
+static void *unflatten_dt_node(void *blob,
+			       void *mem,
+			       int *poffset,
+			       struct device_node *dad,
+			       struct device_node **nodepp,
+			       unsigned long fpsize,
+			       bool dryrun,
+			       bool dynamic)
 {
 	const __be32 *p;
 	struct device_node *np;
 	struct property *pp, **prev_pp = NULL;
 	const char *pathp;
 	unsigned int l, allocl;
-	static int depth = 0;
 	int old_depth;
 	int offset;
 	int has_name = 0;
@@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
 		}
 	}
 
-	np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
+	if (dynamic)
+		np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
+	else
+		np = unflatten_dt_alloc(&mem,
+				sizeof(struct device_node) + allocl,
 				__alignof__(struct device_node));
 	if (!dryrun) {
 		char *fn;
 		of_node_init(np);
 		np->full_name = fn = ((char *)np) + sizeof(*np);
+		if (dynamic)
+			of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
 		if (new_format) {
 			/* rebuild full path for new format */
 			if (dad && dad->parent) {
@@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
 		}
 		if (strcmp(pname, "name") == 0)
 			has_name = 1;
-		pp = unflatten_dt_alloc(&mem, sizeof(struct property),
-					__alignof__(struct property));
+
+		if (dynamic)
+			pp = kzalloc(sizeof(struct property), GFP_KERNEL);
+		else
+			pp = unflatten_dt_alloc(&mem, sizeof(struct property),
+						__alignof__(struct property));
 		if (!dryrun) {
 			/* We accept flattened tree phandles either in
 			 * ePAPR-style "phandle" properties, or the
@@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
 		if (pa < ps)
 			pa = p1;
 		sz = (pa - ps) + 1;
-		pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
-					__alignof__(struct property));
+
+		if (dynamic)
+			pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
+		else
+			pp = unflatten_dt_alloc(&mem,
+						sizeof(struct property) + sz,
+						__alignof__(struct property));
 		if (!dryrun) {
 			pp->name = "name";
 			pp->length = sz;
@@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
 			np->type = "<NULL>";
 	}
 
-	old_depth = depth;
-	*poffset = fdt_next_node(blob, *poffset, &depth);
-	if (depth < 0)
-		depth = 0;
-	while (*poffset > 0 && depth > old_depth)
-		mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
-					fpsize, dryrun);
+	old_depth = cur_node_depth;
+	*poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
+	while (*poffset > 0) {
+		if (cur_node_depth < old_depth)
+			break;
+
+		if (cur_node_depth == old_depth)
+			mem = unflatten_dt_node(blob, mem, poffset,
+						dad, NULL, fpsize,
+						dryrun, dynamic);
+		else if (cur_node_depth > old_depth)
+			mem = unflatten_dt_node(blob, mem, poffset,
+						np, NULL, fpsize,
+						dryrun, dynamic);
+	}
 
 	if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
 		pr_err("unflatten: error %d processing FDT\n", *poffset);
@@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
  * for the resulting tree
  */
 static void __unflatten_device_tree(void *blob,
-			     struct device_node **mynodes,
-			     void * (*dt_alloc)(u64 size, u64 align))
+				struct device_node **mynodes,
+				void * (*dt_alloc)(u64 size, u64 align))
 {
 	unsigned long size;
 	int start;
@@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
 
 	/* First pass, scan for size */
 	start = 0;
-	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
+	cur_node_depth = 1;
+	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
+						NULL, 0, true, false);
 	size = ALIGN(size, 4);
 
 	pr_debug("  size is %lx, allocating...\n", size);
@@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
 
 	/* Second pass, do actual unflattening */
 	start = 0;
-	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
+	cur_node_depth = 1;
+	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
 	if (be32_to_cpup(mem + size) != 0xdeadbeef)
 		pr_warning("End of tree marker overwritten: %08x\n",
 			   be32_to_cpup(mem + size));
@@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
 }
 EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
 
+static void populate_sysfs_for_child_nodes(struct device_node *parent)
+{
+	struct device_node *child;
+
+	for_each_child_of_node(parent, child) {
+		__of_attach_node_sysfs(child);
+		populate_sysfs_for_child_nodes(child);
+	}
+}
+
+/**
+ * of_fdt_add_substree - Create sub-tree of device nodes
+ * @parent: parent device node to which the sub-tree will attach
+ * @blob: flat device tree blob representing the sub-tree
+ *
+ * Copy over the FDT blob, which passed from firmware, and then
+ * unflatten the sub-tree.
+ */
+void of_fdt_add_subtree(struct device_node *parent, void *blob)
+{
+	int start = 0;
+
+	/* Validate the header */
+	if (!blob || fdt_check_header(blob)) {
+		pr_err("%s: Invalid device-tree blob header at 0x%p\n",
+		       __func__, blob);
+		return;
+	}
+
+	/* Free the flat blob for last time lazily */
+	if (parent->subtree) {
+		kfree(parent->subtree);
+		parent->subtree = NULL;
+	}
+
+	/* Copy over the flat blob */
+	parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
+	if (!parent->subtree) {
+		pr_err("%s: Cannot copy over device-tree blob\n",
+		       __func__);
+		return;
+	}
+
+	memcpy(parent->subtree, blob, fdt_totalsize(blob));
+
+	/* Unflatten it */
+	mutex_lock(&of_mutex);
+	cur_node_depth = 1;
+	unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
+			  strlen(parent->full_name), false, true);
+	populate_sysfs_for_child_nodes(parent);
+	mutex_unlock(&of_mutex);
+}
+EXPORT_SYMBOL(of_fdt_add_subtree);
+
 /* Everything below here references initial_boot_params directly. */
 int __initdata dt_root_addr_cells;
 int __initdata dt_root_size_cells;
diff --git a/include/linux/of.h b/include/linux/of.h
index ddeaae6..ac50b02 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -60,6 +60,7 @@ struct device_node {
 	struct	device_node *sibling;
 	struct	kobject kobj;
 	unsigned long _flags;
+	void	*subtree;
 	void	*data;
 #if defined(CONFIG_SPARC)
 	const char *path_component_name;
@@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
 #define OF_DETACHED	2 /* node has been detached from the device tree */
 #define OF_POPULATED	3 /* device already created for the node */
 #define OF_POPULATED_BUS	4 /* of_platform_populate recursed to children of this node */
+#define OF_DYNAMIC_HYBIRD	5 /* similar to OF_DYNAMIC, but partially */
 
 #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
 #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index 587ee50..1fb47d7 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
 			const char *const *compat);
 extern void of_fdt_unflatten_tree(unsigned long *blob,
 			       struct device_node **mynodes);
+extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
 
 /* TBD: Temporary export of fdt globals - remove when code fully merged */
 extern int __initdata dt_root_addr_cells;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 20/21] powerpc/powernv: Select OF_DYNAMIC
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The device tree nodes will be changed dynamically on PCI hotplug
events on PowerNV platform. The patch selects OF_DYNAMIC on the
platform to support PCI hotplug.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 4b044d8..9c62631 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -18,4 +18,5 @@ config PPC_POWERNV
 	select CPU_FREQ_GOV_ONDEMAND
 	select CPU_FREQ_GOV_CONSERVATIVE
 	select PPC_DOORBELL
+	select OF_DYNAMIC
 	default y
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 20/21] powerpc/powernv: Select OF_DYNAMIC
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The device tree nodes will be changed dynamically on PCI hotplug
events on PowerNV platform. The patch selects OF_DYNAMIC on the
platform to support PCI hotplug.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 4b044d8..9c62631 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -18,4 +18,5 @@ config PPC_POWERNV
 	select CPU_FREQ_GOV_ONDEMAND
 	select CPU_FREQ_GOV_CONSERVATIVE
 	select PPC_DOORBELL
+	select OF_DYNAMIC
 	default y
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-01  6:03   ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-pci, benh, bhelgaas, Gavin Shan

The patch intends to add standalone driver to support PCI hotplug
for PowerPC PowerNV platform, which runs on top of skiboot firmware.
The firmware identified hotpluggable slots and marked their device
tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware".
The driver simply scans device-tree to create/register PCI hotplug slot
accordingly.

If the skiboot firmware doesn't support slot status retrieval, the PCI
slot device node shouldn't have property "ibm,reset-by-firmware". In
that case, none of valid PCI slots will be detected from device tree.
The skiboot firmware doesn't export the capability to access attention
LEDs yet and it's something for TBD.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/pci/hotplug/Kconfig            |  12 +
 drivers/pci/hotplug/Makefile           |   4 +
 drivers/pci/hotplug/powernv_php.c      | 146 ++++++++
 drivers/pci/hotplug/powernv_php.h      |  78 ++++
 drivers/pci/hotplug/powernv_php_slot.c | 643 +++++++++++++++++++++++++++++++++
 5 files changed, 883 insertions(+)
 create mode 100644 drivers/pci/hotplug/powernv_php.c
 create mode 100644 drivers/pci/hotplug/powernv_php.h
 create mode 100644 drivers/pci/hotplug/powernv_php_slot.c

diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
index df8caec..ef55dae 100644
--- a/drivers/pci/hotplug/Kconfig
+++ b/drivers/pci/hotplug/Kconfig
@@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC
 
 	  When in doubt, say N.
 
+config HOTPLUG_PCI_POWERNV
+	tristate "PowerPC PowerNV PCI Hotplug driver"
+	depends on PPC_POWERNV && EEH
+	help
+	  Say Y here if you run PowerPC PowerNV platform that supports
+          PCI Hotplug
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called powernv-php.
+
+	  When in doubt, say N.
+
 config HOTPLUG_PCI_RPA
 	tristate "RPA PCI Hotplug driver"
 	depends on PPC_PSERIES && EEH
diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
index 4a9aa08..a69665e 100644
--- a/drivers/pci/hotplug/Makefile
+++ b/drivers/pci/hotplug/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)		+= pciehp.o
 obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550)	+= cpcihp_zt5550.o
 obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)	+= cpcihp_generic.o
 obj-$(CONFIG_HOTPLUG_PCI_SHPC)		+= shpchp.o
+obj-$(CONFIG_HOTPLUG_PCI_POWERNV)	+= powernv-php.o
 obj-$(CONFIG_HOTPLUG_PCI_RPA)		+= rpaphp.o
 obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR)	+= rpadlpar_io.o
 obj-$(CONFIG_HOTPLUG_PCI_SGI)		+= sgi_hotplug.o
@@ -50,6 +51,9 @@ ibmphp-objs		:=	ibmphp_core.o	\
 acpiphp-objs		:=	acpiphp_core.o	\
 				acpiphp_glue.o
 
+powernv-php-objs	:=	powernv_php.o	\
+				powernv_php_slot.o
+
 rpaphp-objs		:=	rpaphp_core.o	\
 				rpaphp_pci.o	\
 				rpaphp_slot.o
diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c
new file mode 100644
index 0000000..5cf9e717
--- /dev/null
+++ b/drivers/pci/hotplug/powernv_php.c
@@ -0,0 +1,146 @@
+/*
+ * PCI Hotplug Driver for PowerPC PowerNV platform.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2015.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sysfs.h>
+#include <linux/pci.h>
+#include <linux/pci_hotplug.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <asm/opal.h>
+#include <asm/pnv-pci.h>
+
+#include "powernv_php.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
+#define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
+
+static struct notifier_block php_msg_nb = {
+	.notifier_call	= powernv_php_msg_handler,
+	.next		= NULL,
+	.priority	= 0,
+};
+
+static int powernv_php_register_one(struct device_node *dn)
+{
+	struct powernv_php_slot *slot;
+	const __be32 *prop32;
+	int ret;
+
+	/* Check if it's hotpluggable slot */
+	prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL);
+	if (!prop32 || !of_read_number(prop32, 1))
+		return 0;
+
+	prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL);
+	if (!prop32 || !of_read_number(prop32, 1))
+		return 0;
+
+	/* Allocate slot */
+	slot = powernv_php_slot_alloc(dn);
+	if (!slot)
+		return -ENODEV;
+
+	/* Register it */
+	ret = powernv_php_slot_register(slot);
+	if (ret) {
+		powernv_php_slot_put(slot);
+		return ret;
+	}
+
+	return powernv_php_slot_enable(slot->php_slot, false, false);
+}
+
+int powernv_php_register(struct device_node *dn)
+{
+	struct device_node *child;
+	int ret = 0;
+
+	/*
+	 * The parent slots should be registered before their
+	 * child slots.
+	 */
+	for_each_child_of_node(dn, child) {
+		ret = powernv_php_register_one(child);
+		if (ret)
+			break;
+
+		powernv_php_register(child);
+	}
+
+	return ret;
+}
+
+static void powernv_php_unregister_one(struct device_node *dn)
+{
+	struct powernv_php_slot *slot;
+
+	slot = powernv_php_slot_find(dn);
+	if (!slot)
+		return;
+
+	pci_hp_deregister(slot->php_slot);
+}
+
+void powernv_php_unregister(struct device_node *dn)
+{
+	struct device_node *child;
+
+	/* The child slots should go before their parent slots */
+	for_each_child_of_node(dn, child) {
+		powernv_php_unregister(child);
+		powernv_php_unregister_one(child);
+	}
+}
+
+static int __init powernv_php_init(void)
+{
+	struct device_node *dn;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	/* Register hotplug message handler */
+	if (pnv_pci_hotplug_notifier(&php_msg_nb, true)) {
+		pr_warn("%s: Cannot register hotplug message notifier\n",
+			__func__);
+		return -EIO;
+	}
+
+	/* Scan PHB nodes and their children */
+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
+		powernv_php_register(dn);
+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
+		powernv_php_register(dn);
+
+	return 0;
+}
+
+static void __exit powernv_php_exit(void)
+{
+	struct device_node *dn;
+
+	pnv_pci_hotplug_notifier(&php_msg_nb, false);
+
+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
+		powernv_php_unregister(dn);
+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
+		powernv_php_unregister(dn);
+}
+
+module_init(powernv_php_init);
+module_exit(powernv_php_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/pci/hotplug/powernv_php.h b/drivers/pci/hotplug/powernv_php.h
new file mode 100644
index 0000000..87ba0d0
--- /dev/null
+++ b/drivers/pci/hotplug/powernv_php.h
@@ -0,0 +1,78 @@
+/*
+ * PCI Hotplug Driver for PowerPC PowerNV platform.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2015.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _POWERNV_PHP_H
+#define _POWERNV_PHP_H
+
+/* Slot power status */
+#define POWERNV_PHP_SLOT_POWER_OFF	0
+#define POWERNV_PHP_SLOT_POWER_ON	1
+
+/* Slot presence status */
+#define POWERNV_PHP_SLOT_EMPTY		0
+#define POWERNV_PHP_SLOT_PRESENT	1
+
+/* Slot attention status */
+#define POWERNV_PHP_SLOT_ATTEN_OFF	0
+#define POWERNV_PHP_SLOT_ATTEN_ON	1
+#define POWERNV_PHP_SLOT_ATTEN_IND	2
+#define POWERNV_PHP_SLOT_ATTEN_ACT	3
+
+struct powernv_php_slot {
+	struct kref		kref;
+	int			state;
+#define POWERNV_PHP_SLOT_STATE_INIT		0x0
+#define POWERNV_PHP_SLOT_STATE_REGISTER		0x1
+#define POWERNV_PHP_SLOT_STATE_POPULATED	0x2
+	char			*name;
+	struct device_node	*dn;
+	struct pci_bus		*bus;
+	uint64_t		id;
+	int			slot_no;
+	int			check_power_status;
+	int			status_confirmed;
+	struct opal_msg		*msg;
+	struct work_struct	work;
+	wait_queue_head_t	queue;
+	struct hotplug_slot	*php_slot;
+	struct powernv_php_slot	*parent;
+	void (*release)(struct kref *kref);
+	struct list_head	children;
+	struct list_head	link;
+};
+
+#define to_powernv_php_slot(kref) container_of(kref, struct powernv_php_slot, kref)
+
+static inline void powernv_php_slot_get(struct powernv_php_slot *slot)
+{
+	if (slot)
+		kref_get(&slot->kref);
+}
+
+static inline int powernv_php_slot_put(struct powernv_php_slot *slot)
+{
+	if (slot)
+		return kref_put(&slot->kref, slot->release);
+
+	return 0;
+}
+
+int powernv_php_msg_handler(struct notifier_block *nb,
+			    unsigned long type, void *message);
+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn);
+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn);
+int powernv_php_slot_register(struct powernv_php_slot *slot);
+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
+			    bool rescan_bus, bool rescan_slot);
+int powernv_php_register(struct device_node *dn);
+void powernv_php_unregister(struct device_node *dn);
+
+#endif /* !_POWERNV_PHP_H */
diff --git a/drivers/pci/hotplug/powernv_php_slot.c b/drivers/pci/hotplug/powernv_php_slot.c
new file mode 100644
index 0000000..fc82355
--- /dev/null
+++ b/drivers/pci/hotplug/powernv_php_slot.c
@@ -0,0 +1,643 @@
+/*
+ * PCI Hotplug Driver for PowerPC PowerNV platform.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2015.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sysfs.h>
+#include <linux/pci.h>
+#include <linux/pci_hotplug.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+#include <asm/opal.h>
+#include <asm/pnv-pci.h>
+#include <asm/ppc-pci.h>
+
+#include "powernv_php.h"
+
+static LIST_HEAD(php_slot_list);
+static DEFINE_SPINLOCK(php_slot_lock);
+
+/*
+ * Release firmware data for all child device nodes of the
+ * indicated one.
+ */
+static void release_device_nodes_info(struct device_node *np)
+{
+	struct device_node *child;
+
+	for_each_child_of_node(np, child) {
+		/* In depth first */
+		release_device_nodes_info(child);
+
+		remove_pci_device_node_info(child);
+	}
+}
+
+/*
+ * Release all subordinate device nodes of the indicated one.
+ * Those device nodes in deepest path should be released firstly.
+ */
+static int release_device_nodes(struct device_node *parent)
+{
+	struct device_node *np, *child;
+	int ret = 0;
+
+	/* If the device node has children, remove them firstly */
+	for_each_child_of_node(parent, np) {
+		ret = release_device_nodes(np);
+		if (ret)
+			return ret;
+
+		/* The device shouldn't have alive children */
+		child = of_get_next_child(np, NULL);
+		if (child) {
+			of_node_put(child);
+			of_node_put(np);
+			pr_err("%s: Alive children of node <%s>\n",
+			       __func__, of_node_full_name(np));
+			return -EBUSY;
+		}
+
+		/* Detach the device node */
+		of_detach_node(np);
+		of_node_put(np);
+	}
+
+	return 0;
+}
+
+/*
+ * The function processes the message sent by firmware
+ * to remove all device tree nodes beneath the slot's
+ * nodes, and the associated auxillary data.
+ */
+static void slot_power_off_handler(struct powernv_php_slot *slot)
+{
+	int ret;
+
+	/* Release the firmware data for the child device nodes */
+	release_device_nodes_info(slot->dn);
+
+	/* Release the child device nodes */
+	ret = release_device_nodes(slot->dn);
+	if (ret)
+		pr_warn("%s: Error %d releasing children of <%s>\n",
+			__func__, ret, of_node_full_name(slot->dn));
+
+	/* Confirm status change */
+	slot->status_confirmed = 1;
+	wake_up_interruptible(&slot->queue);
+}
+
+static void slot_power_on_handler(struct powernv_php_slot *slot)
+{
+	struct opal_msg *msg = slot->msg;
+	unsigned long phys = be64_to_cpu(msg->params[2]);
+	unsigned long len = be64_to_cpu(msg->params[3]);
+	void *blob = (phys && len > 0) ? __va(phys) : NULL;
+
+	/* There might have nothing behind the slot yet */
+	if (!blob || !len)
+		goto out;
+
+	/* Copy the FDT blob and parse it */
+	of_fdt_add_subtree(slot->dn, blob);
+
+	/* Add device node firmware data */
+	traverse_pci_device_nodes(slot->dn,
+				  add_pci_device_node_info,
+				  pci_bus_to_host(slot->bus));
+
+out:
+	/* Confirm status change */
+	slot->status_confirmed = 1;
+	wake_up_interruptible(&slot->queue);
+}
+
+static void powernv_php_slot_work(struct work_struct *data)
+{
+	struct powernv_php_slot *slot = container_of(data,
+						     struct powernv_php_slot,
+						     work);
+	uint64_t php_event = be64_to_cpu(slot->msg->params[0]);
+
+	switch (php_event) {
+	case 0: /* Slot power off */
+		slot_power_off_handler(slot);
+		break;
+	case 1: /* Slot power on */
+		slot_power_on_handler(slot);
+		break;
+	default:
+		pr_warn("%s: Unsupported hotplug event %lld\n",
+			__func__, php_event);
+	}
+
+	of_node_put(slot->dn);
+}
+
+int powernv_php_msg_handler(struct notifier_block *nb,
+			    unsigned long type, void *message)
+{
+	phandle h;
+	struct device_node *np;
+	struct powernv_php_slot *slot;
+	struct opal_msg *msg = message;
+
+	/* Check the message type */
+	if (type != OPAL_MSG_PCI_HOTPLUG) {
+		pr_warn("%s: Wrong message type %ld received!\n",
+			__func__, type);
+		return 0;
+	}
+
+	/* Find the device node */
+	h = (phandle)be64_to_cpu(msg->params[1]);
+	np = of_find_node_by_phandle(h);
+	if (!np) {
+		pr_warn("%s: No device node for phandle 0x%08x\n",
+			__func__, h);
+		return 0;
+	}
+
+	/* Find the slot */
+	slot = powernv_php_slot_find(np);
+	if (!slot) {
+		pr_warn("%s: No slot found for node <%s>\n",
+			__func__, of_node_full_name(np));
+		of_node_put(np);
+		return 0;
+	}
+
+	/* Schedule the work */
+	slot->msg = msg;
+	schedule_work(&slot->work);
+	return 0;
+}
+
+static int set_power_status(struct hotplug_slot *php_slot, u8 val)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	int ret;
+
+	/* Set power status */
+	slot->status_confirmed = 0;
+	ret = pnv_pci_set_power_status(slot->id, val);
+	if (ret) {
+		pr_warn("%s: Error %d powering %s slot %016llx\n",
+			__func__, ret, val ? "on" : "off", slot->id);
+		return ret;
+	}
+
+	/* Waiting until the device tree is updated */
+	ret = wait_event_timeout(slot->queue,
+				 !slot->status_confirmed,
+				 10 * HZ);
+	if (ret) {
+		pr_warn("%s: Error %d completing power-%s slot %016llx\n",
+			__func__, ret, val ? "on" : "off", slot->id);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int get_power_status(struct hotplug_slot *php_slot, u8 *val)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t state;
+	int ret;
+
+	/*
+	 * Retrieve power status from firmware. If we fail
+	 * getting that, the power status fails back to
+	 * be on.
+	 */
+	ret = pnv_pci_get_power_status(slot->id, &state);
+	if (ret) {
+		*val = POWERNV_PHP_SLOT_POWER_ON;
+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
+			__func__, ret, slot->id);
+	} else {
+		*val = state ? POWERNV_PHP_SLOT_POWER_ON :
+			       POWERNV_PHP_SLOT_POWER_OFF;
+		php_slot->info->power_status = *val;
+	}
+
+	return 0;
+}
+
+static int get_adapter_status(struct hotplug_slot *php_slot, u8 *val)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t state;
+	int ret;
+
+	/*
+	 * Retrieve presence status from firmware. If we can't
+	 * get that, it will fail back to be empty.
+	 */
+	ret = pnv_pci_get_presence_status(slot->id, &state);
+	if (ret >= 0) {
+                *val = state ? POWERNV_PHP_SLOT_PRESENT :
+                               POWERNV_PHP_SLOT_EMPTY;
+                php_slot->info->adapter_status = *val;
+	} else {
+		*val = POWERNV_PHP_SLOT_EMPTY;
+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
+			__func__, ret, slot->id);
+	}
+
+	return ret < 0 ? ret : 0;
+}
+
+static int set_attention_status(struct hotplug_slot *php_slot, u8 val)
+{
+	/* The default operation would to turn on the attention */
+	switch (val) {
+	case POWERNV_PHP_SLOT_ATTEN_OFF:
+	case POWERNV_PHP_SLOT_ATTEN_ON:
+	case POWERNV_PHP_SLOT_ATTEN_IND:
+	case POWERNV_PHP_SLOT_ATTEN_ACT:
+		break;
+	default:
+		val = POWERNV_PHP_SLOT_ATTEN_ON;
+	}
+
+	/* FIXME: Make it real once firmware supports it */
+	php_slot->info->attention_status = val;
+
+	return 0;
+}
+
+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
+			    bool rescan_bus, bool rescan_slot)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t presence, power_status;
+	int ret;
+
+	/* Check if the slot has been configured */
+	if (slot->state != POWERNV_PHP_SLOT_STATE_REGISTER)
+		return 0;
+
+	/* Retrieve slot presence status */
+	ret = php_slot->ops->get_adapter_status(php_slot, &presence);
+	if (ret) {
+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
+			__func__, ret, slot->id);
+		return ret;
+	}
+
+	/* Proceed if there have nothing behind the slot */
+	if (presence == POWERNV_PHP_SLOT_EMPTY)
+		goto scan;
+
+	/*
+	 * If we don't detect something behind the slot, we need
+	 * make sure the power suply to the slot is on. Otherwise,
+	 * the slot downstream PCIe linkturn should be down.
+	 *
+	 * On the first time, we don't change the power status to
+	 * boost system boot with assumption that the firmware
+	 * supplies consistent slot power status: empty slot always
+	 * has its power off and non-empty slot has its power on.
+	 */
+	if (!slot->check_power_status) {
+		slot->check_power_status = 1;
+		goto scan;
+	}
+
+	/* Check the power status. Scan the slot if that's already on */
+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
+	if (ret) {
+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
+			__func__, ret, slot->id);
+		return ret;
+	}
+	if (power_status == POWERNV_PHP_SLOT_POWER_ON)
+		goto scan;
+
+	/* Power is off, turn it on and then scan the slot */
+	ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_ON);
+	if (ret) {
+		pr_warn("%s: Error %d powering on slot %016llx\n",
+			__func__, ret, slot->id);
+		return ret;
+	}
+
+scan:
+	switch (presence) {
+	case POWERNV_PHP_SLOT_PRESENT:
+		if (rescan_bus) {
+			pci_lock_rescan_remove();
+			pcibios_add_pci_devices(slot->bus);
+			pci_unlock_rescan_remove();
+		}
+
+		/* Rescan for child hotpluggable slots */
+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
+		if (rescan_slot)
+			powernv_php_register(slot->dn);
+		break;
+	case POWERNV_PHP_SLOT_EMPTY:
+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
+		break;
+	default:
+		pr_warn("%s: Invalid presence status %d of slot %016llx\n",
+			__func__, presence, slot->id);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int enable_slot(struct hotplug_slot *php_slot)
+{
+	return powernv_php_slot_enable(php_slot, true, true);
+}
+
+static int disable_slot(struct hotplug_slot *php_slot)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t power_status;
+	int ret;
+
+	if (slot->state != POWERNV_PHP_SLOT_STATE_POPULATED)
+		return 0;
+
+	/* Remove all devices behind the slot */
+	pci_lock_rescan_remove();
+	pcibios_remove_pci_devices(slot->bus);
+	pci_unlock_rescan_remove();
+
+	/* Detach the child hotpluggable slots */
+	powernv_php_unregister(slot->dn);
+
+	/*
+	 * Check the power status and turn it off if necessary. If we
+	 * fail to get the power status, the power will be forced to
+	 * be off.
+	 */
+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
+	if (ret || power_status == POWERNV_PHP_SLOT_POWER_ON) {
+		ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_OFF);
+		if (ret)
+			pr_warn("%s: Error %d powering off slot %016llx\n",
+				__func__, ret, slot->id);
+	}
+
+	/* Update slot state */
+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
+	return 0;
+}
+
+static struct hotplug_slot_ops php_slot_ops = {
+	.get_power_status	= get_power_status,
+	.get_adapter_status	= get_adapter_status,
+	.set_attention_status	= set_attention_status,
+	.enable_slot		= enable_slot,
+	.disable_slot		= disable_slot,
+};
+
+static struct powernv_php_slot *php_slot_match(struct device_node *dn,
+					       struct powernv_php_slot *slot)
+{
+	struct powernv_php_slot *target, *tmp;
+
+	if (slot->dn == dn)
+		return slot;
+
+	list_for_each_entry(tmp, &slot->children, link) {
+		target = php_slot_match(dn, tmp);
+		if (target)
+			return target;
+	}
+
+	return NULL;
+}
+
+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn)
+{
+	struct powernv_php_slot *slot, *tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&php_slot_lock, flags);
+	list_for_each_entry(tmp, &php_slot_list, link) {
+		slot = php_slot_match(dn, tmp);
+		if (slot) {
+			spin_unlock_irqrestore(&php_slot_lock, flags);
+			return slot;
+		}
+	}
+	spin_unlock_irqrestore(&php_slot_lock, flags);
+
+	return NULL;
+}
+
+static void php_slot_free(struct kref *kref)
+{
+	struct powernv_php_slot *slot = to_powernv_php_slot(kref);
+
+	WARN_ON(!list_empty(&slot->children));
+	kfree(slot->name);
+	kfree(slot);
+}
+
+static void php_slot_release(struct hotplug_slot *hp_slot)
+{
+	struct powernv_php_slot *slot = hp_slot->private;
+	unsigned long flags;
+
+	/* Remove from global or child list */
+	spin_lock_irqsave(&php_slot_lock, flags);
+	list_del(&slot->link);
+	spin_unlock_irqrestore(&php_slot_lock, flags);
+
+	/* Detach from parent */
+	powernv_php_slot_put(slot);
+	powernv_php_slot_put(slot->parent);
+}
+
+static bool php_slot_get_id(struct device_node *dn,
+			    uint64_t *id)
+{
+	struct device_node *parent = dn;
+	const __be64 *prop64;
+	const __be32 *prop32;
+
+	/*
+	 * The hotpluggable slot always has a compound Id, which
+	 * consists of 16-bits PHB Id, 16 bits bus/slot/function
+	 * number, and compound indicator
+	 */
+	*id = (0x1ul << 63);
+
+	/* Bus/Slot/Function number */
+	prop32 = of_get_property(dn, "reg", NULL);
+	if (!prop32)
+		return false;
+	*id |= ((of_read_number(prop32, 1) & 0x00ffff00) << 8);
+
+	/* PHB Id */
+	while ((parent = of_get_parent(parent))) {
+		if (!PCI_DN(parent)) {
+			of_node_put(parent);
+			break;
+		}
+
+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
+		    !of_device_is_compatible(parent, "ibm,ioda-phb")) {
+			of_node_put(parent);
+			continue;
+		}
+
+		prop64 = of_get_property(parent, "ibm,opal-phbid", NULL);
+		if (!prop64) {
+			of_node_put(parent);
+			return false;
+		}
+
+		*id |= be64_to_cpup(prop64);
+		of_node_put(parent);
+		return true;
+	}
+
+        return false;
+}
+
+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn)
+{
+	struct pci_bus *bus;
+	struct powernv_php_slot *slot;
+	const char *label;
+	uint64_t id;
+	int slot_no;
+	size_t size;
+	void *pmem;
+
+	/* Slot name */
+	label = of_get_property(dn, "ibm,slot-label", NULL);
+	if (!label)
+		return NULL;
+
+	/* Slot indentifier */
+	if (!php_slot_get_id(dn, &id))
+		return NULL;
+
+	/* PCI bus */
+	bus = pcibios_find_pci_bus(dn);
+	if (!bus)
+		return NULL;
+
+	/* Slot number */
+	if (dn->child && PCI_DN(dn->child))
+		slot_no = PCI_SLOT(PCI_DN(dn->child)->devfn);
+	else
+		slot_no = -1;
+
+	/* Allocate slot */
+	size = sizeof(struct powernv_php_slot) +
+	       sizeof(struct hotplug_slot) +
+	       sizeof(struct hotplug_slot_info);
+	pmem = kzalloc(size, GFP_KERNEL);
+	if (!pmem) {
+		pr_warn("%s: Cannot allocate slot for node %s\n",
+			__func__, dn->full_name);
+		return NULL;
+	}
+
+	/* Assign memory blocks */
+	slot = pmem;
+	slot->php_slot = pmem + sizeof(struct powernv_php_slot);
+	slot->php_slot->info = pmem + sizeof(struct powernv_php_slot) +
+			      sizeof(struct hotplug_slot);
+	slot->name = kstrdup(label, GFP_KERNEL);
+	if (!slot->name) {
+		pr_warn("%s: Cannot populate name for node %s\n",
+			__func__, dn->full_name);
+		kfree(pmem);
+		return NULL;
+	}
+
+	/* Initialize slot */
+	kref_init(&slot->kref);
+	slot->state = POWERNV_PHP_SLOT_STATE_INIT;
+	slot->dn = dn;
+	slot->bus = bus;
+	slot->id = id;
+	slot->slot_no = slot_no;
+	INIT_WORK(&slot->work, powernv_php_slot_work);
+	init_waitqueue_head(&slot->queue);
+	slot->check_power_status = 0;
+	slot->status_confirmed = 0;
+	slot->release = php_slot_free;
+	slot->php_slot->ops = &php_slot_ops;
+	slot->php_slot->release = php_slot_release;
+	slot->php_slot->private = slot;
+	INIT_LIST_HEAD(&slot->children);
+	INIT_LIST_HEAD(&slot->link);
+
+	return slot;
+}
+
+int powernv_php_slot_register(struct powernv_php_slot *slot)
+{
+	struct powernv_php_slot *parent;
+	struct device_node *dn = slot->dn;
+	unsigned long flags;
+	int ret;
+
+	/* Avoid register same slot for twice */
+	if (powernv_php_slot_find(slot->dn))
+		return -EEXIST;
+
+	/* Register slot */
+	ret = pci_hp_register(slot->php_slot, slot->bus,
+			      slot->slot_no, slot->name);
+	if (ret) {
+		pr_warn("%s: Cannot register slot %s (%d)\n",
+			__func__, slot->name, ret);
+		return ret;
+	}
+
+	/* Put into global or parent list */
+	while ((dn = of_get_parent(dn))) {
+		if (!PCI_DN(dn)) {
+			of_node_put(dn);
+			break;
+		}
+
+		parent = powernv_php_slot_find(dn);
+		if (parent) {
+			of_node_put(dn);
+			break;
+		}
+	}
+
+	spin_lock_irqsave(&php_slot_lock, flags);
+	if (parent) {
+		powernv_php_slot_get(parent);
+		slot->parent = parent;
+		list_add_tail(&slot->link, &parent->children);
+	} else {
+		list_add_tail(&slot->link, &php_slot_list);
+	}
+	spin_unlock_irqrestore(&php_slot_lock, flags);
+
+	/* Update slot state */
+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
+	return 0;
+}
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
@ 2015-05-01  6:03   ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-01  6:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: bhelgaas, linux-pci, Gavin Shan

The patch intends to add standalone driver to support PCI hotplug
for PowerPC PowerNV platform, which runs on top of skiboot firmware.
The firmware identified hotpluggable slots and marked their device
tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware".
The driver simply scans device-tree to create/register PCI hotplug slot
accordingly.

If the skiboot firmware doesn't support slot status retrieval, the PCI
slot device node shouldn't have property "ibm,reset-by-firmware". In
that case, none of valid PCI slots will be detected from device tree.
The skiboot firmware doesn't export the capability to access attention
LEDs yet and it's something for TBD.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/pci/hotplug/Kconfig            |  12 +
 drivers/pci/hotplug/Makefile           |   4 +
 drivers/pci/hotplug/powernv_php.c      | 146 ++++++++
 drivers/pci/hotplug/powernv_php.h      |  78 ++++
 drivers/pci/hotplug/powernv_php_slot.c | 643 +++++++++++++++++++++++++++++++++
 5 files changed, 883 insertions(+)
 create mode 100644 drivers/pci/hotplug/powernv_php.c
 create mode 100644 drivers/pci/hotplug/powernv_php.h
 create mode 100644 drivers/pci/hotplug/powernv_php_slot.c

diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
index df8caec..ef55dae 100644
--- a/drivers/pci/hotplug/Kconfig
+++ b/drivers/pci/hotplug/Kconfig
@@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC
 
 	  When in doubt, say N.
 
+config HOTPLUG_PCI_POWERNV
+	tristate "PowerPC PowerNV PCI Hotplug driver"
+	depends on PPC_POWERNV && EEH
+	help
+	  Say Y here if you run PowerPC PowerNV platform that supports
+          PCI Hotplug
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called powernv-php.
+
+	  When in doubt, say N.
+
 config HOTPLUG_PCI_RPA
 	tristate "RPA PCI Hotplug driver"
 	depends on PPC_PSERIES && EEH
diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
index 4a9aa08..a69665e 100644
--- a/drivers/pci/hotplug/Makefile
+++ b/drivers/pci/hotplug/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)		+= pciehp.o
 obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550)	+= cpcihp_zt5550.o
 obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)	+= cpcihp_generic.o
 obj-$(CONFIG_HOTPLUG_PCI_SHPC)		+= shpchp.o
+obj-$(CONFIG_HOTPLUG_PCI_POWERNV)	+= powernv-php.o
 obj-$(CONFIG_HOTPLUG_PCI_RPA)		+= rpaphp.o
 obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR)	+= rpadlpar_io.o
 obj-$(CONFIG_HOTPLUG_PCI_SGI)		+= sgi_hotplug.o
@@ -50,6 +51,9 @@ ibmphp-objs		:=	ibmphp_core.o	\
 acpiphp-objs		:=	acpiphp_core.o	\
 				acpiphp_glue.o
 
+powernv-php-objs	:=	powernv_php.o	\
+				powernv_php_slot.o
+
 rpaphp-objs		:=	rpaphp_core.o	\
 				rpaphp_pci.o	\
 				rpaphp_slot.o
diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c
new file mode 100644
index 0000000..5cf9e717
--- /dev/null
+++ b/drivers/pci/hotplug/powernv_php.c
@@ -0,0 +1,146 @@
+/*
+ * PCI Hotplug Driver for PowerPC PowerNV platform.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2015.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sysfs.h>
+#include <linux/pci.h>
+#include <linux/pci_hotplug.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <asm/opal.h>
+#include <asm/pnv-pci.h>
+
+#include "powernv_php.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
+#define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
+
+static struct notifier_block php_msg_nb = {
+	.notifier_call	= powernv_php_msg_handler,
+	.next		= NULL,
+	.priority	= 0,
+};
+
+static int powernv_php_register_one(struct device_node *dn)
+{
+	struct powernv_php_slot *slot;
+	const __be32 *prop32;
+	int ret;
+
+	/* Check if it's hotpluggable slot */
+	prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL);
+	if (!prop32 || !of_read_number(prop32, 1))
+		return 0;
+
+	prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL);
+	if (!prop32 || !of_read_number(prop32, 1))
+		return 0;
+
+	/* Allocate slot */
+	slot = powernv_php_slot_alloc(dn);
+	if (!slot)
+		return -ENODEV;
+
+	/* Register it */
+	ret = powernv_php_slot_register(slot);
+	if (ret) {
+		powernv_php_slot_put(slot);
+		return ret;
+	}
+
+	return powernv_php_slot_enable(slot->php_slot, false, false);
+}
+
+int powernv_php_register(struct device_node *dn)
+{
+	struct device_node *child;
+	int ret = 0;
+
+	/*
+	 * The parent slots should be registered before their
+	 * child slots.
+	 */
+	for_each_child_of_node(dn, child) {
+		ret = powernv_php_register_one(child);
+		if (ret)
+			break;
+
+		powernv_php_register(child);
+	}
+
+	return ret;
+}
+
+static void powernv_php_unregister_one(struct device_node *dn)
+{
+	struct powernv_php_slot *slot;
+
+	slot = powernv_php_slot_find(dn);
+	if (!slot)
+		return;
+
+	pci_hp_deregister(slot->php_slot);
+}
+
+void powernv_php_unregister(struct device_node *dn)
+{
+	struct device_node *child;
+
+	/* The child slots should go before their parent slots */
+	for_each_child_of_node(dn, child) {
+		powernv_php_unregister(child);
+		powernv_php_unregister_one(child);
+	}
+}
+
+static int __init powernv_php_init(void)
+{
+	struct device_node *dn;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	/* Register hotplug message handler */
+	if (pnv_pci_hotplug_notifier(&php_msg_nb, true)) {
+		pr_warn("%s: Cannot register hotplug message notifier\n",
+			__func__);
+		return -EIO;
+	}
+
+	/* Scan PHB nodes and their children */
+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
+		powernv_php_register(dn);
+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
+		powernv_php_register(dn);
+
+	return 0;
+}
+
+static void __exit powernv_php_exit(void)
+{
+	struct device_node *dn;
+
+	pnv_pci_hotplug_notifier(&php_msg_nb, false);
+
+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
+		powernv_php_unregister(dn);
+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
+		powernv_php_unregister(dn);
+}
+
+module_init(powernv_php_init);
+module_exit(powernv_php_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/pci/hotplug/powernv_php.h b/drivers/pci/hotplug/powernv_php.h
new file mode 100644
index 0000000..87ba0d0
--- /dev/null
+++ b/drivers/pci/hotplug/powernv_php.h
@@ -0,0 +1,78 @@
+/*
+ * PCI Hotplug Driver for PowerPC PowerNV platform.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2015.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _POWERNV_PHP_H
+#define _POWERNV_PHP_H
+
+/* Slot power status */
+#define POWERNV_PHP_SLOT_POWER_OFF	0
+#define POWERNV_PHP_SLOT_POWER_ON	1
+
+/* Slot presence status */
+#define POWERNV_PHP_SLOT_EMPTY		0
+#define POWERNV_PHP_SLOT_PRESENT	1
+
+/* Slot attention status */
+#define POWERNV_PHP_SLOT_ATTEN_OFF	0
+#define POWERNV_PHP_SLOT_ATTEN_ON	1
+#define POWERNV_PHP_SLOT_ATTEN_IND	2
+#define POWERNV_PHP_SLOT_ATTEN_ACT	3
+
+struct powernv_php_slot {
+	struct kref		kref;
+	int			state;
+#define POWERNV_PHP_SLOT_STATE_INIT		0x0
+#define POWERNV_PHP_SLOT_STATE_REGISTER		0x1
+#define POWERNV_PHP_SLOT_STATE_POPULATED	0x2
+	char			*name;
+	struct device_node	*dn;
+	struct pci_bus		*bus;
+	uint64_t		id;
+	int			slot_no;
+	int			check_power_status;
+	int			status_confirmed;
+	struct opal_msg		*msg;
+	struct work_struct	work;
+	wait_queue_head_t	queue;
+	struct hotplug_slot	*php_slot;
+	struct powernv_php_slot	*parent;
+	void (*release)(struct kref *kref);
+	struct list_head	children;
+	struct list_head	link;
+};
+
+#define to_powernv_php_slot(kref) container_of(kref, struct powernv_php_slot, kref)
+
+static inline void powernv_php_slot_get(struct powernv_php_slot *slot)
+{
+	if (slot)
+		kref_get(&slot->kref);
+}
+
+static inline int powernv_php_slot_put(struct powernv_php_slot *slot)
+{
+	if (slot)
+		return kref_put(&slot->kref, slot->release);
+
+	return 0;
+}
+
+int powernv_php_msg_handler(struct notifier_block *nb,
+			    unsigned long type, void *message);
+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn);
+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn);
+int powernv_php_slot_register(struct powernv_php_slot *slot);
+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
+			    bool rescan_bus, bool rescan_slot);
+int powernv_php_register(struct device_node *dn);
+void powernv_php_unregister(struct device_node *dn);
+
+#endif /* !_POWERNV_PHP_H */
diff --git a/drivers/pci/hotplug/powernv_php_slot.c b/drivers/pci/hotplug/powernv_php_slot.c
new file mode 100644
index 0000000..fc82355
--- /dev/null
+++ b/drivers/pci/hotplug/powernv_php_slot.c
@@ -0,0 +1,643 @@
+/*
+ * PCI Hotplug Driver for PowerPC PowerNV platform.
+ *
+ * Copyright Gavin Shan, IBM Corporation 2015.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sysfs.h>
+#include <linux/pci.h>
+#include <linux/pci_hotplug.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+
+#include <asm/opal.h>
+#include <asm/pnv-pci.h>
+#include <asm/ppc-pci.h>
+
+#include "powernv_php.h"
+
+static LIST_HEAD(php_slot_list);
+static DEFINE_SPINLOCK(php_slot_lock);
+
+/*
+ * Release firmware data for all child device nodes of the
+ * indicated one.
+ */
+static void release_device_nodes_info(struct device_node *np)
+{
+	struct device_node *child;
+
+	for_each_child_of_node(np, child) {
+		/* In depth first */
+		release_device_nodes_info(child);
+
+		remove_pci_device_node_info(child);
+	}
+}
+
+/*
+ * Release all subordinate device nodes of the indicated one.
+ * Those device nodes in deepest path should be released firstly.
+ */
+static int release_device_nodes(struct device_node *parent)
+{
+	struct device_node *np, *child;
+	int ret = 0;
+
+	/* If the device node has children, remove them firstly */
+	for_each_child_of_node(parent, np) {
+		ret = release_device_nodes(np);
+		if (ret)
+			return ret;
+
+		/* The device shouldn't have alive children */
+		child = of_get_next_child(np, NULL);
+		if (child) {
+			of_node_put(child);
+			of_node_put(np);
+			pr_err("%s: Alive children of node <%s>\n",
+			       __func__, of_node_full_name(np));
+			return -EBUSY;
+		}
+
+		/* Detach the device node */
+		of_detach_node(np);
+		of_node_put(np);
+	}
+
+	return 0;
+}
+
+/*
+ * The function processes the message sent by firmware
+ * to remove all device tree nodes beneath the slot's
+ * nodes, and the associated auxillary data.
+ */
+static void slot_power_off_handler(struct powernv_php_slot *slot)
+{
+	int ret;
+
+	/* Release the firmware data for the child device nodes */
+	release_device_nodes_info(slot->dn);
+
+	/* Release the child device nodes */
+	ret = release_device_nodes(slot->dn);
+	if (ret)
+		pr_warn("%s: Error %d releasing children of <%s>\n",
+			__func__, ret, of_node_full_name(slot->dn));
+
+	/* Confirm status change */
+	slot->status_confirmed = 1;
+	wake_up_interruptible(&slot->queue);
+}
+
+static void slot_power_on_handler(struct powernv_php_slot *slot)
+{
+	struct opal_msg *msg = slot->msg;
+	unsigned long phys = be64_to_cpu(msg->params[2]);
+	unsigned long len = be64_to_cpu(msg->params[3]);
+	void *blob = (phys && len > 0) ? __va(phys) : NULL;
+
+	/* There might have nothing behind the slot yet */
+	if (!blob || !len)
+		goto out;
+
+	/* Copy the FDT blob and parse it */
+	of_fdt_add_subtree(slot->dn, blob);
+
+	/* Add device node firmware data */
+	traverse_pci_device_nodes(slot->dn,
+				  add_pci_device_node_info,
+				  pci_bus_to_host(slot->bus));
+
+out:
+	/* Confirm status change */
+	slot->status_confirmed = 1;
+	wake_up_interruptible(&slot->queue);
+}
+
+static void powernv_php_slot_work(struct work_struct *data)
+{
+	struct powernv_php_slot *slot = container_of(data,
+						     struct powernv_php_slot,
+						     work);
+	uint64_t php_event = be64_to_cpu(slot->msg->params[0]);
+
+	switch (php_event) {
+	case 0: /* Slot power off */
+		slot_power_off_handler(slot);
+		break;
+	case 1: /* Slot power on */
+		slot_power_on_handler(slot);
+		break;
+	default:
+		pr_warn("%s: Unsupported hotplug event %lld\n",
+			__func__, php_event);
+	}
+
+	of_node_put(slot->dn);
+}
+
+int powernv_php_msg_handler(struct notifier_block *nb,
+			    unsigned long type, void *message)
+{
+	phandle h;
+	struct device_node *np;
+	struct powernv_php_slot *slot;
+	struct opal_msg *msg = message;
+
+	/* Check the message type */
+	if (type != OPAL_MSG_PCI_HOTPLUG) {
+		pr_warn("%s: Wrong message type %ld received!\n",
+			__func__, type);
+		return 0;
+	}
+
+	/* Find the device node */
+	h = (phandle)be64_to_cpu(msg->params[1]);
+	np = of_find_node_by_phandle(h);
+	if (!np) {
+		pr_warn("%s: No device node for phandle 0x%08x\n",
+			__func__, h);
+		return 0;
+	}
+
+	/* Find the slot */
+	slot = powernv_php_slot_find(np);
+	if (!slot) {
+		pr_warn("%s: No slot found for node <%s>\n",
+			__func__, of_node_full_name(np));
+		of_node_put(np);
+		return 0;
+	}
+
+	/* Schedule the work */
+	slot->msg = msg;
+	schedule_work(&slot->work);
+	return 0;
+}
+
+static int set_power_status(struct hotplug_slot *php_slot, u8 val)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	int ret;
+
+	/* Set power status */
+	slot->status_confirmed = 0;
+	ret = pnv_pci_set_power_status(slot->id, val);
+	if (ret) {
+		pr_warn("%s: Error %d powering %s slot %016llx\n",
+			__func__, ret, val ? "on" : "off", slot->id);
+		return ret;
+	}
+
+	/* Waiting until the device tree is updated */
+	ret = wait_event_timeout(slot->queue,
+				 !slot->status_confirmed,
+				 10 * HZ);
+	if (ret) {
+		pr_warn("%s: Error %d completing power-%s slot %016llx\n",
+			__func__, ret, val ? "on" : "off", slot->id);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int get_power_status(struct hotplug_slot *php_slot, u8 *val)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t state;
+	int ret;
+
+	/*
+	 * Retrieve power status from firmware. If we fail
+	 * getting that, the power status fails back to
+	 * be on.
+	 */
+	ret = pnv_pci_get_power_status(slot->id, &state);
+	if (ret) {
+		*val = POWERNV_PHP_SLOT_POWER_ON;
+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
+			__func__, ret, slot->id);
+	} else {
+		*val = state ? POWERNV_PHP_SLOT_POWER_ON :
+			       POWERNV_PHP_SLOT_POWER_OFF;
+		php_slot->info->power_status = *val;
+	}
+
+	return 0;
+}
+
+static int get_adapter_status(struct hotplug_slot *php_slot, u8 *val)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t state;
+	int ret;
+
+	/*
+	 * Retrieve presence status from firmware. If we can't
+	 * get that, it will fail back to be empty.
+	 */
+	ret = pnv_pci_get_presence_status(slot->id, &state);
+	if (ret >= 0) {
+                *val = state ? POWERNV_PHP_SLOT_PRESENT :
+                               POWERNV_PHP_SLOT_EMPTY;
+                php_slot->info->adapter_status = *val;
+	} else {
+		*val = POWERNV_PHP_SLOT_EMPTY;
+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
+			__func__, ret, slot->id);
+	}
+
+	return ret < 0 ? ret : 0;
+}
+
+static int set_attention_status(struct hotplug_slot *php_slot, u8 val)
+{
+	/* The default operation would to turn on the attention */
+	switch (val) {
+	case POWERNV_PHP_SLOT_ATTEN_OFF:
+	case POWERNV_PHP_SLOT_ATTEN_ON:
+	case POWERNV_PHP_SLOT_ATTEN_IND:
+	case POWERNV_PHP_SLOT_ATTEN_ACT:
+		break;
+	default:
+		val = POWERNV_PHP_SLOT_ATTEN_ON;
+	}
+
+	/* FIXME: Make it real once firmware supports it */
+	php_slot->info->attention_status = val;
+
+	return 0;
+}
+
+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
+			    bool rescan_bus, bool rescan_slot)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t presence, power_status;
+	int ret;
+
+	/* Check if the slot has been configured */
+	if (slot->state != POWERNV_PHP_SLOT_STATE_REGISTER)
+		return 0;
+
+	/* Retrieve slot presence status */
+	ret = php_slot->ops->get_adapter_status(php_slot, &presence);
+	if (ret) {
+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
+			__func__, ret, slot->id);
+		return ret;
+	}
+
+	/* Proceed if there have nothing behind the slot */
+	if (presence == POWERNV_PHP_SLOT_EMPTY)
+		goto scan;
+
+	/*
+	 * If we don't detect something behind the slot, we need
+	 * make sure the power suply to the slot is on. Otherwise,
+	 * the slot downstream PCIe linkturn should be down.
+	 *
+	 * On the first time, we don't change the power status to
+	 * boost system boot with assumption that the firmware
+	 * supplies consistent slot power status: empty slot always
+	 * has its power off and non-empty slot has its power on.
+	 */
+	if (!slot->check_power_status) {
+		slot->check_power_status = 1;
+		goto scan;
+	}
+
+	/* Check the power status. Scan the slot if that's already on */
+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
+	if (ret) {
+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
+			__func__, ret, slot->id);
+		return ret;
+	}
+	if (power_status == POWERNV_PHP_SLOT_POWER_ON)
+		goto scan;
+
+	/* Power is off, turn it on and then scan the slot */
+	ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_ON);
+	if (ret) {
+		pr_warn("%s: Error %d powering on slot %016llx\n",
+			__func__, ret, slot->id);
+		return ret;
+	}
+
+scan:
+	switch (presence) {
+	case POWERNV_PHP_SLOT_PRESENT:
+		if (rescan_bus) {
+			pci_lock_rescan_remove();
+			pcibios_add_pci_devices(slot->bus);
+			pci_unlock_rescan_remove();
+		}
+
+		/* Rescan for child hotpluggable slots */
+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
+		if (rescan_slot)
+			powernv_php_register(slot->dn);
+		break;
+	case POWERNV_PHP_SLOT_EMPTY:
+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
+		break;
+	default:
+		pr_warn("%s: Invalid presence status %d of slot %016llx\n",
+			__func__, presence, slot->id);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int enable_slot(struct hotplug_slot *php_slot)
+{
+	return powernv_php_slot_enable(php_slot, true, true);
+}
+
+static int disable_slot(struct hotplug_slot *php_slot)
+{
+	struct powernv_php_slot *slot = php_slot->private;
+	uint8_t power_status;
+	int ret;
+
+	if (slot->state != POWERNV_PHP_SLOT_STATE_POPULATED)
+		return 0;
+
+	/* Remove all devices behind the slot */
+	pci_lock_rescan_remove();
+	pcibios_remove_pci_devices(slot->bus);
+	pci_unlock_rescan_remove();
+
+	/* Detach the child hotpluggable slots */
+	powernv_php_unregister(slot->dn);
+
+	/*
+	 * Check the power status and turn it off if necessary. If we
+	 * fail to get the power status, the power will be forced to
+	 * be off.
+	 */
+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
+	if (ret || power_status == POWERNV_PHP_SLOT_POWER_ON) {
+		ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_OFF);
+		if (ret)
+			pr_warn("%s: Error %d powering off slot %016llx\n",
+				__func__, ret, slot->id);
+	}
+
+	/* Update slot state */
+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
+	return 0;
+}
+
+static struct hotplug_slot_ops php_slot_ops = {
+	.get_power_status	= get_power_status,
+	.get_adapter_status	= get_adapter_status,
+	.set_attention_status	= set_attention_status,
+	.enable_slot		= enable_slot,
+	.disable_slot		= disable_slot,
+};
+
+static struct powernv_php_slot *php_slot_match(struct device_node *dn,
+					       struct powernv_php_slot *slot)
+{
+	struct powernv_php_slot *target, *tmp;
+
+	if (slot->dn == dn)
+		return slot;
+
+	list_for_each_entry(tmp, &slot->children, link) {
+		target = php_slot_match(dn, tmp);
+		if (target)
+			return target;
+	}
+
+	return NULL;
+}
+
+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn)
+{
+	struct powernv_php_slot *slot, *tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&php_slot_lock, flags);
+	list_for_each_entry(tmp, &php_slot_list, link) {
+		slot = php_slot_match(dn, tmp);
+		if (slot) {
+			spin_unlock_irqrestore(&php_slot_lock, flags);
+			return slot;
+		}
+	}
+	spin_unlock_irqrestore(&php_slot_lock, flags);
+
+	return NULL;
+}
+
+static void php_slot_free(struct kref *kref)
+{
+	struct powernv_php_slot *slot = to_powernv_php_slot(kref);
+
+	WARN_ON(!list_empty(&slot->children));
+	kfree(slot->name);
+	kfree(slot);
+}
+
+static void php_slot_release(struct hotplug_slot *hp_slot)
+{
+	struct powernv_php_slot *slot = hp_slot->private;
+	unsigned long flags;
+
+	/* Remove from global or child list */
+	spin_lock_irqsave(&php_slot_lock, flags);
+	list_del(&slot->link);
+	spin_unlock_irqrestore(&php_slot_lock, flags);
+
+	/* Detach from parent */
+	powernv_php_slot_put(slot);
+	powernv_php_slot_put(slot->parent);
+}
+
+static bool php_slot_get_id(struct device_node *dn,
+			    uint64_t *id)
+{
+	struct device_node *parent = dn;
+	const __be64 *prop64;
+	const __be32 *prop32;
+
+	/*
+	 * The hotpluggable slot always has a compound Id, which
+	 * consists of 16-bits PHB Id, 16 bits bus/slot/function
+	 * number, and compound indicator
+	 */
+	*id = (0x1ul << 63);
+
+	/* Bus/Slot/Function number */
+	prop32 = of_get_property(dn, "reg", NULL);
+	if (!prop32)
+		return false;
+	*id |= ((of_read_number(prop32, 1) & 0x00ffff00) << 8);
+
+	/* PHB Id */
+	while ((parent = of_get_parent(parent))) {
+		if (!PCI_DN(parent)) {
+			of_node_put(parent);
+			break;
+		}
+
+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
+		    !of_device_is_compatible(parent, "ibm,ioda-phb")) {
+			of_node_put(parent);
+			continue;
+		}
+
+		prop64 = of_get_property(parent, "ibm,opal-phbid", NULL);
+		if (!prop64) {
+			of_node_put(parent);
+			return false;
+		}
+
+		*id |= be64_to_cpup(prop64);
+		of_node_put(parent);
+		return true;
+	}
+
+        return false;
+}
+
+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn)
+{
+	struct pci_bus *bus;
+	struct powernv_php_slot *slot;
+	const char *label;
+	uint64_t id;
+	int slot_no;
+	size_t size;
+	void *pmem;
+
+	/* Slot name */
+	label = of_get_property(dn, "ibm,slot-label", NULL);
+	if (!label)
+		return NULL;
+
+	/* Slot indentifier */
+	if (!php_slot_get_id(dn, &id))
+		return NULL;
+
+	/* PCI bus */
+	bus = pcibios_find_pci_bus(dn);
+	if (!bus)
+		return NULL;
+
+	/* Slot number */
+	if (dn->child && PCI_DN(dn->child))
+		slot_no = PCI_SLOT(PCI_DN(dn->child)->devfn);
+	else
+		slot_no = -1;
+
+	/* Allocate slot */
+	size = sizeof(struct powernv_php_slot) +
+	       sizeof(struct hotplug_slot) +
+	       sizeof(struct hotplug_slot_info);
+	pmem = kzalloc(size, GFP_KERNEL);
+	if (!pmem) {
+		pr_warn("%s: Cannot allocate slot for node %s\n",
+			__func__, dn->full_name);
+		return NULL;
+	}
+
+	/* Assign memory blocks */
+	slot = pmem;
+	slot->php_slot = pmem + sizeof(struct powernv_php_slot);
+	slot->php_slot->info = pmem + sizeof(struct powernv_php_slot) +
+			      sizeof(struct hotplug_slot);
+	slot->name = kstrdup(label, GFP_KERNEL);
+	if (!slot->name) {
+		pr_warn("%s: Cannot populate name for node %s\n",
+			__func__, dn->full_name);
+		kfree(pmem);
+		return NULL;
+	}
+
+	/* Initialize slot */
+	kref_init(&slot->kref);
+	slot->state = POWERNV_PHP_SLOT_STATE_INIT;
+	slot->dn = dn;
+	slot->bus = bus;
+	slot->id = id;
+	slot->slot_no = slot_no;
+	INIT_WORK(&slot->work, powernv_php_slot_work);
+	init_waitqueue_head(&slot->queue);
+	slot->check_power_status = 0;
+	slot->status_confirmed = 0;
+	slot->release = php_slot_free;
+	slot->php_slot->ops = &php_slot_ops;
+	slot->php_slot->release = php_slot_release;
+	slot->php_slot->private = slot;
+	INIT_LIST_HEAD(&slot->children);
+	INIT_LIST_HEAD(&slot->link);
+
+	return slot;
+}
+
+int powernv_php_slot_register(struct powernv_php_slot *slot)
+{
+	struct powernv_php_slot *parent;
+	struct device_node *dn = slot->dn;
+	unsigned long flags;
+	int ret;
+
+	/* Avoid register same slot for twice */
+	if (powernv_php_slot_find(slot->dn))
+		return -EEXIST;
+
+	/* Register slot */
+	ret = pci_hp_register(slot->php_slot, slot->bus,
+			      slot->slot_no, slot->name);
+	if (ret) {
+		pr_warn("%s: Cannot register slot %s (%d)\n",
+			__func__, slot->name, ret);
+		return ret;
+	}
+
+	/* Put into global or parent list */
+	while ((dn = of_get_parent(dn))) {
+		if (!PCI_DN(dn)) {
+			of_node_put(dn);
+			break;
+		}
+
+		parent = powernv_php_slot_find(dn);
+		if (parent) {
+			of_node_put(dn);
+			break;
+		}
+	}
+
+	spin_lock_irqsave(&php_slot_lock, flags);
+	if (parent) {
+		powernv_php_slot_get(parent);
+		slot->parent = parent;
+		list_add_tail(&slot->link, &parent->children);
+	} else {
+		list_add_tail(&slot->link, &php_slot_list);
+	}
+	spin_unlock_irqrestore(&php_slot_lock, flags);
+
+	/* Update slot state */
+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
+	return 0;
+}
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-01 12:54     ` Rob Herring
  -1 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-01 12:54 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linuxppc-dev, linux-pci, Benjamin Herrenschmidt, Bjorn Helgaas,
	Grant Likely, devicetree

+dt list

On Fri, May 1, 2015 at 1:03 AM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> The requirement is raised when developing the PCI hotplug feature
> for PowerPC PowerNV platform, which runs on top of skiboot firmware.
> When plugging PCI adapter to one PCI slot, the firmware rescans the
> slot and build FDT (Flat Device Tree) blob, which is sent to the
> PowerNV PCI hotplug driver for processing. The new constructed device
> nodes from the FDT blob are expected to be attached to the device
> node of the PCI slot. Unfortunately, it seems we don't have a API
> to support the scenario. The patch intends to support it by newly
> introduced function of_fdt_add_subtree(), the design behind it is
> shown as below:
>
>    * When the sub-tree FDT blob, which is owned by firmware, is
>      received by kernel. It's copied over to the blob, which is
>      dynamically allocated. Since then, the FDT blob owned by
>      firmware isn't touched.
>    * Rework unflatten_dt_node() so that the device nodes in current
>      and deeper depth have been constructed from the FDT blob. All
>      device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is

Perhaps you meant HYBRID?

>      similar to OF_DYNAMIC. However, device node with the flag set
>      can be free'd, but in the way other than that for OF_DYNAMIC
>      device nodes.

The difference seems to be whether you allocate space or just point to
the FDT for various strings/data. Is that right?

>    * of_fdt_add_subtree() is the introduced API to do the work.

Have you looked at overlays and if so why do they not work for your purposes?

Why do you need to do this with the flattened tree?

Rob

>
> Cc: Grant Likely <grant.likely@linaro.org>
> Cc: Rob Herring <robh+dt@kernel.org>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  drivers/of/dynamic.c   |  19 +++++--
>  drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
>  include/linux/of.h     |   2 +
>  include/linux/of_fdt.h |   1 +
>  4 files changed, 127 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 3351ef4..f562080 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
>                 return;
>         }
>
> -       if (!of_node_check_flag(node, OF_DYNAMIC))
> +       /* Release the subtree */
> +       if (node->subtree) {
> +               kfree(node->subtree);
> +               node->subtree = NULL;
> +       }
> +
> +       if (!of_node_check_flag(node, OF_DYNAMIC) &&
> +           !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
>                 return;
>
>         while (prop) {
>                 struct property *next = prop->next;
> -               kfree(prop->name);
> -               kfree(prop->value);
> +               if (of_node_check_flag(node, OF_DYNAMIC)) {
> +                       kfree(prop->name);
> +                       kfree(prop->value);
> +               }
>                 kfree(prop);
>                 prop = next;
>
> @@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
>                         node->deadprops = NULL;
>                 }
>         }
> -       kfree(node->full_name);
> +
> +       if (of_node_check_flag(node, OF_DYNAMIC))
> +               kfree(node->full_name);
>         kfree(node->data);
>         kfree(node);
>  }
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index cde35c5d01..7659560 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -28,6 +28,10 @@
>  #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
>  #include <asm/page.h>
>
> +#include "of_private.h"
> +
> +static int cur_node_depth;
> +
>  /*
>   * of_fdt_limit_memory - limit the number of regions in the /memory node
>   * @limit: maximum entries
> @@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
>   * @dad: Parent struct device_node
>   * @fpsize: Size of the node path up at the current depth.
>   */
> -static void * unflatten_dt_node(void *blob,
> -                               void *mem,
> -                               int *poffset,
> -                               struct device_node *dad,
> -                               struct device_node **nodepp,
> -                               unsigned long fpsize,
> -                               bool dryrun)
> +static void *unflatten_dt_node(void *blob,
> +                              void *mem,
> +                              int *poffset,
> +                              struct device_node *dad,
> +                              struct device_node **nodepp,
> +                              unsigned long fpsize,
> +                              bool dryrun,
> +                              bool dynamic)
>  {
>         const __be32 *p;
>         struct device_node *np;
>         struct property *pp, **prev_pp = NULL;
>         const char *pathp;
>         unsigned int l, allocl;
> -       static int depth = 0;
>         int old_depth;
>         int offset;
>         int has_name = 0;
> @@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
>                 }
>         }
>
> -       np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
> +       if (dynamic)
> +               np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
> +       else
> +               np = unflatten_dt_alloc(&mem,
> +                               sizeof(struct device_node) + allocl,
>                                 __alignof__(struct device_node));
>         if (!dryrun) {
>                 char *fn;
>                 of_node_init(np);
>                 np->full_name = fn = ((char *)np) + sizeof(*np);
> +               if (dynamic)
> +                       of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
>                 if (new_format) {
>                         /* rebuild full path for new format */
>                         if (dad && dad->parent) {
> @@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
>                 }
>                 if (strcmp(pname, "name") == 0)
>                         has_name = 1;
> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property),
> -                                       __alignof__(struct property));
> +
> +               if (dynamic)
> +                       pp = kzalloc(sizeof(struct property), GFP_KERNEL);
> +               else
> +                       pp = unflatten_dt_alloc(&mem, sizeof(struct property),
> +                                               __alignof__(struct property));
>                 if (!dryrun) {
>                         /* We accept flattened tree phandles either in
>                          * ePAPR-style "phandle" properties, or the
> @@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
>                 if (pa < ps)
>                         pa = p1;
>                 sz = (pa - ps) + 1;
> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
> -                                       __alignof__(struct property));
> +
> +               if (dynamic)
> +                       pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
> +               else
> +                       pp = unflatten_dt_alloc(&mem,
> +                                               sizeof(struct property) + sz,
> +                                               __alignof__(struct property));
>                 if (!dryrun) {
>                         pp->name = "name";
>                         pp->length = sz;
> @@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
>                         np->type = "<NULL>";
>         }
>
> -       old_depth = depth;
> -       *poffset = fdt_next_node(blob, *poffset, &depth);
> -       if (depth < 0)
> -               depth = 0;
> -       while (*poffset > 0 && depth > old_depth)
> -               mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
> -                                       fpsize, dryrun);
> +       old_depth = cur_node_depth;
> +       *poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
> +       while (*poffset > 0) {
> +               if (cur_node_depth < old_depth)
> +                       break;
> +
> +               if (cur_node_depth == old_depth)
> +                       mem = unflatten_dt_node(blob, mem, poffset,
> +                                               dad, NULL, fpsize,
> +                                               dryrun, dynamic);
> +               else if (cur_node_depth > old_depth)
> +                       mem = unflatten_dt_node(blob, mem, poffset,
> +                                               np, NULL, fpsize,
> +                                               dryrun, dynamic);
> +       }
>
>         if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
>                 pr_err("unflatten: error %d processing FDT\n", *poffset);
> @@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
>   * for the resulting tree
>   */
>  static void __unflatten_device_tree(void *blob,
> -                            struct device_node **mynodes,
> -                            void * (*dt_alloc)(u64 size, u64 align))
> +                               struct device_node **mynodes,
> +                               void * (*dt_alloc)(u64 size, u64 align))
>  {
>         unsigned long size;
>         int start;
> @@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
>
>         /* First pass, scan for size */
>         start = 0;
> -       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
> +       cur_node_depth = 1;
> +       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
> +                                               NULL, 0, true, false);
>         size = ALIGN(size, 4);
>
>         pr_debug("  size is %lx, allocating...\n", size);
> @@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
>
>         /* Second pass, do actual unflattening */
>         start = 0;
> -       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
> +       cur_node_depth = 1;
> +       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
>         if (be32_to_cpup(mem + size) != 0xdeadbeef)
>                 pr_warning("End of tree marker overwritten: %08x\n",
>                            be32_to_cpup(mem + size));
> @@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
>  }
>  EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
>
> +static void populate_sysfs_for_child_nodes(struct device_node *parent)
> +{
> +       struct device_node *child;
> +
> +       for_each_child_of_node(parent, child) {
> +               __of_attach_node_sysfs(child);
> +               populate_sysfs_for_child_nodes(child);
> +       }
> +}
> +
> +/**
> + * of_fdt_add_substree - Create sub-tree of device nodes
> + * @parent: parent device node to which the sub-tree will attach
> + * @blob: flat device tree blob representing the sub-tree
> + *
> + * Copy over the FDT blob, which passed from firmware, and then
> + * unflatten the sub-tree.
> + */
> +void of_fdt_add_subtree(struct device_node *parent, void *blob)
> +{
> +       int start = 0;
> +
> +       /* Validate the header */
> +       if (!blob || fdt_check_header(blob)) {
> +               pr_err("%s: Invalid device-tree blob header at 0x%p\n",
> +                      __func__, blob);
> +               return;
> +       }
> +
> +       /* Free the flat blob for last time lazily */
> +       if (parent->subtree) {
> +               kfree(parent->subtree);
> +               parent->subtree = NULL;
> +       }
> +
> +       /* Copy over the flat blob */
> +       parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
> +       if (!parent->subtree) {
> +               pr_err("%s: Cannot copy over device-tree blob\n",
> +                      __func__);
> +               return;
> +       }
> +
> +       memcpy(parent->subtree, blob, fdt_totalsize(blob));
> +
> +       /* Unflatten it */
> +       mutex_lock(&of_mutex);
> +       cur_node_depth = 1;
> +       unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
> +                         strlen(parent->full_name), false, true);
> +       populate_sysfs_for_child_nodes(parent);
> +       mutex_unlock(&of_mutex);
> +}
> +EXPORT_SYMBOL(of_fdt_add_subtree);
> +
>  /* Everything below here references initial_boot_params directly. */
>  int __initdata dt_root_addr_cells;
>  int __initdata dt_root_size_cells;
> diff --git a/include/linux/of.h b/include/linux/of.h
> index ddeaae6..ac50b02 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -60,6 +60,7 @@ struct device_node {
>         struct  device_node *sibling;
>         struct  kobject kobj;
>         unsigned long _flags;
> +       void    *subtree;
>         void    *data;
>  #if defined(CONFIG_SPARC)
>         const char *path_component_name;
> @@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
>  #define OF_DETACHED    2 /* node has been detached from the device tree */
>  #define OF_POPULATED   3 /* device already created for the node */
>  #define OF_POPULATED_BUS       4 /* of_platform_populate recursed to children of this node */
> +#define OF_DYNAMIC_HYBIRD      5 /* similar to OF_DYNAMIC, but partially */
>
>  #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
>  #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
> diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
> index 587ee50..1fb47d7 100644
> --- a/include/linux/of_fdt.h
> +++ b/include/linux/of_fdt.h
> @@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
>                         const char *const *compat);
>  extern void of_fdt_unflatten_tree(unsigned long *blob,
>                                struct device_node **mynodes);
> +extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
>
>  /* TBD: Temporary export of fdt globals - remove when code fully merged */
>  extern int __initdata dt_root_addr_cells;
> --
> 2.1.0
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-01 12:54     ` Rob Herring
  0 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-01 12:54 UTC (permalink / raw)
  To: Gavin Shan
  Cc: devicetree, linux-pci, Grant Likely, Bjorn Helgaas, linuxppc-dev

+dt list

On Fri, May 1, 2015 at 1:03 AM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> The requirement is raised when developing the PCI hotplug feature
> for PowerPC PowerNV platform, which runs on top of skiboot firmware.
> When plugging PCI adapter to one PCI slot, the firmware rescans the
> slot and build FDT (Flat Device Tree) blob, which is sent to the
> PowerNV PCI hotplug driver for processing. The new constructed device
> nodes from the FDT blob are expected to be attached to the device
> node of the PCI slot. Unfortunately, it seems we don't have a API
> to support the scenario. The patch intends to support it by newly
> introduced function of_fdt_add_subtree(), the design behind it is
> shown as below:
>
>    * When the sub-tree FDT blob, which is owned by firmware, is
>      received by kernel. It's copied over to the blob, which is
>      dynamically allocated. Since then, the FDT blob owned by
>      firmware isn't touched.
>    * Rework unflatten_dt_node() so that the device nodes in current
>      and deeper depth have been constructed from the FDT blob. All
>      device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is

Perhaps you meant HYBRID?

>      similar to OF_DYNAMIC. However, device node with the flag set
>      can be free'd, but in the way other than that for OF_DYNAMIC
>      device nodes.

The difference seems to be whether you allocate space or just point to
the FDT for various strings/data. Is that right?

>    * of_fdt_add_subtree() is the introduced API to do the work.

Have you looked at overlays and if so why do they not work for your purposes?

Why do you need to do this with the flattened tree?

Rob

>
> Cc: Grant Likely <grant.likely@linaro.org>
> Cc: Rob Herring <robh+dt@kernel.org>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  drivers/of/dynamic.c   |  19 +++++--
>  drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
>  include/linux/of.h     |   2 +
>  include/linux/of_fdt.h |   1 +
>  4 files changed, 127 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 3351ef4..f562080 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
>                 return;
>         }
>
> -       if (!of_node_check_flag(node, OF_DYNAMIC))
> +       /* Release the subtree */
> +       if (node->subtree) {
> +               kfree(node->subtree);
> +               node->subtree = NULL;
> +       }
> +
> +       if (!of_node_check_flag(node, OF_DYNAMIC) &&
> +           !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
>                 return;
>
>         while (prop) {
>                 struct property *next = prop->next;
> -               kfree(prop->name);
> -               kfree(prop->value);
> +               if (of_node_check_flag(node, OF_DYNAMIC)) {
> +                       kfree(prop->name);
> +                       kfree(prop->value);
> +               }
>                 kfree(prop);
>                 prop = next;
>
> @@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
>                         node->deadprops = NULL;
>                 }
>         }
> -       kfree(node->full_name);
> +
> +       if (of_node_check_flag(node, OF_DYNAMIC))
> +               kfree(node->full_name);
>         kfree(node->data);
>         kfree(node);
>  }
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index cde35c5d01..7659560 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -28,6 +28,10 @@
>  #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
>  #include <asm/page.h>
>
> +#include "of_private.h"
> +
> +static int cur_node_depth;
> +
>  /*
>   * of_fdt_limit_memory - limit the number of regions in the /memory node
>   * @limit: maximum entries
> @@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
>   * @dad: Parent struct device_node
>   * @fpsize: Size of the node path up at the current depth.
>   */
> -static void * unflatten_dt_node(void *blob,
> -                               void *mem,
> -                               int *poffset,
> -                               struct device_node *dad,
> -                               struct device_node **nodepp,
> -                               unsigned long fpsize,
> -                               bool dryrun)
> +static void *unflatten_dt_node(void *blob,
> +                              void *mem,
> +                              int *poffset,
> +                              struct device_node *dad,
> +                              struct device_node **nodepp,
> +                              unsigned long fpsize,
> +                              bool dryrun,
> +                              bool dynamic)
>  {
>         const __be32 *p;
>         struct device_node *np;
>         struct property *pp, **prev_pp = NULL;
>         const char *pathp;
>         unsigned int l, allocl;
> -       static int depth = 0;
>         int old_depth;
>         int offset;
>         int has_name = 0;
> @@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
>                 }
>         }
>
> -       np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
> +       if (dynamic)
> +               np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
> +       else
> +               np = unflatten_dt_alloc(&mem,
> +                               sizeof(struct device_node) + allocl,
>                                 __alignof__(struct device_node));
>         if (!dryrun) {
>                 char *fn;
>                 of_node_init(np);
>                 np->full_name = fn = ((char *)np) + sizeof(*np);
> +               if (dynamic)
> +                       of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
>                 if (new_format) {
>                         /* rebuild full path for new format */
>                         if (dad && dad->parent) {
> @@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
>                 }
>                 if (strcmp(pname, "name") == 0)
>                         has_name = 1;
> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property),
> -                                       __alignof__(struct property));
> +
> +               if (dynamic)
> +                       pp = kzalloc(sizeof(struct property), GFP_KERNEL);
> +               else
> +                       pp = unflatten_dt_alloc(&mem, sizeof(struct property),
> +                                               __alignof__(struct property));
>                 if (!dryrun) {
>                         /* We accept flattened tree phandles either in
>                          * ePAPR-style "phandle" properties, or the
> @@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
>                 if (pa < ps)
>                         pa = p1;
>                 sz = (pa - ps) + 1;
> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
> -                                       __alignof__(struct property));
> +
> +               if (dynamic)
> +                       pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
> +               else
> +                       pp = unflatten_dt_alloc(&mem,
> +                                               sizeof(struct property) + sz,
> +                                               __alignof__(struct property));
>                 if (!dryrun) {
>                         pp->name = "name";
>                         pp->length = sz;
> @@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
>                         np->type = "<NULL>";
>         }
>
> -       old_depth = depth;
> -       *poffset = fdt_next_node(blob, *poffset, &depth);
> -       if (depth < 0)
> -               depth = 0;
> -       while (*poffset > 0 && depth > old_depth)
> -               mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
> -                                       fpsize, dryrun);
> +       old_depth = cur_node_depth;
> +       *poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
> +       while (*poffset > 0) {
> +               if (cur_node_depth < old_depth)
> +                       break;
> +
> +               if (cur_node_depth == old_depth)
> +                       mem = unflatten_dt_node(blob, mem, poffset,
> +                                               dad, NULL, fpsize,
> +                                               dryrun, dynamic);
> +               else if (cur_node_depth > old_depth)
> +                       mem = unflatten_dt_node(blob, mem, poffset,
> +                                               np, NULL, fpsize,
> +                                               dryrun, dynamic);
> +       }
>
>         if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
>                 pr_err("unflatten: error %d processing FDT\n", *poffset);
> @@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
>   * for the resulting tree
>   */
>  static void __unflatten_device_tree(void *blob,
> -                            struct device_node **mynodes,
> -                            void * (*dt_alloc)(u64 size, u64 align))
> +                               struct device_node **mynodes,
> +                               void * (*dt_alloc)(u64 size, u64 align))
>  {
>         unsigned long size;
>         int start;
> @@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
>
>         /* First pass, scan for size */
>         start = 0;
> -       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
> +       cur_node_depth = 1;
> +       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
> +                                               NULL, 0, true, false);
>         size = ALIGN(size, 4);
>
>         pr_debug("  size is %lx, allocating...\n", size);
> @@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
>
>         /* Second pass, do actual unflattening */
>         start = 0;
> -       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
> +       cur_node_depth = 1;
> +       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
>         if (be32_to_cpup(mem + size) != 0xdeadbeef)
>                 pr_warning("End of tree marker overwritten: %08x\n",
>                            be32_to_cpup(mem + size));
> @@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
>  }
>  EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
>
> +static void populate_sysfs_for_child_nodes(struct device_node *parent)
> +{
> +       struct device_node *child;
> +
> +       for_each_child_of_node(parent, child) {
> +               __of_attach_node_sysfs(child);
> +               populate_sysfs_for_child_nodes(child);
> +       }
> +}
> +
> +/**
> + * of_fdt_add_substree - Create sub-tree of device nodes
> + * @parent: parent device node to which the sub-tree will attach
> + * @blob: flat device tree blob representing the sub-tree
> + *
> + * Copy over the FDT blob, which passed from firmware, and then
> + * unflatten the sub-tree.
> + */
> +void of_fdt_add_subtree(struct device_node *parent, void *blob)
> +{
> +       int start = 0;
> +
> +       /* Validate the header */
> +       if (!blob || fdt_check_header(blob)) {
> +               pr_err("%s: Invalid device-tree blob header at 0x%p\n",
> +                      __func__, blob);
> +               return;
> +       }
> +
> +       /* Free the flat blob for last time lazily */
> +       if (parent->subtree) {
> +               kfree(parent->subtree);
> +               parent->subtree = NULL;
> +       }
> +
> +       /* Copy over the flat blob */
> +       parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
> +       if (!parent->subtree) {
> +               pr_err("%s: Cannot copy over device-tree blob\n",
> +                      __func__);
> +               return;
> +       }
> +
> +       memcpy(parent->subtree, blob, fdt_totalsize(blob));
> +
> +       /* Unflatten it */
> +       mutex_lock(&of_mutex);
> +       cur_node_depth = 1;
> +       unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
> +                         strlen(parent->full_name), false, true);
> +       populate_sysfs_for_child_nodes(parent);
> +       mutex_unlock(&of_mutex);
> +}
> +EXPORT_SYMBOL(of_fdt_add_subtree);
> +
>  /* Everything below here references initial_boot_params directly. */
>  int __initdata dt_root_addr_cells;
>  int __initdata dt_root_size_cells;
> diff --git a/include/linux/of.h b/include/linux/of.h
> index ddeaae6..ac50b02 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -60,6 +60,7 @@ struct device_node {
>         struct  device_node *sibling;
>         struct  kobject kobj;
>         unsigned long _flags;
> +       void    *subtree;
>         void    *data;
>  #if defined(CONFIG_SPARC)
>         const char *path_component_name;
> @@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
>  #define OF_DETACHED    2 /* node has been detached from the device tree */
>  #define OF_POPULATED   3 /* device already created for the node */
>  #define OF_POPULATED_BUS       4 /* of_platform_populate recursed to children of this node */
> +#define OF_DYNAMIC_HYBIRD      5 /* similar to OF_DYNAMIC, but partially */
>
>  #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
>  #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
> diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
> index 587ee50..1fb47d7 100644
> --- a/include/linux/of_fdt.h
> +++ b/include/linux/of_fdt.h
> @@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
>                         const char *const *compat);
>  extern void of_fdt_unflatten_tree(unsigned long *blob,
>                                struct device_node **mynodes);
> +extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
>
>  /* TBD: Temporary export of fdt globals - remove when code fully merged */
>  extern int __initdata dt_root_addr_cells;
> --
> 2.1.0
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 12:54     ` Rob Herring
@ 2015-05-01 15:22       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-01 15:22 UTC (permalink / raw)
  To: Rob Herring
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, Grant Likely,
	devicetree

On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:

> The difference seems to be whether you allocate space or just point to
> the FDT for various strings/data. Is that right?
> 
> >    * of_fdt_add_subtree() is the introduced API to do the work.
> 
> Have you looked at overlays and if so why do they not work for your purposes?
> 
> Why do you need to do this with the flattened tree?

The basic idea I asked Gavin to implement is that since the FW needs to
provide a bunch of DT updates to Linux at runtime in the form of new
nodes below an existing one, rather than doing it via some new/custom
format, instead, have it send a bit of FDT blob to expand under an
existing node.

As for the details of Gavin implementation, I haven't looked at it in
details yet so there might be issues there, however I don't know what
you mean by "overlays", any pointer ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-01 15:22       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-01 15:22 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:

> The difference seems to be whether you allocate space or just point to
> the FDT for various strings/data. Is that right?
> 
> >    * of_fdt_add_subtree() is the introduced API to do the work.
> 
> Have you looked at overlays and if so why do they not work for your purposes?
> 
> Why do you need to do this with the flattened tree?

The basic idea I asked Gavin to implement is that since the FW needs to
provide a bunch of DT updates to Linux at runtime in the form of new
nodes below an existing one, rather than doing it via some new/custom
format, instead, have it send a bit of FDT blob to expand under an
existing node.

As for the details of Gavin implementation, I haven't looked at it in
details yet so there might be issues there, however I don't know what
you mean by "overlays", any pointer ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 15:22       ` Benjamin Herrenschmidt
@ 2015-05-01 18:46         ` Rob Herring
  -1 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-01 18:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, Grant Likely,
	devicetree

On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
>
>> The difference seems to be whether you allocate space or just point to
>> the FDT for various strings/data. Is that right?
>>
>> >    * of_fdt_add_subtree() is the introduced API to do the work.
>>
>> Have you looked at overlays and if so why do they not work for your purposes?
>>
>> Why do you need to do this with the flattened tree?
>
> The basic idea I asked Gavin to implement is that since the FW needs to
> provide a bunch of DT updates to Linux at runtime in the form of new
> nodes below an existing one, rather than doing it via some new/custom
> format, instead, have it send a bit of FDT blob to expand under an
> existing node.

Overlay = an FDT blob to graft into a live running system. Sounds like
the same thing.

> As for the details of Gavin implementation, I haven't looked at it in
> details yet so there might be issues there, however I don't know what
> you mean by "overlays", any pointer ?

CONFIG_OF_OVERLAY

http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf

Rob

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-01 18:46         ` Rob Herring
  0 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-01 18:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
>
>> The difference seems to be whether you allocate space or just point to
>> the FDT for various strings/data. Is that right?
>>
>> >    * of_fdt_add_subtree() is the introduced API to do the work.
>>
>> Have you looked at overlays and if so why do they not work for your purposes?
>>
>> Why do you need to do this with the flattened tree?
>
> The basic idea I asked Gavin to implement is that since the FW needs to
> provide a bunch of DT updates to Linux at runtime in the form of new
> nodes below an existing one, rather than doing it via some new/custom
> format, instead, have it send a bit of FDT blob to expand under an
> existing node.

Overlay = an FDT blob to graft into a live running system. Sounds like
the same thing.

> As for the details of Gavin implementation, I haven't looked at it in
> details yet so there might be issues there, however I don't know what
> you mean by "overlays", any pointer ?

CONFIG_OF_OVERLAY

http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf

Rob

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 18:46         ` Rob Herring
@ 2015-05-01 22:57           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-01 22:57 UTC (permalink / raw)
  To: Rob Herring
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, Grant Likely,
	devicetree

On Fri, 2015-05-01 at 13:46 -0500, Rob Herring wrote:
> On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
> >
> >> The difference seems to be whether you allocate space or just point to
> >> the FDT for various strings/data. Is that right?
> >>
> >> >    * of_fdt_add_subtree() is the introduced API to do the work.
> >>
> >> Have you looked at overlays and if so why do they not work for your purposes?
> >>
> >> Why do you need to do this with the flattened tree?
> >
> > The basic idea I asked Gavin to implement is that since the FW needs to
> > provide a bunch of DT updates to Linux at runtime in the form of new
> > nodes below an existing one, rather than doing it via some new/custom
> > format, instead, have it send a bit of FDT blob to expand under an
> > existing node.
> 
> Overlay = an FDT blob to graft into a live running system. Sounds like
> the same thing.
> 
> > As for the details of Gavin implementation, I haven't looked at it in
> > details yet so there might be issues there, however I don't know what
> > you mean by "overlays", any pointer ?
> 
> CONFIG_OF_OVERLAY
> 
> http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf

Well, that looks horrendously complicated, poorly documented and totally
unused in-tree outside of the unittest stuff, yay ! It has all sort of
"features" that I don't really care about.

I still don't see what it buys me other than making my FW a lot more
complex having to generate all that additional fixup etc... crap that I
don't totally get yet.

What's wrong with just unflattening the nodes in place ? The DT comes
from the FW in the first place so all the phandles are already good in
the new added blob. Internally, the FW created new nodes in its internal
representation and flattened the subtree and sends that subtree to
Linux.

I don't plan to play "revert" either, if you unplug, I do need to remove
what's under the slot but that's true of boot time devices, not just
"new" ones, so the overlay stuff won't do the trick and I certainly
don't want to keep track...

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-01 22:57           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-01 22:57 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Fri, 2015-05-01 at 13:46 -0500, Rob Herring wrote:
> On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
> >
> >> The difference seems to be whether you allocate space or just point to
> >> the FDT for various strings/data. Is that right?
> >>
> >> >    * of_fdt_add_subtree() is the introduced API to do the work.
> >>
> >> Have you looked at overlays and if so why do they not work for your purposes?
> >>
> >> Why do you need to do this with the flattened tree?
> >
> > The basic idea I asked Gavin to implement is that since the FW needs to
> > provide a bunch of DT updates to Linux at runtime in the form of new
> > nodes below an existing one, rather than doing it via some new/custom
> > format, instead, have it send a bit of FDT blob to expand under an
> > existing node.
> 
> Overlay = an FDT blob to graft into a live running system. Sounds like
> the same thing.
> 
> > As for the details of Gavin implementation, I haven't looked at it in
> > details yet so there might be issues there, however I don't know what
> > you mean by "overlays", any pointer ?
> 
> CONFIG_OF_OVERLAY
> 
> http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf

Well, that looks horrendously complicated, poorly documented and totally
unused in-tree outside of the unittest stuff, yay ! It has all sort of
"features" that I don't really care about.

I still don't see what it buys me other than making my FW a lot more
complex having to generate all that additional fixup etc... crap that I
don't totally get yet.

What's wrong with just unflattening the nodes in place ? The DT comes
from the FW in the first place so all the phandles are already good in
the new added blob. Internally, the FW created new nodes in its internal
representation and flattened the subtree and sends that subtree to
Linux.

I don't plan to play "revert" either, if you unplug, I do need to remove
what's under the slot but that's true of boot time devices, not just
"new" ones, so the overlay stuff won't do the trick and I certainly
don't want to keep track...

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 22:57           ` Benjamin Herrenschmidt
@ 2015-05-01 23:29             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-01 23:29 UTC (permalink / raw)
  To: Rob Herring
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, Grant Likely,
	devicetree

On Sat, 2015-05-02 at 08:57 +1000, Benjamin Herrenschmidt wrote:

> > Overlay = an FDT blob to graft into a live running system. Sounds like
> > the same thing.
> > 
> > > As for the details of Gavin implementation, I haven't looked at it in
> > > details yet so there might be issues there, however I don't know what
> > > you mean by "overlays", any pointer ?
> > 
> > CONFIG_OF_OVERLAY
> > 
> > http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf
> 
> Well, that looks horrendously complicated, poorly documented and totally
> unused in-tree outside of the unittest stuff, yay ! It has all sort of
> "features" that I don't really care about.

Looking a bit more at it, I don't quite see how I can attach a subtree
using that stuff.

Instead, each node in the overlay seems to need extra nodes and
properties to refer to the original.

So the FW would essentially have to create something a lot more complex
than just reflattening a bit of its internal tree. For each internal
node, it will need to add all those __overlay__ nodes and properties.

That is not going to fly for me at all. It's order of magnitudes more
complex than the solution we are pursuing.

So I think for our use case, we should continue in the direction of
having a helper to unflatten a piece of FDT underneath an existing
node. I don't like the "HYBRID" stuff though, we should not refer to
the original FDT, we should just make them normal dynamic nodes.

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-01 23:29             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-01 23:29 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Sat, 2015-05-02 at 08:57 +1000, Benjamin Herrenschmidt wrote:

> > Overlay = an FDT blob to graft into a live running system. Sounds like
> > the same thing.
> > 
> > > As for the details of Gavin implementation, I haven't looked at it in
> > > details yet so there might be issues there, however I don't know what
> > > you mean by "overlays", any pointer ?
> > 
> > CONFIG_OF_OVERLAY
> > 
> > http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf
> 
> Well, that looks horrendously complicated, poorly documented and totally
> unused in-tree outside of the unittest stuff, yay ! It has all sort of
> "features" that I don't really care about.

Looking a bit more at it, I don't quite see how I can attach a subtree
using that stuff.

Instead, each node in the overlay seems to need extra nodes and
properties to refer to the original.

So the FW would essentially have to create something a lot more complex
than just reflattening a bit of its internal tree. For each internal
node, it will need to add all those __overlay__ nodes and properties.

That is not going to fly for me at all. It's order of magnitudes more
complex than the solution we are pursuing.

So I think for our use case, we should continue in the direction of
having a helper to unflatten a piece of FDT underneath an existing
node. I don't like the "HYBRID" stuff though, we should not refer to
the original FDT, we should just make them normal dynamic nodes.

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 23:29             ` Benjamin Herrenschmidt
@ 2015-05-02  2:48               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-02  2:48 UTC (permalink / raw)
  To: Rob Herring
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, Grant Likely,
	devicetree

On Sat, 2015-05-02 at 09:29 +1000, Benjamin Herrenschmidt wrote:

> Looking a bit more at it, I don't quite see how I can attach a subtree
> using that stuff.
> 
> Instead, each node in the overlay seems to need extra nodes and
> properties to refer to the original.
> 
> So the FW would essentially have to create something a lot more complex
> than just reflattening a bit of its internal tree. For each internal
> node, it will need to add all those __overlay__ nodes and properties.
> 
> That is not going to fly for me at all. It's order of magnitudes more
> complex than the solution we are pursuing.
> 
> So I think for our use case, we should continue in the direction of
> having a helper to unflatten a piece of FDT underneath an existing
> node. I don't like the "HYBRID" stuff though, we should not refer to
> the original FDT, we should just make them normal dynamic nodes.

A bit more thought... if we were to use the overlay stuff, Gavin, what
we *could* do is add to OPAL FW internal representation a generation
count to every node and property.

That way we could essentially know whenever something's changed from
what we flattened originally for the kernel.

We can then create a generic (not PCI specific) call that generates
an overlay tree for every node and property that has a generation
count that is newer than what was flattened (or passed by the OS).

It's still a LOT more complex than what we need though...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-02  2:48               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-02  2:48 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Sat, 2015-05-02 at 09:29 +1000, Benjamin Herrenschmidt wrote:

> Looking a bit more at it, I don't quite see how I can attach a subtree
> using that stuff.
> 
> Instead, each node in the overlay seems to need extra nodes and
> properties to refer to the original.
> 
> So the FW would essentially have to create something a lot more complex
> than just reflattening a bit of its internal tree. For each internal
> node, it will need to add all those __overlay__ nodes and properties.
> 
> That is not going to fly for me at all. It's order of magnitudes more
> complex than the solution we are pursuing.
> 
> So I think for our use case, we should continue in the direction of
> having a helper to unflatten a piece of FDT underneath an existing
> node. I don't like the "HYBRID" stuff though, we should not refer to
> the original FDT, we should just make them normal dynamic nodes.

A bit more thought... if we were to use the overlay stuff, Gavin, what
we *could* do is add to OPAL FW internal representation a generation
count to every node and property.

That way we could essentially know whenever something's changed from
what we flattened originally for the kernel.

We can then create a generic (not PCI specific) call that generates
an overlay tree for every node and property that has a generation
count that is newer than what was flattened (or passed by the OS).

It's still a LOT more complex than what we need though...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 12:54     ` Rob Herring
@ 2015-05-03 23:28       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-03 23:28 UTC (permalink / raw)
  To: Rob Herring
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Benjamin Herrenschmidt,
	Bjorn Helgaas, Grant Likely, devicetree

On Fri, May 01, 2015 at 07:54:03AM -0500, Rob Herring wrote:
>+dt list
>
>On Fri, May 1, 2015 at 1:03 AM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>> The requirement is raised when developing the PCI hotplug feature
>> for PowerPC PowerNV platform, which runs on top of skiboot firmware.
>> When plugging PCI adapter to one PCI slot, the firmware rescans the
>> slot and build FDT (Flat Device Tree) blob, which is sent to the
>> PowerNV PCI hotplug driver for processing. The new constructed device
>> nodes from the FDT blob are expected to be attached to the device
>> node of the PCI slot. Unfortunately, it seems we don't have a API
>> to support the scenario. The patch intends to support it by newly
>> introduced function of_fdt_add_subtree(), the design behind it is
>> shown as below:
>>
>>    * When the sub-tree FDT blob, which is owned by firmware, is
>>      received by kernel. It's copied over to the blob, which is
>>      dynamically allocated. Since then, the FDT blob owned by
>>      firmware isn't touched.
>>    * Rework unflatten_dt_node() so that the device nodes in current
>>      and deeper depth have been constructed from the FDT blob. All
>>      device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is
>
>Perhaps you meant HYBRID?
>

Yeah, It should be "HYBRID".

>>      similar to OF_DYNAMIC. However, device node with the flag set
>>      can be free'd, but in the way other than that for OF_DYNAMIC
>>      device nodes.
>
>The difference seems to be whether you allocate space or just point to
>the FDT for various strings/data. Is that right?
>

It's correct. The FDT blob passed from firmware is copied by kernel to
the memory chunk, which is allocated from slab. That means the FDT blob
managed by firmware can be released in time. In kernel, the instances of
"struct device_node" and "struct property" are allocated from slab
dynamically, but some of their fields are points to the (copied) FDT
blob. It indicates the (copied) FDT can only be released when the sub-tree
is cut off completely.


>>    * of_fdt_add_subtree() is the introduced API to do the work.
>
>Have you looked at overlays and if so why do they not work for your purposes?
>
>Why do you need to do this with the flattened tree?
>

It seems that Ben already helped answering the questions. I'll reply
in other threads if necessary. Rob, thanks for review.

Thanks,
Gavin

>Rob
>
>>
>> Cc: Grant Likely <grant.likely@linaro.org>
>> Cc: Rob Herring <robh+dt@kernel.org>
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  drivers/of/dynamic.c   |  19 +++++--
>>  drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
>>  include/linux/of.h     |   2 +
>>  include/linux/of_fdt.h |   1 +
>>  4 files changed, 127 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index 3351ef4..f562080 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
>>                 return;
>>         }
>>
>> -       if (!of_node_check_flag(node, OF_DYNAMIC))
>> +       /* Release the subtree */
>> +       if (node->subtree) {
>> +               kfree(node->subtree);
>> +               node->subtree = NULL;
>> +       }
>> +
>> +       if (!of_node_check_flag(node, OF_DYNAMIC) &&
>> +           !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
>>                 return;
>>
>>         while (prop) {
>>                 struct property *next = prop->next;
>> -               kfree(prop->name);
>> -               kfree(prop->value);
>> +               if (of_node_check_flag(node, OF_DYNAMIC)) {
>> +                       kfree(prop->name);
>> +                       kfree(prop->value);
>> +               }
>>                 kfree(prop);
>>                 prop = next;
>>
>> @@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
>>                         node->deadprops = NULL;
>>                 }
>>         }
>> -       kfree(node->full_name);
>> +
>> +       if (of_node_check_flag(node, OF_DYNAMIC))
>> +               kfree(node->full_name);
>>         kfree(node->data);
>>         kfree(node);
>>  }
>> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
>> index cde35c5d01..7659560 100644
>> --- a/drivers/of/fdt.c
>> +++ b/drivers/of/fdt.c
>> @@ -28,6 +28,10 @@
>>  #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
>>  #include <asm/page.h>
>>
>> +#include "of_private.h"
>> +
>> +static int cur_node_depth;
>> +
>>  /*
>>   * of_fdt_limit_memory - limit the number of regions in the /memory node
>>   * @limit: maximum entries
>> @@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
>>   * @dad: Parent struct device_node
>>   * @fpsize: Size of the node path up at the current depth.
>>   */
>> -static void * unflatten_dt_node(void *blob,
>> -                               void *mem,
>> -                               int *poffset,
>> -                               struct device_node *dad,
>> -                               struct device_node **nodepp,
>> -                               unsigned long fpsize,
>> -                               bool dryrun)
>> +static void *unflatten_dt_node(void *blob,
>> +                              void *mem,
>> +                              int *poffset,
>> +                              struct device_node *dad,
>> +                              struct device_node **nodepp,
>> +                              unsigned long fpsize,
>> +                              bool dryrun,
>> +                              bool dynamic)
>>  {
>>         const __be32 *p;
>>         struct device_node *np;
>>         struct property *pp, **prev_pp = NULL;
>>         const char *pathp;
>>         unsigned int l, allocl;
>> -       static int depth = 0;
>>         int old_depth;
>>         int offset;
>>         int has_name = 0;
>> @@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
>>                 }
>>         }
>>
>> -       np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
>> +       if (dynamic)
>> +               np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
>> +       else
>> +               np = unflatten_dt_alloc(&mem,
>> +                               sizeof(struct device_node) + allocl,
>>                                 __alignof__(struct device_node));
>>         if (!dryrun) {
>>                 char *fn;
>>                 of_node_init(np);
>>                 np->full_name = fn = ((char *)np) + sizeof(*np);
>> +               if (dynamic)
>> +                       of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
>>                 if (new_format) {
>>                         /* rebuild full path for new format */
>>                         if (dad && dad->parent) {
>> @@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
>>                 }
>>                 if (strcmp(pname, "name") == 0)
>>                         has_name = 1;
>> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>> -                                       __alignof__(struct property));
>> +
>> +               if (dynamic)
>> +                       pp = kzalloc(sizeof(struct property), GFP_KERNEL);
>> +               else
>> +                       pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>> +                                               __alignof__(struct property));
>>                 if (!dryrun) {
>>                         /* We accept flattened tree phandles either in
>>                          * ePAPR-style "phandle" properties, or the
>> @@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
>>                 if (pa < ps)
>>                         pa = p1;
>>                 sz = (pa - ps) + 1;
>> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
>> -                                       __alignof__(struct property));
>> +
>> +               if (dynamic)
>> +                       pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
>> +               else
>> +                       pp = unflatten_dt_alloc(&mem,
>> +                                               sizeof(struct property) + sz,
>> +                                               __alignof__(struct property));
>>                 if (!dryrun) {
>>                         pp->name = "name";
>>                         pp->length = sz;
>> @@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
>>                         np->type = "<NULL>";
>>         }
>>
>> -       old_depth = depth;
>> -       *poffset = fdt_next_node(blob, *poffset, &depth);
>> -       if (depth < 0)
>> -               depth = 0;
>> -       while (*poffset > 0 && depth > old_depth)
>> -               mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
>> -                                       fpsize, dryrun);
>> +       old_depth = cur_node_depth;
>> +       *poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
>> +       while (*poffset > 0) {
>> +               if (cur_node_depth < old_depth)
>> +                       break;
>> +
>> +               if (cur_node_depth == old_depth)
>> +                       mem = unflatten_dt_node(blob, mem, poffset,
>> +                                               dad, NULL, fpsize,
>> +                                               dryrun, dynamic);
>> +               else if (cur_node_depth > old_depth)
>> +                       mem = unflatten_dt_node(blob, mem, poffset,
>> +                                               np, NULL, fpsize,
>> +                                               dryrun, dynamic);
>> +       }
>>
>>         if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
>>                 pr_err("unflatten: error %d processing FDT\n", *poffset);
>> @@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
>>   * for the resulting tree
>>   */
>>  static void __unflatten_device_tree(void *blob,
>> -                            struct device_node **mynodes,
>> -                            void * (*dt_alloc)(u64 size, u64 align))
>> +                               struct device_node **mynodes,
>> +                               void * (*dt_alloc)(u64 size, u64 align))
>>  {
>>         unsigned long size;
>>         int start;
>> @@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
>>
>>         /* First pass, scan for size */
>>         start = 0;
>> -       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
>> +       cur_node_depth = 1;
>> +       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
>> +                                               NULL, 0, true, false);
>>         size = ALIGN(size, 4);
>>
>>         pr_debug("  size is %lx, allocating...\n", size);
>> @@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
>>
>>         /* Second pass, do actual unflattening */
>>         start = 0;
>> -       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
>> +       cur_node_depth = 1;
>> +       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
>>         if (be32_to_cpup(mem + size) != 0xdeadbeef)
>>                 pr_warning("End of tree marker overwritten: %08x\n",
>>                            be32_to_cpup(mem + size));
>> @@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
>>  }
>>  EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
>>
>> +static void populate_sysfs_for_child_nodes(struct device_node *parent)
>> +{
>> +       struct device_node *child;
>> +
>> +       for_each_child_of_node(parent, child) {
>> +               __of_attach_node_sysfs(child);
>> +               populate_sysfs_for_child_nodes(child);
>> +       }
>> +}
>> +
>> +/**
>> + * of_fdt_add_substree - Create sub-tree of device nodes
>> + * @parent: parent device node to which the sub-tree will attach
>> + * @blob: flat device tree blob representing the sub-tree
>> + *
>> + * Copy over the FDT blob, which passed from firmware, and then
>> + * unflatten the sub-tree.
>> + */
>> +void of_fdt_add_subtree(struct device_node *parent, void *blob)
>> +{
>> +       int start = 0;
>> +
>> +       /* Validate the header */
>> +       if (!blob || fdt_check_header(blob)) {
>> +               pr_err("%s: Invalid device-tree blob header at 0x%p\n",
>> +                      __func__, blob);
>> +               return;
>> +       }
>> +
>> +       /* Free the flat blob for last time lazily */
>> +       if (parent->subtree) {
>> +               kfree(parent->subtree);
>> +               parent->subtree = NULL;
>> +       }
>> +
>> +       /* Copy over the flat blob */
>> +       parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
>> +       if (!parent->subtree) {
>> +               pr_err("%s: Cannot copy over device-tree blob\n",
>> +                      __func__);
>> +               return;
>> +       }
>> +
>> +       memcpy(parent->subtree, blob, fdt_totalsize(blob));
>> +
>> +       /* Unflatten it */
>> +       mutex_lock(&of_mutex);
>> +       cur_node_depth = 1;
>> +       unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
>> +                         strlen(parent->full_name), false, true);
>> +       populate_sysfs_for_child_nodes(parent);
>> +       mutex_unlock(&of_mutex);
>> +}
>> +EXPORT_SYMBOL(of_fdt_add_subtree);
>> +
>>  /* Everything below here references initial_boot_params directly. */
>>  int __initdata dt_root_addr_cells;
>>  int __initdata dt_root_size_cells;
>> diff --git a/include/linux/of.h b/include/linux/of.h
>> index ddeaae6..ac50b02 100644
>> --- a/include/linux/of.h
>> +++ b/include/linux/of.h
>> @@ -60,6 +60,7 @@ struct device_node {
>>         struct  device_node *sibling;
>>         struct  kobject kobj;
>>         unsigned long _flags;
>> +       void    *subtree;
>>         void    *data;
>>  #if defined(CONFIG_SPARC)
>>         const char *path_component_name;
>> @@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
>>  #define OF_DETACHED    2 /* node has been detached from the device tree */
>>  #define OF_POPULATED   3 /* device already created for the node */
>>  #define OF_POPULATED_BUS       4 /* of_platform_populate recursed to children of this node */
>> +#define OF_DYNAMIC_HYBIRD      5 /* similar to OF_DYNAMIC, but partially */
>>
>>  #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
>>  #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
>> diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
>> index 587ee50..1fb47d7 100644
>> --- a/include/linux/of_fdt.h
>> +++ b/include/linux/of_fdt.h
>> @@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
>>                         const char *const *compat);
>>  extern void of_fdt_unflatten_tree(unsigned long *blob,
>>                                struct device_node **mynodes);
>> +extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
>>
>>  /* TBD: Temporary export of fdt globals - remove when code fully merged */
>>  extern int __initdata dt_root_addr_cells;
>> --
>> 2.1.0
>>
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-03 23:28       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-03 23:28 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Fri, May 01, 2015 at 07:54:03AM -0500, Rob Herring wrote:
>+dt list
>
>On Fri, May 1, 2015 at 1:03 AM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>> The requirement is raised when developing the PCI hotplug feature
>> for PowerPC PowerNV platform, which runs on top of skiboot firmware.
>> When plugging PCI adapter to one PCI slot, the firmware rescans the
>> slot and build FDT (Flat Device Tree) blob, which is sent to the
>> PowerNV PCI hotplug driver for processing. The new constructed device
>> nodes from the FDT blob are expected to be attached to the device
>> node of the PCI slot. Unfortunately, it seems we don't have a API
>> to support the scenario. The patch intends to support it by newly
>> introduced function of_fdt_add_subtree(), the design behind it is
>> shown as below:
>>
>>    * When the sub-tree FDT blob, which is owned by firmware, is
>>      received by kernel. It's copied over to the blob, which is
>>      dynamically allocated. Since then, the FDT blob owned by
>>      firmware isn't touched.
>>    * Rework unflatten_dt_node() so that the device nodes in current
>>      and deeper depth have been constructed from the FDT blob. All
>>      device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is
>
>Perhaps you meant HYBRID?
>

Yeah, It should be "HYBRID".

>>      similar to OF_DYNAMIC. However, device node with the flag set
>>      can be free'd, but in the way other than that for OF_DYNAMIC
>>      device nodes.
>
>The difference seems to be whether you allocate space or just point to
>the FDT for various strings/data. Is that right?
>

It's correct. The FDT blob passed from firmware is copied by kernel to
the memory chunk, which is allocated from slab. That means the FDT blob
managed by firmware can be released in time. In kernel, the instances of
"struct device_node" and "struct property" are allocated from slab
dynamically, but some of their fields are points to the (copied) FDT
blob. It indicates the (copied) FDT can only be released when the sub-tree
is cut off completely.


>>    * of_fdt_add_subtree() is the introduced API to do the work.
>
>Have you looked at overlays and if so why do they not work for your purposes?
>
>Why do you need to do this with the flattened tree?
>

It seems that Ben already helped answering the questions. I'll reply
in other threads if necessary. Rob, thanks for review.

Thanks,
Gavin

>Rob
>
>>
>> Cc: Grant Likely <grant.likely@linaro.org>
>> Cc: Rob Herring <robh+dt@kernel.org>
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  drivers/of/dynamic.c   |  19 +++++--
>>  drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
>>  include/linux/of.h     |   2 +
>>  include/linux/of_fdt.h |   1 +
>>  4 files changed, 127 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index 3351ef4..f562080 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
>>                 return;
>>         }
>>
>> -       if (!of_node_check_flag(node, OF_DYNAMIC))
>> +       /* Release the subtree */
>> +       if (node->subtree) {
>> +               kfree(node->subtree);
>> +               node->subtree = NULL;
>> +       }
>> +
>> +       if (!of_node_check_flag(node, OF_DYNAMIC) &&
>> +           !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
>>                 return;
>>
>>         while (prop) {
>>                 struct property *next = prop->next;
>> -               kfree(prop->name);
>> -               kfree(prop->value);
>> +               if (of_node_check_flag(node, OF_DYNAMIC)) {
>> +                       kfree(prop->name);
>> +                       kfree(prop->value);
>> +               }
>>                 kfree(prop);
>>                 prop = next;
>>
>> @@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
>>                         node->deadprops = NULL;
>>                 }
>>         }
>> -       kfree(node->full_name);
>> +
>> +       if (of_node_check_flag(node, OF_DYNAMIC))
>> +               kfree(node->full_name);
>>         kfree(node->data);
>>         kfree(node);
>>  }
>> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
>> index cde35c5d01..7659560 100644
>> --- a/drivers/of/fdt.c
>> +++ b/drivers/of/fdt.c
>> @@ -28,6 +28,10 @@
>>  #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
>>  #include <asm/page.h>
>>
>> +#include "of_private.h"
>> +
>> +static int cur_node_depth;
>> +
>>  /*
>>   * of_fdt_limit_memory - limit the number of regions in the /memory node
>>   * @limit: maximum entries
>> @@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
>>   * @dad: Parent struct device_node
>>   * @fpsize: Size of the node path up at the current depth.
>>   */
>> -static void * unflatten_dt_node(void *blob,
>> -                               void *mem,
>> -                               int *poffset,
>> -                               struct device_node *dad,
>> -                               struct device_node **nodepp,
>> -                               unsigned long fpsize,
>> -                               bool dryrun)
>> +static void *unflatten_dt_node(void *blob,
>> +                              void *mem,
>> +                              int *poffset,
>> +                              struct device_node *dad,
>> +                              struct device_node **nodepp,
>> +                              unsigned long fpsize,
>> +                              bool dryrun,
>> +                              bool dynamic)
>>  {
>>         const __be32 *p;
>>         struct device_node *np;
>>         struct property *pp, **prev_pp = NULL;
>>         const char *pathp;
>>         unsigned int l, allocl;
>> -       static int depth = 0;
>>         int old_depth;
>>         int offset;
>>         int has_name = 0;
>> @@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
>>                 }
>>         }
>>
>> -       np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
>> +       if (dynamic)
>> +               np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
>> +       else
>> +               np = unflatten_dt_alloc(&mem,
>> +                               sizeof(struct device_node) + allocl,
>>                                 __alignof__(struct device_node));
>>         if (!dryrun) {
>>                 char *fn;
>>                 of_node_init(np);
>>                 np->full_name = fn = ((char *)np) + sizeof(*np);
>> +               if (dynamic)
>> +                       of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
>>                 if (new_format) {
>>                         /* rebuild full path for new format */
>>                         if (dad && dad->parent) {
>> @@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
>>                 }
>>                 if (strcmp(pname, "name") == 0)
>>                         has_name = 1;
>> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>> -                                       __alignof__(struct property));
>> +
>> +               if (dynamic)
>> +                       pp = kzalloc(sizeof(struct property), GFP_KERNEL);
>> +               else
>> +                       pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>> +                                               __alignof__(struct property));
>>                 if (!dryrun) {
>>                         /* We accept flattened tree phandles either in
>>                          * ePAPR-style "phandle" properties, or the
>> @@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
>>                 if (pa < ps)
>>                         pa = p1;
>>                 sz = (pa - ps) + 1;
>> -               pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
>> -                                       __alignof__(struct property));
>> +
>> +               if (dynamic)
>> +                       pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
>> +               else
>> +                       pp = unflatten_dt_alloc(&mem,
>> +                                               sizeof(struct property) + sz,
>> +                                               __alignof__(struct property));
>>                 if (!dryrun) {
>>                         pp->name = "name";
>>                         pp->length = sz;
>> @@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
>>                         np->type = "<NULL>";
>>         }
>>
>> -       old_depth = depth;
>> -       *poffset = fdt_next_node(blob, *poffset, &depth);
>> -       if (depth < 0)
>> -               depth = 0;
>> -       while (*poffset > 0 && depth > old_depth)
>> -               mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
>> -                                       fpsize, dryrun);
>> +       old_depth = cur_node_depth;
>> +       *poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
>> +       while (*poffset > 0) {
>> +               if (cur_node_depth < old_depth)
>> +                       break;
>> +
>> +               if (cur_node_depth == old_depth)
>> +                       mem = unflatten_dt_node(blob, mem, poffset,
>> +                                               dad, NULL, fpsize,
>> +                                               dryrun, dynamic);
>> +               else if (cur_node_depth > old_depth)
>> +                       mem = unflatten_dt_node(blob, mem, poffset,
>> +                                               np, NULL, fpsize,
>> +                                               dryrun, dynamic);
>> +       }
>>
>>         if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
>>                 pr_err("unflatten: error %d processing FDT\n", *poffset);
>> @@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
>>   * for the resulting tree
>>   */
>>  static void __unflatten_device_tree(void *blob,
>> -                            struct device_node **mynodes,
>> -                            void * (*dt_alloc)(u64 size, u64 align))
>> +                               struct device_node **mynodes,
>> +                               void * (*dt_alloc)(u64 size, u64 align))
>>  {
>>         unsigned long size;
>>         int start;
>> @@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
>>
>>         /* First pass, scan for size */
>>         start = 0;
>> -       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
>> +       cur_node_depth = 1;
>> +       size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
>> +                                               NULL, 0, true, false);
>>         size = ALIGN(size, 4);
>>
>>         pr_debug("  size is %lx, allocating...\n", size);
>> @@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
>>
>>         /* Second pass, do actual unflattening */
>>         start = 0;
>> -       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
>> +       cur_node_depth = 1;
>> +       unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
>>         if (be32_to_cpup(mem + size) != 0xdeadbeef)
>>                 pr_warning("End of tree marker overwritten: %08x\n",
>>                            be32_to_cpup(mem + size));
>> @@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
>>  }
>>  EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
>>
>> +static void populate_sysfs_for_child_nodes(struct device_node *parent)
>> +{
>> +       struct device_node *child;
>> +
>> +       for_each_child_of_node(parent, child) {
>> +               __of_attach_node_sysfs(child);
>> +               populate_sysfs_for_child_nodes(child);
>> +       }
>> +}
>> +
>> +/**
>> + * of_fdt_add_substree - Create sub-tree of device nodes
>> + * @parent: parent device node to which the sub-tree will attach
>> + * @blob: flat device tree blob representing the sub-tree
>> + *
>> + * Copy over the FDT blob, which passed from firmware, and then
>> + * unflatten the sub-tree.
>> + */
>> +void of_fdt_add_subtree(struct device_node *parent, void *blob)
>> +{
>> +       int start = 0;
>> +
>> +       /* Validate the header */
>> +       if (!blob || fdt_check_header(blob)) {
>> +               pr_err("%s: Invalid device-tree blob header at 0x%p\n",
>> +                      __func__, blob);
>> +               return;
>> +       }
>> +
>> +       /* Free the flat blob for last time lazily */
>> +       if (parent->subtree) {
>> +               kfree(parent->subtree);
>> +               parent->subtree = NULL;
>> +       }
>> +
>> +       /* Copy over the flat blob */
>> +       parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
>> +       if (!parent->subtree) {
>> +               pr_err("%s: Cannot copy over device-tree blob\n",
>> +                      __func__);
>> +               return;
>> +       }
>> +
>> +       memcpy(parent->subtree, blob, fdt_totalsize(blob));
>> +
>> +       /* Unflatten it */
>> +       mutex_lock(&of_mutex);
>> +       cur_node_depth = 1;
>> +       unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
>> +                         strlen(parent->full_name), false, true);
>> +       populate_sysfs_for_child_nodes(parent);
>> +       mutex_unlock(&of_mutex);
>> +}
>> +EXPORT_SYMBOL(of_fdt_add_subtree);
>> +
>>  /* Everything below here references initial_boot_params directly. */
>>  int __initdata dt_root_addr_cells;
>>  int __initdata dt_root_size_cells;
>> diff --git a/include/linux/of.h b/include/linux/of.h
>> index ddeaae6..ac50b02 100644
>> --- a/include/linux/of.h
>> +++ b/include/linux/of.h
>> @@ -60,6 +60,7 @@ struct device_node {
>>         struct  device_node *sibling;
>>         struct  kobject kobj;
>>         unsigned long _flags;
>> +       void    *subtree;
>>         void    *data;
>>  #if defined(CONFIG_SPARC)
>>         const char *path_component_name;
>> @@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
>>  #define OF_DETACHED    2 /* node has been detached from the device tree */
>>  #define OF_POPULATED   3 /* device already created for the node */
>>  #define OF_POPULATED_BUS       4 /* of_platform_populate recursed to children of this node */
>> +#define OF_DYNAMIC_HYBIRD      5 /* similar to OF_DYNAMIC, but partially */
>>
>>  #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
>>  #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
>> diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
>> index 587ee50..1fb47d7 100644
>> --- a/include/linux/of_fdt.h
>> +++ b/include/linux/of_fdt.h
>> @@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
>>                         const char *const *compat);
>>  extern void of_fdt_unflatten_tree(unsigned long *blob,
>>                                struct device_node **mynodes);
>> +extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
>>
>>  /* TBD: Temporary export of fdt globals - remove when code fully merged */
>>  extern int __initdata dt_root_addr_cells;
>> --
>> 2.1.0
>>
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 23:29             ` Benjamin Herrenschmidt
@ 2015-05-04  0:23               ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-04  0:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Sat, May 02, 2015 at 09:29:36AM +1000, Benjamin Herrenschmidt wrote:
>On Sat, 2015-05-02 at 08:57 +1000, Benjamin Herrenschmidt wrote:
>
>> > Overlay = an FDT blob to graft into a live running system. Sounds like
>> > the same thing.
>> > 
>> > > As for the details of Gavin implementation, I haven't looked at it in
>> > > details yet so there might be issues there, however I don't know what
>> > > you mean by "overlays", any pointer ?
>> > 
>> > CONFIG_OF_OVERLAY
>> > 
>> > http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf
>> 
>> Well, that looks horrendously complicated, poorly documented and totally
>> unused in-tree outside of the unittest stuff, yay ! It has all sort of
>> "features" that I don't really care about.
>
>Looking a bit more at it, I don't quite see how I can attach a subtree
>using that stuff.
>
>Instead, each node in the overlay seems to need extra nodes and
>properties to refer to the original.
>
>So the FW would essentially have to create something a lot more complex
>than just reflattening a bit of its internal tree. For each internal
>node, it will need to add all those __overlay__ nodes and properties.
>
>That is not going to fly for me at all. It's order of magnitudes more
>complex than the solution we are pursuing.
>
>So I think for our use case, we should continue in the direction of
>having a helper to unflatten a piece of FDT underneath an existing
>node. I don't like the "HYBRID" stuff though, we should not refer to
>the original FDT, we should just make them normal dynamic nodes.
>

The original FDT from firmware is copied over to the memory chunk
allocated from slab by kernel. So we refer to the copy of the FDT,
not original one. Yeah, "HYBRID" wouldn't be a good idea. If we
want make all device nodes and properties of the sub-tree "DYNAMIC",
the FDT hasn't to be copied over from skiboot to kernel, indicating
those dynamic device nodes and properties in the subtree can be
figured out directly from the FDT blob, which is owned by firmware.

Also, it just need small changes to code what we have. Not too much
changes needed.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-04  0:23               ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-04  0:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Sat, May 02, 2015 at 09:29:36AM +1000, Benjamin Herrenschmidt wrote:
>On Sat, 2015-05-02 at 08:57 +1000, Benjamin Herrenschmidt wrote:
>
>> > Overlay = an FDT blob to graft into a live running system. Sounds like
>> > the same thing.
>> > 
>> > > As for the details of Gavin implementation, I haven't looked at it in
>> > > details yet so there might be issues there, however I don't know what
>> > > you mean by "overlays", any pointer ?
>> > 
>> > CONFIG_OF_OVERLAY
>> > 
>> > http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf
>> 
>> Well, that looks horrendously complicated, poorly documented and totally
>> unused in-tree outside of the unittest stuff, yay ! It has all sort of
>> "features" that I don't really care about.
>
>Looking a bit more at it, I don't quite see how I can attach a subtree
>using that stuff.
>
>Instead, each node in the overlay seems to need extra nodes and
>properties to refer to the original.
>
>So the FW would essentially have to create something a lot more complex
>than just reflattening a bit of its internal tree. For each internal
>node, it will need to add all those __overlay__ nodes and properties.
>
>That is not going to fly for me at all. It's order of magnitudes more
>complex than the solution we are pursuing.
>
>So I think for our use case, we should continue in the direction of
>having a helper to unflatten a piece of FDT underneath an existing
>node. I don't like the "HYBRID" stuff though, we should not refer to
>the original FDT, we should just make them normal dynamic nodes.
>

The original FDT from firmware is copied over to the memory chunk
allocated from slab by kernel. So we refer to the copy of the FDT,
not original one. Yeah, "HYBRID" wouldn't be a good idea. If we
want make all device nodes and properties of the sub-tree "DYNAMIC",
the FDT hasn't to be copied over from skiboot to kernel, indicating
those dynamic device nodes and properties in the subtree can be
figured out directly from the FDT blob, which is owned by firmware.

Also, it just need small changes to code what we have. Not too much
changes needed.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-02  2:48               ` Benjamin Herrenschmidt
@ 2015-05-04  1:30                 ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-04  1:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Sat, May 02, 2015 at 12:48:26PM +1000, Benjamin Herrenschmidt wrote:
>On Sat, 2015-05-02 at 09:29 +1000, Benjamin Herrenschmidt wrote:
>
>> Looking a bit more at it, I don't quite see how I can attach a subtree
>> using that stuff.
>> 
>> Instead, each node in the overlay seems to need extra nodes and
>> properties to refer to the original.
>> 
>> So the FW would essentially have to create something a lot more complex
>> than just reflattening a bit of its internal tree. For each internal
>> node, it will need to add all those __overlay__ nodes and properties.
>> 
>> That is not going to fly for me at all. It's order of magnitudes more
>> complex than the solution we are pursuing.
>> 
>> So I think for our use case, we should continue in the direction of
>> having a helper to unflatten a piece of FDT underneath an existing
>> node. I don't like the "HYBRID" stuff though, we should not refer to
>> the original FDT, we should just make them normal dynamic nodes.

Just took a close look on the overlay code. Hopefully I understand
how it works completely. Yeah, there is one questions according to my
understanding. The "overlay" device node should have been in child list
of the device node, who also has the indicator to "target" node. That
means some one else has to create "overlay" node and figure out the
"target" node in advance, then invokes overlay module to apply the
changes. From this perspective, the mechanism is something used to
apply the changes to device-tree, not parsing and create device nodes
from input. It does gurantee all the changes will be applied or none
of them. So I agree on what Ben suggested: to continue the direction
of having a helper to unflatten FDT blobk underneath the existing node,
and "HYBRID" should be replaced with "OF_DYNAMIC".

>
>A bit more thought... if we were to use the overlay stuff, Gavin, what
>we *could* do is add to OPAL FW internal representation a generation
>count to every node and property.
>
>That way we could essentially know whenever something's changed from
>what we flattened originally for the kernel.
>
>We can then create a generic (not PCI specific) call that generates
>an overlay tree for every node and property that has a generation
>count that is newer than what was flattened (or passed by the OS).
>
>It's still a LOT more complex than what we need though...
>

Thanks, Ben. If we really need utilize overlay to support our case,
we need some one to parse the input (device-tree changes) from firmware
and create "overlay" device node and "target" node as I mentioned above.
It's not simpler than the way we had to support our case. I'm not sure
if we really need utilize overlay for our case.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-04  1:30                 ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-04  1:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Sat, May 02, 2015 at 12:48:26PM +1000, Benjamin Herrenschmidt wrote:
>On Sat, 2015-05-02 at 09:29 +1000, Benjamin Herrenschmidt wrote:
>
>> Looking a bit more at it, I don't quite see how I can attach a subtree
>> using that stuff.
>> 
>> Instead, each node in the overlay seems to need extra nodes and
>> properties to refer to the original.
>> 
>> So the FW would essentially have to create something a lot more complex
>> than just reflattening a bit of its internal tree. For each internal
>> node, it will need to add all those __overlay__ nodes and properties.
>> 
>> That is not going to fly for me at all. It's order of magnitudes more
>> complex than the solution we are pursuing.
>> 
>> So I think for our use case, we should continue in the direction of
>> having a helper to unflatten a piece of FDT underneath an existing
>> node. I don't like the "HYBRID" stuff though, we should not refer to
>> the original FDT, we should just make them normal dynamic nodes.

Just took a close look on the overlay code. Hopefully I understand
how it works completely. Yeah, there is one questions according to my
understanding. The "overlay" device node should have been in child list
of the device node, who also has the indicator to "target" node. That
means some one else has to create "overlay" node and figure out the
"target" node in advance, then invokes overlay module to apply the
changes. From this perspective, the mechanism is something used to
apply the changes to device-tree, not parsing and create device nodes
from input. It does gurantee all the changes will be applied or none
of them. So I agree on what Ben suggested: to continue the direction
of having a helper to unflatten FDT blobk underneath the existing node,
and "HYBRID" should be replaced with "OF_DYNAMIC".

>
>A bit more thought... if we were to use the overlay stuff, Gavin, what
>we *could* do is add to OPAL FW internal representation a generation
>count to every node and property.
>
>That way we could essentially know whenever something's changed from
>what we flattened originally for the kernel.
>
>We can then create a generic (not PCI specific) call that generates
>an overlay tree for every node and property that has a generation
>count that is newer than what was flattened (or passed by the OS).
>
>It's still a LOT more complex than what we need though...
>

Thanks, Ben. If we really need utilize overlay to support our case,
we need some one to parse the input (device-tree changes) from firmware
and create "overlay" device node and "target" node as I mentioned above.
It's not simpler than the way we had to support our case. I'm not sure
if we really need utilize overlay for our case.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-04  1:30                 ` Gavin Shan
@ 2015-05-04  4:51                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-04  4:51 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Rob Herring, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Mon, 2015-05-04 at 11:30 +1000, Gavin Shan wrote:
> Thanks, Ben. If we really need utilize overlay to support our case,
> we need some one to parse the input (device-tree changes) from
> firmware
> and create "overlay" device node and "target" node as I mentioned
> above.
> It's not simpler than the way we had to support our case. I'm not sure
> if we really need utilize overlay for our case.

No, if we decide to go down that path, then the FW needs to create
the overlay.

This could be done by having some kind of versioning to all nodes and
properties using a global generation count.

Ie, if we "know" what we passed to Linux, we can generate an overlay
that contains everything that changed since then using the version
numbers.

However, we should probably encode the version in the tree itself and
have specific APIs to retrieve "from" a given version to properly deal
with kexec'ing a kernel since in that case, the new kernel will have
something that isn't version 0 but version N where N is the latest
applied overlay.

Also I don't know how removing nodes works with overlay. IE the overlay
system is designed around the idea of removing the overlay to retrieve
the original tree.

In our case, our overlays are meant to be fully committed, and I don't
know whether there's a way to keep track. On unplug, we will just remove
all the nodes below the slot.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-04  4:51                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-04  4:51 UTC (permalink / raw)
  To: Gavin Shan
  Cc: devicetree, linux-pci, Grant Likely, Rob Herring, Bjorn Helgaas,
	linuxppc-dev

On Mon, 2015-05-04 at 11:30 +1000, Gavin Shan wrote:
> Thanks, Ben. If we really need utilize overlay to support our case,
> we need some one to parse the input (device-tree changes) from
> firmware
> and create "overlay" device node and "target" node as I mentioned
> above.
> It's not simpler than the way we had to support our case. I'm not sure
> if we really need utilize overlay for our case.

No, if we decide to go down that path, then the FW needs to create
the overlay.

This could be done by having some kind of versioning to all nodes and
properties using a global generation count.

Ie, if we "know" what we passed to Linux, we can generate an overlay
that contains everything that changed since then using the version
numbers.

However, we should probably encode the version in the tree itself and
have specific APIs to retrieve "from" a given version to properly deal
with kexec'ing a kernel since in that case, the new kernel will have
something that isn't version 0 but version N where N is the latest
applied overlay.

Also I don't know how removing nodes works with overlay. IE the overlay
system is designed around the idea of removing the overlay to retrieve
the original tree.

In our case, our overlays are meant to be fully committed, and I don't
know whether there's a way to keep track. On unplug, we will just remove
all the nodes below the slot.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01 22:57           ` Benjamin Herrenschmidt
  (?)
@ 2015-05-04 16:41               ` Pantelis Antoniou
  -1 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-04 16:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

Hi Ben,

> On May 2, 2015, at 01:57 , Benjamin Herrenschmidt <benh-XVmvHMARGAQRh2imMr4xaA@public.gmane.orgng.org> wrote:
> 
> On Fri, 2015-05-01 at 13:46 -0500, Rob Herring wrote:
>> On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
>> <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org> wrote:
>>> On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
>>> 
>>>> The difference seems to be whether you allocate space or just point to
>>>> the FDT for various strings/data. Is that right?
>>>> 
>>>>>   * of_fdt_add_subtree() is the introduced API to do the work.
>>>> 
>>>> Have you looked at overlays and if so why do they not work for your purposes?
>>>> 
>>>> Why do you need to do this with the flattened tree?
>>> 
>>> The basic idea I asked Gavin to implement is that since the FW needs to
>>> provide a bunch of DT updates to Linux at runtime in the form of new
>>> nodes below an existing one, rather than doing it via some new/custom
>>> format, instead, have it send a bit of FDT blob to expand under an
>>> existing node.
>> 
>> Overlay = an FDT blob to graft into a live running system. Sounds like
>> the same thing.
>> 
>>> As for the details of Gavin implementation, I haven't looked at it in
>>> details yet so there might be issues there, however I don't know what
>>> you mean by "overlays", any pointer ?
>> 
>> CONFIG_OF_OVERLAY
>> 
>> http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf
> 
> Well, that looks horrendously complicated, poorly documented and totally
> unused in-tree outside of the unittest stuff, yay ! It has all sort of
> "features" that I don't really care about.
> 

If it was easy to get stuff in, it would get more of the real-use drivers
in.

> I still don't see what it buys me other than making my FW a lot more
> complex having to generate all that additional fixup etc... crap that I
> don't totally get yet.
> 

You don’t generate any additional fixups. You just compile with the option
that generates all the fixups for you.

> What's wrong with just unflattening the nodes in place ? The DT comes
> from the FW in the first place so all the phandles are already good in
> the new added blob. Internally, the FW created new nodes in its internal
> representation and flattened the subtree and sends that subtree to
> Linux.
> 
> I don't plan to play "revert" either, if you unplug, I do need to remove
> what's under the slot but that's true of boot time devices, not just
> "new" ones, so the overlay stuff won't do the trick and I certainly
> don't want to keep track…
> 

You get all of the corner cases handled for free. Perhaps it works for your
case too.

Perhaps you can educate me on what you need supported and we can make sure
it’s included.

> Ben.
> 
> 

Regards

— Pantelis

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-04 16:41               ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-04 16:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 2, 2015, at 01:57 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> On Fri, 2015-05-01 at 13:46 -0500, Rob Herring wrote:
>> On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>>> On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
>>> 
>>>> The difference seems to be whether you allocate space or just point to
>>>> the FDT for various strings/data. Is that right?
>>>> 
>>>>>   * of_fdt_add_subtree() is the introduced API to do the work.
>>>> 
>>>> Have you looked at overlays and if so why do they not work for your purposes?
>>>> 
>>>> Why do you need to do this with the flattened tree?
>>> 
>>> The basic idea I asked Gavin to implement is that since the FW needs to
>>> provide a bunch of DT updates to Linux at runtime in the form of new
>>> nodes below an existing one, rather than doing it via some new/custom
>>> format, instead, have it send a bit of FDT blob to expand under an
>>> existing node.
>> 
>> Overlay = an FDT blob to graft into a live running system. Sounds like
>> the same thing.
>> 
>>> As for the details of Gavin implementation, I haven't looked at it in
>>> details yet so there might be issues there, however I don't know what
>>> you mean by "overlays", any pointer ?
>> 
>> CONFIG_OF_OVERLAY
>> 
>> http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-keynote-v3.pdf
> 
> Well, that looks horrendously complicated, poorly documented and totally
> unused in-tree outside of the unittest stuff, yay ! It has all sort of
> "features" that I don't really care about.
> 

If it was easy to get stuff in, it would get more of the real-use drivers
in.

> I still don't see what it buys me other than making my FW a lot more
> complex having to generate all that additional fixup etc... crap that I
> don't totally get yet.
> 

You don’t generate any additional fixups. You just compile with the option
that generates all the fixups for you.

> What's wrong with just unflattening the nodes in place ? The DT comes
> from the FW in the first place so all the phandles are already good in
> the new added blob. Internally, the FW created new nodes in its internal
> representation and flattened the subtree and sends that subtree to
> Linux.
> 
> I don't plan to play "revert" either, if you unplug, I do need to remove
> what's under the slot but that's true of boot time devices, not just
> "new" ones, so the overlay stuff won't do the trick and I certainly
> don't want to keep track…
> 

You get all of the corner cases handled for free. Perhaps it works for your
case too.

Perhaps you can educate me on what you need supported and we can make sure
it’s included.

> Ben.
> 
> 

Regards

— Pantelis


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-04 16:41               ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-04 16:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

Hi Ben,

> On May 2, 2015, at 01:57 , Benjamin Herrenschmidt =
<benh@kernel.crashing.org> wrote:
>=20
> On Fri, 2015-05-01 at 13:46 -0500, Rob Herring wrote:
>> On Fri, May 1, 2015 at 10:22 AM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>>> On Fri, 2015-05-01 at 07:54 -0500, Rob Herring wrote:
>>>=20
>>>> The difference seems to be whether you allocate space or just point =
to
>>>> the FDT for various strings/data. Is that right?
>>>>=20
>>>>>   * of_fdt_add_subtree() is the introduced API to do the work.
>>>>=20
>>>> Have you looked at overlays and if so why do they not work for your =
purposes?
>>>>=20
>>>> Why do you need to do this with the flattened tree?
>>>=20
>>> The basic idea I asked Gavin to implement is that since the FW needs =
to
>>> provide a bunch of DT updates to Linux at runtime in the form of new
>>> nodes below an existing one, rather than doing it via some =
new/custom
>>> format, instead, have it send a bit of FDT blob to expand under an
>>> existing node.
>>=20
>> Overlay =3D an FDT blob to graft into a live running system. Sounds =
like
>> the same thing.
>>=20
>>> As for the details of Gavin implementation, I haven't looked at it =
in
>>> details yet so there might be issues there, however I don't know =
what
>>> you mean by "overlays", any pointer ?
>>=20
>> CONFIG_OF_OVERLAY
>>=20
>> =
http://events.linuxfoundation.org/sites/events/files/slides/dynamic-dt-key=
note-v3.pdf
>=20
> Well, that looks horrendously complicated, poorly documented and =
totally
> unused in-tree outside of the unittest stuff, yay ! It has all sort of
> "features" that I don't really care about.
>=20

If it was easy to get stuff in, it would get more of the real-use =
drivers
in.

> I still don't see what it buys me other than making my FW a lot more
> complex having to generate all that additional fixup etc... crap that =
I
> don't totally get yet.
>=20

You don=E2=80=99t generate any additional fixups. You just compile with =
the option
that generates all the fixups for you.

> What's wrong with just unflattening the nodes in place ? The DT comes
> from the FW in the first place so all the phandles are already good in
> the new added blob. Internally, the FW created new nodes in its =
internal
> representation and flattened the subtree and sends that subtree to
> Linux.
>=20
> I don't plan to play "revert" either, if you unplug, I do need to =
remove
> what's under the slot but that's true of boot time devices, not just
> "new" ones, so the overlay stuff won't do the trick and I certainly
> don't want to keep track=E2=80=A6
>=20

You get all of the corner cases handled for free. Perhaps it works for =
your
case too.

Perhaps you can educate me on what you need supported and we can make =
sure
it=E2=80=99s included.

> Ben.
>=20
>=20

Regards

=E2=80=94 Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-04 16:41               ` Pantelis Antoniou
@ 2015-05-04 21:14                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-04 21:14 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Mon, 2015-05-04 at 19:41 +0300, Pantelis Antoniou wrote:
> 
> You get all of the corner cases handled for free. Perhaps it works for
> your case too.
> 
> Perhaps you can educate me on what you need supported and we can make
> sure it’s included.

Which corner cases ?

IE, what I want is simply "update" the device-tree below a PCIe slot on
PCI hotplug.

The DT isn't "compiled" from a dts (it's amazing how many people seem to
believe this is the only way you get fdt's nowadays). It's dynamically
(ie programatically) generated by firmware at boot time and contains
whatever PCIe devices happen to be plugged during boot.

When doing PCIe hotplug operations, the kernel does various FW calls
(among others to control slot power), and during these, the FW re-probes
underneath the slot and refreshes its internal representation. So the
phandles remain fully consistent, there is no fixup needed.

We want the kernel to also update his copy as wee in order to avoid
keeping stale nodes that don't match what's there anymore. Also, when
plugging specific kind of IO drawers, the FW can provide additional node
and properties that will be used to control slots inside the drawers.

So what we need is:

  - On PCIe unplug, remove all old nodes below the slot
  - On PCIe plug, get all the new nodes from FW

Note that there is no need to do anything like platform device probing
etc... the PCI layer takes care of that, we will remove the old nodes
after the pci_dev are gone and create the new ones before Linux
re-probes the PCIe bus subtree.

So what we need is very simple: The removal can be handled without FW
help, and the plug case is a matter of just transferring all those new
nodes to Linux to re-expand.

Since the phandle etc... are all consistent with the original tree,
there is no fixups required.

So the "trivial" way to do it (and the way we have implemented the FW
side so far) is to have the FW simply "flatten" the subtree below the
slot and pass it to Linux, with the intent of expanding it back below
the slot node.

This is what Gavin proposed patches do.

The overlay mechanism adds all sorts of features that we don't seen to
need and would make the above more complex.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-04 21:14                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-04 21:14 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Mon, 2015-05-04 at 19:41 +0300, Pantelis Antoniou wrote:
> 
> You get all of the corner cases handled for free. Perhaps it works for
> your case too.
> 
> Perhaps you can educate me on what you need supported and we can make
> sure it’s included.

Which corner cases ?

IE, what I want is simply "update" the device-tree below a PCIe slot on
PCI hotplug.

The DT isn't "compiled" from a dts (it's amazing how many people seem to
believe this is the only way you get fdt's nowadays). It's dynamically
(ie programatically) generated by firmware at boot time and contains
whatever PCIe devices happen to be plugged during boot.

When doing PCIe hotplug operations, the kernel does various FW calls
(among others to control slot power), and during these, the FW re-probes
underneath the slot and refreshes its internal representation. So the
phandles remain fully consistent, there is no fixup needed.

We want the kernel to also update his copy as wee in order to avoid
keeping stale nodes that don't match what's there anymore. Also, when
plugging specific kind of IO drawers, the FW can provide additional node
and properties that will be used to control slots inside the drawers.

So what we need is:

  - On PCIe unplug, remove all old nodes below the slot
  - On PCIe plug, get all the new nodes from FW

Note that there is no need to do anything like platform device probing
etc... the PCI layer takes care of that, we will remove the old nodes
after the pci_dev are gone and create the new ones before Linux
re-probes the PCIe bus subtree.

So what we need is very simple: The removal can be handled without FW
help, and the plug case is a matter of just transferring all those new
nodes to Linux to re-expand.

Since the phandle etc... are all consistent with the original tree,
there is no fixups required.

So the "trivial" way to do it (and the way we have implemented the FW
side so far) is to have the FW simply "flatten" the subtree below the
slot and pass it to Linux, with the intent of expanding it back below
the slot node.

This is what Gavin proposed patches do.

The overlay mechanism adds all sorts of features that we don't seen to
need and would make the above more complex.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-07 22:12     ` Bjorn Helgaas
  -1 siblings, 0 replies; 184+ messages in thread
From: Bjorn Helgaas @ 2015-05-07 22:12 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linuxppc-dev, linux-pci, benh

Hi Gavin,

[Please run "git log --oneline drivers/pci/setup-bus.c" and observe the
capitalization convention.]

On Fri, May 01, 2015 at 04:02:48PM +1000, Gavin Shan wrote:
> Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
> which is called for once after PCI probing and resource assignment
> are completed, to allocate platform required resources for PCI devices:
> PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
> Obviously, it's not hotplug friendly.
> 
> The patch adds weak function pcibios_setup_bridge(), which is called
> by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
> to assign above platform required resources to newly added PCI devices,
> in order to support PCI hotplug on PowerPC PowerNV platform.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  drivers/pci/setup-bus.c | 12 +++++++++---
>  include/linux/pci.h     |  1 +
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 4fd0cac..a7d0c3c 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
>  	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
>  }
>  
> -static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
> +
> +void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
>  {
>  	struct pci_dev *bridge = bus->self;
>  
> @@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
>  	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
>  }
>  
> +void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
> +{
> +	pci_setup_bridge_resources(bus, type);
> +}

I'm not opposed to adding a pcibios_setup_bridge(), but I would rather do
the architected updates in the generic PCI core code instead of down in the
pcibios code.  In other words, I would rather have this:

  void pci_setup_bridge(struct pci_bus *bus)
  {
    pcibios_setup_bridge(bus, type);
    pci_setup_bridge_resources(bus, type);
  }

That way the default pcibios hook is empty, showing that by default there's
no arch-specific code in this path, and we only have to look at the generic
core code to verify that we actually do program the bridge windows.

Bjorn

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
@ 2015-05-07 22:12     ` Bjorn Helgaas
  0 siblings, 0 replies; 184+ messages in thread
From: Bjorn Helgaas @ 2015-05-07 22:12 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, linuxppc-dev

Hi Gavin,

[Please run "git log --oneline drivers/pci/setup-bus.c" and observe the
capitalization convention.]

On Fri, May 01, 2015 at 04:02:48PM +1000, Gavin Shan wrote:
> Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
> which is called for once after PCI probing and resource assignment
> are completed, to allocate platform required resources for PCI devices:
> PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
> Obviously, it's not hotplug friendly.
> 
> The patch adds weak function pcibios_setup_bridge(), which is called
> by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
> to assign above platform required resources to newly added PCI devices,
> in order to support PCI hotplug on PowerPC PowerNV platform.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  drivers/pci/setup-bus.c | 12 +++++++++---
>  include/linux/pci.h     |  1 +
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 4fd0cac..a7d0c3c 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
>  	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
>  }
>  
> -static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
> +
> +void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
>  {
>  	struct pci_dev *bridge = bus->self;
>  
> @@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
>  	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
>  }
>  
> +void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
> +{
> +	pci_setup_bridge_resources(bus, type);
> +}

I'm not opposed to adding a pcibios_setup_bridge(), but I would rather do
the architected updates in the generic PCI core code instead of down in the
pcibios code.  In other words, I would rather have this:

  void pci_setup_bridge(struct pci_bus *bus)
  {
    pcibios_setup_bridge(bus, type);
    pci_setup_bridge_resources(bus, type);
  }

That way the default pcibios hook is empty, showing that by default there's
no arch-specific code in this path, and we only have to look at the generic
core code to verify that we actually do program the bridge windows.

Bjorn

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
  2015-05-01  6:02 ` Gavin Shan
@ 2015-05-08 23:59   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-08 23:59 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The series of patches intend to support PCI slot for PowerPC PowerNV platform,
> which is running on top of skiboot firmware. The patchset requires corresponding
> changes from skiboot firmware, which is sent to skiboot@lists.ozlabs.org
> for review. The PCI slots are exposed by skiboot with device node properties,
> and kernel utilizes those properties to populated PCI slots accordingly.
>
> The original PCI infrastructure on PowerNV platform can't support hotplug
> because the PE is assigned during PHB fixup time, which is called for once
> during system boot time. For this, the PCI infrastructure on PowerNV platform
> has been reworked for a lot. After that, the PE and its corresponding resources
> (IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating
> PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64
> resources, on P8 strictly speaking).

Out of curiosity - does this PCI scan happen when memory subsystem is 
initialized? More precisely, after these changes, won't 
pnv_pci_ioda2_setup_dma_pe() be called too early after boot so I won't be 
able to use kmalloc() to allocate iommu_table's?

Also, checkpatch.pl failed multiple times on the series. Please fix.


> Each PE will maintain a reference count,
> which is (number of child PCI devices + 1). That indicates when last child PCI
> device leaves the PE, the PE and its included resources will be relased and put
> back into free pool again. With this design, the PE will be released when EEH PE
> is released. PATCH[1 - 8] are related to this part.
>
>  From skiboot perspective, PCI slot is providing (hot/fundamental/complete)
> resets to EEH. The kernel gets to know if skiboot supports various reset on one
> particular PCI slot through device-tree node. If it does, EEH will utilize the
> functionality provided by skiboot. Besides, the device-tree nodes have to change
> in order to support PCI hotplug. For example, when one PCI adapter inserted to
> one slot, its device-tree node should be added to the system dynamically. Conversely,
> the device-tree node should be removed from the system when the PCI adapter is going
> to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes,
> they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are
> doing the related work.
>
> The last patch is the standalone PCI hotplug driver for PowerNV platform. When
> removing PCI adapter from one PCI slot, which is invoked by command in userland,
> the skiboot will power off the slot to save power and remove all device-tree
> nodes for all PCI devices behind the slot. Conversely, the Power to the slot
> is turned on, the PCI devices behind the slot is rescanned, and the device-tree
> nodes for those newly detected PCI devices will be built in skiboot. For both
> of cases, one message will be sent to kernel by skiboot so that the kernel
> can adjust the device-tree accordingly. At the same time, the kernel also have
> to deallocate or allocate PE# and its related resources (PE# and so on) for the
> removed/added PCI devices.
>
> Changelog
> =========
> v4:
>     * Rebased to 4.1.RC1
>     * Added API to unflatten FDT blob to device node sub-tree, which is attached
>       the indicated parent device node. The original mechanism based on formatted
>       string stream has been dropped.
>     * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug")
>       was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH
>       Support" depends on that.
> v3:
>     * Rebased to 4.1.RC0
>     * PowerNV PCI infrasturcture is total refactored in order to support PCI
>       hotplug. The PowerNV hotplug driver is also reworked a lot because of
>       the changes in skiboot in order to support PCI hotplug.
>
> Gavin Shan (21):
>    pci: Add pcibios_setup_bridge()
>    powerpc/powernv: Enable M64 on P7IOC
>    powerpc/powernv: M64 support improvement
>    powerpc/powernv: Improve IO and M32 mapping
>    powerpc/powernv: Improve DMA32 segment assignment
>    powerpc/powernv: Create PEs dynamically
>    powerpc/powernv: Release PEs dynamically
>    powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
>    powerpc/powernv: Use PCI slot reset infrastructure
>    powerpc/powernv: Fundamental reset for PCI bus reset
>    powerpc/pci: Don't scan empty slot
>    powerpc/pci: Move pcibios_find_pci_bus() around
>    powerpc/powernv: Introduce pnv_pci_poll()
>    powerpc/powernv: Functions to get/reset PCI slot status
>    powerpc/pci: Delay creating pci_dn
>    powerpc/pci: Create eeh_dev while creating pci_dn
>    powerpc/pci: Export traverse_pci_device_nodes()
>    powerpc/pci: Update bridge windows on PCI plugging
>    drivers/of: Support adding sub-tree
>    powerpc/powernv: Select OF_DYNAMIC
>    pci/hotplug: PowerPC PowerNV PCI hotplug driver
>
>   arch/powerpc/include/asm/eeh.h                 |    7 +-
>   arch/powerpc/include/asm/opal-api.h            |    7 +-
>   arch/powerpc/include/asm/opal.h                |    7 +-
>   arch/powerpc/include/asm/pci-bridge.h          |    7 +-
>   arch/powerpc/include/asm/pnv-pci.h             |    5 +
>   arch/powerpc/include/asm/ppc-pci.h             |    7 +-
>   arch/powerpc/kernel/eeh_dev.c                  |   20 +-
>   arch/powerpc/kernel/pci-common.c               |   18 +-
>   arch/powerpc/kernel/pci-hotplug.c              |   44 +-
>   arch/powerpc/kernel/pci_dn.c                   |  119 +-
>   arch/powerpc/platforms/maple/pci.c             |   35 +-
>   arch/powerpc/platforms/pasemi/pci.c            |    3 -
>   arch/powerpc/platforms/powermac/pci.c          |   39 +-
>   arch/powerpc/platforms/powernv/Kconfig         |    1 +
>   arch/powerpc/platforms/powernv/eeh-powernv.c   |  245 ++--
>   arch/powerpc/platforms/powernv/opal-wrappers.S |    3 +
>   arch/powerpc/platforms/powernv/pci-ioda.c      | 1657 +++++++++++++++---------
>   arch/powerpc/platforms/powernv/pci.c           |   64 +-
>   arch/powerpc/platforms/powernv/pci.h           |   52 +-
>   arch/powerpc/platforms/pseries/msi.c           |    4 +-
>   arch/powerpc/platforms/pseries/pci_dlpar.c     |   32 -
>   arch/powerpc/platforms/pseries/setup.c         |    9 +-
>   drivers/of/dynamic.c                           |   19 +-
>   drivers/of/fdt.c                               |  133 +-
>   drivers/pci/hotplug/Kconfig                    |   12 +
>   drivers/pci/hotplug/Makefile                   |    4 +
>   drivers/pci/hotplug/powernv_php.c              |  146 +++
>   drivers/pci/hotplug/powernv_php.h              |   78 ++
>   drivers/pci/hotplug/powernv_php_slot.c         |  643 +++++++++
>   drivers/pci/setup-bus.c                        |   12 +-
>   include/linux/of.h                             |    2 +
>   include/linux/of_fdt.h                         |    1 +
>   include/linux/pci.h                            |    1 +
>   33 files changed, 2473 insertions(+), 963 deletions(-)
>   create mode 100644 drivers/pci/hotplug/powernv_php.c
>   create mode 100644 drivers/pci/hotplug/powernv_php.h
>   create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
@ 2015-05-08 23:59   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-08 23:59 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The series of patches intend to support PCI slot for PowerPC PowerNV platform,
> which is running on top of skiboot firmware. The patchset requires corresponding
> changes from skiboot firmware, which is sent to skiboot@lists.ozlabs.org
> for review. The PCI slots are exposed by skiboot with device node properties,
> and kernel utilizes those properties to populated PCI slots accordingly.
>
> The original PCI infrastructure on PowerNV platform can't support hotplug
> because the PE is assigned during PHB fixup time, which is called for once
> during system boot time. For this, the PCI infrastructure on PowerNV platform
> has been reworked for a lot. After that, the PE and its corresponding resources
> (IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating
> PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64
> resources, on P8 strictly speaking).

Out of curiosity - does this PCI scan happen when memory subsystem is 
initialized? More precisely, after these changes, won't 
pnv_pci_ioda2_setup_dma_pe() be called too early after boot so I won't be 
able to use kmalloc() to allocate iommu_table's?

Also, checkpatch.pl failed multiple times on the series. Please fix.


> Each PE will maintain a reference count,
> which is (number of child PCI devices + 1). That indicates when last child PCI
> device leaves the PE, the PE and its included resources will be relased and put
> back into free pool again. With this design, the PE will be released when EEH PE
> is released. PATCH[1 - 8] are related to this part.
>
>  From skiboot perspective, PCI slot is providing (hot/fundamental/complete)
> resets to EEH. The kernel gets to know if skiboot supports various reset on one
> particular PCI slot through device-tree node. If it does, EEH will utilize the
> functionality provided by skiboot. Besides, the device-tree nodes have to change
> in order to support PCI hotplug. For example, when one PCI adapter inserted to
> one slot, its device-tree node should be added to the system dynamically. Conversely,
> the device-tree node should be removed from the system when the PCI adapter is going
> to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes,
> they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are
> doing the related work.
>
> The last patch is the standalone PCI hotplug driver for PowerNV platform. When
> removing PCI adapter from one PCI slot, which is invoked by command in userland,
> the skiboot will power off the slot to save power and remove all device-tree
> nodes for all PCI devices behind the slot. Conversely, the Power to the slot
> is turned on, the PCI devices behind the slot is rescanned, and the device-tree
> nodes for those newly detected PCI devices will be built in skiboot. For both
> of cases, one message will be sent to kernel by skiboot so that the kernel
> can adjust the device-tree accordingly. At the same time, the kernel also have
> to deallocate or allocate PE# and its related resources (PE# and so on) for the
> removed/added PCI devices.
>
> Changelog
> =========
> v4:
>     * Rebased to 4.1.RC1
>     * Added API to unflatten FDT blob to device node sub-tree, which is attached
>       the indicated parent device node. The original mechanism based on formatted
>       string stream has been dropped.
>     * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug")
>       was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH
>       Support" depends on that.
> v3:
>     * Rebased to 4.1.RC0
>     * PowerNV PCI infrasturcture is total refactored in order to support PCI
>       hotplug. The PowerNV hotplug driver is also reworked a lot because of
>       the changes in skiboot in order to support PCI hotplug.
>
> Gavin Shan (21):
>    pci: Add pcibios_setup_bridge()
>    powerpc/powernv: Enable M64 on P7IOC
>    powerpc/powernv: M64 support improvement
>    powerpc/powernv: Improve IO and M32 mapping
>    powerpc/powernv: Improve DMA32 segment assignment
>    powerpc/powernv: Create PEs dynamically
>    powerpc/powernv: Release PEs dynamically
>    powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
>    powerpc/powernv: Use PCI slot reset infrastructure
>    powerpc/powernv: Fundamental reset for PCI bus reset
>    powerpc/pci: Don't scan empty slot
>    powerpc/pci: Move pcibios_find_pci_bus() around
>    powerpc/powernv: Introduce pnv_pci_poll()
>    powerpc/powernv: Functions to get/reset PCI slot status
>    powerpc/pci: Delay creating pci_dn
>    powerpc/pci: Create eeh_dev while creating pci_dn
>    powerpc/pci: Export traverse_pci_device_nodes()
>    powerpc/pci: Update bridge windows on PCI plugging
>    drivers/of: Support adding sub-tree
>    powerpc/powernv: Select OF_DYNAMIC
>    pci/hotplug: PowerPC PowerNV PCI hotplug driver
>
>   arch/powerpc/include/asm/eeh.h                 |    7 +-
>   arch/powerpc/include/asm/opal-api.h            |    7 +-
>   arch/powerpc/include/asm/opal.h                |    7 +-
>   arch/powerpc/include/asm/pci-bridge.h          |    7 +-
>   arch/powerpc/include/asm/pnv-pci.h             |    5 +
>   arch/powerpc/include/asm/ppc-pci.h             |    7 +-
>   arch/powerpc/kernel/eeh_dev.c                  |   20 +-
>   arch/powerpc/kernel/pci-common.c               |   18 +-
>   arch/powerpc/kernel/pci-hotplug.c              |   44 +-
>   arch/powerpc/kernel/pci_dn.c                   |  119 +-
>   arch/powerpc/platforms/maple/pci.c             |   35 +-
>   arch/powerpc/platforms/pasemi/pci.c            |    3 -
>   arch/powerpc/platforms/powermac/pci.c          |   39 +-
>   arch/powerpc/platforms/powernv/Kconfig         |    1 +
>   arch/powerpc/platforms/powernv/eeh-powernv.c   |  245 ++--
>   arch/powerpc/platforms/powernv/opal-wrappers.S |    3 +
>   arch/powerpc/platforms/powernv/pci-ioda.c      | 1657 +++++++++++++++---------
>   arch/powerpc/platforms/powernv/pci.c           |   64 +-
>   arch/powerpc/platforms/powernv/pci.h           |   52 +-
>   arch/powerpc/platforms/pseries/msi.c           |    4 +-
>   arch/powerpc/platforms/pseries/pci_dlpar.c     |   32 -
>   arch/powerpc/platforms/pseries/setup.c         |    9 +-
>   drivers/of/dynamic.c                           |   19 +-
>   drivers/of/fdt.c                               |  133 +-
>   drivers/pci/hotplug/Kconfig                    |   12 +
>   drivers/pci/hotplug/Makefile                   |    4 +
>   drivers/pci/hotplug/powernv_php.c              |  146 +++
>   drivers/pci/hotplug/powernv_php.h              |   78 ++
>   drivers/pci/hotplug/powernv_php_slot.c         |  643 +++++++++
>   drivers/pci/setup-bus.c                        |   12 +-
>   include/linux/of.h                             |    2 +
>   include/linux/of_fdt.h                         |    1 +
>   include/linux/pci.h                            |    1 +
>   33 files changed, 2473 insertions(+), 963 deletions(-)
>   create mode 100644 drivers/pci/hotplug/powernv_php.c
>   create mode 100644 drivers/pci/hotplug/powernv_php.h
>   create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09  0:18     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09  0:18 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The patch enables M64 window on P7IOC, which has been enabled on
> PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them
> are divided to 8 segments.

"compared to something" means you will tell about PHB3 too :)

Do I understand correctly that IODA==IODA1==P7IOC  and P7IOC != IODA2? The 
code does not use "PHB3" or "P7IOC" acronym so it is a bit confusing.


> So each PHB can support 128 M64 segments.
> Also, P7IOC has M64DT, which helps mapping one particular M64
> segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs)
> segments and fixed mapping between PE# and M64 segment# in order
> to keep same logic to support M64 for PHB3 and P7IOC. In turn, we
> just need different phb->init_m64() hooks for P7IOC and PHB3.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++++++++++++++++++++++++++----
>   1 file changed, 103 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f8bc950..646962f 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>   	clear_bit(pe, phb->ioda.pe_alloc);
>   }
>
> +static int pnv_ioda1_init_m64(struct pnv_phb *phb)
> +{
> +	struct resource *r;
> +	int seg;
> +	s64 rc;

Here @rc is of the "s64" type.

> +
> +	/* Each PHB supports 16 separate M64 BARs, each of which are
> +	 * divided into 8 segments. So there are number of M64 segments
> +	 * as total PE#, which is 128.
> +	 */

"there are as many M64 segments as a maximum number of PEs which is 128"?


> +	for (seg = 0; seg < phb->ioda.total_pe; seg += 8) {
> +		unsigned long base;
> +
> +		base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize;
> +		rc = opal_pci_set_phb_mem_window(phb->opal_id,
> +						 OPAL_M64_WINDOW_TYPE,
> +						 seg / 8,
> +						 base,
> +						 0, /* unused */
> +						 8 * phb->ioda.m64_segsize);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("  Failure %lld configuring M64 BAR#%d on PHB#%d\n",
> +				rc, seg / 8, phb->hose->global_number);
> +			goto fail;
> +		}
> +
> +		rc = opal_pci_phb_mmio_enable(phb->opal_id,
> +					      OPAL_M64_WINDOW_TYPE,
> +					      seg / 8,
> +					      OPAL_ENABLE_M64_SPLIT);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("  Failure %lld enabling M64 BAR#%d on PHB#%d\n",
> +				rc, seg / 8, phb->hose->global_number);
> +			goto fail;
> +		}
> +	}
> +
> +	/* Strip of the segment used by the reserved PE, which
> +	 * is expected to be 0 or last supported PE#
> +	 */
> +	r = &phb->hose->mem_resources[1];

mem_resources[0] is IO, mem_resources[1] is MMIO, mem_resources[2] is for 
what? Would be nice to have this commented somewhere.


> +	if (phb->ioda.reserved_pe == 0)
> +		r->start += phb->ioda.m64_segsize;
> +	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
> +		r->end -= phb->ioda.m64_segsize;
> +	else
> +		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
> +			phb->ioda.reserved_pe);
> +
> +	return 0;
> +
> +fail:
> +	for ( ; seg >= 0; seg -= 8)
> +		opal_pci_phb_mmio_enable(phb->opal_id,
> +					 OPAL_M64_WINDOW_TYPE,
> +					 seg / 8,
> +					 OPAL_DISABLE_M64);

Out of curiosity - is not there a counterpart for 
opal_pci_set_phb_mem_window() for cleanup?


> +
> +	return -EIO;
> +}
> +
>   /* The default M64 BAR is shared by all PEs */
>   static int pnv_ioda2_init_m64(struct pnv_phb *phb)
>   {
> @@ -222,7 +283,7 @@ fail:
>   	return -EIO;
>   }
>
> -static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
> +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
>   {
>   	resource_size_t sgsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
> @@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
>   	}
>   }
>
> -static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb,
> -				 struct pci_bus *bus, int all)
> +static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
> +				struct pci_bus *bus, int all)
>   {
>   	resource_size_t segsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
> @@ -346,6 +407,28 @@ done:
>   			pe->master = master_pe;
>   			list_add_tail(&pe->list, &master_pe->slaves);
>   		}
> +
> +		/* P7IOC supports M64DT, which helps mapping M64 segment
> +		 * to one particular PE#. Unfortunately, PHB3 has fixed

Why is it "Unfortunately"? This is just the way it is :)


> +		 * mapping between M64 segment and PE#. In order for same
> +		 * logic for P7IOC and PHB3, we enforce fixed mapping
> +		 * between M64 segment and PE# on P7IOC.
> +		 */
> +		if (phb->type == PNV_PHB_IODA1) {
> +			int64_t rc;

Here @rc is of the "int64_t" type. And this one and the one above are used 
for return code from OPAL API. Make them the same (int64_t or long, up to you).


> +
> +			rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> +							 pe->pe_number,
> +							 OPAL_M64_WINDOW_TYPE,
> +							 pe->pe_number / 8,
> +							 pe->pe_number % 8);
> +			if (rc != OPAL_SUCCESS)
> +				pr_warn("%s: Failure %lld mapping "
> +					"M64 for PHB#%d-PE#%d\n",
> +					__func__, rc,
> +					phb->hose->global_number,
> +					pe->pe_number);
> +		}
>   	}
>
>   	kfree(pe_alloc);
> @@ -360,12 +443,6 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>   	const u32 *r;
>   	u64 pci_addr;
>
> -	/* FIXME: Support M64 for P7IOC */
> -	if (phb->type != PNV_PHB_IODA2) {
> -		pr_info("  Not support M64 window\n");
> -		return;
> -	}
> -
>   	if (!firmware_has_feature(FW_FEATURE_OPALv3)) {
>   		pr_info("  Firmware too old to support M64 window\n");
>   		return;
> @@ -394,9 +471,23 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>
>   	/* Use last M64 BAR to cover M64 window */
>   	phb->ioda.m64_bar_idx = 15;
> -	phb->init_m64 = pnv_ioda2_init_m64;
> -	phb->reserve_m64_pe = pnv_ioda2_reserve_m64_pe;
> -	phb->pick_m64_pe = pnv_ioda2_pick_m64_pe;
> +	phb->reserve_m64_pe = pnv_ioda_reserve_m64_pe;


reserve_m64_pe() is called once from pnv_pci_ioda_setup_PEs() so it is 
IODA-only and in this case reserve_m64_pe != NULL and 
pnv_ioda_reserve_m64_pe() will be called always.

In general, it feels like pnv_phb has too many callbacks while they could 
be just direct calls.



> +	phb->pick_m64_pe = pnv_ioda_pick_m64_pe;
> +	switch (phb->type) {
> +	case PNV_PHB_IODA1:
> +		phb->init_m64 = pnv_ioda1_init_m64;
> +		break;
> +	case PNV_PHB_IODA2:
> +		phb->init_m64 = pnv_ioda2_init_m64;
> +		break;
> +	default:
> +		phb->init_m64 = NULL;
> +		phb->reserve_m64_pe = NULL;
> +		phb->pick_m64_pe = NULL;
> +		phb->ioda.m64_size = 0;
> +		phb->ioda.m64_segsize = 0;
> +		phb->ioda.m64_base = 0;

There are just 2 PHB types - IODA1 and IODA2, right? And the fields you 
reset after "default" - they have to be zeroes already, no? And on what 
hardware would the default branch actuall work? None?


> +	}
>   }
>
>   static void pnv_ioda_freeze_pe(struct pnv_phb *phb, int pe_no)
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
@ 2015-05-09  0:18     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09  0:18 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The patch enables M64 window on P7IOC, which has been enabled on
> PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them
> are divided to 8 segments.

"compared to something" means you will tell about PHB3 too :)

Do I understand correctly that IODA==IODA1==P7IOC  and P7IOC != IODA2? The 
code does not use "PHB3" or "P7IOC" acronym so it is a bit confusing.


> So each PHB can support 128 M64 segments.
> Also, P7IOC has M64DT, which helps mapping one particular M64
> segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs)
> segments and fixed mapping between PE# and M64 segment# in order
> to keep same logic to support M64 for PHB3 and P7IOC. In turn, we
> just need different phb->init_m64() hooks for P7IOC and PHB3.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++++++++++++++++++++++++++----
>   1 file changed, 103 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f8bc950..646962f 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>   	clear_bit(pe, phb->ioda.pe_alloc);
>   }
>
> +static int pnv_ioda1_init_m64(struct pnv_phb *phb)
> +{
> +	struct resource *r;
> +	int seg;
> +	s64 rc;

Here @rc is of the "s64" type.

> +
> +	/* Each PHB supports 16 separate M64 BARs, each of which are
> +	 * divided into 8 segments. So there are number of M64 segments
> +	 * as total PE#, which is 128.
> +	 */

"there are as many M64 segments as a maximum number of PEs which is 128"?


> +	for (seg = 0; seg < phb->ioda.total_pe; seg += 8) {
> +		unsigned long base;
> +
> +		base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize;
> +		rc = opal_pci_set_phb_mem_window(phb->opal_id,
> +						 OPAL_M64_WINDOW_TYPE,
> +						 seg / 8,
> +						 base,
> +						 0, /* unused */
> +						 8 * phb->ioda.m64_segsize);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("  Failure %lld configuring M64 BAR#%d on PHB#%d\n",
> +				rc, seg / 8, phb->hose->global_number);
> +			goto fail;
> +		}
> +
> +		rc = opal_pci_phb_mmio_enable(phb->opal_id,
> +					      OPAL_M64_WINDOW_TYPE,
> +					      seg / 8,
> +					      OPAL_ENABLE_M64_SPLIT);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("  Failure %lld enabling M64 BAR#%d on PHB#%d\n",
> +				rc, seg / 8, phb->hose->global_number);
> +			goto fail;
> +		}
> +	}
> +
> +	/* Strip of the segment used by the reserved PE, which
> +	 * is expected to be 0 or last supported PE#
> +	 */
> +	r = &phb->hose->mem_resources[1];

mem_resources[0] is IO, mem_resources[1] is MMIO, mem_resources[2] is for 
what? Would be nice to have this commented somewhere.


> +	if (phb->ioda.reserved_pe == 0)
> +		r->start += phb->ioda.m64_segsize;
> +	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
> +		r->end -= phb->ioda.m64_segsize;
> +	else
> +		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
> +			phb->ioda.reserved_pe);
> +
> +	return 0;
> +
> +fail:
> +	for ( ; seg >= 0; seg -= 8)
> +		opal_pci_phb_mmio_enable(phb->opal_id,
> +					 OPAL_M64_WINDOW_TYPE,
> +					 seg / 8,
> +					 OPAL_DISABLE_M64);

Out of curiosity - is not there a counterpart for 
opal_pci_set_phb_mem_window() for cleanup?


> +
> +	return -EIO;
> +}
> +
>   /* The default M64 BAR is shared by all PEs */
>   static int pnv_ioda2_init_m64(struct pnv_phb *phb)
>   {
> @@ -222,7 +283,7 @@ fail:
>   	return -EIO;
>   }
>
> -static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
> +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
>   {
>   	resource_size_t sgsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
> @@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
>   	}
>   }
>
> -static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb,
> -				 struct pci_bus *bus, int all)
> +static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
> +				struct pci_bus *bus, int all)
>   {
>   	resource_size_t segsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
> @@ -346,6 +407,28 @@ done:
>   			pe->master = master_pe;
>   			list_add_tail(&pe->list, &master_pe->slaves);
>   		}
> +
> +		/* P7IOC supports M64DT, which helps mapping M64 segment
> +		 * to one particular PE#. Unfortunately, PHB3 has fixed

Why is it "Unfortunately"? This is just the way it is :)


> +		 * mapping between M64 segment and PE#. In order for same
> +		 * logic for P7IOC and PHB3, we enforce fixed mapping
> +		 * between M64 segment and PE# on P7IOC.
> +		 */
> +		if (phb->type == PNV_PHB_IODA1) {
> +			int64_t rc;

Here @rc is of the "int64_t" type. And this one and the one above are used 
for return code from OPAL API. Make them the same (int64_t or long, up to you).


> +
> +			rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> +							 pe->pe_number,
> +							 OPAL_M64_WINDOW_TYPE,
> +							 pe->pe_number / 8,
> +							 pe->pe_number % 8);
> +			if (rc != OPAL_SUCCESS)
> +				pr_warn("%s: Failure %lld mapping "
> +					"M64 for PHB#%d-PE#%d\n",
> +					__func__, rc,
> +					phb->hose->global_number,
> +					pe->pe_number);
> +		}
>   	}
>
>   	kfree(pe_alloc);
> @@ -360,12 +443,6 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>   	const u32 *r;
>   	u64 pci_addr;
>
> -	/* FIXME: Support M64 for P7IOC */
> -	if (phb->type != PNV_PHB_IODA2) {
> -		pr_info("  Not support M64 window\n");
> -		return;
> -	}
> -
>   	if (!firmware_has_feature(FW_FEATURE_OPALv3)) {
>   		pr_info("  Firmware too old to support M64 window\n");
>   		return;
> @@ -394,9 +471,23 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>
>   	/* Use last M64 BAR to cover M64 window */
>   	phb->ioda.m64_bar_idx = 15;
> -	phb->init_m64 = pnv_ioda2_init_m64;
> -	phb->reserve_m64_pe = pnv_ioda2_reserve_m64_pe;
> -	phb->pick_m64_pe = pnv_ioda2_pick_m64_pe;
> +	phb->reserve_m64_pe = pnv_ioda_reserve_m64_pe;


reserve_m64_pe() is called once from pnv_pci_ioda_setup_PEs() so it is 
IODA-only and in this case reserve_m64_pe != NULL and 
pnv_ioda_reserve_m64_pe() will be called always.

In general, it feels like pnv_phb has too many callbacks while they could 
be just direct calls.



> +	phb->pick_m64_pe = pnv_ioda_pick_m64_pe;
> +	switch (phb->type) {
> +	case PNV_PHB_IODA1:
> +		phb->init_m64 = pnv_ioda1_init_m64;
> +		break;
> +	case PNV_PHB_IODA2:
> +		phb->init_m64 = pnv_ioda2_init_m64;
> +		break;
> +	default:
> +		phb->init_m64 = NULL;
> +		phb->reserve_m64_pe = NULL;
> +		phb->pick_m64_pe = NULL;
> +		phb->ioda.m64_size = 0;
> +		phb->ioda.m64_segsize = 0;
> +		phb->ioda.m64_base = 0;

There are just 2 PHB types - IODA1 and IODA2, right? And the fields you 
reset after "default" - they have to be zeroes already, no? And on what 
hardware would the default branch actuall work? None?


> +	}
>   }
>
>   static void pnv_ioda_freeze_pe(struct pnv_phb *phb, int pe_no)
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 03/21] powerpc/powernv: M64 support improvement
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 10:24     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 10:24 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> We're having the hardware or enforced (on P7IOC) limitation: M64

I would think if it is enforced, then it is enforced by hardware but you 
say "hardware OR enforced" :)


> segment#x can only be assigned to PE#x. IO and M32 segment can be
> mapped to arbitrary PE# via IODT and M32DT. It means the PE number
> should be x if M64 segment#x has been assigned to the PE. Also, each
> PE own one M64 segment at most. Currently, we are reserving PE#
> according to root port's M64 window. It won't be reliable once we
> extend M64 windows of root port, or the upstream port of the PCIE
> switch behind root port to PHB's M64 window, in order to support
> PCI hotplug in future.
>
> The patch reserves PE# for M64 segments according to the M64 resources
> of the PCI devices (not bridges) contained in the PE. Besides, it's
> always worthy to trace the M64 segments consumed by the PE, which can
> be released at PCI unplugging time.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++++++++++++++++++------------
>   arch/powerpc/platforms/powernv/pci.h      |  10 +-
>   2 files changed, 122 insertions(+), 78 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 646962f..a994882 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -283,28 +283,78 @@ fail:
>   	return -EIO;
>   }
>
> -static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
> +/* We extend the M64 window of root port, or the upstream bridge port
> + * of the PCIE switch behind root port. So we shouldn't reserve PEs
> + * for M64 resources because there are no (normal) PCI devices consuming

"PCI devices"? Not "root ports or PCI bridges"?

> + * M64 resources on the PCI buses leading from root port, or the upstream
> + * bridge port.The function returns true if the indicated PCI bus needs
> + * reserved PEs because of M64 resources in advance. Otherwise, the
> + * function returns false.
> + */
> +static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb,
> +				 struct pci_bus *bus)
>   {
> -	resource_size_t sgsz = phb->ioda.m64_segsize;
> +	/* Root bus */

The comment is too obvious as the call below is called "pci_is_root_bus" :)


> +	if (!bus || pci_is_root_bus(bus))
> +		return false;
> +
> +	/* Bus leading from root port. We need check what types of PCI
> +	 * devices on the bus. If it's connecting PCI bridge, we don't
> +	 * need reserve M64 PEs for it. Otherwise, we still need to do
> +	 * that.
> +	 */
> +	if (pci_is_root_bus(bus->self->bus)) {
> +		struct pci_dev *pdev;
> +
> +		list_for_each_entry(pdev, &bus->devices, bus_list) {
> +			if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> +				return true;
> +		}
> +
> +		return false;
> +	}
> +
> +	/* Bus leading from the upstream bridge port on top level */
> +	if (pci_is_root_bus(bus->self->bus->self->bus))


Is it for second level bridges? Like root->bridge->bridge? And for 3 levels 
you will need a PE?


> +		return false;
> +
> +	return true;
> +}
> +
> +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
> +				    struct pci_bus *bus)
> +{
> +	resource_size_t segsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
>   	struct resource *r;
> -	int base, step, i;
> +	unsigned long pe_no, limit;
> +	int i;
>
> -	/*
> -	 * Root bus always has full M64 range and root port has
> -	 * M64 range used in reality. So we're checking root port
> -	 * instead of root bus.
> +	if (!pnv_ioda_need_m64_pe(phb, bus))
> +		return;
> +
> +	/* The bridge's M64 window might have been extended to the
> +	 * PHB's M64 window in order to support PCI hotplug. So the
> +	 * bridge's M64 window isn't reliable to be used for picking
> +	 * PE# for its leading PCI bus. We have to check the M64
> +	 * resources consumed by the PCI devices, which seat on the
> +	 * PCI bus.
>   	 */
> -	list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) {
> -		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
> -			r = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
> -			if (!r->parent ||
> -			    !pnv_pci_is_mem_pref_64(r->flags))
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +#ifdef CONFIG_PCI_IOV
> +			if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END)
> +				continue;
> +#endif
> +			r = &pdev->resource[i];
> +			if (!r->flags || r->start >= r->end ||
> +			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>   				continue;
>
> -			base = (r->start - phb->ioda.m64_base) / sgsz;
> -			for (step = 0; step < resource_size(r) / sgsz; step++)
> -				pnv_ioda_reserve_pe(phb, base + step);
> +			pe_no = (r->start - phb->ioda.m64_base) / segsz;
> +			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
> +			for (; pe_no < limit; pe_no++)
> +				pnv_ioda_reserve_pe(phb, pe_no);
>   		}
>   	}
>   }
> @@ -316,85 +366,64 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	struct pci_dev *pdev;
>   	struct resource *r;
>   	struct pnv_ioda_pe *master_pe, *pe;
> -	unsigned long size, *pe_alloc;
> -	bool found;
> -	int start, i, j;
> -
> -	/* Root bus shouldn't use M64 */
> -	if (pci_is_root_bus(bus))
> -		return IODA_INVALID_PE;
> -
> -	/* We support only one M64 window on each bus */
> -	found = false;
> -	pci_bus_for_each_resource(bus, r, i) {
> -		if (r && r->parent &&
> -		    pnv_pci_is_mem_pref_64(r->flags)) {
> -			found = true;
> -			break;
> -		}
> -	}
> +	unsigned long size, *pe_bitsmap;

s/pe_bitsmap/pe_bitmap/


> +	unsigned long pe_no, limit;
> +	int i;
>
> -	/* No M64 window found ? */
> -	if (!found)
> +	if (!pnv_ioda_need_m64_pe(phb, bus))
>   		return IODA_INVALID_PE;
>
> -	/* Allocate bitmap */
> +        /* Allocate bitmap */
>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
> -	pe_alloc = kzalloc(size, GFP_KERNEL);
> -	if (!pe_alloc) {
> -		pr_warn("%s: Out of memory !\n",
> -			__func__);
> +	pe_bitsmap = kzalloc(size, GFP_KERNEL);
> +	if (!pe_bitsmap) {
> +		pr_warn("%s: Out of memory !\n", __func__);
>   		return IODA_INVALID_PE;
>   	}
>
> -	/*
> -	 * Figure out reserved PE numbers by the PE
> -	 * the its child PEs.
> -	 */
> -	start = (r->start - phb->ioda.m64_base) / segsz;
> -	for (i = 0; i < resource_size(r) / segsz; i++)
> -		set_bit(start + i, pe_alloc);
> -
> -	if (all)
> -		goto done;
> -
> -	/*
> -	 * If the PE doesn't cover all subordinate buses,
> -	 * we need subtract from reserved PEs for children.
> +	/* The bridge's M64 window might be extended to PHB's M64
> +	 * window by intention to support PCI hotplug. So we have
> +	 * to check the M64 resources consumed by the PCI devices
> +	 * on the PCI bus.
>   	 */
>   	list_for_each_entry(pdev, &bus->devices, bus_list) {
> -		if (!pdev->subordinate)
> -			continue;
> +		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +#ifdef CONFIG_PCI_IOV
> +			if (i >= PCI_IOV_RESOURCES &&
> +			    i <= PCI_IOV_RESOURCE_END)
> +				continue;
> +#endif
> +			/* Don't scan bridge's window if the PE
> +			 * doesn't contain its subordinate bus.
> +			 */
> +			if (!all && i >= PCI_BRIDGE_RESOURCES &&
> +			    i <= PCI_BRIDGE_RESOURCE_END)
> +				continue;
>
> -		pci_bus_for_each_resource(pdev->subordinate, r, i) {
> -			if (!r || !r->parent ||
> -			    !pnv_pci_is_mem_pref_64(r->flags))
> +			r = &pdev->resource[i];
> +			if (!r->flags || r->start >= r->end ||
> +			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>   				continue;
>
> -			start = (r->start - phb->ioda.m64_base) / segsz;
> -			for (j = 0; j < resource_size(r) / segsz ; j++)
> -				clear_bit(start + j, pe_alloc);
> -                }
> -        }
> +			pe_no = (r->start - phb->ioda.m64_base) / segsz;
> +			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
> +			for (; pe_no < limit; pe_no++)
> +				set_bit(pe_no, pe_bitsmap);
> +		}
> +	}
>
> -	/*
> -	 * the current bus might not own M64 window and that's all
> -	 * contributed by its child buses. For the case, we needn't
> -	 * pick M64 dependent PE#.
> -	 */
> -	if (bitmap_empty(pe_alloc, phb->ioda.total_pe)) {
> -		kfree(pe_alloc);
> +	/* No M64 window found ? */
> +	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
> +		kfree(pe_bitsmap);
>   		return IODA_INVALID_PE;
>   	}
>
> -	/*
> -	 * Figure out the master PE and put all slave PEs to master
> -	 * PE's list to form compound PE.
> +	/* Figure out the master PE and put all slave PEs
> +	 * to master PE's list to form compound PE.
>   	 */
> -done:
>   	master_pe = NULL;
>   	i = -1;
> -	while ((i = find_next_bit(pe_alloc, phb->ioda.total_pe, i + 1)) <
> +	while ((i = find_next_bit(pe_bitsmap, phb->ioda.total_pe, i + 1)) <
>   		phb->ioda.total_pe) {
>   		pe = &phb->ioda.pe_array[i];
>
> @@ -408,6 +437,13 @@ done:
>   			list_add_tail(&pe->list, &master_pe->slaves);
>   		}
>
> +		/* Pick the M64 segment, which should be available. Also,

test_and_set_bit() does not pick or choose, it just marks PE#pe_number used.

> +		 * those M64 segments consumed by slave PEs are contributed
> +		 * to the master PE.
> +		 */
> +		BUG_ON(test_and_set_bit(pe->pe_number, phb->ioda.m64_segmap));
> +		BUG_ON(test_and_set_bit(pe->pe_number, master_pe->m64_segmap));
> +
>   		/* P7IOC supports M64DT, which helps mapping M64 segment
>   		 * to one particular PE#. Unfortunately, PHB3 has fixed
>   		 * mapping between M64 segment and PE#. In order for same
> @@ -431,7 +467,7 @@ done:
>   		}
>   	}
>
> -	kfree(pe_alloc);
> +	kfree(pe_bitsmap);
>   	return master_pe->pe_number;
>   }
>
> @@ -1233,7 +1269,7 @@ static void pnv_pci_ioda_setup_PEs(void)
>
>   		/* M64 layout might affect PE allocation */
>   		if (phb->reserve_m64_pe)
> -			phb->reserve_m64_pe(phb);
> +			phb->reserve_m64_pe(phb, phb->hose->bus);
>
>   		pnv_ioda_setup_PEs(hose->bus);
>   	}
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 070ee88..19022cf 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -49,6 +49,13 @@ struct pnv_ioda_pe {
>   	/* PE number */
>   	unsigned int		pe_number;
>
> +	/* IO/M32/M64 segments consumed by the PE. Each PE can
> +	 * have one M64 segment at most, but M64 segments consumed
> +	 * by slave PEs will be contributed to the master PE. One
> +	 * PE can own multiple IO and M32 segments.
> +	 */
> +	unsigned long		m64_segmap[8];


Why 8? 64*8 = 512 segments?  s'8'512/sizeof(unsigned long)' may be?


> +
>   	/* "Weight" assigned to the PE for the sake of DMA resource
>   	 * allocations
>   	 */
> @@ -114,7 +121,7 @@ struct pnv_phb {
>   	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
>   	void (*shutdown)(struct pnv_phb *phb);
>   	int (*init_m64)(struct pnv_phb *phb);
> -	void (*reserve_m64_pe)(struct pnv_phb *phb);
> +	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>   	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>   	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>   	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
> @@ -153,6 +160,7 @@ struct pnv_phb {
>   			struct mutex		pe_alloc_mutex;
>
>   			/* M32 & IO segment maps */
> +			unsigned long		m64_segmap[8];
>   			unsigned int		*m32_segmap;
>   			unsigned int		*io_segmap;
>   			struct pnv_ioda_pe	*pe_array;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 03/21] powerpc/powernv: M64 support improvement
@ 2015-05-09 10:24     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 10:24 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> We're having the hardware or enforced (on P7IOC) limitation: M64

I would think if it is enforced, then it is enforced by hardware but you 
say "hardware OR enforced" :)


> segment#x can only be assigned to PE#x. IO and M32 segment can be
> mapped to arbitrary PE# via IODT and M32DT. It means the PE number
> should be x if M64 segment#x has been assigned to the PE. Also, each
> PE own one M64 segment at most. Currently, we are reserving PE#
> according to root port's M64 window. It won't be reliable once we
> extend M64 windows of root port, or the upstream port of the PCIE
> switch behind root port to PHB's M64 window, in order to support
> PCI hotplug in future.
>
> The patch reserves PE# for M64 segments according to the M64 resources
> of the PCI devices (not bridges) contained in the PE. Besides, it's
> always worthy to trace the M64 segments consumed by the PE, which can
> be released at PCI unplugging time.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++++++++++++++++++------------
>   arch/powerpc/platforms/powernv/pci.h      |  10 +-
>   2 files changed, 122 insertions(+), 78 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 646962f..a994882 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -283,28 +283,78 @@ fail:
>   	return -EIO;
>   }
>
> -static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
> +/* We extend the M64 window of root port, or the upstream bridge port
> + * of the PCIE switch behind root port. So we shouldn't reserve PEs
> + * for M64 resources because there are no (normal) PCI devices consuming

"PCI devices"? Not "root ports or PCI bridges"?

> + * M64 resources on the PCI buses leading from root port, or the upstream
> + * bridge port.The function returns true if the indicated PCI bus needs
> + * reserved PEs because of M64 resources in advance. Otherwise, the
> + * function returns false.
> + */
> +static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb,
> +				 struct pci_bus *bus)
>   {
> -	resource_size_t sgsz = phb->ioda.m64_segsize;
> +	/* Root bus */

The comment is too obvious as the call below is called "pci_is_root_bus" :)


> +	if (!bus || pci_is_root_bus(bus))
> +		return false;
> +
> +	/* Bus leading from root port. We need check what types of PCI
> +	 * devices on the bus. If it's connecting PCI bridge, we don't
> +	 * need reserve M64 PEs for it. Otherwise, we still need to do
> +	 * that.
> +	 */
> +	if (pci_is_root_bus(bus->self->bus)) {
> +		struct pci_dev *pdev;
> +
> +		list_for_each_entry(pdev, &bus->devices, bus_list) {
> +			if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
> +				return true;
> +		}
> +
> +		return false;
> +	}
> +
> +	/* Bus leading from the upstream bridge port on top level */
> +	if (pci_is_root_bus(bus->self->bus->self->bus))


Is it for second level bridges? Like root->bridge->bridge? And for 3 levels 
you will need a PE?


> +		return false;
> +
> +	return true;
> +}
> +
> +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
> +				    struct pci_bus *bus)
> +{
> +	resource_size_t segsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
>   	struct resource *r;
> -	int base, step, i;
> +	unsigned long pe_no, limit;
> +	int i;
>
> -	/*
> -	 * Root bus always has full M64 range and root port has
> -	 * M64 range used in reality. So we're checking root port
> -	 * instead of root bus.
> +	if (!pnv_ioda_need_m64_pe(phb, bus))
> +		return;
> +
> +	/* The bridge's M64 window might have been extended to the
> +	 * PHB's M64 window in order to support PCI hotplug. So the
> +	 * bridge's M64 window isn't reliable to be used for picking
> +	 * PE# for its leading PCI bus. We have to check the M64
> +	 * resources consumed by the PCI devices, which seat on the
> +	 * PCI bus.
>   	 */
> -	list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) {
> -		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
> -			r = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
> -			if (!r->parent ||
> -			    !pnv_pci_is_mem_pref_64(r->flags))
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +#ifdef CONFIG_PCI_IOV
> +			if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END)
> +				continue;
> +#endif
> +			r = &pdev->resource[i];
> +			if (!r->flags || r->start >= r->end ||
> +			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>   				continue;
>
> -			base = (r->start - phb->ioda.m64_base) / sgsz;
> -			for (step = 0; step < resource_size(r) / sgsz; step++)
> -				pnv_ioda_reserve_pe(phb, base + step);
> +			pe_no = (r->start - phb->ioda.m64_base) / segsz;
> +			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
> +			for (; pe_no < limit; pe_no++)
> +				pnv_ioda_reserve_pe(phb, pe_no);
>   		}
>   	}
>   }
> @@ -316,85 +366,64 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	struct pci_dev *pdev;
>   	struct resource *r;
>   	struct pnv_ioda_pe *master_pe, *pe;
> -	unsigned long size, *pe_alloc;
> -	bool found;
> -	int start, i, j;
> -
> -	/* Root bus shouldn't use M64 */
> -	if (pci_is_root_bus(bus))
> -		return IODA_INVALID_PE;
> -
> -	/* We support only one M64 window on each bus */
> -	found = false;
> -	pci_bus_for_each_resource(bus, r, i) {
> -		if (r && r->parent &&
> -		    pnv_pci_is_mem_pref_64(r->flags)) {
> -			found = true;
> -			break;
> -		}
> -	}
> +	unsigned long size, *pe_bitsmap;

s/pe_bitsmap/pe_bitmap/


> +	unsigned long pe_no, limit;
> +	int i;
>
> -	/* No M64 window found ? */
> -	if (!found)
> +	if (!pnv_ioda_need_m64_pe(phb, bus))
>   		return IODA_INVALID_PE;
>
> -	/* Allocate bitmap */
> +        /* Allocate bitmap */
>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
> -	pe_alloc = kzalloc(size, GFP_KERNEL);
> -	if (!pe_alloc) {
> -		pr_warn("%s: Out of memory !\n",
> -			__func__);
> +	pe_bitsmap = kzalloc(size, GFP_KERNEL);
> +	if (!pe_bitsmap) {
> +		pr_warn("%s: Out of memory !\n", __func__);
>   		return IODA_INVALID_PE;
>   	}
>
> -	/*
> -	 * Figure out reserved PE numbers by the PE
> -	 * the its child PEs.
> -	 */
> -	start = (r->start - phb->ioda.m64_base) / segsz;
> -	for (i = 0; i < resource_size(r) / segsz; i++)
> -		set_bit(start + i, pe_alloc);
> -
> -	if (all)
> -		goto done;
> -
> -	/*
> -	 * If the PE doesn't cover all subordinate buses,
> -	 * we need subtract from reserved PEs for children.
> +	/* The bridge's M64 window might be extended to PHB's M64
> +	 * window by intention to support PCI hotplug. So we have
> +	 * to check the M64 resources consumed by the PCI devices
> +	 * on the PCI bus.
>   	 */
>   	list_for_each_entry(pdev, &bus->devices, bus_list) {
> -		if (!pdev->subordinate)
> -			continue;
> +		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +#ifdef CONFIG_PCI_IOV
> +			if (i >= PCI_IOV_RESOURCES &&
> +			    i <= PCI_IOV_RESOURCE_END)
> +				continue;
> +#endif
> +			/* Don't scan bridge's window if the PE
> +			 * doesn't contain its subordinate bus.
> +			 */
> +			if (!all && i >= PCI_BRIDGE_RESOURCES &&
> +			    i <= PCI_BRIDGE_RESOURCE_END)
> +				continue;
>
> -		pci_bus_for_each_resource(pdev->subordinate, r, i) {
> -			if (!r || !r->parent ||
> -			    !pnv_pci_is_mem_pref_64(r->flags))
> +			r = &pdev->resource[i];
> +			if (!r->flags || r->start >= r->end ||
> +			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>   				continue;
>
> -			start = (r->start - phb->ioda.m64_base) / segsz;
> -			for (j = 0; j < resource_size(r) / segsz ; j++)
> -				clear_bit(start + j, pe_alloc);
> -                }
> -        }
> +			pe_no = (r->start - phb->ioda.m64_base) / segsz;
> +			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
> +			for (; pe_no < limit; pe_no++)
> +				set_bit(pe_no, pe_bitsmap);
> +		}
> +	}
>
> -	/*
> -	 * the current bus might not own M64 window and that's all
> -	 * contributed by its child buses. For the case, we needn't
> -	 * pick M64 dependent PE#.
> -	 */
> -	if (bitmap_empty(pe_alloc, phb->ioda.total_pe)) {
> -		kfree(pe_alloc);
> +	/* No M64 window found ? */
> +	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
> +		kfree(pe_bitsmap);
>   		return IODA_INVALID_PE;
>   	}
>
> -	/*
> -	 * Figure out the master PE and put all slave PEs to master
> -	 * PE's list to form compound PE.
> +	/* Figure out the master PE and put all slave PEs
> +	 * to master PE's list to form compound PE.
>   	 */
> -done:
>   	master_pe = NULL;
>   	i = -1;
> -	while ((i = find_next_bit(pe_alloc, phb->ioda.total_pe, i + 1)) <
> +	while ((i = find_next_bit(pe_bitsmap, phb->ioda.total_pe, i + 1)) <
>   		phb->ioda.total_pe) {
>   		pe = &phb->ioda.pe_array[i];
>
> @@ -408,6 +437,13 @@ done:
>   			list_add_tail(&pe->list, &master_pe->slaves);
>   		}
>
> +		/* Pick the M64 segment, which should be available. Also,

test_and_set_bit() does not pick or choose, it just marks PE#pe_number used.

> +		 * those M64 segments consumed by slave PEs are contributed
> +		 * to the master PE.
> +		 */
> +		BUG_ON(test_and_set_bit(pe->pe_number, phb->ioda.m64_segmap));
> +		BUG_ON(test_and_set_bit(pe->pe_number, master_pe->m64_segmap));
> +
>   		/* P7IOC supports M64DT, which helps mapping M64 segment
>   		 * to one particular PE#. Unfortunately, PHB3 has fixed
>   		 * mapping between M64 segment and PE#. In order for same
> @@ -431,7 +467,7 @@ done:
>   		}
>   	}
>
> -	kfree(pe_alloc);
> +	kfree(pe_bitsmap);
>   	return master_pe->pe_number;
>   }
>
> @@ -1233,7 +1269,7 @@ static void pnv_pci_ioda_setup_PEs(void)
>
>   		/* M64 layout might affect PE allocation */
>   		if (phb->reserve_m64_pe)
> -			phb->reserve_m64_pe(phb);
> +			phb->reserve_m64_pe(phb, phb->hose->bus);
>
>   		pnv_ioda_setup_PEs(hose->bus);
>   	}
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 070ee88..19022cf 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -49,6 +49,13 @@ struct pnv_ioda_pe {
>   	/* PE number */
>   	unsigned int		pe_number;
>
> +	/* IO/M32/M64 segments consumed by the PE. Each PE can
> +	 * have one M64 segment at most, but M64 segments consumed
> +	 * by slave PEs will be contributed to the master PE. One
> +	 * PE can own multiple IO and M32 segments.
> +	 */
> +	unsigned long		m64_segmap[8];


Why 8? 64*8 = 512 segments?  s'8'512/sizeof(unsigned long)' may be?


> +
>   	/* "Weight" assigned to the PE for the sake of DMA resource
>   	 * allocations
>   	 */
> @@ -114,7 +121,7 @@ struct pnv_phb {
>   	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
>   	void (*shutdown)(struct pnv_phb *phb);
>   	int (*init_m64)(struct pnv_phb *phb);
> -	void (*reserve_m64_pe)(struct pnv_phb *phb);
> +	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>   	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>   	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>   	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
> @@ -153,6 +160,7 @@ struct pnv_phb {
>   			struct mutex		pe_alloc_mutex;
>
>   			/* M32 & IO segment maps */
> +			unsigned long		m64_segmap[8];
>   			unsigned int		*m32_segmap;
>   			unsigned int		*io_segmap;
>   			struct pnv_ioda_pe	*pe_array;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 10:53     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 10:53 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The PHB's IO or M32 window is divided evenly to segments, each of
> them can be mapped to arbitrary PE# by IODT or M32DT. Current code
> figures out the consumed IO and M32 segments by one particular PE
> from the windows of the PE's upstream bridge. It won't be reliable
> once we extend M64 windows of root port, or the upstream port of
> the PCIE switch behind root port to PHB's IO or M32 window, in order
> to support PCI hotplug in future.
>
> The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed
> IO or M32 segments from its contained devices, no bridge involved any
> more. Also, the logic to mapping IO and M32 segments are combined to
> simplify the code. Besides, it's always worthy to trace the IO and M32
> segments consumed by one PE, which can be released at PCI unplugging
> time.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 150 ++++++++++++++++--------------
>   arch/powerpc/platforms/powernv/pci.h      |  13 +--
>   2 files changed, 85 insertions(+), 78 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index a994882..7e6e266 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>   }
>   #endif /* CONFIG_PCI_IOV */
>
> -/*
> - * This function is supposed to be called on basis of PE from top
> - * to bottom style. So the the I/O or MMIO segment assigned to
> - * parent PE could be overrided by its child PEs if necessary.
> - */
> -static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
> -				  struct pnv_ioda_pe *pe)
> +static int pnv_ioda_map_pe_one_res(struct pci_controller *hose,
> +				   struct pnv_ioda_pe *pe,
> +				   struct resource *res)
>   {
>   	struct pnv_phb *phb = hose->private_data;
>   	struct pci_bus_region region;
> -	struct resource *res;
> -	int i, index;
> -	int rc;
> +	unsigned int segsize, index;
> +	unsigned long *segmap, *pe_segmap;
> +	uint16_t win_type;
> +	int64_t rc;
>
> -	/*
> -	 * NOTE: We only care PCI bus based PE for now. For PCI
> -	 * device based PE, for example SRIOV sensitive VF should
> -	 * be figured out later.
> -	 */
> -	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
> +	/* Check if we need map the resource */
> +	if (!res->parent || !res->flags ||
> +	    res->start > res->end ||
> +	    pnv_pci_is_mem_pref_64(res->flags))
> +		return 0;
>
> -	pci_bus_for_each_resource(pe->pbus, res, i) {
> -		if (!res || !res->flags ||
> -		    res->start > res->end)
> -			continue;
> +	if (res->flags & IORESOURCE_IO) {
> +		segmap = phb->ioda.io_segmap;
> +		pe_segmap = pe->io_segmap;
> +		region.start = res->start - phb->ioda.io_pci_base;
> +		region.end = res->end - phb->ioda.io_pci_base;
> +		segsize = phb->ioda.io_segsize;
> +		win_type = OPAL_IO_WINDOW_TYPE;
> +	} else {
> +		segmap = phb->ioda.m32_segmap;
> +		pe_segmap = pe->m32_segmap;
> +		region.start = res->start -
> +			       hose->mem_offset[0] -
> +			       phb->ioda.m32_pci_base;
> +		region.end = res->end -
> +			     hose->mem_offset[0] -
> +			     phb->ioda.m32_pci_base;
> +		segsize = phb->ioda.m32_segsize;
> +		win_type = OPAL_M32_WINDOW_TYPE;
> +	}
> +
> +	index = region.start / segsize;
> +	while (index < phb->ioda.total_pe &&
> +	       region.start <= region.end) {
> +		rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> +				pe->pe_number, win_type, 0, index);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n",
> +				__func__, rc, win_type, index, pe->pe_number);
> +			return -EIO;
> +		}
>
> -		if (res->flags & IORESOURCE_IO) {
> -			region.start = res->start - phb->ioda.io_pci_base;
> -			region.end   = res->end - phb->ioda.io_pci_base;
> -			index = region.start / phb->ioda.io_segsize;
> +		set_bit(index, segmap);
> +		set_bit(index, pe_segmap);
> +		region.start += segsize;
> +		index++;
> +	}
>
> -			while (index < phb->ioda.total_pe &&
> -			       region.start <= region.end) {
> -				phb->ioda.io_segmap[index] = pe->pe_number;
> -				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> -					pe->pe_number, OPAL_IO_WINDOW_TYPE, 0, index);
> -				if (rc != OPAL_SUCCESS) {
> -					pr_err("%s: OPAL error %d when mapping IO "
> -					       "segment #%d to PE#%d\n",
> -					       __func__, rc, index, pe->pe_number);
> -					break;
> -				}
> +	return 0;
> +}
>
> -				region.start += phb->ioda.io_segsize;
> -				index++;
> -			}
> -		} else if ((res->flags & IORESOURCE_MEM) &&
> -			   !pnv_pci_is_mem_pref_64(res->flags)) {
> -			region.start = res->start -
> -				       hose->mem_offset[0] -
> -				       phb->ioda.m32_pci_base;
> -			region.end   = res->end -
> -				       hose->mem_offset[0] -
> -				       phb->ioda.m32_pci_base;
> -			index = region.start / phb->ioda.m32_segsize;
> -
> -			while (index < phb->ioda.total_pe &&
> -			       region.start <= region.end) {
> -				phb->ioda.m32_segmap[index] = pe->pe_number;
> -				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> -					pe->pe_number, OPAL_M32_WINDOW_TYPE, 0, index);
> -				if (rc != OPAL_SUCCESS) {
> -					pr_err("%s: OPAL error %d when mapping M32 "
> -					       "segment#%d to PE#%d",
> -					       __func__, rc, index, pe->pe_number);
> -					break;
> -				}
> +static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
> +				  struct pnv_ioda_pe *pe)
> +{
> +	struct pci_dev *pdev;
> +	struct resource *res;
> +	int i;
>
> -				region.start += phb->ioda.m32_segsize;
> -				index++;
> -			}
> +	/* This function only works for bus dependent PE */
> +	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
> +
> +	list_for_each_entry(pdev, &pe->pbus->devices, bus_list) {
> +		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> +			res = &pdev->resource[i];
> +			if (pnv_ioda_map_pe_one_res(hose, pe, res))
> +				return;
> +		}
> +
> +		/* If the PE contains all subordinate PCI buses, the
> +		 * resources of the child bridges should be mapped
> +		 * to the PE as well.
> +		 */
> +		if (!(pe->flags & PNV_IODA_PE_BUS_ALL) ||
> +		    (pdev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> +			continue;
> +
> +		for (i = 0; i <= PCI_BRIDGE_RESOURCE_NUM; i++) {
> +			res = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
> +			if (pnv_ioda_map_pe_one_res(hose, pe, res))
> +				return;


This chunk is really hard to review. Looks like you completely 
reimplemented the function instead of patching it. For review-ability and 
bisect-ability it would help to split it to several simpler patches.



>   		}
>   	}
>   }
> @@ -2780,7 +2795,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>   {
>   	struct pci_controller *hose;
>   	struct pnv_phb *phb;
> -	unsigned long size, m32map_off, pemap_off, iomap_off = 0;
> +	unsigned long size, pemap_off;
>   	const __be64 *prop64;
>   	const __be32 *prop32;
>   	int len;
> @@ -2865,19 +2880,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>
>   	/* Allocate aux data & arrays. We don't have IO ports on PHB3 */
>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
> -	m32map_off = size;
> -	size += phb->ioda.total_pe * sizeof(phb->ioda.m32_segmap[0]);
> -	if (phb->type == PNV_PHB_IODA1) {
> -		iomap_off = size;
> -		size += phb->ioda.total_pe * sizeof(phb->ioda.io_segmap[0]);
> -	}
>   	pemap_off = size;
>   	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
>   	aux = memblock_virt_alloc(size, 0);
>   	phb->ioda.pe_alloc = aux;
> -	phb->ioda.m32_segmap = aux + m32map_off;
> -	if (phb->type == PNV_PHB_IODA1)
> -		phb->ioda.io_segmap = aux + iomap_off;
>   	phb->ioda.pe_array = aux + pemap_off;
>   	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
>
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 19022cf..f604bb7 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -54,6 +54,8 @@ struct pnv_ioda_pe {
>   	 * by slave PEs will be contributed to the master PE. One
>   	 * PE can own multiple IO and M32 segments.
>   	 */
> +	unsigned long		io_segmap[8];
> +	unsigned long		m32_segmap[8];
>   	unsigned long		m64_segmap[8];
>
>   	/* "Weight" assigned to the PE for the sake of DMA resource
> @@ -154,16 +156,15 @@ struct pnv_phb {
>   			unsigned int		io_segsize;
>   			unsigned int		io_pci_base;
>
> -			/* PE allocation bitmap */
> +			/* PE allocation */
> +			struct pnv_ioda_pe	*pe_array;
>   			unsigned long		*pe_alloc;
> -			/* PE allocation mutex */
>   			struct mutex		pe_alloc_mutex;
>
> -			/* M32 & IO segment maps */
> +			/* IO/M32/M64 segment bitmaps */
> +			unsigned long		io_segmap[8];
> +			unsigned long		m32_segmap[8];
>   			unsigned long		m64_segmap[8];


Is this a copy of the same name fields above, in pnv_ioda_pe? Why 8?


> -			unsigned int		*m32_segmap;
> -			unsigned int		*io_segmap;
> -			struct pnv_ioda_pe	*pe_array;
>

Why moved this?


>   			/* IRQ chip */
>   			int			irq_chip_init;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
@ 2015-05-09 10:53     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 10:53 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The PHB's IO or M32 window is divided evenly to segments, each of
> them can be mapped to arbitrary PE# by IODT or M32DT. Current code
> figures out the consumed IO and M32 segments by one particular PE
> from the windows of the PE's upstream bridge. It won't be reliable
> once we extend M64 windows of root port, or the upstream port of
> the PCIE switch behind root port to PHB's IO or M32 window, in order
> to support PCI hotplug in future.
>
> The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed
> IO or M32 segments from its contained devices, no bridge involved any
> more. Also, the logic to mapping IO and M32 segments are combined to
> simplify the code. Besides, it's always worthy to trace the IO and M32
> segments consumed by one PE, which can be released at PCI unplugging
> time.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 150 ++++++++++++++++--------------
>   arch/powerpc/platforms/powernv/pci.h      |  13 +--
>   2 files changed, 85 insertions(+), 78 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index a994882..7e6e266 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>   }
>   #endif /* CONFIG_PCI_IOV */
>
> -/*
> - * This function is supposed to be called on basis of PE from top
> - * to bottom style. So the the I/O or MMIO segment assigned to
> - * parent PE could be overrided by its child PEs if necessary.
> - */
> -static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
> -				  struct pnv_ioda_pe *pe)
> +static int pnv_ioda_map_pe_one_res(struct pci_controller *hose,
> +				   struct pnv_ioda_pe *pe,
> +				   struct resource *res)
>   {
>   	struct pnv_phb *phb = hose->private_data;
>   	struct pci_bus_region region;
> -	struct resource *res;
> -	int i, index;
> -	int rc;
> +	unsigned int segsize, index;
> +	unsigned long *segmap, *pe_segmap;
> +	uint16_t win_type;
> +	int64_t rc;
>
> -	/*
> -	 * NOTE: We only care PCI bus based PE for now. For PCI
> -	 * device based PE, for example SRIOV sensitive VF should
> -	 * be figured out later.
> -	 */
> -	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
> +	/* Check if we need map the resource */
> +	if (!res->parent || !res->flags ||
> +	    res->start > res->end ||
> +	    pnv_pci_is_mem_pref_64(res->flags))
> +		return 0;
>
> -	pci_bus_for_each_resource(pe->pbus, res, i) {
> -		if (!res || !res->flags ||
> -		    res->start > res->end)
> -			continue;
> +	if (res->flags & IORESOURCE_IO) {
> +		segmap = phb->ioda.io_segmap;
> +		pe_segmap = pe->io_segmap;
> +		region.start = res->start - phb->ioda.io_pci_base;
> +		region.end = res->end - phb->ioda.io_pci_base;
> +		segsize = phb->ioda.io_segsize;
> +		win_type = OPAL_IO_WINDOW_TYPE;
> +	} else {
> +		segmap = phb->ioda.m32_segmap;
> +		pe_segmap = pe->m32_segmap;
> +		region.start = res->start -
> +			       hose->mem_offset[0] -
> +			       phb->ioda.m32_pci_base;
> +		region.end = res->end -
> +			     hose->mem_offset[0] -
> +			     phb->ioda.m32_pci_base;
> +		segsize = phb->ioda.m32_segsize;
> +		win_type = OPAL_M32_WINDOW_TYPE;
> +	}
> +
> +	index = region.start / segsize;
> +	while (index < phb->ioda.total_pe &&
> +	       region.start <= region.end) {
> +		rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> +				pe->pe_number, win_type, 0, index);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n",
> +				__func__, rc, win_type, index, pe->pe_number);
> +			return -EIO;
> +		}
>
> -		if (res->flags & IORESOURCE_IO) {
> -			region.start = res->start - phb->ioda.io_pci_base;
> -			region.end   = res->end - phb->ioda.io_pci_base;
> -			index = region.start / phb->ioda.io_segsize;
> +		set_bit(index, segmap);
> +		set_bit(index, pe_segmap);
> +		region.start += segsize;
> +		index++;
> +	}
>
> -			while (index < phb->ioda.total_pe &&
> -			       region.start <= region.end) {
> -				phb->ioda.io_segmap[index] = pe->pe_number;
> -				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> -					pe->pe_number, OPAL_IO_WINDOW_TYPE, 0, index);
> -				if (rc != OPAL_SUCCESS) {
> -					pr_err("%s: OPAL error %d when mapping IO "
> -					       "segment #%d to PE#%d\n",
> -					       __func__, rc, index, pe->pe_number);
> -					break;
> -				}
> +	return 0;
> +}
>
> -				region.start += phb->ioda.io_segsize;
> -				index++;
> -			}
> -		} else if ((res->flags & IORESOURCE_MEM) &&
> -			   !pnv_pci_is_mem_pref_64(res->flags)) {
> -			region.start = res->start -
> -				       hose->mem_offset[0] -
> -				       phb->ioda.m32_pci_base;
> -			region.end   = res->end -
> -				       hose->mem_offset[0] -
> -				       phb->ioda.m32_pci_base;
> -			index = region.start / phb->ioda.m32_segsize;
> -
> -			while (index < phb->ioda.total_pe &&
> -			       region.start <= region.end) {
> -				phb->ioda.m32_segmap[index] = pe->pe_number;
> -				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> -					pe->pe_number, OPAL_M32_WINDOW_TYPE, 0, index);
> -				if (rc != OPAL_SUCCESS) {
> -					pr_err("%s: OPAL error %d when mapping M32 "
> -					       "segment#%d to PE#%d",
> -					       __func__, rc, index, pe->pe_number);
> -					break;
> -				}
> +static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
> +				  struct pnv_ioda_pe *pe)
> +{
> +	struct pci_dev *pdev;
> +	struct resource *res;
> +	int i;
>
> -				region.start += phb->ioda.m32_segsize;
> -				index++;
> -			}
> +	/* This function only works for bus dependent PE */
> +	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
> +
> +	list_for_each_entry(pdev, &pe->pbus->devices, bus_list) {
> +		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> +			res = &pdev->resource[i];
> +			if (pnv_ioda_map_pe_one_res(hose, pe, res))
> +				return;
> +		}
> +
> +		/* If the PE contains all subordinate PCI buses, the
> +		 * resources of the child bridges should be mapped
> +		 * to the PE as well.
> +		 */
> +		if (!(pe->flags & PNV_IODA_PE_BUS_ALL) ||
> +		    (pdev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> +			continue;
> +
> +		for (i = 0; i <= PCI_BRIDGE_RESOURCE_NUM; i++) {
> +			res = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
> +			if (pnv_ioda_map_pe_one_res(hose, pe, res))
> +				return;


This chunk is really hard to review. Looks like you completely 
reimplemented the function instead of patching it. For review-ability and 
bisect-ability it would help to split it to several simpler patches.



>   		}
>   	}
>   }
> @@ -2780,7 +2795,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>   {
>   	struct pci_controller *hose;
>   	struct pnv_phb *phb;
> -	unsigned long size, m32map_off, pemap_off, iomap_off = 0;
> +	unsigned long size, pemap_off;
>   	const __be64 *prop64;
>   	const __be32 *prop32;
>   	int len;
> @@ -2865,19 +2880,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>
>   	/* Allocate aux data & arrays. We don't have IO ports on PHB3 */
>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
> -	m32map_off = size;
> -	size += phb->ioda.total_pe * sizeof(phb->ioda.m32_segmap[0]);
> -	if (phb->type == PNV_PHB_IODA1) {
> -		iomap_off = size;
> -		size += phb->ioda.total_pe * sizeof(phb->ioda.io_segmap[0]);
> -	}
>   	pemap_off = size;
>   	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
>   	aux = memblock_virt_alloc(size, 0);
>   	phb->ioda.pe_alloc = aux;
> -	phb->ioda.m32_segmap = aux + m32map_off;
> -	if (phb->type == PNV_PHB_IODA1)
> -		phb->ioda.io_segmap = aux + iomap_off;
>   	phb->ioda.pe_array = aux + pemap_off;
>   	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
>
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 19022cf..f604bb7 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -54,6 +54,8 @@ struct pnv_ioda_pe {
>   	 * by slave PEs will be contributed to the master PE. One
>   	 * PE can own multiple IO and M32 segments.
>   	 */
> +	unsigned long		io_segmap[8];
> +	unsigned long		m32_segmap[8];
>   	unsigned long		m64_segmap[8];
>
>   	/* "Weight" assigned to the PE for the sake of DMA resource
> @@ -154,16 +156,15 @@ struct pnv_phb {
>   			unsigned int		io_segsize;
>   			unsigned int		io_pci_base;
>
> -			/* PE allocation bitmap */
> +			/* PE allocation */
> +			struct pnv_ioda_pe	*pe_array;
>   			unsigned long		*pe_alloc;
> -			/* PE allocation mutex */
>   			struct mutex		pe_alloc_mutex;
>
> -			/* M32 & IO segment maps */
> +			/* IO/M32/M64 segment bitmaps */
> +			unsigned long		io_segmap[8];
> +			unsigned long		m32_segmap[8];
>   			unsigned long		m64_segmap[8];


Is this a copy of the same name fields above, in pnv_ioda_pe? Why 8?


> -			unsigned int		*m32_segmap;
> -			unsigned int		*io_segmap;
> -			struct pnv_ioda_pe	*pe_array;
>

Why moved this?


>   			/* IRQ chip */
>   			int			irq_chip_init;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 11:43     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 11:43 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> Currently, the PEs and their associated resources are assigned
> in ppc_md.pcibios_fixup(). The function is called for once after
> PCI probing and resources assignment are finished. Obviously, it's
> not hotplug friendly. The patch creates PEs dynamically by
> ppc_md.pcibios_setup_bridge(), which is called on the event during
> system bootup and PCI hotplug: updating PCI bridge's windows after
> resource assignment/reassignment are finished. For partial hotplug
> case, where not all PCI devices belonging to the PE are unplugged
> and plugged again, we just need unbinding/binding the affected
> PCI devices with the corresponding PE without creating new one.


Some PEs are already created dynamically (SRIOV). I'd suggest to make 
subject more specific.


> Besides, it might require addtional resources (e.g. M32) to the
> windows of the PCI bridge when unplugging current adapter, and
> insert a different adapter if there is one PCI slot, which is
> assumed behind root port, or the downstream bridge of the PCIE
> switch behind root port. The parent bridge of the newly plugged
> adapter would reject the request to add more resources, leading
> to hotplug failure. For the issue, the patch extends the windows
> of root port, or the upstream port of the PCIe switch behind root
> port to PHB's windows when ppc_md.pcibios_setup_bridge() is called.
>
> There is no upstream bridge for root bus, so we have to reserve
> PE#, which is next to the reserved PE# in advance and fixing the
> PE for root bus in ppc_md.pcibios_setup_bridge().
>
> The patch also changes the rule assigning PE#: PE# reserved for
> prefetchable 64-bits memory resource and SRIOV VFs starts from
> zero while PE# for dynamic allocations starts from ioda.total_pe
> reversely. It's because PE# for prefetchable 64-bits memory resource,
> which is ually allocated begining with the PHB's aperatus and PE#

s/aperatus/apertures/?

May be it is just me but it looks like the patch moves existing bits and 
also adds this dynamic PE creation, cannot it be separated somehow into 
smaller patches as it is really hard to track all the changes you are 
making here?




-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
@ 2015-05-09 11:43     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 11:43 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> Currently, the PEs and their associated resources are assigned
> in ppc_md.pcibios_fixup(). The function is called for once after
> PCI probing and resources assignment are finished. Obviously, it's
> not hotplug friendly. The patch creates PEs dynamically by
> ppc_md.pcibios_setup_bridge(), which is called on the event during
> system bootup and PCI hotplug: updating PCI bridge's windows after
> resource assignment/reassignment are finished. For partial hotplug
> case, where not all PCI devices belonging to the PE are unplugged
> and plugged again, we just need unbinding/binding the affected
> PCI devices with the corresponding PE without creating new one.


Some PEs are already created dynamically (SRIOV). I'd suggest to make 
subject more specific.


> Besides, it might require addtional resources (e.g. M32) to the
> windows of the PCI bridge when unplugging current adapter, and
> insert a different adapter if there is one PCI slot, which is
> assumed behind root port, or the downstream bridge of the PCIE
> switch behind root port. The parent bridge of the newly plugged
> adapter would reject the request to add more resources, leading
> to hotplug failure. For the issue, the patch extends the windows
> of root port, or the upstream port of the PCIe switch behind root
> port to PHB's windows when ppc_md.pcibios_setup_bridge() is called.
>
> There is no upstream bridge for root bus, so we have to reserve
> PE#, which is next to the reserved PE# in advance and fixing the
> PE for root bus in ppc_md.pcibios_setup_bridge().
>
> The patch also changes the rule assigning PE#: PE# reserved for
> prefetchable 64-bits memory resource and SRIOV VFs starts from
> zero while PE# for dynamic allocations starts from ioda.total_pe
> reversely. It's because PE# for prefetchable 64-bits memory resource,
> which is ually allocated begining with the PHB's aperatus and PE#

s/aperatus/apertures/?

May be it is just me but it looks like the patch moves existing bits and 
also adds this dynamic PE creation, cannot it be separated somehow into 
smaller patches as it is really hard to track all the changes you are 
making here?




-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 12:43     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 12:43 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The original code doesn't support releasing PEs dynamically, meaning
> that PE and the associated resources (IO, M32, M64 and DMA) can't
> be released when unplugging a PCI adapter from one hotpluggable slot.
>
> The patch takes object oriented methodology, introducs reference
> count to PE, which is initialized to 1 and increased with 1 when a
> new PCI device joins the PE. Once the last PCI device leaves the
> PE, the PE is going to be release together with its associated
> (IO, M32, M64, DMA) resources.


Too little commit log for non-trivial non-cut-n-paste 30KB patch...


>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/pci-bridge.h     |   3 +
>   arch/powerpc/kernel/pci-hotplug.c         |   5 +
>   arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>   arch/powerpc/platforms/powernv/pci.h      |   4 +-
>   4 files changed, 432 insertions(+), 238 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 5367eb3..a6ad4b1 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -31,6 +31,9 @@ struct pci_controller_ops {
>   	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>   	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>   	void		(*reset_secondary_bus)(struct pci_dev *dev);
> +
> +	/* Called when PCI device is released */
> +	void		(*release_device)(struct pci_dev *);
>   };
>
>   /*
> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
> index 7ed85a6..0040343 100644
> --- a/arch/powerpc/kernel/pci-hotplug.c
> +++ b/arch/powerpc/kernel/pci-hotplug.c
> @@ -29,6 +29,11 @@
>    */
>   void pcibios_release_device(struct pci_dev *dev)
>   {
> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> +
> +	if (hose->controller_ops.release_device)
> +		hose->controller_ops.release_device(dev);
> +
>   	eeh_remove_device(dev);
>   }
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 910fb67..ef8c216 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -12,6 +12,8 @@
>   #undef DEBUG
>
>   #include <linux/kernel.h>
> +#include <linux/atomic.h>
> +#include <linux/kref.h>
>   #include <linux/pci.h>
>   #include <linux/crash_dump.h>
>   #include <linux/debugfs.h>
> @@ -47,6 +49,8 @@
>   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>   #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>
> +static void pnv_ioda_release_pe(struct kref *kref);
> +
>   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>   			    const char *fmt, ...)
>   {
> @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>   		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>   }
>
> -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
> +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>   {
> -	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
> -		pr_warn("%s: Invalid PE %d on PHB#%x\n",
> -			__func__, pe_no, phb->hose->global_number);
> +	if (!pe)
> +		return;
> +
> +	kref_get(&pe->kref);
> +}
> +
> +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
> +{
> +	unsigned int count;
> +
> +	if (!pe)
>   		return;
> +
> +	/*
> +	 * The count is initialized to 1 and increased with 1 when
> +	 * a new PCI device is bound with the PE. Once the last PCI
> +	 * device is leaving from the PE, the PE is going to be
> +	 * released.
> +	 */
> +	count = atomic_read(&pe->kref.refcount);
> +	if (count == 2)
> +		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
> +	else
> +		kref_put(&pe->kref, pnv_ioda_release_pe);


What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?


> +}
> +
> +static void pnv_pci_release_device(struct pci_dev *pdev)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +	struct pnv_phb *phb = hose->private_data;
> +	struct pci_dn *pdn = pci_get_pdn(pdev);
> +	struct pnv_ioda_pe *pe;
> +
> +	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> +		pe = &phb->ioda.pe_array[pdn->pe_number];
> +		pnv_ioda_pe_put(pe);
> +		pdn->pe_number = IODA_INVALID_PE;
>   	}
> +}
>
> -	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
> -		pr_warn("%s: PE %d was assigned on PHB#%x\n",
> -			__func__, pe_no, phb->hose->global_number);
> +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
> +{
> +	struct pnv_phb *phb = pe->phb;
> +	int index, count;
> +	unsigned long tbl_addr, tbl_size;
> +
> +	/* No DMA capability for slave PEs */
> +	if (pe->flags & PNV_IODA_PE_SLAVE)
> +		return;
> +
> +	/* Bypass DMA window */
> +	if (phb->type == PNV_PHB_IODA2 &&
> +	    pe->tce_bypass_enabled &&
> +	    pe->tce32_table &&
> +	    pe->tce32_table->set_bypass)
> +		pe->tce32_table->set_bypass(pe->tce32_table, false);
> +
> +	/* 32-bits DMA window */
> +	count = pe->tce32_seg_end - pe->tce32_seg_start;
> +	tbl_addr = pe->tce32_table->it_base;
> +	if (!count)
>   		return;
> +
> +	/* Free IOMMU table */
> +	iommu_free_table(pe->tce32_table,
> +			 of_node_full_name(phb->hose->dn));
> +
> +	/* Deconfigure TCE table */
> +	switch (phb->type) {
> +	case PNV_PHB_IODA1:
> +		for (index = 0; index < count; index++)
> +			opal_pci_map_pe_dma_window(phb->opal_id,
> +						   pe->pe_number,
> +						   pe->tce32_seg_start + index,
> +						   1,
> +						   __pa(tbl_addr) +
> +						   index * TCE32_TABLE_SIZE,
> +						   0,
> +						   0x1000);
> +		bitmap_clear(phb->ioda.tce32_segmap,
> +			     pe->tce32_seg_start,
> +			     count);
> +		tbl_size = TCE32_TABLE_SIZE * count;
> +		break;
> +	case PNV_PHB_IODA2:
> +		opal_pci_map_pe_dma_window(phb->opal_id,
> +					   pe->pe_number,
> +					   pe->pe_number << 1,
> +					   1,
> +					   __pa(tbl_addr),
> +					   0,
> +					   0x1000);
> +		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
> +		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
> +		break;
> +	default:
> +		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
> +		return;
> +	}
> +
> +	/* Free memory of IOMMU table */
> +	free_pages(tbl_addr, get_order(tbl_size));


You just programmed the table address to TVT and then you are releasing the 
pages. It does not seem right, it will leave garbage in TVT. Also, I am 
adding helpers to alloc/free TCE pages in DDW patchset, you could reuse 
bits from there (I'll post v10 soon, you'll be in copy and you'll have to 
review that ;) ).


> +	pe->tce32_table = NULL;
> +	pe->tce32_seg_start = 0;
> +	pe->tce32_seg_end = 0;
> +}
> +
> +static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
> +{
> +	struct pnv_phb *phb = pe->phb;
> +	unsigned long *segmap = NULL, *pe_segmap = NULL;
> +	int i;
> +	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
> +				     OPAL_M32_WINDOW_TYPE,
> +				     OPAL_M64_WINDOW_TYPE };
> +
> +	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
> +		switch (win_type[win]) {
> +		case OPAL_IO_WINDOW_TYPE:
> +			segmap = phb->ioda.io_segmap;
> +			pe_segmap = pe->io_segmap;
> +			break;
> +		case OPAL_M32_WINDOW_TYPE:
> +			segmap = phb->ioda.m32_segmap;
> +			pe_segmap = pe->m32_segmap;
> +			break;
> +		case OPAL_M64_WINDOW_TYPE:
> +			segmap = phb->ioda.m64_segmap;
> +			pe_segmap = pe->m64_segmap;
> +			break;
> +		}
> +		i = -1;
> +		while ((i = find_next_bit(pe_segmap,
> +			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
> +			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
> +			    win_type[win] == OPAL_M32_WINDOW_TYPE)
> +				opal_pci_map_pe_mmio_window(phb->opal_id,
> +						phb->ioda.reserved_pe,
> +						win_type[win], 0, i);
> +			else if (phb->type == PNV_PHB_IODA1)
> +				opal_pci_map_pe_mmio_window(phb->opal_id,
> +						phb->ioda.reserved_pe,
> +						win_type[win],
> +						i / 8, i % 8);

The function is called ""release" but it programs something what looks like 
reasonable values, is it correct?



> +
> +			clear_bit(i, pe_segmap);
> +			clear_bit(i, segmap);
> +		}
> +	}
> +}
> +
> +static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
> +				  struct pnv_ioda_pe *parent,
> +				  struct pnv_ioda_pe *child,
> +				  bool is_add)
> +{
> +	const char *desc = is_add ? "adding" : "removing";
> +	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
> +			      OPAL_REMOVE_PE_FROM_DOMAIN;
> +	struct pnv_ioda_pe *slave;
> +	long rc;
> +
> +	/* Parent PE affects child PE */
> +	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> +				child->pe_number, op);
> +	if (rc != OPAL_SUCCESS) {
> +		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
> +			rc, desc);
> +		return -ENXIO;
> +	}
> +
> +	if (!(child->flags & PNV_IODA_PE_MASTER))
> +		return 0;
> +
> +	/* Compound case: parent PE affects slave PEs */
> +	list_for_each_entry(slave, &child->slaves, list) {
> +		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> +					slave->pe_number, op);
> +		if (rc != OPAL_SUCCESS) {
> +			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
> +				rc, desc);
> +			return -ENXIO;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
> +{
> +	struct pnv_phb *phb = pe->phb;
> +	struct pnv_ioda_pe *slave;
> +	struct pci_dev *pdev = NULL;
> +	int ret;
> +
> +	/*
> +	 * Clear PE frozen state. If it's master PE, we need
> +	 * clear slave PE frozen state as well.
> +	 */
> +	opal_pci_eeh_freeze_clear(phb->opal_id,
> +				  pe->pe_number,
> +				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> +	if (pe->flags & PNV_IODA_PE_MASTER) {
> +		list_for_each_entry(slave, &pe->slaves, list) {
> +			opal_pci_eeh_freeze_clear(phb->opal_id,
> +						  slave->pe_number,
> +						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> +		}
> +	}
> +
> +	/*
> +	 * Associate PE in PELT. We need add the PE into the
> +	 * corresponding PELT-V as well. Otherwise, the error
> +	 * originated from the PE might contribute to other
> +	 * PEs.
> +	 */
> +	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
> +	if (ret)
> +		return ret;
> +
> +	/* For compound PEs, any one affects all of them */
> +	if (pe->flags & PNV_IODA_PE_MASTER) {
> +		list_for_each_entry(slave, &pe->slaves, list) {
> +			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
> +		pdev = pe->pbus->self;
> +	else if (pe->flags & PNV_IODA_PE_DEV)
> +		pdev = pe->pdev->bus->self;
> +#ifdef CONFIG_PCI_IOV
> +	else if (pe->flags & PNV_IODA_PE_VF)
> +		pdev = pe->parent_dev->bus->self;
> +#endif /* CONFIG_PCI_IOV */
> +
> +	while (pdev) {
> +		struct pci_dn *pdn = pci_get_pdn(pdev);
> +		struct pnv_ioda_pe *parent;
> +
> +		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> +			parent = &phb->ioda.pe_array[pdn->pe_number];
> +			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		pdev = pdev->bus->self;
> +	}
> +
> +	return 0;
> +}
> +
> +static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)


It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just 
moving of this function to a different place deserves a separate patch with 
a comment why ("it is going to be used now for non-SRIOV case too" may be?).



> +{
> +	struct pnv_phb *phb = pe->phb;
> +	struct pci_dev *parent;
> +	uint8_t bcomp, dcomp, fcomp;
> +	long rid_end, rid;
> +	int64_t rc;
> +
> +	/* Tear down MVE */
> +	if (phb->type == PNV_PHB_IODA1 &&
> +	    pe->mve_number != -1) {
> +		rc = opal_pci_set_mve(phb->opal_id,
> +				      pe->mve_number,
> +				      phb->ioda.reserved_pe);
> +		if (rc != OPAL_SUCCESS)
> +			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
> +				rc, pe->mve_number);
> +		rc = opal_pci_set_mve_enable(phb->opal_id,
> +					     pe->mve_number,
> +					     OPAL_DISABLE_MVE);
> +		if (rc != OPAL_SUCCESS)
> +			pe_warn(pe, "Error %lld disabling MVE#%d\n",
> +				rc, pe->mve_number);
> +		pe->mve_number = -1;
> +	}
> +
> +	/* Unmapping PELTV */
> +	pnv_ioda_set_peltv(pe, false);
> +
> +	/* To unmap PELTM */
> +	if (pe->pbus) {
> +		int count;
> +
> +		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
> +		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
> +		parent = pe->pbus->self;
> +		if (pe->flags & PNV_IODA_PE_BUS_ALL)
> +			count = pe->pbus->busn_res.end -
> +				pe->pbus->busn_res.start + 1;
> +		else
> +			count = 1;
> +
> +		switch(count) {
> +		case  1: bcomp = OpalPciBusAll;   break;
> +		case  2: bcomp = OpalPciBus7Bits; break;
> +		case  4: bcomp = OpalPciBus6Bits; break;
> +		case  8: bcomp = OpalPciBus5Bits; break;
> +		case 16: bcomp = OpalPciBus4Bits; break;
> +		case 32: bcomp = OpalPciBus3Bits; break;
> +		default:
> +			/* Fail back to case of one bus */
> +			pe_warn(pe, "Cannot support %d buses\n", count);
> +			bcomp = OpalPciBusAll;
> +		}
> +		rid_end = pe->rid + (count << 8);
> +	} else {
> +#ifdef CONFIG_PCI_IOV
> +		if (pe->flags & PNV_IODA_PE_VF)
> +			parent = pe->parent_dev;
> +		else
> +#endif
> +			parent = pe->pdev->bus->self;
> +		bcomp = OpalPciBusAll;
> +		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
> +		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
> +		rid_end = pe->rid + 1;
> +	}
> +
> +	/* Clear RID mapping */
> +	for (rid = pe->rid; rid < rid_end; rid++)
> +		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
> +
> +	/* Unmapping PELTM */
> +	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
> +			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
> +	if (rc)
> +		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
> +}
> +
> +static void pnv_ioda_release_pe(struct kref *kref)
> +{
> +	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
> +	struct pnv_ioda_pe *tmp, *slave;
> +	struct pnv_phb *phb = pe->phb;
> +
> +	pnv_ioda_release_pe_dma(pe);
> +	pnv_ioda_release_pe_seg(pe);
> +	pnv_ioda_deconfigure_pe(pe);
> +
> +	/* Release slave PEs for compound PE */
> +	if (pe->flags & PNV_IODA_PE_MASTER) {
> +		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
> +			pnv_ioda_pe_put(slave);
> +	}
> +
> +	/* Remove the PE from various list. We need remove slave
> +	 * PE from master's list.
> +	 */
> +	list_del(&pe->dma_link);
> +	list_del(&pe->list);
> +
> +	/* Free PE number */
> +	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
> +}
> +
> +static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
> +					    int pe_no)
> +{
> +	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
> +
> +	kref_init(&pe->kref);
> +	pe->phb = phb;
> +	pe->pe_number = pe_no;
> +	INIT_LIST_HEAD(&pe->dma_link);
> +	INIT_LIST_HEAD(&pe->list);
> +
> +	return pe;
> +}
> +
> +static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
> +					       int pe_no)
> +{
> +	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
> +		pr_warn("%s: Invalid PE %d on PHB#%x\n",
> +			__func__, pe_no, phb->hose->global_number);
> +		return NULL;
>   	}
>
> -	phb->ioda.pe_array[pe_no].phb = phb;
> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
> +	/*
> +	 * Same PE might be reserved for multiple times, which
> +	 * is out of problem actually.
> +	 */
> +	set_bit(pe_no, phb->ioda.pe_alloc);
> +	return pnv_ioda_init_pe(phb, pe_no);
>   }
>
> -static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
> +static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>   {
>   	unsigned long pe_no;
>   	unsigned long limit = phb->ioda.total_pe - 1;
> @@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>   			break;
>
>   		if (--limit >= phb->ioda.total_pe)
> -			return IODA_INVALID_PE;
> +			return NULL;
>   	} while(1);
>
> -	phb->ioda.pe_array[pe_no].phb = phb;
> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
> -	return pe_no;
> -}
> -
> -static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
> -{
> -	WARN_ON(phb->ioda.pe_array[pe].pdev);
> -
> -	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
> -	clear_bit(pe, phb->ioda.pe_alloc);
> +	return pnv_ioda_init_pe(phb, pe_no);
>   }
>
>   static int pnv_ioda1_init_m64(struct pnv_phb *phb)
> @@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>   	}
>   }
>
> -static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
> -				struct pci_bus *bus, int all)
> +static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
> +						struct pci_bus *bus,
> +						int all)


Mechanic changes like this could easily go to a separate patch.


>   {
>   	resource_size_t segsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
> @@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	int i;
>
>   	if (!pnv_ioda_need_m64_pe(phb, bus))
> -		return IODA_INVALID_PE;
> +		return NULL;
>
>           /* Allocate bitmap */
>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>   	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>   	if (!pe_bitsmap) {
>   		pr_warn("%s: Out of memory !\n", __func__);
> -		return IODA_INVALID_PE;
> +		return NULL;
>   	}
>
>   	/* The bridge's M64 window might be extended to PHB's M64
> @@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	/* No M64 window found ? */
>   	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>   		kfree(pe_bitsmap);
> -		return IODA_INVALID_PE;
> +		return NULL;
>   	}
>
>   	/* Figure out the master PE and put all slave PEs
> @@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	}
>
>   	kfree(pe_bitsmap);
> -	return master_pe->pe_number;
> +	return master_pe;
>   }
>
>   static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
> @@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>    * but in the meantime, we need to protect them to avoid warnings
>    */
>   #ifdef CONFIG_PCI_MSI
> -static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
> +static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>   {
>   	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>   	struct pnv_phb *phb = hose->private_data;
> @@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>   }
>   #endif /* CONFIG_PCI_MSI */
>
> -static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
> -				  struct pnv_ioda_pe *parent,
> -				  struct pnv_ioda_pe *child,
> -				  bool is_add)
> -{
> -	const char *desc = is_add ? "adding" : "removing";
> -	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
> -			      OPAL_REMOVE_PE_FROM_DOMAIN;
> -	struct pnv_ioda_pe *slave;
> -	long rc;
> -
> -	/* Parent PE affects child PE */
> -	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> -				child->pe_number, op);
> -	if (rc != OPAL_SUCCESS) {
> -		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
> -			rc, desc);
> -		return -ENXIO;
> -	}
> -
> -	if (!(child->flags & PNV_IODA_PE_MASTER))
> -		return 0;
> -
> -	/* Compound case: parent PE affects slave PEs */
> -	list_for_each_entry(slave, &child->slaves, list) {
> -		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> -					slave->pe_number, op);
> -		if (rc != OPAL_SUCCESS) {
> -			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
> -				rc, desc);
> -			return -ENXIO;
> -		}
> -	}
> -
> -	return 0;
> -}
> -
> -static int pnv_ioda_set_peltv(struct pnv_phb *phb,
> -			      struct pnv_ioda_pe *pe,
> -			      bool is_add)
> -{
> -	struct pnv_ioda_pe *slave;
> -	struct pci_dev *pdev = NULL;
> -	int ret;
> -
> -	/*
> -	 * Clear PE frozen state. If it's master PE, we need
> -	 * clear slave PE frozen state as well.
> -	 */
> -	if (is_add) {
> -		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
> -					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> -		if (pe->flags & PNV_IODA_PE_MASTER) {
> -			list_for_each_entry(slave, &pe->slaves, list)
> -				opal_pci_eeh_freeze_clear(phb->opal_id,
> -							  slave->pe_number,
> -							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> -		}
> -	}
> -
> -	/*
> -	 * Associate PE in PELT. We need add the PE into the
> -	 * corresponding PELT-V as well. Otherwise, the error
> -	 * originated from the PE might contribute to other
> -	 * PEs.
> -	 */
> -	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
> -	if (ret)
> -		return ret;
> -
> -	/* For compound PEs, any one affects all of them */
> -	if (pe->flags & PNV_IODA_PE_MASTER) {
> -		list_for_each_entry(slave, &pe->slaves, list) {
> -			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
> -			if (ret)
> -				return ret;
> -		}
> -	}
> -
> -	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
> -		pdev = pe->pbus->self;
> -	else if (pe->flags & PNV_IODA_PE_DEV)
> -		pdev = pe->pdev->bus->self;
> -#ifdef CONFIG_PCI_IOV
> -	else if (pe->flags & PNV_IODA_PE_VF)
> -		pdev = pe->parent_dev->bus->self;
> -#endif /* CONFIG_PCI_IOV */
> -	while (pdev) {
> -		struct pci_dn *pdn = pci_get_pdn(pdev);
> -		struct pnv_ioda_pe *parent;
> -
> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> -			parent = &phb->ioda.pe_array[pdn->pe_number];
> -			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
> -			if (ret)
> -				return ret;
> -		}
> -
> -		pdev = pdev->bus->self;
> -	}
> -
> -	return 0;
> -}
> -
> -#ifdef CONFIG_PCI_IOV
> -static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
> -{
> -	struct pci_dev *parent;
> -	uint8_t bcomp, dcomp, fcomp;
> -	int64_t rc;
> -	long rid_end, rid;
> -
> -	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
> -	if (pe->pbus) {
> -		int count;
> -
> -		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
> -		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
> -		parent = pe->pbus->self;
> -		if (pe->flags & PNV_IODA_PE_BUS_ALL)
> -			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
> -		else
> -			count = 1;
> -
> -		switch(count) {
> -		case  1: bcomp = OpalPciBusAll;         break;
> -		case  2: bcomp = OpalPciBus7Bits;       break;
> -		case  4: bcomp = OpalPciBus6Bits;       break;
> -		case  8: bcomp = OpalPciBus5Bits;       break;
> -		case 16: bcomp = OpalPciBus4Bits;       break;
> -		case 32: bcomp = OpalPciBus3Bits;       break;
> -		default:
> -			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
> -			        count);
> -			/* Do an exact match only */
> -			bcomp = OpalPciBusAll;
> -		}
> -		rid_end = pe->rid + (count << 8);
> -	} else {
> -		if (pe->flags & PNV_IODA_PE_VF)
> -			parent = pe->parent_dev;
> -		else
> -			parent = pe->pdev->bus->self;
> -		bcomp = OpalPciBusAll;
> -		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
> -		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
> -		rid_end = pe->rid + 1;
> -	}
> -
> -	/* Clear the reverse map */
> -	for (rid = pe->rid; rid < rid_end; rid++)
> -		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
> -
> -	/* Release from all parents PELT-V */
> -	while (parent) {
> -		struct pci_dn *pdn = pci_get_pdn(parent);
> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> -			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
> -						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
> -			/* XXX What to do in case of error ? */


Not much :) Free associated memory and mark it "dead" so it won't be used 
again till reboot. In what circumstance can this opal_pci_set_peltv() fail 
at all?


> -		}
> -		parent = parent->bus->self;
> -	}
> -
> -	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
> -				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> -
> -	/* Disassociate PE in PELT */
> -	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
> -				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
> -	if (rc)
> -		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
> -	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
> -			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
> -	if (rc)
> -		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
> -
> -	pe->pbus = NULL;
> -	pe->pdev = NULL;
> -	pe->parent_dev = NULL;
> -
> -	return 0;
> -}
> -#endif /* CONFIG_PCI_IOV */
> -
>   static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>   {
>   	struct pci_dev *parent;
> @@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>   	}
>
>   	/* Configure PELTV */
> -	pnv_ioda_set_peltv(phb, pe, true);
> +	pnv_ioda_set_peltv(pe, true);
>
>   	/* Setup reverse map */
>   	for (rid = pe->rid; rid < rid_end; rid++)
> @@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>   		if (pdn->pe_number != IODA_INVALID_PE)
>   			continue;
>
> +		/* Increase reference count of the parent PE */

When you comment like this, I read it as the comment belongs to the whole 
next chunk till the first empty line, i.e. to all 5 lines below, which is 
not the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() 
name suggests incrementing the reference counter 2) "pe" is always parent 
in this function. I do not insist though.


> +		pnv_ioda_pe_get(pe);
>   		pdn->pe_number = pe->pe_number;
>   		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>   		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> @@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>   {
>   	struct pci_controller *hose = pci_bus_to_host(bus);
>   	struct pnv_phb *phb = hose->private_data;
> -	struct pnv_ioda_pe *pe;
> +	struct pnv_ioda_pe *pe = NULL;
>   	int pe_num = IODA_INVALID_PE;
>
>   	/* For partial hotplug case, the PE instance hasn't been destroyed
> @@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>   	}
>
>   	/* PE number for root bus should have been reserved */
> -	if (pci_is_root_bus(bus))
> -		pe_num = phb->ioda.root_pe_no;
> +	if (pci_is_root_bus(bus) &&
> +	    phb->ioda.root_pe_no != IODA_INVALID_PE)
> +		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>
>   	/* Check if PE is determined by M64 */
> -	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
> -		pe_num = phb->pick_m64_pe(phb, bus, all);
> +	if (!pe && phb->pick_m64_pe)
> +		pe = phb->pick_m64_pe(phb, bus, all);
>
>   	/* The PE number isn't pinned by M64 */
> -	if (pe_num == IODA_INVALID_PE)
> -		pe_num = pnv_ioda_alloc_pe(phb);
> +	if (!pe)
> +		pe = pnv_ioda_alloc_pe(phb);
>
> -	if (pe_num == IODA_INVALID_PE) {
> -		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
> +	if (!pe) {
> +		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>   			__func__, pci_domain_nr(bus), bus->number);
>   		return NULL;
>   	}
>
> -	pe = &phb->ioda.pe_array[pe_num];
>   	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>   	pe->pbus = bus;
>   	pe->pdev = NULL;
> @@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>
>   	if (pnv_ioda_configure_pe(phb, pe)) {
>   		/* XXX What do we do here ? */
> -		if (pe_num)
> -			pnv_ioda_free_pe(phb, pe_num);
> -		pe->pbus = NULL;
> +		pnv_ioda_pe_put(pe);
>   		return NULL;
>   	}
>
>   	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
> -			GFP_KERNEL, hose->node);
> +				       GFP_KERNEL, hose->node);

Seems like spaces change only - if you really want this change (which I 
hate - makes code look inaccurate to my taste but it seems I am in minority 
here :) ), please put it to the separate patch.


>   	pe->tce32_table->data = pe;
>
>   	/* Associate it with all child devices */
> @@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>   		list_del(&pe->list);
>   		mutex_unlock(&phb->ioda.pe_list_mutex);
>
> -		pnv_ioda_deconfigure_pe(phb, pe);
> +		pnv_ioda_deconfigure_pe(pe);


Is this change necessary to get "Release PEs dynamically" working? Move it 
to mechanical changes patch may be?


>
> -		pnv_ioda_free_pe(phb, pe->pe_number);
> +		pnv_ioda_pe_put(pe);
>   	}
>   }
>
> @@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>
>   		if (pnv_ioda_configure_pe(phb, pe)) {
>   			/* XXX What do we do here ? */
> -			if (pe_num)
> -				pnv_ioda_free_pe(phb, pe_num);
> -			pe->pdev = NULL;
> +			pnv_ioda_pe_put(pe);
>   			continue;
>   		}
>
> @@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>   	struct pnv_ioda_pe *pe;
>   	int rc;
>
> -	pe = pnv_ioda_get_pe(dev);
> +	pe = pnv_ioda_pci_dev_to_pe(dev);


And this change could to separately. Not clear how this helps to "Release 
PEs dynamically".


>   	if (!pe)
>   		return -ENODEV;
>
> @@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>   	struct pnv_ioda_pe *pe;
>   	int rc;
>
> -	if (!(pe = pnv_ioda_get_pe(dev)))
> +	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>   		return -ENODEV;
>
>   	/* Assign XIVE to PE */
> @@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>   				  unsigned int hwirq, unsigned int virq,
>   				  unsigned int is_64, struct msi_msg *msg)
>   {
> -	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
> +	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>   	unsigned int xive_num = hwirq - phb->msi_base;
>   	__be32 data;
>   	int rc;
> @@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>   	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>   	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>   	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
> +	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>   	hose->controller_ops = pnv_pci_controller_ops;
>
>   #ifdef CONFIG_PCI_IOV
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 1bea3a8..8b10f01 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -28,6 +28,7 @@ enum pnv_phb_model {
>   /* Data associated with a PE, including IOMMU tracking etc.. */
>   struct pnv_phb;
>   struct pnv_ioda_pe {
> +	struct kref		kref;
>   	unsigned long		flags;
>   	struct pnv_phb		*phb;
>
> @@ -120,7 +121,8 @@ struct pnv_phb {
>   	void (*shutdown)(struct pnv_phb *phb);
>   	int (*init_m64)(struct pnv_phb *phb);
>   	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
> -	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
> +	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
> +					   struct pci_bus *bus, int all);
>   	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>   	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>   	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-09 12:43     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 12:43 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> The original code doesn't support releasing PEs dynamically, meaning
> that PE and the associated resources (IO, M32, M64 and DMA) can't
> be released when unplugging a PCI adapter from one hotpluggable slot.
>
> The patch takes object oriented methodology, introducs reference
> count to PE, which is initialized to 1 and increased with 1 when a
> new PCI device joins the PE. Once the last PCI device leaves the
> PE, the PE is going to be release together with its associated
> (IO, M32, M64, DMA) resources.


Too little commit log for non-trivial non-cut-n-paste 30KB patch...


>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/pci-bridge.h     |   3 +
>   arch/powerpc/kernel/pci-hotplug.c         |   5 +
>   arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>   arch/powerpc/platforms/powernv/pci.h      |   4 +-
>   4 files changed, 432 insertions(+), 238 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 5367eb3..a6ad4b1 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -31,6 +31,9 @@ struct pci_controller_ops {
>   	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>   	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>   	void		(*reset_secondary_bus)(struct pci_dev *dev);
> +
> +	/* Called when PCI device is released */
> +	void		(*release_device)(struct pci_dev *);
>   };
>
>   /*
> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
> index 7ed85a6..0040343 100644
> --- a/arch/powerpc/kernel/pci-hotplug.c
> +++ b/arch/powerpc/kernel/pci-hotplug.c
> @@ -29,6 +29,11 @@
>    */
>   void pcibios_release_device(struct pci_dev *dev)
>   {
> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> +
> +	if (hose->controller_ops.release_device)
> +		hose->controller_ops.release_device(dev);
> +
>   	eeh_remove_device(dev);
>   }
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 910fb67..ef8c216 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -12,6 +12,8 @@
>   #undef DEBUG
>
>   #include <linux/kernel.h>
> +#include <linux/atomic.h>
> +#include <linux/kref.h>
>   #include <linux/pci.h>
>   #include <linux/crash_dump.h>
>   #include <linux/debugfs.h>
> @@ -47,6 +49,8 @@
>   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>   #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>
> +static void pnv_ioda_release_pe(struct kref *kref);
> +
>   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>   			    const char *fmt, ...)
>   {
> @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>   		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>   }
>
> -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
> +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>   {
> -	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
> -		pr_warn("%s: Invalid PE %d on PHB#%x\n",
> -			__func__, pe_no, phb->hose->global_number);
> +	if (!pe)
> +		return;
> +
> +	kref_get(&pe->kref);
> +}
> +
> +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
> +{
> +	unsigned int count;
> +
> +	if (!pe)
>   		return;
> +
> +	/*
> +	 * The count is initialized to 1 and increased with 1 when
> +	 * a new PCI device is bound with the PE. Once the last PCI
> +	 * device is leaving from the PE, the PE is going to be
> +	 * released.
> +	 */
> +	count = atomic_read(&pe->kref.refcount);
> +	if (count == 2)
> +		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
> +	else
> +		kref_put(&pe->kref, pnv_ioda_release_pe);


What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?


> +}
> +
> +static void pnv_pci_release_device(struct pci_dev *pdev)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +	struct pnv_phb *phb = hose->private_data;
> +	struct pci_dn *pdn = pci_get_pdn(pdev);
> +	struct pnv_ioda_pe *pe;
> +
> +	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> +		pe = &phb->ioda.pe_array[pdn->pe_number];
> +		pnv_ioda_pe_put(pe);
> +		pdn->pe_number = IODA_INVALID_PE;
>   	}
> +}
>
> -	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
> -		pr_warn("%s: PE %d was assigned on PHB#%x\n",
> -			__func__, pe_no, phb->hose->global_number);
> +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
> +{
> +	struct pnv_phb *phb = pe->phb;
> +	int index, count;
> +	unsigned long tbl_addr, tbl_size;
> +
> +	/* No DMA capability for slave PEs */
> +	if (pe->flags & PNV_IODA_PE_SLAVE)
> +		return;
> +
> +	/* Bypass DMA window */
> +	if (phb->type == PNV_PHB_IODA2 &&
> +	    pe->tce_bypass_enabled &&
> +	    pe->tce32_table &&
> +	    pe->tce32_table->set_bypass)
> +		pe->tce32_table->set_bypass(pe->tce32_table, false);
> +
> +	/* 32-bits DMA window */
> +	count = pe->tce32_seg_end - pe->tce32_seg_start;
> +	tbl_addr = pe->tce32_table->it_base;
> +	if (!count)
>   		return;
> +
> +	/* Free IOMMU table */
> +	iommu_free_table(pe->tce32_table,
> +			 of_node_full_name(phb->hose->dn));
> +
> +	/* Deconfigure TCE table */
> +	switch (phb->type) {
> +	case PNV_PHB_IODA1:
> +		for (index = 0; index < count; index++)
> +			opal_pci_map_pe_dma_window(phb->opal_id,
> +						   pe->pe_number,
> +						   pe->tce32_seg_start + index,
> +						   1,
> +						   __pa(tbl_addr) +
> +						   index * TCE32_TABLE_SIZE,
> +						   0,
> +						   0x1000);
> +		bitmap_clear(phb->ioda.tce32_segmap,
> +			     pe->tce32_seg_start,
> +			     count);
> +		tbl_size = TCE32_TABLE_SIZE * count;
> +		break;
> +	case PNV_PHB_IODA2:
> +		opal_pci_map_pe_dma_window(phb->opal_id,
> +					   pe->pe_number,
> +					   pe->pe_number << 1,
> +					   1,
> +					   __pa(tbl_addr),
> +					   0,
> +					   0x1000);
> +		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
> +		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
> +		break;
> +	default:
> +		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
> +		return;
> +	}
> +
> +	/* Free memory of IOMMU table */
> +	free_pages(tbl_addr, get_order(tbl_size));


You just programmed the table address to TVT and then you are releasing the 
pages. It does not seem right, it will leave garbage in TVT. Also, I am 
adding helpers to alloc/free TCE pages in DDW patchset, you could reuse 
bits from there (I'll post v10 soon, you'll be in copy and you'll have to 
review that ;) ).


> +	pe->tce32_table = NULL;
> +	pe->tce32_seg_start = 0;
> +	pe->tce32_seg_end = 0;
> +}
> +
> +static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
> +{
> +	struct pnv_phb *phb = pe->phb;
> +	unsigned long *segmap = NULL, *pe_segmap = NULL;
> +	int i;
> +	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
> +				     OPAL_M32_WINDOW_TYPE,
> +				     OPAL_M64_WINDOW_TYPE };
> +
> +	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
> +		switch (win_type[win]) {
> +		case OPAL_IO_WINDOW_TYPE:
> +			segmap = phb->ioda.io_segmap;
> +			pe_segmap = pe->io_segmap;
> +			break;
> +		case OPAL_M32_WINDOW_TYPE:
> +			segmap = phb->ioda.m32_segmap;
> +			pe_segmap = pe->m32_segmap;
> +			break;
> +		case OPAL_M64_WINDOW_TYPE:
> +			segmap = phb->ioda.m64_segmap;
> +			pe_segmap = pe->m64_segmap;
> +			break;
> +		}
> +		i = -1;
> +		while ((i = find_next_bit(pe_segmap,
> +			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
> +			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
> +			    win_type[win] == OPAL_M32_WINDOW_TYPE)
> +				opal_pci_map_pe_mmio_window(phb->opal_id,
> +						phb->ioda.reserved_pe,
> +						win_type[win], 0, i);
> +			else if (phb->type == PNV_PHB_IODA1)
> +				opal_pci_map_pe_mmio_window(phb->opal_id,
> +						phb->ioda.reserved_pe,
> +						win_type[win],
> +						i / 8, i % 8);

The function is called ""release" but it programs something what looks like 
reasonable values, is it correct?



> +
> +			clear_bit(i, pe_segmap);
> +			clear_bit(i, segmap);
> +		}
> +	}
> +}
> +
> +static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
> +				  struct pnv_ioda_pe *parent,
> +				  struct pnv_ioda_pe *child,
> +				  bool is_add)
> +{
> +	const char *desc = is_add ? "adding" : "removing";
> +	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
> +			      OPAL_REMOVE_PE_FROM_DOMAIN;
> +	struct pnv_ioda_pe *slave;
> +	long rc;
> +
> +	/* Parent PE affects child PE */
> +	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> +				child->pe_number, op);
> +	if (rc != OPAL_SUCCESS) {
> +		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
> +			rc, desc);
> +		return -ENXIO;
> +	}
> +
> +	if (!(child->flags & PNV_IODA_PE_MASTER))
> +		return 0;
> +
> +	/* Compound case: parent PE affects slave PEs */
> +	list_for_each_entry(slave, &child->slaves, list) {
> +		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> +					slave->pe_number, op);
> +		if (rc != OPAL_SUCCESS) {
> +			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
> +				rc, desc);
> +			return -ENXIO;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
> +{
> +	struct pnv_phb *phb = pe->phb;
> +	struct pnv_ioda_pe *slave;
> +	struct pci_dev *pdev = NULL;
> +	int ret;
> +
> +	/*
> +	 * Clear PE frozen state. If it's master PE, we need
> +	 * clear slave PE frozen state as well.
> +	 */
> +	opal_pci_eeh_freeze_clear(phb->opal_id,
> +				  pe->pe_number,
> +				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> +	if (pe->flags & PNV_IODA_PE_MASTER) {
> +		list_for_each_entry(slave, &pe->slaves, list) {
> +			opal_pci_eeh_freeze_clear(phb->opal_id,
> +						  slave->pe_number,
> +						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> +		}
> +	}
> +
> +	/*
> +	 * Associate PE in PELT. We need add the PE into the
> +	 * corresponding PELT-V as well. Otherwise, the error
> +	 * originated from the PE might contribute to other
> +	 * PEs.
> +	 */
> +	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
> +	if (ret)
> +		return ret;
> +
> +	/* For compound PEs, any one affects all of them */
> +	if (pe->flags & PNV_IODA_PE_MASTER) {
> +		list_for_each_entry(slave, &pe->slaves, list) {
> +			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
> +		pdev = pe->pbus->self;
> +	else if (pe->flags & PNV_IODA_PE_DEV)
> +		pdev = pe->pdev->bus->self;
> +#ifdef CONFIG_PCI_IOV
> +	else if (pe->flags & PNV_IODA_PE_VF)
> +		pdev = pe->parent_dev->bus->self;
> +#endif /* CONFIG_PCI_IOV */
> +
> +	while (pdev) {
> +		struct pci_dn *pdn = pci_get_pdn(pdev);
> +		struct pnv_ioda_pe *parent;
> +
> +		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> +			parent = &phb->ioda.pe_array[pdn->pe_number];
> +			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		pdev = pdev->bus->self;
> +	}
> +
> +	return 0;
> +}
> +
> +static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)


It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just 
moving of this function to a different place deserves a separate patch with 
a comment why ("it is going to be used now for non-SRIOV case too" may be?).



> +{
> +	struct pnv_phb *phb = pe->phb;
> +	struct pci_dev *parent;
> +	uint8_t bcomp, dcomp, fcomp;
> +	long rid_end, rid;
> +	int64_t rc;
> +
> +	/* Tear down MVE */
> +	if (phb->type == PNV_PHB_IODA1 &&
> +	    pe->mve_number != -1) {
> +		rc = opal_pci_set_mve(phb->opal_id,
> +				      pe->mve_number,
> +				      phb->ioda.reserved_pe);
> +		if (rc != OPAL_SUCCESS)
> +			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
> +				rc, pe->mve_number);
> +		rc = opal_pci_set_mve_enable(phb->opal_id,
> +					     pe->mve_number,
> +					     OPAL_DISABLE_MVE);
> +		if (rc != OPAL_SUCCESS)
> +			pe_warn(pe, "Error %lld disabling MVE#%d\n",
> +				rc, pe->mve_number);
> +		pe->mve_number = -1;
> +	}
> +
> +	/* Unmapping PELTV */
> +	pnv_ioda_set_peltv(pe, false);
> +
> +	/* To unmap PELTM */
> +	if (pe->pbus) {
> +		int count;
> +
> +		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
> +		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
> +		parent = pe->pbus->self;
> +		if (pe->flags & PNV_IODA_PE_BUS_ALL)
> +			count = pe->pbus->busn_res.end -
> +				pe->pbus->busn_res.start + 1;
> +		else
> +			count = 1;
> +
> +		switch(count) {
> +		case  1: bcomp = OpalPciBusAll;   break;
> +		case  2: bcomp = OpalPciBus7Bits; break;
> +		case  4: bcomp = OpalPciBus6Bits; break;
> +		case  8: bcomp = OpalPciBus5Bits; break;
> +		case 16: bcomp = OpalPciBus4Bits; break;
> +		case 32: bcomp = OpalPciBus3Bits; break;
> +		default:
> +			/* Fail back to case of one bus */
> +			pe_warn(pe, "Cannot support %d buses\n", count);
> +			bcomp = OpalPciBusAll;
> +		}
> +		rid_end = pe->rid + (count << 8);
> +	} else {
> +#ifdef CONFIG_PCI_IOV
> +		if (pe->flags & PNV_IODA_PE_VF)
> +			parent = pe->parent_dev;
> +		else
> +#endif
> +			parent = pe->pdev->bus->self;
> +		bcomp = OpalPciBusAll;
> +		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
> +		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
> +		rid_end = pe->rid + 1;
> +	}
> +
> +	/* Clear RID mapping */
> +	for (rid = pe->rid; rid < rid_end; rid++)
> +		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
> +
> +	/* Unmapping PELTM */
> +	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
> +			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
> +	if (rc)
> +		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
> +}
> +
> +static void pnv_ioda_release_pe(struct kref *kref)
> +{
> +	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
> +	struct pnv_ioda_pe *tmp, *slave;
> +	struct pnv_phb *phb = pe->phb;
> +
> +	pnv_ioda_release_pe_dma(pe);
> +	pnv_ioda_release_pe_seg(pe);
> +	pnv_ioda_deconfigure_pe(pe);
> +
> +	/* Release slave PEs for compound PE */
> +	if (pe->flags & PNV_IODA_PE_MASTER) {
> +		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
> +			pnv_ioda_pe_put(slave);
> +	}
> +
> +	/* Remove the PE from various list. We need remove slave
> +	 * PE from master's list.
> +	 */
> +	list_del(&pe->dma_link);
> +	list_del(&pe->list);
> +
> +	/* Free PE number */
> +	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
> +}
> +
> +static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
> +					    int pe_no)
> +{
> +	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
> +
> +	kref_init(&pe->kref);
> +	pe->phb = phb;
> +	pe->pe_number = pe_no;
> +	INIT_LIST_HEAD(&pe->dma_link);
> +	INIT_LIST_HEAD(&pe->list);
> +
> +	return pe;
> +}
> +
> +static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
> +					       int pe_no)
> +{
> +	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
> +		pr_warn("%s: Invalid PE %d on PHB#%x\n",
> +			__func__, pe_no, phb->hose->global_number);
> +		return NULL;
>   	}
>
> -	phb->ioda.pe_array[pe_no].phb = phb;
> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
> +	/*
> +	 * Same PE might be reserved for multiple times, which
> +	 * is out of problem actually.
> +	 */
> +	set_bit(pe_no, phb->ioda.pe_alloc);
> +	return pnv_ioda_init_pe(phb, pe_no);
>   }
>
> -static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
> +static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>   {
>   	unsigned long pe_no;
>   	unsigned long limit = phb->ioda.total_pe - 1;
> @@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>   			break;
>
>   		if (--limit >= phb->ioda.total_pe)
> -			return IODA_INVALID_PE;
> +			return NULL;
>   	} while(1);
>
> -	phb->ioda.pe_array[pe_no].phb = phb;
> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
> -	return pe_no;
> -}
> -
> -static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
> -{
> -	WARN_ON(phb->ioda.pe_array[pe].pdev);
> -
> -	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
> -	clear_bit(pe, phb->ioda.pe_alloc);
> +	return pnv_ioda_init_pe(phb, pe_no);
>   }
>
>   static int pnv_ioda1_init_m64(struct pnv_phb *phb)
> @@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>   	}
>   }
>
> -static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
> -				struct pci_bus *bus, int all)
> +static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
> +						struct pci_bus *bus,
> +						int all)


Mechanic changes like this could easily go to a separate patch.


>   {
>   	resource_size_t segsz = phb->ioda.m64_segsize;
>   	struct pci_dev *pdev;
> @@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	int i;
>
>   	if (!pnv_ioda_need_m64_pe(phb, bus))
> -		return IODA_INVALID_PE;
> +		return NULL;
>
>           /* Allocate bitmap */
>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>   	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>   	if (!pe_bitsmap) {
>   		pr_warn("%s: Out of memory !\n", __func__);
> -		return IODA_INVALID_PE;
> +		return NULL;
>   	}
>
>   	/* The bridge's M64 window might be extended to PHB's M64
> @@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	/* No M64 window found ? */
>   	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>   		kfree(pe_bitsmap);
> -		return IODA_INVALID_PE;
> +		return NULL;
>   	}
>
>   	/* Figure out the master PE and put all slave PEs
> @@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>   	}
>
>   	kfree(pe_bitsmap);
> -	return master_pe->pe_number;
> +	return master_pe;
>   }
>
>   static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
> @@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>    * but in the meantime, we need to protect them to avoid warnings
>    */
>   #ifdef CONFIG_PCI_MSI
> -static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
> +static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>   {
>   	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>   	struct pnv_phb *phb = hose->private_data;
> @@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>   }
>   #endif /* CONFIG_PCI_MSI */
>
> -static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
> -				  struct pnv_ioda_pe *parent,
> -				  struct pnv_ioda_pe *child,
> -				  bool is_add)
> -{
> -	const char *desc = is_add ? "adding" : "removing";
> -	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
> -			      OPAL_REMOVE_PE_FROM_DOMAIN;
> -	struct pnv_ioda_pe *slave;
> -	long rc;
> -
> -	/* Parent PE affects child PE */
> -	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> -				child->pe_number, op);
> -	if (rc != OPAL_SUCCESS) {
> -		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
> -			rc, desc);
> -		return -ENXIO;
> -	}
> -
> -	if (!(child->flags & PNV_IODA_PE_MASTER))
> -		return 0;
> -
> -	/* Compound case: parent PE affects slave PEs */
> -	list_for_each_entry(slave, &child->slaves, list) {
> -		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
> -					slave->pe_number, op);
> -		if (rc != OPAL_SUCCESS) {
> -			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
> -				rc, desc);
> -			return -ENXIO;
> -		}
> -	}
> -
> -	return 0;
> -}
> -
> -static int pnv_ioda_set_peltv(struct pnv_phb *phb,
> -			      struct pnv_ioda_pe *pe,
> -			      bool is_add)
> -{
> -	struct pnv_ioda_pe *slave;
> -	struct pci_dev *pdev = NULL;
> -	int ret;
> -
> -	/*
> -	 * Clear PE frozen state. If it's master PE, we need
> -	 * clear slave PE frozen state as well.
> -	 */
> -	if (is_add) {
> -		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
> -					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> -		if (pe->flags & PNV_IODA_PE_MASTER) {
> -			list_for_each_entry(slave, &pe->slaves, list)
> -				opal_pci_eeh_freeze_clear(phb->opal_id,
> -							  slave->pe_number,
> -							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> -		}
> -	}
> -
> -	/*
> -	 * Associate PE in PELT. We need add the PE into the
> -	 * corresponding PELT-V as well. Otherwise, the error
> -	 * originated from the PE might contribute to other
> -	 * PEs.
> -	 */
> -	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
> -	if (ret)
> -		return ret;
> -
> -	/* For compound PEs, any one affects all of them */
> -	if (pe->flags & PNV_IODA_PE_MASTER) {
> -		list_for_each_entry(slave, &pe->slaves, list) {
> -			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
> -			if (ret)
> -				return ret;
> -		}
> -	}
> -
> -	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
> -		pdev = pe->pbus->self;
> -	else if (pe->flags & PNV_IODA_PE_DEV)
> -		pdev = pe->pdev->bus->self;
> -#ifdef CONFIG_PCI_IOV
> -	else if (pe->flags & PNV_IODA_PE_VF)
> -		pdev = pe->parent_dev->bus->self;
> -#endif /* CONFIG_PCI_IOV */
> -	while (pdev) {
> -		struct pci_dn *pdn = pci_get_pdn(pdev);
> -		struct pnv_ioda_pe *parent;
> -
> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> -			parent = &phb->ioda.pe_array[pdn->pe_number];
> -			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
> -			if (ret)
> -				return ret;
> -		}
> -
> -		pdev = pdev->bus->self;
> -	}
> -
> -	return 0;
> -}
> -
> -#ifdef CONFIG_PCI_IOV
> -static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
> -{
> -	struct pci_dev *parent;
> -	uint8_t bcomp, dcomp, fcomp;
> -	int64_t rc;
> -	long rid_end, rid;
> -
> -	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
> -	if (pe->pbus) {
> -		int count;
> -
> -		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
> -		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
> -		parent = pe->pbus->self;
> -		if (pe->flags & PNV_IODA_PE_BUS_ALL)
> -			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
> -		else
> -			count = 1;
> -
> -		switch(count) {
> -		case  1: bcomp = OpalPciBusAll;         break;
> -		case  2: bcomp = OpalPciBus7Bits;       break;
> -		case  4: bcomp = OpalPciBus6Bits;       break;
> -		case  8: bcomp = OpalPciBus5Bits;       break;
> -		case 16: bcomp = OpalPciBus4Bits;       break;
> -		case 32: bcomp = OpalPciBus3Bits;       break;
> -		default:
> -			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
> -			        count);
> -			/* Do an exact match only */
> -			bcomp = OpalPciBusAll;
> -		}
> -		rid_end = pe->rid + (count << 8);
> -	} else {
> -		if (pe->flags & PNV_IODA_PE_VF)
> -			parent = pe->parent_dev;
> -		else
> -			parent = pe->pdev->bus->self;
> -		bcomp = OpalPciBusAll;
> -		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
> -		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
> -		rid_end = pe->rid + 1;
> -	}
> -
> -	/* Clear the reverse map */
> -	for (rid = pe->rid; rid < rid_end; rid++)
> -		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
> -
> -	/* Release from all parents PELT-V */
> -	while (parent) {
> -		struct pci_dn *pdn = pci_get_pdn(parent);
> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
> -			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
> -						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
> -			/* XXX What to do in case of error ? */


Not much :) Free associated memory and mark it "dead" so it won't be used 
again till reboot. In what circumstance can this opal_pci_set_peltv() fail 
at all?


> -		}
> -		parent = parent->bus->self;
> -	}
> -
> -	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
> -				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
> -
> -	/* Disassociate PE in PELT */
> -	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
> -				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
> -	if (rc)
> -		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
> -	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
> -			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
> -	if (rc)
> -		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
> -
> -	pe->pbus = NULL;
> -	pe->pdev = NULL;
> -	pe->parent_dev = NULL;
> -
> -	return 0;
> -}
> -#endif /* CONFIG_PCI_IOV */
> -
>   static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>   {
>   	struct pci_dev *parent;
> @@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>   	}
>
>   	/* Configure PELTV */
> -	pnv_ioda_set_peltv(phb, pe, true);
> +	pnv_ioda_set_peltv(pe, true);
>
>   	/* Setup reverse map */
>   	for (rid = pe->rid; rid < rid_end; rid++)
> @@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>   		if (pdn->pe_number != IODA_INVALID_PE)
>   			continue;
>
> +		/* Increase reference count of the parent PE */

When you comment like this, I read it as the comment belongs to the whole 
next chunk till the first empty line, i.e. to all 5 lines below, which is 
not the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() 
name suggests incrementing the reference counter 2) "pe" is always parent 
in this function. I do not insist though.


> +		pnv_ioda_pe_get(pe);
>   		pdn->pe_number = pe->pe_number;
>   		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>   		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> @@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>   {
>   	struct pci_controller *hose = pci_bus_to_host(bus);
>   	struct pnv_phb *phb = hose->private_data;
> -	struct pnv_ioda_pe *pe;
> +	struct pnv_ioda_pe *pe = NULL;
>   	int pe_num = IODA_INVALID_PE;
>
>   	/* For partial hotplug case, the PE instance hasn't been destroyed
> @@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>   	}
>
>   	/* PE number for root bus should have been reserved */
> -	if (pci_is_root_bus(bus))
> -		pe_num = phb->ioda.root_pe_no;
> +	if (pci_is_root_bus(bus) &&
> +	    phb->ioda.root_pe_no != IODA_INVALID_PE)
> +		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>
>   	/* Check if PE is determined by M64 */
> -	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
> -		pe_num = phb->pick_m64_pe(phb, bus, all);
> +	if (!pe && phb->pick_m64_pe)
> +		pe = phb->pick_m64_pe(phb, bus, all);
>
>   	/* The PE number isn't pinned by M64 */
> -	if (pe_num == IODA_INVALID_PE)
> -		pe_num = pnv_ioda_alloc_pe(phb);
> +	if (!pe)
> +		pe = pnv_ioda_alloc_pe(phb);
>
> -	if (pe_num == IODA_INVALID_PE) {
> -		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
> +	if (!pe) {
> +		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>   			__func__, pci_domain_nr(bus), bus->number);
>   		return NULL;
>   	}
>
> -	pe = &phb->ioda.pe_array[pe_num];
>   	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>   	pe->pbus = bus;
>   	pe->pdev = NULL;
> @@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>
>   	if (pnv_ioda_configure_pe(phb, pe)) {
>   		/* XXX What do we do here ? */
> -		if (pe_num)
> -			pnv_ioda_free_pe(phb, pe_num);
> -		pe->pbus = NULL;
> +		pnv_ioda_pe_put(pe);
>   		return NULL;
>   	}
>
>   	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
> -			GFP_KERNEL, hose->node);
> +				       GFP_KERNEL, hose->node);

Seems like spaces change only - if you really want this change (which I 
hate - makes code look inaccurate to my taste but it seems I am in minority 
here :) ), please put it to the separate patch.


>   	pe->tce32_table->data = pe;
>
>   	/* Associate it with all child devices */
> @@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>   		list_del(&pe->list);
>   		mutex_unlock(&phb->ioda.pe_list_mutex);
>
> -		pnv_ioda_deconfigure_pe(phb, pe);
> +		pnv_ioda_deconfigure_pe(pe);


Is this change necessary to get "Release PEs dynamically" working? Move it 
to mechanical changes patch may be?


>
> -		pnv_ioda_free_pe(phb, pe->pe_number);
> +		pnv_ioda_pe_put(pe);
>   	}
>   }
>
> @@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>
>   		if (pnv_ioda_configure_pe(phb, pe)) {
>   			/* XXX What do we do here ? */
> -			if (pe_num)
> -				pnv_ioda_free_pe(phb, pe_num);
> -			pe->pdev = NULL;
> +			pnv_ioda_pe_put(pe);
>   			continue;
>   		}
>
> @@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>   	struct pnv_ioda_pe *pe;
>   	int rc;
>
> -	pe = pnv_ioda_get_pe(dev);
> +	pe = pnv_ioda_pci_dev_to_pe(dev);


And this change could to separately. Not clear how this helps to "Release 
PEs dynamically".


>   	if (!pe)
>   		return -ENODEV;
>
> @@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>   	struct pnv_ioda_pe *pe;
>   	int rc;
>
> -	if (!(pe = pnv_ioda_get_pe(dev)))
> +	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>   		return -ENODEV;
>
>   	/* Assign XIVE to PE */
> @@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>   				  unsigned int hwirq, unsigned int virq,
>   				  unsigned int is_64, struct msi_msg *msg)
>   {
> -	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
> +	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>   	unsigned int xive_num = hwirq - phb->msi_base;
>   	__be32 data;
>   	int rc;
> @@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>   	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>   	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>   	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
> +	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>   	hose->controller_ops = pnv_pci_controller_ops;
>
>   #ifdef CONFIG_PCI_IOV
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 1bea3a8..8b10f01 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -28,6 +28,7 @@ enum pnv_phb_model {
>   /* Data associated with a PE, including IOMMU tracking etc.. */
>   struct pnv_phb;
>   struct pnv_ioda_pe {
> +	struct kref		kref;
>   	unsigned long		flags;
>   	struct pnv_phb		*phb;
>
> @@ -120,7 +121,8 @@ struct pnv_phb {
>   	void (*shutdown)(struct pnv_phb *phb);
>   	int (*init_m64)(struct pnv_phb *phb);
>   	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
> -	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
> +	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
> +					   struct pci_bus *bus, int all);
>   	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>   	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>   	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 08/21] powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 12:45     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 12:45 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> Nobody is using the this function. The patch drops it.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

Yay! :)

I would move this patchset along with other mechanical changes to the 
beginning of the patchset.

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 71 -------------------------------
>   1 file changed, 71 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index ef8c216..5cd8298 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1302,77 +1302,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>   }
>   #endif /* CONFIG_PCI_IOV */
>
> -#if 0
> -static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
> -{
> -	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> -	struct pci_dn *pdn = pci_get_pdn(dev);
> -	struct pnv_ioda_pe *pe;
> -	int pe_num;
> -
> -	if (!pdn) {
> -		pr_err("%s: Device tree node not associated properly\n",
> -			   pci_name(dev));
> -		return NULL;
> -	}
> -	if (pdn->pe_number != IODA_INVALID_PE)
> -		return NULL;
> -
> -	/* PE#0 has been pre-set */
> -	if (dev->bus->number == 0)
> -		pe_num = 0;
> -	else
> -		pe_num = pnv_ioda_alloc_pe(phb);
> -	if (pe_num == IODA_INVALID_PE) {
> -		pr_warning("%s: Not enough PE# available, disabling device\n",
> -			   pci_name(dev));
> -		return NULL;
> -	}
> -
> -	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
> -	 * pointer in the PE data structure, both should be destroyed at the
> -	 * same time. However, this needs to be looked at more closely again
> -	 * once we actually start removing things (Hotplug, SR-IOV, ...)
> -	 *
> -	 * At some point we want to remove the PDN completely anyways
> -	 */
> -	pe = &phb->ioda.pe_array[pe_num];
> -	pci_dev_get(dev);
> -	pdn->pcidev = dev;
> -	pdn->pe_number = pe_num;
> -	pe->pdev = dev;
> -	pe->pbus = NULL;
> -	pe->tce32_seg = -1;
> -	pe->mve_number = -1;
> -	pe->rid = dev->bus->number << 8 | pdn->devfn;
> -
> -	pe_info(pe, "Associated device to PE\n");
> -
> -	if (pnv_ioda_configure_pe(phb, pe)) {
> -		/* XXX What do we do here ? */
> -		if (pe_num)
> -			pnv_ioda_free_pe(phb, pe_num);
> -		pdn->pe_number = IODA_INVALID_PE;
> -		pe->pdev = NULL;
> -		pci_dev_put(dev);
> -		return NULL;
> -	}
> -
> -	/* Assign a DMA weight to the device */
> -	pe->dma_weight = pnv_ioda_dma_weight(dev);
> -	if (pe->dma_weight != 0) {
> -		phb->ioda.dma_weight += pe->dma_weight;
> -		phb->ioda.dma_pe_count++;
> -	}
> -
> -	/* Link the PE */
> -	pnv_ioda_link_pe_by_weight(phb, pe);
> -
> -	return pe;
> -}
> -#endif /* Useful for SRIOV case */
> -
>   static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>   {
>   	struct pci_dev *dev;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 08/21] powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
@ 2015-05-09 12:45     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 12:45 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> Nobody is using the this function. The patch drops it.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

Yay! :)

I would move this patchset along with other mechanical changes to the 
beginning of the patchset.

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>   arch/powerpc/platforms/powernv/pci-ioda.c | 71 -------------------------------
>   1 file changed, 71 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index ef8c216..5cd8298 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1302,77 +1302,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>   }
>   #endif /* CONFIG_PCI_IOV */
>
> -#if 0
> -static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
> -{
> -	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> -	struct pci_dn *pdn = pci_get_pdn(dev);
> -	struct pnv_ioda_pe *pe;
> -	int pe_num;
> -
> -	if (!pdn) {
> -		pr_err("%s: Device tree node not associated properly\n",
> -			   pci_name(dev));
> -		return NULL;
> -	}
> -	if (pdn->pe_number != IODA_INVALID_PE)
> -		return NULL;
> -
> -	/* PE#0 has been pre-set */
> -	if (dev->bus->number == 0)
> -		pe_num = 0;
> -	else
> -		pe_num = pnv_ioda_alloc_pe(phb);
> -	if (pe_num == IODA_INVALID_PE) {
> -		pr_warning("%s: Not enough PE# available, disabling device\n",
> -			   pci_name(dev));
> -		return NULL;
> -	}
> -
> -	/* NOTE: We get only one ref to the pci_dev for the pdn, not for the
> -	 * pointer in the PE data structure, both should be destroyed at the
> -	 * same time. However, this needs to be looked at more closely again
> -	 * once we actually start removing things (Hotplug, SR-IOV, ...)
> -	 *
> -	 * At some point we want to remove the PDN completely anyways
> -	 */
> -	pe = &phb->ioda.pe_array[pe_num];
> -	pci_dev_get(dev);
> -	pdn->pcidev = dev;
> -	pdn->pe_number = pe_num;
> -	pe->pdev = dev;
> -	pe->pbus = NULL;
> -	pe->tce32_seg = -1;
> -	pe->mve_number = -1;
> -	pe->rid = dev->bus->number << 8 | pdn->devfn;
> -
> -	pe_info(pe, "Associated device to PE\n");
> -
> -	if (pnv_ioda_configure_pe(phb, pe)) {
> -		/* XXX What do we do here ? */
> -		if (pe_num)
> -			pnv_ioda_free_pe(phb, pe_num);
> -		pdn->pe_number = IODA_INVALID_PE;
> -		pe->pdev = NULL;
> -		pci_dev_put(dev);
> -		return NULL;
> -	}
> -
> -	/* Assign a DMA weight to the device */
> -	pe->dma_weight = pnv_ioda_dma_weight(dev);
> -	if (pe->dma_weight != 0) {
> -		phb->ioda.dma_weight += pe->dma_weight;
> -		phb->ioda.dma_pe_count++;
> -	}
> -
> -	/* Link the PE */
> -	pnv_ioda_link_pe_by_weight(phb, pe);
> -
> -	return pe;
> -}
> -#endif /* Useful for SRIOV case */
> -
>   static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>   {
>   	struct pci_dev *dev;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 13:41     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 13:41 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> For PowerNV platform, running on top of skiboot, all PE level reset
> should be routed to firmware if the bridge of the PE primary bus has
> device-node property "ibm,reset-by-firmware". Otherwise, the kernel
> has to issue hot reset on PE's primary bus despite the requested reset
> types, which is the behaviour before the firmware supports PCI slot
> reset. So the changes don't depend on the PCI slot reset capability
> exposed from the firmware.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/eeh.h               |   1 +
>   arch/powerpc/include/asm/opal.h              |   4 +-
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
>   3 files changed, 102 insertions(+), 109 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index c5eb86f..2793d24 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -190,6 +190,7 @@ enum {
>   #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>   #define EEH_RESET_HOT		1	/* Hot reset			*/
>   #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
> +#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
>   #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
>   #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
>
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 042af1a..6d467df 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
>   int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
>   					uint16_t dma_window_number, uint64_t pci_start_addr,
>   					uint64_t pci_mem_size);
> -int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
> +int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
>
>   int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
>   				   uint64_t diag_buffer_len);
> @@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
>   int64_t opal_set_system_attention_led(uint8_t led_action);
>   int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
>   			    __be16 *pci_error_type, __be16 *severity);
> -int64_t opal_pci_poll(uint64_t phb_id);
> +int64_t opal_pci_poll(uint64_t id, uint8_t *val);
>   int64_t opal_return_cpu(void);
>   int64_t opal_check_token(uint64_t token);
>   int64_t opal_reinit_cpus(uint64_t flags);
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index ce738ab..3c01095 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>   	return ret;
>   }
>
> -static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
> +static s64 pnv_eeh_poll(uint64_t id)
>   {
>   	s64 rc = OPAL_HARDWARE;
>
>   	while (1) {
> -		rc = opal_pci_poll(phb->opal_id);
> +		rc = opal_pci_poll(id, NULL);
>   		if (rc <= 0)
>   			break;
>
> @@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>   int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>   {
>   	struct pnv_phb *phb = hose->private_data;
> +	uint8_t scope;
>   	s64 rc = OPAL_HARDWARE;
>
>   	pr_debug("%s: Reset PHB#%x, option=%d\n",
>   		 __func__, hose->global_number, option);
> -
> -	/* Issue PHB complete reset request */
> -	if (option == EEH_RESET_FUNDAMENTAL ||
> -	    option == EEH_RESET_HOT)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PHB_COMPLETE,
> -				    OPAL_ASSERT_RESET);
> -	else if (option == EEH_RESET_DEACTIVATE)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PHB_COMPLETE,
> -				    OPAL_DEASSERT_RESET);
> -	if (rc < 0)
> -		goto out;
> -
> -	/*
> -	 * Poll state of the PHB until the request is done
> -	 * successfully. The PHB reset is usually PHB complete
> -	 * reset followed by hot reset on root bus. So we also
> -	 * need the PCI bus settlement delay.
> -	 */
> -	rc = pnv_eeh_phb_poll(phb);
> -	if (option == EEH_RESET_DEACTIVATE) {
> -		if (system_state < SYSTEM_RUNNING)
> -			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
> -		else
> -			msleep(EEH_PE_RST_SETTLE_TIME);


These udelay() and msleep() are gone. How come they are not needed anymore? 
Worth commenting in the commit log or remove those in a separate patch.

I just remember you mentioning some missing delays somewhere which caused 
NVIDIA device to issue EEH and I do not want those to disappear :)


> +	switch (option) {
> +	case EEH_RESET_HOT:
> +		scope = OPAL_RESET_PCI_HOT;
> +		break;
> +	case EEH_RESET_FUNDAMENTAL:
> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
> +		break;
> +	case EEH_RESET_COMPLETE:
> +		scope = OPAL_RESET_PHB_COMPLETE;
> +		break;
> +	case EEH_RESET_DEACTIVATE:
> +		return 0;
> +	default:
> +		pr_warn("%s: Unsupported option %d\n",
> +			__func__, option);
> +		return -EINVAL;
>   	}
> -out:
> -	if (rc != OPAL_SUCCESS)
> -		return -EIO;
>
> -	return 0;
> -}
> -
> -static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
> -{
> -	struct pnv_phb *phb = hose->private_data;
> -	s64 rc = OPAL_HARDWARE;
> +	/* Issue reset and poll until it's completed */
> +	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
> +	if (rc > 0)
> +		rc = pnv_eeh_poll(phb->opal_id);
>
> -	pr_debug("%s: Reset PHB#%x, option=%d\n",
> -		 __func__, hose->global_number, option);
> -
> -	/*
> -	 * During the reset deassert time, we needn't care
> -	 * the reset scope because the firmware does nothing
> -	 * for fundamental or hot reset during deassert phase.
> -	 */
> -	if (option == EEH_RESET_FUNDAMENTAL)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PCI_FUNDAMENTAL,
> -				    OPAL_ASSERT_RESET);
> -	else if (option == EEH_RESET_HOT)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PCI_HOT,
> -				    OPAL_ASSERT_RESET);
> -	else if (option == EEH_RESET_DEACTIVATE)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PCI_HOT,
> -				    OPAL_DEASSERT_RESET);
> -	if (rc < 0)
> -		goto out;
> -
> -	/* Poll state of the PHB until the request is done */
> -	rc = pnv_eeh_phb_poll(phb);
> -	if (option == EEH_RESET_DEACTIVATE)
> -		msleep(EEH_PE_RST_SETTLE_TIME);
> -out:
> -	if (rc != OPAL_SUCCESS)
> -		return -EIO;
> -
> -	return 0;
> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>   }
>
> -static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
> +static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   {
>   	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
>   	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
> @@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   	return 0;
>   }
>
> +static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
> +{
> +	struct pci_controller *hose;
> +	struct pnv_phb *phb;
> +	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
> +	uint64_t id = (0x1ul << 60);
> +	uint8_t scope;
> +	s64 rc;


int64_t for @rc?


> +
> +	/*
> +	 * If the firmware can't handle it, we will issue hot reset
> +	 * on the secondary bus despite the requested reset type
> +	 */
> +	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
> +		return __pnv_eeh_bridge_reset(dev, option);
> +
> +	/* The firmware can handle the request */
> +	switch (option) {
> +	case EEH_RESET_HOT:
> +		scope = OPAL_RESET_PCI_HOT;
> +		break;
> +	case EEH_RESET_FUNDAMENTAL:
> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
> +		break;
> +	case EEH_RESET_DEACTIVATE:
> +		return 0;
> +	case EEH_RESET_COMPLETE:
> +	default:
> +		pr_warn("%s: Unsupported option %d on device %s\n",
> +			__func__, option, pci_name(dev));
> +		return -EINVAL;
> +	}


This is the same switch as earlier in this patch (slightly different 
order). Move it and opal_pci_reset() into a helper and call it 
pnv_opal_pci_reset()?


> +
> +	hose = pci_bus_to_host(dev->bus);
> +	phb = hose->private_data;

Previously you would initialize @hose and @phb where you declared those but 
not here. If you did the same thing as before, the patch could have been 
smaller and easier to read.



> +	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
> +	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
> +	if (rc > 0)
> +		rc = pnv_eeh_poll(id);
> +
> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
> +}
> +
>   void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>   {
>   	struct pci_controller *hose;
>
>   	if (pci_is_root_bus(dev->bus)) {
>   		hose = pci_bus_to_host(dev->bus);
> -		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
> -		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
> +		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
> +		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>   	} else {
>   		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>   		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
> @@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>   static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>   {
>   	struct pci_controller *hose = pe->phb;
> +	struct pnv_phb *phb;
>   	struct pci_bus *bus;
> -	int ret;
> +	s64 rc;
>
>   	/*
>   	 * For PHB reset, we always have complete reset. For those PEs whose
> @@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>   	 * reset. The side effect is that EEH core has to clear the frozen
>   	 * state explicitly after BAR restore.
>   	 */
> -	if (pe->type & EEH_PE_PHB) {
> -		ret = pnv_eeh_phb_reset(hose, option);
> -	} else {
> -		struct pnv_phb *phb;
> -		s64 rc;
> +	if (pe->type & EEH_PE_PHB)

I would keep "{" in the line above ....

> +		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);

...put "} else {" here...

and the chunk below would become 1) very small 2) very trivial... And then 
you could make a trivial patch which would do scope removal but without 
functional changes. Or vice versa.

>
> -		/*
> -		 * The frozen PE might be caused by PAPR error injection
> -		 * registers, which are expected to be cleared after hitting
> -		 * frozen PE as stated in the hardware spec. Unfortunately,
> -		 * that's not true on P7IOC. So we have to clear it manually
> -		 * to avoid recursive EEH errors during recovery.
> -		 */
> -		phb = hose->private_data;
> -		if (phb->model == PNV_PHB_MODEL_P7IOC &&
> -		    (option == EEH_RESET_HOT ||
> -		    option == EEH_RESET_FUNDAMENTAL)) {
> -			rc = opal_pci_reset(phb->opal_id,
> -					    OPAL_RESET_PHB_ERROR,
> -					    OPAL_ASSERT_RESET);
> -			if (rc != OPAL_SUCCESS) {
> -				pr_warn("%s: Failure %lld clearing "
> -					"error injection registers\n",
> -					__func__, rc);
> -				return -EIO;
> -			}
> +	/*
> +	 * The frozen PE might be caused by PAPR error injection
> +	 * registers, which are expected to be cleared after hitting
> +	 * frozen PE as stated in the hardware spec. Unfortunately,
> +	 * that's not true on P7IOC. So we have to clear it manually
> +	 * to avoid recursive EEH errors during recovery.
> +	 */
> +	phb = hose->private_data;
> +	if (phb->model == PNV_PHB_MODEL_P7IOC &&
> +	    (option == EEH_RESET_HOT ||
> +	    option == EEH_RESET_FUNDAMENTAL)) {
> +		rc = opal_pci_reset(phb->opal_id,
> +				    OPAL_RESET_PHB_ERROR,
> +				    OPAL_ASSERT_RESET);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("%s: Failure %lld clearing error "
> +				"injection registers on PHB#%d\n",
> +				__func__, rc, hose->global_number);
> +			return -EIO;
>   		}
> -
> -		bus = eeh_pe_bus_get(pe);
> -		if (pci_is_root_bus(bus) ||
> -			pci_is_root_bus(bus->parent))
> -			ret = pnv_eeh_root_reset(hose, option);
> -		else
> -			ret = pnv_eeh_bridge_reset(bus->self, option);
>   	}
>
> -	return ret;
> +	/* Route the reset request to PHB or upstream bridge */
> +	bus = eeh_pe_bus_get(pe);
> +	if (pci_is_root_bus(bus))
> +		return pnv_eeh_phb_reset(hose, option);
> +
> +	return pnv_eeh_bridge_reset(bus->self, option);
>   }
>
>   /**
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
@ 2015-05-09 13:41     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 13:41 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> For PowerNV platform, running on top of skiboot, all PE level reset
> should be routed to firmware if the bridge of the PE primary bus has
> device-node property "ibm,reset-by-firmware". Otherwise, the kernel
> has to issue hot reset on PE's primary bus despite the requested reset
> types, which is the behaviour before the firmware supports PCI slot
> reset. So the changes don't depend on the PCI slot reset capability
> exposed from the firmware.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/eeh.h               |   1 +
>   arch/powerpc/include/asm/opal.h              |   4 +-
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
>   3 files changed, 102 insertions(+), 109 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index c5eb86f..2793d24 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -190,6 +190,7 @@ enum {
>   #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>   #define EEH_RESET_HOT		1	/* Hot reset			*/
>   #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
> +#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
>   #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
>   #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
>
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 042af1a..6d467df 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
>   int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
>   					uint16_t dma_window_number, uint64_t pci_start_addr,
>   					uint64_t pci_mem_size);
> -int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
> +int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
>
>   int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
>   				   uint64_t diag_buffer_len);
> @@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
>   int64_t opal_set_system_attention_led(uint8_t led_action);
>   int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
>   			    __be16 *pci_error_type, __be16 *severity);
> -int64_t opal_pci_poll(uint64_t phb_id);
> +int64_t opal_pci_poll(uint64_t id, uint8_t *val);
>   int64_t opal_return_cpu(void);
>   int64_t opal_check_token(uint64_t token);
>   int64_t opal_reinit_cpus(uint64_t flags);
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index ce738ab..3c01095 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>   	return ret;
>   }
>
> -static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
> +static s64 pnv_eeh_poll(uint64_t id)
>   {
>   	s64 rc = OPAL_HARDWARE;
>
>   	while (1) {
> -		rc = opal_pci_poll(phb->opal_id);
> +		rc = opal_pci_poll(id, NULL);
>   		if (rc <= 0)
>   			break;
>
> @@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>   int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>   {
>   	struct pnv_phb *phb = hose->private_data;
> +	uint8_t scope;
>   	s64 rc = OPAL_HARDWARE;
>
>   	pr_debug("%s: Reset PHB#%x, option=%d\n",
>   		 __func__, hose->global_number, option);
> -
> -	/* Issue PHB complete reset request */
> -	if (option == EEH_RESET_FUNDAMENTAL ||
> -	    option == EEH_RESET_HOT)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PHB_COMPLETE,
> -				    OPAL_ASSERT_RESET);
> -	else if (option == EEH_RESET_DEACTIVATE)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PHB_COMPLETE,
> -				    OPAL_DEASSERT_RESET);
> -	if (rc < 0)
> -		goto out;
> -
> -	/*
> -	 * Poll state of the PHB until the request is done
> -	 * successfully. The PHB reset is usually PHB complete
> -	 * reset followed by hot reset on root bus. So we also
> -	 * need the PCI bus settlement delay.
> -	 */
> -	rc = pnv_eeh_phb_poll(phb);
> -	if (option == EEH_RESET_DEACTIVATE) {
> -		if (system_state < SYSTEM_RUNNING)
> -			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
> -		else
> -			msleep(EEH_PE_RST_SETTLE_TIME);


These udelay() and msleep() are gone. How come they are not needed anymore? 
Worth commenting in the commit log or remove those in a separate patch.

I just remember you mentioning some missing delays somewhere which caused 
NVIDIA device to issue EEH and I do not want those to disappear :)


> +	switch (option) {
> +	case EEH_RESET_HOT:
> +		scope = OPAL_RESET_PCI_HOT;
> +		break;
> +	case EEH_RESET_FUNDAMENTAL:
> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
> +		break;
> +	case EEH_RESET_COMPLETE:
> +		scope = OPAL_RESET_PHB_COMPLETE;
> +		break;
> +	case EEH_RESET_DEACTIVATE:
> +		return 0;
> +	default:
> +		pr_warn("%s: Unsupported option %d\n",
> +			__func__, option);
> +		return -EINVAL;
>   	}
> -out:
> -	if (rc != OPAL_SUCCESS)
> -		return -EIO;
>
> -	return 0;
> -}
> -
> -static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
> -{
> -	struct pnv_phb *phb = hose->private_data;
> -	s64 rc = OPAL_HARDWARE;
> +	/* Issue reset and poll until it's completed */
> +	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
> +	if (rc > 0)
> +		rc = pnv_eeh_poll(phb->opal_id);
>
> -	pr_debug("%s: Reset PHB#%x, option=%d\n",
> -		 __func__, hose->global_number, option);
> -
> -	/*
> -	 * During the reset deassert time, we needn't care
> -	 * the reset scope because the firmware does nothing
> -	 * for fundamental or hot reset during deassert phase.
> -	 */
> -	if (option == EEH_RESET_FUNDAMENTAL)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PCI_FUNDAMENTAL,
> -				    OPAL_ASSERT_RESET);
> -	else if (option == EEH_RESET_HOT)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PCI_HOT,
> -				    OPAL_ASSERT_RESET);
> -	else if (option == EEH_RESET_DEACTIVATE)
> -		rc = opal_pci_reset(phb->opal_id,
> -				    OPAL_RESET_PCI_HOT,
> -				    OPAL_DEASSERT_RESET);
> -	if (rc < 0)
> -		goto out;
> -
> -	/* Poll state of the PHB until the request is done */
> -	rc = pnv_eeh_phb_poll(phb);
> -	if (option == EEH_RESET_DEACTIVATE)
> -		msleep(EEH_PE_RST_SETTLE_TIME);
> -out:
> -	if (rc != OPAL_SUCCESS)
> -		return -EIO;
> -
> -	return 0;
> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>   }
>
> -static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
> +static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   {
>   	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
>   	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
> @@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   	return 0;
>   }
>
> +static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
> +{
> +	struct pci_controller *hose;
> +	struct pnv_phb *phb;
> +	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
> +	uint64_t id = (0x1ul << 60);
> +	uint8_t scope;
> +	s64 rc;


int64_t for @rc?


> +
> +	/*
> +	 * If the firmware can't handle it, we will issue hot reset
> +	 * on the secondary bus despite the requested reset type
> +	 */
> +	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
> +		return __pnv_eeh_bridge_reset(dev, option);
> +
> +	/* The firmware can handle the request */
> +	switch (option) {
> +	case EEH_RESET_HOT:
> +		scope = OPAL_RESET_PCI_HOT;
> +		break;
> +	case EEH_RESET_FUNDAMENTAL:
> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
> +		break;
> +	case EEH_RESET_DEACTIVATE:
> +		return 0;
> +	case EEH_RESET_COMPLETE:
> +	default:
> +		pr_warn("%s: Unsupported option %d on device %s\n",
> +			__func__, option, pci_name(dev));
> +		return -EINVAL;
> +	}


This is the same switch as earlier in this patch (slightly different 
order). Move it and opal_pci_reset() into a helper and call it 
pnv_opal_pci_reset()?


> +
> +	hose = pci_bus_to_host(dev->bus);
> +	phb = hose->private_data;

Previously you would initialize @hose and @phb where you declared those but 
not here. If you did the same thing as before, the patch could have been 
smaller and easier to read.



> +	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
> +	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
> +	if (rc > 0)
> +		rc = pnv_eeh_poll(id);
> +
> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
> +}
> +
>   void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>   {
>   	struct pci_controller *hose;
>
>   	if (pci_is_root_bus(dev->bus)) {
>   		hose = pci_bus_to_host(dev->bus);
> -		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
> -		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
> +		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
> +		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>   	} else {
>   		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>   		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
> @@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>   static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>   {
>   	struct pci_controller *hose = pe->phb;
> +	struct pnv_phb *phb;
>   	struct pci_bus *bus;
> -	int ret;
> +	s64 rc;
>
>   	/*
>   	 * For PHB reset, we always have complete reset. For those PEs whose
> @@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>   	 * reset. The side effect is that EEH core has to clear the frozen
>   	 * state explicitly after BAR restore.
>   	 */
> -	if (pe->type & EEH_PE_PHB) {
> -		ret = pnv_eeh_phb_reset(hose, option);
> -	} else {
> -		struct pnv_phb *phb;
> -		s64 rc;
> +	if (pe->type & EEH_PE_PHB)

I would keep "{" in the line above ....

> +		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);

...put "} else {" here...

and the chunk below would become 1) very small 2) very trivial... And then 
you could make a trivial patch which would do scope removal but without 
functional changes. Or vice versa.

>
> -		/*
> -		 * The frozen PE might be caused by PAPR error injection
> -		 * registers, which are expected to be cleared after hitting
> -		 * frozen PE as stated in the hardware spec. Unfortunately,
> -		 * that's not true on P7IOC. So we have to clear it manually
> -		 * to avoid recursive EEH errors during recovery.
> -		 */
> -		phb = hose->private_data;
> -		if (phb->model == PNV_PHB_MODEL_P7IOC &&
> -		    (option == EEH_RESET_HOT ||
> -		    option == EEH_RESET_FUNDAMENTAL)) {
> -			rc = opal_pci_reset(phb->opal_id,
> -					    OPAL_RESET_PHB_ERROR,
> -					    OPAL_ASSERT_RESET);
> -			if (rc != OPAL_SUCCESS) {
> -				pr_warn("%s: Failure %lld clearing "
> -					"error injection registers\n",
> -					__func__, rc);
> -				return -EIO;
> -			}
> +	/*
> +	 * The frozen PE might be caused by PAPR error injection
> +	 * registers, which are expected to be cleared after hitting
> +	 * frozen PE as stated in the hardware spec. Unfortunately,
> +	 * that's not true on P7IOC. So we have to clear it manually
> +	 * to avoid recursive EEH errors during recovery.
> +	 */
> +	phb = hose->private_data;
> +	if (phb->model == PNV_PHB_MODEL_P7IOC &&
> +	    (option == EEH_RESET_HOT ||
> +	    option == EEH_RESET_FUNDAMENTAL)) {
> +		rc = opal_pci_reset(phb->opal_id,
> +				    OPAL_RESET_PHB_ERROR,
> +				    OPAL_ASSERT_RESET);
> +		if (rc != OPAL_SUCCESS) {
> +			pr_warn("%s: Failure %lld clearing error "
> +				"injection registers on PHB#%d\n",
> +				__func__, rc, hose->global_number);
> +			return -EIO;
>   		}
> -
> -		bus = eeh_pe_bus_get(pe);
> -		if (pci_is_root_bus(bus) ||
> -			pci_is_root_bus(bus->parent))
> -			ret = pnv_eeh_root_reset(hose, option);
> -		else
> -			ret = pnv_eeh_bridge_reset(bus->self, option);
>   	}
>
> -	return ret;
> +	/* Route the reset request to PHB or upstream bridge */
> +	bus = eeh_pe_bus_get(pe);
> +	if (pci_is_root_bus(bus))
> +		return pnv_eeh_phb_reset(hose, option);
> +
> +	return pnv_eeh_bridge_reset(bus->self, option);
>   }
>
>   /**
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
  2015-05-01  6:02   ` Gavin Shan
@ 2015-05-09 14:12     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:12 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> Function pnv_pci_reset_secondary_bus() is used to reset specified
> PCI bus, which is leaded by root complex or PCI bridge. That means
> the function shouldn't be called on PCI root bus and the patch
> removes the logic for that case.
>
> Also, some adapters beneath the indicated PCI bus may require
> fundamental reset in order to successfully reload their firmwares
> after the reset. The patch translates hot reset to fundamental reset
> for that case.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>   1 file changed, 26 insertions(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 3c01095..58e4dcf 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>   }
>
> -void pnv_pci_reset_secondary_bus(struct pci_dev *dev)


Why changing dev to pdev? Keeping "dev" could make the patch simpler.


> +static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>   {
> -	struct pci_controller *hose;
> +	int *freset = data;
>
> -	if (pci_is_root_bus(dev->bus)) {
> -		hose = pci_bus_to_host(dev->bus);
> -		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
> -		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
> -	} else {
> -		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
> -		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
> +	/*
> +	 * Stop the iteration immediately if there is any
> +	 * one PCI device requesting fundamental reset
> +	 */
> +	*freset |= pdev->needs_freset;
> +	return *freset;
> +}
> +
> +void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
> +{
> +	int option = EEH_RESET_HOT;
> +	int freset = 0;
> +
> +	/* Check if there're any PCI devices asking for fundamental reset */
> +	if (pdev->subordinate) {
> +		pci_walk_bus(pdev->subordinate,
> +			     pnv_pci_dev_reset_type,
> +			     &freset);
> +		if (freset)
> +			option = EEH_RESET_FUNDAMENTAL;
>   	}
> +
> +	/* Issue the requested type of reset */
> +	pnv_eeh_bridge_reset(pdev, option);
> +	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>   }
>
>   /**
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
@ 2015-05-09 14:12     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:12 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:02 PM, Gavin Shan wrote:
> Function pnv_pci_reset_secondary_bus() is used to reset specified
> PCI bus, which is leaded by root complex or PCI bridge. That means
> the function shouldn't be called on PCI root bus and the patch
> removes the logic for that case.
>
> Also, some adapters beneath the indicated PCI bus may require
> fundamental reset in order to successfully reload their firmwares
> after the reset. The patch translates hot reset to fundamental reset
> for that case.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>   1 file changed, 26 insertions(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 3c01095..58e4dcf 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>   }
>
> -void pnv_pci_reset_secondary_bus(struct pci_dev *dev)


Why changing dev to pdev? Keeping "dev" could make the patch simpler.


> +static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>   {
> -	struct pci_controller *hose;
> +	int *freset = data;
>
> -	if (pci_is_root_bus(dev->bus)) {
> -		hose = pci_bus_to_host(dev->bus);
> -		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
> -		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
> -	} else {
> -		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
> -		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
> +	/*
> +	 * Stop the iteration immediately if there is any
> +	 * one PCI device requesting fundamental reset
> +	 */
> +	*freset |= pdev->needs_freset;
> +	return *freset;
> +}
> +
> +void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
> +{
> +	int option = EEH_RESET_HOT;
> +	int freset = 0;
> +
> +	/* Check if there're any PCI devices asking for fundamental reset */
> +	if (pdev->subordinate) {
> +		pci_walk_bus(pdev->subordinate,
> +			     pnv_pci_dev_reset_type,
> +			     &freset);
> +		if (freset)
> +			option = EEH_RESET_FUNDAMENTAL;
>   	}
> +
> +	/* Issue the requested type of reset */
> +	pnv_eeh_bridge_reset(pdev, option);
> +	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>   }
>
>   /**
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-09 14:30     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:30 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> We might not get some PCI slot information (e.g. power status)
> immediately by OPAL API. Instead, opal_pci_poll() need to be called
> for the required information.
>
> The patch introduces pnv_pci_poll(), which bases on original
> pnv_eeh_poll(), to cover the above case
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++--------------------------
>   arch/powerpc/platforms/powernv/pci.c         | 16 ++++++++++++++++
>   arch/powerpc/platforms/powernv/pci.h         |  1 +
>   3 files changed, 19 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 58e4dcf..9253b9e 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>   	return ret;
>   }
>
> -static s64 pnv_eeh_poll(uint64_t id)
> -{
> -	s64 rc = OPAL_HARDWARE;
> -
> -	while (1) {
> -		rc = opal_pci_poll(id, NULL);
> -		if (rc <= 0)
> -			break;
> -
> -		if (system_state < SYSTEM_RUNNING)
> -			udelay(1000 * rc);
> -		else
> -			msleep(rc);
> -	}
> -
> -	return rc;
> -}
> -
>   int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>   {
>   	struct pnv_phb *phb = hose->private_data;
> @@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>
>   	/* Issue reset and poll until it's completed */
>   	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
> -	if (rc > 0)
> -		rc = pnv_eeh_poll(phb->opal_id);
> -
> -	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
> +	return pnv_pci_poll(phb->opal_id, rc, NULL);


You are carrying a negative value to the new helper too? Looks complicated.

Also, before you only cared if opal_pci_reset() returned negative value, 
now you treat it as a timeout, is it new change to OPAL or it has always 
been there?


>   }
>
>   static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
> @@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   	phb = hose->private_data;
>   	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>   	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
> -	if (rc > 0)
> -		rc = pnv_eeh_poll(id);
> -
> -	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
> +	return pnv_pci_poll(id, rc, NULL);
>   }
>
>   static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index bca2aeb..a2da9a3 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -44,6 +44,22 @@
>   #define cfg_dbg(fmt...)	do { } while(0)
>   //#define cfg_dbg(fmt...)	printk(fmt)
>
> +int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
> +{
> +	while (rval > 0) {
> +		if (system_state < SYSTEM_RUNNING)
> +			udelay(1000 * rval);
> +		else
> +			msleep(rval);

Are these delays the once removed by "PATCH v4 09/21] powerpc/powernv: Use 
PCI slot reset infrastructure"? If so, I would merge this patch into 09/24 
or move this one before that one, for bisect'ability.


> +
> +		rval = opal_pci_poll(id, pval);
> +		if (rval == OPAL_SUCCESS && pval)
> +			rval = opal_pci_poll(id, pval);

Why calling it twice?


> +	}
> +
> +	return rval ? -EIO : 0;
> +}
> +
>   #ifdef CONFIG_PCI_MSI
>   static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>   {
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 8b10f01..82c5539 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -202,6 +202,7 @@ struct pnv_phb {
>
>   extern struct pci_ops pnv_pci_ops;
>
> +int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval);
>   void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
>   				unsigned char *log_buff);
>   int pnv_pci_cfg_read(struct pci_dn *pdn,
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
@ 2015-05-09 14:30     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:30 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> We might not get some PCI slot information (e.g. power status)
> immediately by OPAL API. Instead, opal_pci_poll() need to be called
> for the required information.
>
> The patch introduces pnv_pci_poll(), which bases on original
> pnv_eeh_poll(), to cover the above case
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++--------------------------
>   arch/powerpc/platforms/powernv/pci.c         | 16 ++++++++++++++++
>   arch/powerpc/platforms/powernv/pci.h         |  1 +
>   3 files changed, 19 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 58e4dcf..9253b9e 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>   	return ret;
>   }
>
> -static s64 pnv_eeh_poll(uint64_t id)
> -{
> -	s64 rc = OPAL_HARDWARE;
> -
> -	while (1) {
> -		rc = opal_pci_poll(id, NULL);
> -		if (rc <= 0)
> -			break;
> -
> -		if (system_state < SYSTEM_RUNNING)
> -			udelay(1000 * rc);
> -		else
> -			msleep(rc);
> -	}
> -
> -	return rc;
> -}
> -
>   int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>   {
>   	struct pnv_phb *phb = hose->private_data;
> @@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>
>   	/* Issue reset and poll until it's completed */
>   	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
> -	if (rc > 0)
> -		rc = pnv_eeh_poll(phb->opal_id);
> -
> -	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
> +	return pnv_pci_poll(phb->opal_id, rc, NULL);


You are carrying a negative value to the new helper too? Looks complicated.

Also, before you only cared if opal_pci_reset() returned negative value, 
now you treat it as a timeout, is it new change to OPAL or it has always 
been there?


>   }
>
>   static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
> @@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>   	phb = hose->private_data;
>   	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>   	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
> -	if (rc > 0)
> -		rc = pnv_eeh_poll(id);
> -
> -	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
> +	return pnv_pci_poll(id, rc, NULL);
>   }
>
>   static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index bca2aeb..a2da9a3 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -44,6 +44,22 @@
>   #define cfg_dbg(fmt...)	do { } while(0)
>   //#define cfg_dbg(fmt...)	printk(fmt)
>
> +int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
> +{
> +	while (rval > 0) {
> +		if (system_state < SYSTEM_RUNNING)
> +			udelay(1000 * rval);
> +		else
> +			msleep(rval);

Are these delays the once removed by "PATCH v4 09/21] powerpc/powernv: Use 
PCI slot reset infrastructure"? If so, I would merge this patch into 09/24 
or move this one before that one, for bisect'ability.


> +
> +		rval = opal_pci_poll(id, pval);
> +		if (rval == OPAL_SUCCESS && pval)
> +			rval = opal_pci_poll(id, pval);

Why calling it twice?


> +	}
> +
> +	return rval ? -EIO : 0;
> +}
> +
>   #ifdef CONFIG_PCI_MSI
>   static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>   {
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 8b10f01..82c5539 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -202,6 +202,7 @@ struct pnv_phb {
>
>   extern struct pci_ops pnv_pci_ops;
>
> +int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval);
>   void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
>   				unsigned char *log_buff);
>   int pnv_pci_cfg_read(struct pci_dn *pdn,
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 14/21] powerpc/powernv: Functions to get/reset PCI slot status
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-09 14:44     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:44 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The patch exports 3 functions, which base on corresponding OPAL
> APIs to get or set PCI slot status. Those functions are going to
> be used by PCI hotplug module in subsequent patches:
>
>     pnv_pci_get_presence_status()  opal_pci_get_presence_status()
>     pnv_pci_get_power_status()     opal_pci_get_power_status()
>     pnv_pci_set_power_status()     opal_pci_set_power_status()
>
> Besides, the patch also exports pnv_pci_hotplug_notifier() to allow
> registering PCI hotplug notifier, which will be used to receive PCI
> hotplug message from skiboot firmware.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/opal-api.h            |  7 +++-
>   arch/powerpc/include/asm/opal.h                |  3 ++
>   arch/powerpc/include/asm/pnv-pci.h             |  5 +++
>   arch/powerpc/platforms/powernv/opal-wrappers.S |  3 ++
>   arch/powerpc/platforms/powernv/pci.c           | 45 ++++++++++++++++++++++++++
>   5 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index 0321a90..29b407d 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -153,7 +153,10 @@
>   #define OPAL_FLASH_READ				110
>   #define OPAL_FLASH_WRITE			111
>   #define OPAL_FLASH_ERASE			112
> -#define OPAL_LAST				112
> +#define OPAL_PCI_GET_PRESENCE_STATUS		116
> +#define OPAL_PCI_GET_POWER_STATUS		117
> +#define OPAL_PCI_SET_POWER_STATUS		118
> +#define OPAL_LAST				118
>
>   /* Device tree flags */
>
> @@ -352,6 +355,8 @@ enum opal_msg_type {
>   	OPAL_MSG_SHUTDOWN,		/* params[0] = 1 reboot, 0 shutdown */
>   	OPAL_MSG_HMI_EVT,
>   	OPAL_MSG_DPO,
> +	OPAL_MSG_PRD,
> +	OPAL_MSG_PCI_HOTPLUG,
>   	OPAL_MSG_TYPE_MAX,
>   };
>
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 6d467df..a0eb206 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -200,6 +200,9 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf,
>   		uint64_t size, uint64_t token);
>   int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size,
>   		uint64_t token);
> +int64_t opal_pci_get_presence_status(uint64_t id, uint8_t *status);
> +int64_t opal_pci_get_power_status(uint64_t id, uint8_t *status);
> +int64_t opal_pci_set_power_status(uint64_t id, uint8_t status);
>
>   /* Internal functions */
>   extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
> diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
> index f9b4982..50d92a4 100644
> --- a/arch/powerpc/include/asm/pnv-pci.h
> +++ b/arch/powerpc/include/asm/pnv-pci.h
> @@ -13,6 +13,11 @@
>   #include <linux/pci.h>
>   #include <misc/cxl.h>
>
> +extern int pnv_pci_get_presence_status(uint64_t id, uint8_t *status);
> +extern int pnv_pci_get_power_status(uint64_t id, uint8_t *status);
> +extern int pnv_pci_set_power_status(uint64_t id, uint8_t status);
> +extern int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable);
> +
>   int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
>   int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>   			   unsigned int virq);
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index a7ade94..aa95dcb 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -295,3 +295,6 @@ OPAL_CALL(opal_i2c_request,			OPAL_I2C_REQUEST);
>   OPAL_CALL(opal_flash_read,			OPAL_FLASH_READ);
>   OPAL_CALL(opal_flash_write,			OPAL_FLASH_WRITE);
>   OPAL_CALL(opal_flash_erase,			OPAL_FLASH_ERASE);
> +OPAL_CALL(opal_pci_get_presence_status,		OPAL_PCI_GET_PRESENCE_STATUS);
> +OPAL_CALL(opal_pci_get_power_status,		OPAL_PCI_GET_POWER_STATUS);
> +OPAL_CALL(opal_pci_set_power_status,		OPAL_PCI_SET_POWER_STATUS);
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index a2da9a3..60e6d65 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -60,6 +60,51 @@ int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
>   	return rval ? -EIO : 0;
>   }
>
> +int pnv_pci_get_presence_status(uint64_t id, uint8_t *status)
> +{
> +	long rc;
> +
> +	if (!opal_check_token(OPAL_PCI_GET_PRESENCE_STATUS))


I got a question about the style (i.e. I do not mean the patch is wrong :) )

Everywhere else you use int64_t or s64 for the value returned by OPAL but 
not with opal_check_point(). And you would compare it to OPAL_SUCCESS 
rather than plain zero. What does opal_check_token() return when succeeded? 
1, -1,...? OPAL_SUCCESS means here an error, right?


> +		return -ENXIO;
> +
> +	rc = opal_pci_get_presence_status(id, status);
> +	return pnv_pci_poll(id, rc, status);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_get_presence_status);
> +
> +int pnv_pci_get_power_status(uint64_t id, uint8_t *status)
> +{
> +	long rc;
> +
> +	if (!opal_check_token(OPAL_PCI_GET_POWER_STATUS))
> +		return -ENXIO;
> +
> +	rc = opal_pci_get_power_status(id, status);
> +	return pnv_pci_poll(id, rc, status);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_get_power_status);
> +
> +int pnv_pci_set_power_status(uint64_t id, uint8_t status)
> +{
> +	long rc;
> +
> +	if (!opal_check_token(OPAL_PCI_SET_POWER_STATUS))
> +		return -ENXIO;
> +
> +	rc = opal_pci_set_power_status(id, status);
> +	return pnv_pci_poll(id, rc, NULL);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_set_power_status);
> +
> +int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable)
> +{
> +	if (enable)
> +		return opal_message_notifier_register(OPAL_MSG_PCI_HOTPLUG, nb);
> +
> +	return opal_message_notifier_unregister(OPAL_MSG_PCI_HOTPLUG, nb);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_hotplug_notifier);
> +
>   #ifdef CONFIG_PCI_MSI
>   static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>   {
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 14/21] powerpc/powernv: Functions to get/reset PCI slot status
@ 2015-05-09 14:44     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:44 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The patch exports 3 functions, which base on corresponding OPAL
> APIs to get or set PCI slot status. Those functions are going to
> be used by PCI hotplug module in subsequent patches:
>
>     pnv_pci_get_presence_status()  opal_pci_get_presence_status()
>     pnv_pci_get_power_status()     opal_pci_get_power_status()
>     pnv_pci_set_power_status()     opal_pci_set_power_status()
>
> Besides, the patch also exports pnv_pci_hotplug_notifier() to allow
> registering PCI hotplug notifier, which will be used to receive PCI
> hotplug message from skiboot firmware.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/opal-api.h            |  7 +++-
>   arch/powerpc/include/asm/opal.h                |  3 ++
>   arch/powerpc/include/asm/pnv-pci.h             |  5 +++
>   arch/powerpc/platforms/powernv/opal-wrappers.S |  3 ++
>   arch/powerpc/platforms/powernv/pci.c           | 45 ++++++++++++++++++++++++++
>   5 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index 0321a90..29b407d 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -153,7 +153,10 @@
>   #define OPAL_FLASH_READ				110
>   #define OPAL_FLASH_WRITE			111
>   #define OPAL_FLASH_ERASE			112
> -#define OPAL_LAST				112
> +#define OPAL_PCI_GET_PRESENCE_STATUS		116
> +#define OPAL_PCI_GET_POWER_STATUS		117
> +#define OPAL_PCI_SET_POWER_STATUS		118
> +#define OPAL_LAST				118
>
>   /* Device tree flags */
>
> @@ -352,6 +355,8 @@ enum opal_msg_type {
>   	OPAL_MSG_SHUTDOWN,		/* params[0] = 1 reboot, 0 shutdown */
>   	OPAL_MSG_HMI_EVT,
>   	OPAL_MSG_DPO,
> +	OPAL_MSG_PRD,
> +	OPAL_MSG_PCI_HOTPLUG,
>   	OPAL_MSG_TYPE_MAX,
>   };
>
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 6d467df..a0eb206 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -200,6 +200,9 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf,
>   		uint64_t size, uint64_t token);
>   int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size,
>   		uint64_t token);
> +int64_t opal_pci_get_presence_status(uint64_t id, uint8_t *status);
> +int64_t opal_pci_get_power_status(uint64_t id, uint8_t *status);
> +int64_t opal_pci_set_power_status(uint64_t id, uint8_t status);
>
>   /* Internal functions */
>   extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
> diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h
> index f9b4982..50d92a4 100644
> --- a/arch/powerpc/include/asm/pnv-pci.h
> +++ b/arch/powerpc/include/asm/pnv-pci.h
> @@ -13,6 +13,11 @@
>   #include <linux/pci.h>
>   #include <misc/cxl.h>
>
> +extern int pnv_pci_get_presence_status(uint64_t id, uint8_t *status);
> +extern int pnv_pci_get_power_status(uint64_t id, uint8_t *status);
> +extern int pnv_pci_set_power_status(uint64_t id, uint8_t status);
> +extern int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable);
> +
>   int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
>   int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>   			   unsigned int virq);
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index a7ade94..aa95dcb 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -295,3 +295,6 @@ OPAL_CALL(opal_i2c_request,			OPAL_I2C_REQUEST);
>   OPAL_CALL(opal_flash_read,			OPAL_FLASH_READ);
>   OPAL_CALL(opal_flash_write,			OPAL_FLASH_WRITE);
>   OPAL_CALL(opal_flash_erase,			OPAL_FLASH_ERASE);
> +OPAL_CALL(opal_pci_get_presence_status,		OPAL_PCI_GET_PRESENCE_STATUS);
> +OPAL_CALL(opal_pci_get_power_status,		OPAL_PCI_GET_POWER_STATUS);
> +OPAL_CALL(opal_pci_set_power_status,		OPAL_PCI_SET_POWER_STATUS);
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index a2da9a3..60e6d65 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -60,6 +60,51 @@ int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
>   	return rval ? -EIO : 0;
>   }
>
> +int pnv_pci_get_presence_status(uint64_t id, uint8_t *status)
> +{
> +	long rc;
> +
> +	if (!opal_check_token(OPAL_PCI_GET_PRESENCE_STATUS))


I got a question about the style (i.e. I do not mean the patch is wrong :) )

Everywhere else you use int64_t or s64 for the value returned by OPAL but 
not with opal_check_point(). And you would compare it to OPAL_SUCCESS 
rather than plain zero. What does opal_check_token() return when succeeded? 
1, -1,...? OPAL_SUCCESS means here an error, right?


> +		return -ENXIO;
> +
> +	rc = opal_pci_get_presence_status(id, status);
> +	return pnv_pci_poll(id, rc, status);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_get_presence_status);
> +
> +int pnv_pci_get_power_status(uint64_t id, uint8_t *status)
> +{
> +	long rc;
> +
> +	if (!opal_check_token(OPAL_PCI_GET_POWER_STATUS))
> +		return -ENXIO;
> +
> +	rc = opal_pci_get_power_status(id, status);
> +	return pnv_pci_poll(id, rc, status);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_get_power_status);
> +
> +int pnv_pci_set_power_status(uint64_t id, uint8_t status)
> +{
> +	long rc;
> +
> +	if (!opal_check_token(OPAL_PCI_SET_POWER_STATUS))
> +		return -ENXIO;
> +
> +	rc = opal_pci_set_power_status(id, status);
> +	return pnv_pci_poll(id, rc, NULL);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_set_power_status);
> +
> +int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable)
> +{
> +	if (enable)
> +		return opal_message_notifier_register(OPAL_MSG_PCI_HOTPLUG, nb);
> +
> +	return opal_message_notifier_unregister(OPAL_MSG_PCI_HOTPLUG, nb);
> +}
> +EXPORT_SYMBOL_GPL(pnv_pci_hotplug_notifier);
> +
>   #ifdef CONFIG_PCI_MSI
>   static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>   {
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-09 14:55     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:55 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The pci_dn instances are allocated from memblock or bootmem when
> creating PCI controller (hoses) in setup_arch(). The PCI hotplug,
> which will be supported by proceeding patches, will release PCI
> device nodes and their corresponding pci_dn on unplugging event.
> The pci_dn instance memory chunks alloed from memblock or bootmem
> are hard to reused after being released.
>
> The patch delay creating pci_dn so that they can be allocated from
> slab. In turn, the memory chunks for them can be reused after being
> released without problem. The creation of eeh_dev instances, which
> depends on pci_dn, is delayed a bit as well.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/ppc-pci.h     |  1 -
>   arch/powerpc/kernel/eeh_dev.c          |  2 +-
>   arch/powerpc/kernel/pci_dn.c           | 40 +++++++++++++++++++---------------
>   arch/powerpc/platforms/maple/pci.c     | 35 +++++++++++++++++------------
>   arch/powerpc/platforms/pasemi/pci.c    |  3 ---
>   arch/powerpc/platforms/powermac/pci.c  | 39 ++++++++++++++++++++-------------
>   arch/powerpc/platforms/powernv/pci.c   |  3 ---
>   arch/powerpc/platforms/pseries/setup.c |  1 -
>   8 files changed, 68 insertions(+), 56 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
> index 4122a86..7388316 100644
> --- a/arch/powerpc/include/asm/ppc-pci.h
> +++ b/arch/powerpc/include/asm/ppc-pci.h
> @@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root,
>   		      void *(*fn)(struct pci_dn *, void *),
>   		      void *data);
>
> -extern void pci_devs_phb_init(void);
>   extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
>
>   /* From rtas_pci.h */
> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
> index aabba94..f33ce5b 100644
> --- a/arch/powerpc/kernel/eeh_dev.c
> +++ b/arch/powerpc/kernel/eeh_dev.c
> @@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void)
>   	return 0;
>   }
>
> -core_initcall(eeh_dev_phb_init);
> +core_initcall_sync(eeh_dev_phb_init);
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index b3b4df9..d3833af 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>   	struct device_node *parent;
>   	struct pci_dn *pdn;
>
> -	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>   	if (pdn == NULL)
>   		return NULL;
>   	dn->data = pdn;
> @@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
>   	traverse_pci_devices(dn, update_dn_pci_info, phb);
>   }
>
> -/**
> +static void pci_dev_pdn_setup(struct pci_dev *pdev)
> +{
> +	struct pci_dn *pdn;
> +
> +	if (pdev->dev.archdata.pci_data)
> +		return;
> +
> +	/* Setup the fast path */
> +	pdn = pci_get_pdn(pdev);
> +	pdev->dev.archdata.pci_data = pdn;
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);


How does moving of the chunk above help to "Delay creating pci_dn"?


> +
> +/*
>    * pci_devs_phb_init - Initialize phbs and pci devs under them.
> - *
> - * This routine walks over all phb's (pci-host bridges) on the
> - * system, and sets up assorted pci-related structures
> + *
> + * This routine walks over all phb's (pci-host bridges) on
> + * the system, and sets up assorted pci-related structures
>    * (including pci info in the device node structs) for each
>    * pci device found underneath.  This routine runs once,
>    * early in the boot sequence.
>    */
> -void __init pci_devs_phb_init(void)
> +static int __init pci_devs_phb_init(void)
>   {
>   	struct pci_controller *phb, *tmp;
>
>   	/* This must be done first so the device nodes have valid pci info! */
>   	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>   		pci_devs_phb_init_dynamic(phb);
> -}
> -
> -static void pci_dev_pdn_setup(struct pci_dev *pdev)
> -{
> -	struct pci_dn *pdn;
>
> -	if (pdev->dev.archdata.pci_data)
> -		return;
> -
> -	/* Setup the fast path */
> -	pdn = pci_get_pdn(pdev);
> -	pdev->dev.archdata.pci_data = pdn;
> +	return 0;
>   }
> -DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
> +
> +core_initcall(pci_devs_phb_init);
> diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c
> index a923230..04a69a8 100644
> --- a/arch/powerpc/platforms/maple/pci.c
> +++ b/arch/powerpc/platforms/maple/pci.c
> @@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev)
>   	DBG(" <- maple_pci_irq_fixup\n");
>   }
>
> +static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
> +	struct device_node *np, *child;
> +
> +	if (hose != u3_agp)
> +		return 0;
> +
> +	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> +	 * assume there is no P2P bridge on the AGP bus, which should be a
> +	 * safe assumptions hopefully.
> +	 */
> +	np = hose->dn;
> +	PCI_DN(np)->busno = 0xf0;
> +	for_each_child_of_node(np, child)
> +		PCI_DN(child)->busno = 0xf0;
> +
> +	return 0;
> +}
> +
>   void __init maple_pci_init(void)
>   {
>   	struct device_node *np, *root;
> @@ -605,20 +625,7 @@ void __init maple_pci_init(void)
>   	if (ht && maple_add_bridge(ht) != 0)
>   		of_node_put(ht);
>
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
> -
> -	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> -	 * assume there is no P2P bridge on the AGP bus, which should be a
> -	 * safe assumptions hopefully.
> -	 */
> -	if (u3_agp) {
> -		struct device_node *np = u3_agp->dn;
> -		PCI_DN(np)->busno = 0xf0;
> -		for (np = np->child; np; np = np->sibling)
> -			PCI_DN(np)->busno = 0xf0;
> -	}
> -
> +	ppc_md.pcibios_root_bridge_prepare = maple_pci_root_bridge_prepare;
>   	/* Tell pci.c to not change any resource allocations.  */
>   	pci_add_flags(PCI_PROBE_ONLY);
>   }
> diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
> index f3a68a0..10c4e8f 100644
> --- a/arch/powerpc/platforms/pasemi/pci.c
> +++ b/arch/powerpc/platforms/pasemi/pci.c
> @@ -229,9 +229,6 @@ void __init pas_pci_init(void)
>   			of_node_get(np);
>
>   	of_node_put(root);
> -
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
>   }
>
>   void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)
> diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
> index 59ab16f..368716f 100644
> --- a/arch/powerpc/platforms/powermac/pci.c
> +++ b/arch/powerpc/platforms/powermac/pci.c
> @@ -878,6 +878,29 @@ void pmac_pci_irq_fixup(struct pci_dev *dev)
>   #endif /* CONFIG_PPC32 */
>   }
>
> +#ifdef CONFIG_PPC64
> +static int pmac_pci_root_bridge_prepare(struct pci_hot_bridge *bridge)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
> +	struct device_node *np, *child;
> +
> +	if (hose != u3_agp)
> +		return 0;
> +
> +	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> +	 * assume there is no P2P bridge on the AGP bus, which should be a
> +	 * safe assumptions for now. We should do something better in the
> +	 * future though
> +	 */
> +	np = hose->dn;
> +	PCI_DN(np)->busno = 0xf0;
> +	for_each_child_of_node(np, child)
> +		PCI_DN(child)->busno = 0xf0;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_PPC64 */
> +
>   void __init pmac_pci_init(void)
>   {
>   	struct device_node *np, *root;
> @@ -914,22 +937,8 @@ void __init pmac_pci_init(void)
>   	if (ht && pmac_add_bridge(ht) != 0)
>   		of_node_put(ht);
>
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
> -
> -	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> -	 * assume there is no P2P bridge on the AGP bus, which should be a
> -	 * safe assumptions for now. We should do something better in the
> -	 * future though
> -	 */
> -	if (u3_agp) {
> -		struct device_node *np = u3_agp->dn;
> -		PCI_DN(np)->busno = 0xf0;
> -		for (np = np->child; np; np = np->sibling)
> -			PCI_DN(np)->busno = 0xf0;
> -	}
>   	/* pmac_check_ht_link(); */
> -
> +	ppc_md.pcibios_root_bridge_prepare = pmac_pci_root_bridge_prepare;
>   #else /* CONFIG_PPC64 */
>   	init_p2pbridge();
>   	init_second_ohare();
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 60e6d65..21a4eb3 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -819,9 +819,6 @@ void __init pnv_pci_init(void)
>   	for_each_compatible_node(np, NULL, "ibm,ioda2-phb")
>   		pnv_pci_init_ioda2_phb(np);
>
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
> -
>   	/* Configure IOMMU DMA hooks */
>   	ppc_md.tce_build = pnv_tce_build_vm;
>   	ppc_md.tce_free = pnv_tce_free_vm;
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index df6a704..5f80758 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -482,7 +482,6 @@ static void __init find_and_init_phbs(void)
>   	}
>
>   	of_node_put(root);
> -	pci_devs_phb_init();
>
>   	/*
>   	 * PCI_PROBE_ONLY and PCI_REASSIGN_ALL_BUS can be set via properties
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
@ 2015-05-09 14:55     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 14:55 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The pci_dn instances are allocated from memblock or bootmem when
> creating PCI controller (hoses) in setup_arch(). The PCI hotplug,
> which will be supported by proceeding patches, will release PCI
> device nodes and their corresponding pci_dn on unplugging event.
> The pci_dn instance memory chunks alloed from memblock or bootmem
> are hard to reused after being released.
>
> The patch delay creating pci_dn so that they can be allocated from
> slab. In turn, the memory chunks for them can be reused after being
> released without problem. The creation of eeh_dev instances, which
> depends on pci_dn, is delayed a bit as well.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/ppc-pci.h     |  1 -
>   arch/powerpc/kernel/eeh_dev.c          |  2 +-
>   arch/powerpc/kernel/pci_dn.c           | 40 +++++++++++++++++++---------------
>   arch/powerpc/platforms/maple/pci.c     | 35 +++++++++++++++++------------
>   arch/powerpc/platforms/pasemi/pci.c    |  3 ---
>   arch/powerpc/platforms/powermac/pci.c  | 39 ++++++++++++++++++++-------------
>   arch/powerpc/platforms/powernv/pci.c   |  3 ---
>   arch/powerpc/platforms/pseries/setup.c |  1 -
>   8 files changed, 68 insertions(+), 56 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
> index 4122a86..7388316 100644
> --- a/arch/powerpc/include/asm/ppc-pci.h
> +++ b/arch/powerpc/include/asm/ppc-pci.h
> @@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root,
>   		      void *(*fn)(struct pci_dn *, void *),
>   		      void *data);
>
> -extern void pci_devs_phb_init(void);
>   extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
>
>   /* From rtas_pci.h */
> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
> index aabba94..f33ce5b 100644
> --- a/arch/powerpc/kernel/eeh_dev.c
> +++ b/arch/powerpc/kernel/eeh_dev.c
> @@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void)
>   	return 0;
>   }
>
> -core_initcall(eeh_dev_phb_init);
> +core_initcall_sync(eeh_dev_phb_init);
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index b3b4df9..d3833af 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>   	struct device_node *parent;
>   	struct pci_dn *pdn;
>
> -	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>   	if (pdn == NULL)
>   		return NULL;
>   	dn->data = pdn;
> @@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
>   	traverse_pci_devices(dn, update_dn_pci_info, phb);
>   }
>
> -/**
> +static void pci_dev_pdn_setup(struct pci_dev *pdev)
> +{
> +	struct pci_dn *pdn;
> +
> +	if (pdev->dev.archdata.pci_data)
> +		return;
> +
> +	/* Setup the fast path */
> +	pdn = pci_get_pdn(pdev);
> +	pdev->dev.archdata.pci_data = pdn;
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);


How does moving of the chunk above help to "Delay creating pci_dn"?


> +
> +/*
>    * pci_devs_phb_init - Initialize phbs and pci devs under them.
> - *
> - * This routine walks over all phb's (pci-host bridges) on the
> - * system, and sets up assorted pci-related structures
> + *
> + * This routine walks over all phb's (pci-host bridges) on
> + * the system, and sets up assorted pci-related structures
>    * (including pci info in the device node structs) for each
>    * pci device found underneath.  This routine runs once,
>    * early in the boot sequence.
>    */
> -void __init pci_devs_phb_init(void)
> +static int __init pci_devs_phb_init(void)
>   {
>   	struct pci_controller *phb, *tmp;
>
>   	/* This must be done first so the device nodes have valid pci info! */
>   	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>   		pci_devs_phb_init_dynamic(phb);
> -}
> -
> -static void pci_dev_pdn_setup(struct pci_dev *pdev)
> -{
> -	struct pci_dn *pdn;
>
> -	if (pdev->dev.archdata.pci_data)
> -		return;
> -
> -	/* Setup the fast path */
> -	pdn = pci_get_pdn(pdev);
> -	pdev->dev.archdata.pci_data = pdn;
> +	return 0;
>   }
> -DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
> +
> +core_initcall(pci_devs_phb_init);
> diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c
> index a923230..04a69a8 100644
> --- a/arch/powerpc/platforms/maple/pci.c
> +++ b/arch/powerpc/platforms/maple/pci.c
> @@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev)
>   	DBG(" <- maple_pci_irq_fixup\n");
>   }
>
> +static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
> +	struct device_node *np, *child;
> +
> +	if (hose != u3_agp)
> +		return 0;
> +
> +	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> +	 * assume there is no P2P bridge on the AGP bus, which should be a
> +	 * safe assumptions hopefully.
> +	 */
> +	np = hose->dn;
> +	PCI_DN(np)->busno = 0xf0;
> +	for_each_child_of_node(np, child)
> +		PCI_DN(child)->busno = 0xf0;
> +
> +	return 0;
> +}
> +
>   void __init maple_pci_init(void)
>   {
>   	struct device_node *np, *root;
> @@ -605,20 +625,7 @@ void __init maple_pci_init(void)
>   	if (ht && maple_add_bridge(ht) != 0)
>   		of_node_put(ht);
>
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
> -
> -	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> -	 * assume there is no P2P bridge on the AGP bus, which should be a
> -	 * safe assumptions hopefully.
> -	 */
> -	if (u3_agp) {
> -		struct device_node *np = u3_agp->dn;
> -		PCI_DN(np)->busno = 0xf0;
> -		for (np = np->child; np; np = np->sibling)
> -			PCI_DN(np)->busno = 0xf0;
> -	}
> -
> +	ppc_md.pcibios_root_bridge_prepare = maple_pci_root_bridge_prepare;
>   	/* Tell pci.c to not change any resource allocations.  */
>   	pci_add_flags(PCI_PROBE_ONLY);
>   }
> diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
> index f3a68a0..10c4e8f 100644
> --- a/arch/powerpc/platforms/pasemi/pci.c
> +++ b/arch/powerpc/platforms/pasemi/pci.c
> @@ -229,9 +229,6 @@ void __init pas_pci_init(void)
>   			of_node_get(np);
>
>   	of_node_put(root);
> -
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
>   }
>
>   void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)
> diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
> index 59ab16f..368716f 100644
> --- a/arch/powerpc/platforms/powermac/pci.c
> +++ b/arch/powerpc/platforms/powermac/pci.c
> @@ -878,6 +878,29 @@ void pmac_pci_irq_fixup(struct pci_dev *dev)
>   #endif /* CONFIG_PPC32 */
>   }
>
> +#ifdef CONFIG_PPC64
> +static int pmac_pci_root_bridge_prepare(struct pci_hot_bridge *bridge)
> +{
> +	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
> +	struct device_node *np, *child;
> +
> +	if (hose != u3_agp)
> +		return 0;
> +
> +	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> +	 * assume there is no P2P bridge on the AGP bus, which should be a
> +	 * safe assumptions for now. We should do something better in the
> +	 * future though
> +	 */
> +	np = hose->dn;
> +	PCI_DN(np)->busno = 0xf0;
> +	for_each_child_of_node(np, child)
> +		PCI_DN(child)->busno = 0xf0;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_PPC64 */
> +
>   void __init pmac_pci_init(void)
>   {
>   	struct device_node *np, *root;
> @@ -914,22 +937,8 @@ void __init pmac_pci_init(void)
>   	if (ht && pmac_add_bridge(ht) != 0)
>   		of_node_put(ht);
>
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
> -
> -	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
> -	 * assume there is no P2P bridge on the AGP bus, which should be a
> -	 * safe assumptions for now. We should do something better in the
> -	 * future though
> -	 */
> -	if (u3_agp) {
> -		struct device_node *np = u3_agp->dn;
> -		PCI_DN(np)->busno = 0xf0;
> -		for (np = np->child; np; np = np->sibling)
> -			PCI_DN(np)->busno = 0xf0;
> -	}
>   	/* pmac_check_ht_link(); */
> -
> +	ppc_md.pcibios_root_bridge_prepare = pmac_pci_root_bridge_prepare;
>   #else /* CONFIG_PPC64 */
>   	init_p2pbridge();
>   	init_second_ohare();
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 60e6d65..21a4eb3 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -819,9 +819,6 @@ void __init pnv_pci_init(void)
>   	for_each_compatible_node(np, NULL, "ibm,ioda2-phb")
>   		pnv_pci_init_ioda2_phb(np);
>
> -	/* Setup the linkage between OF nodes and PHBs */
> -	pci_devs_phb_init();
> -
>   	/* Configure IOMMU DMA hooks */
>   	ppc_md.tce_build = pnv_tce_build_vm;
>   	ppc_md.tce_free = pnv_tce_free_vm;
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index df6a704..5f80758 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -482,7 +482,6 @@ static void __init find_and_init_phbs(void)
>   	}
>
>   	of_node_put(root);
> -	pci_devs_phb_init();
>
>   	/*
>   	 * PCI_PROBE_ONLY and PCI_REASSIGN_ALL_BUS can be set via properties
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-09 15:08     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 15:08 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The eeh_dev is always created based on pci_dn, but with initcall
> supported by core_initcall_sync(). The patch creates eeh_dev
> when pci_dn is created, indicating they have same life cycle.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/eeh.h         |  6 ++++--
>   arch/powerpc/kernel/eeh_dev.c          | 18 ++++--------------
>   arch/powerpc/kernel/pci_dn.c           | 12 ++++++++++++
>   arch/powerpc/platforms/pseries/setup.c |  6 +-----
>   4 files changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 2793d24..4ed88f6 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
>   const char *eeh_pe_loc_get(struct eeh_pe *pe);
>   struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>
> -void *eeh_dev_init(struct pci_dn *pdn, void *data);
> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
> +			     struct pci_controller *phb);


Everywhere else (?) you name these pci_controller pointer variables "hose" 
but not in this patch.


>   void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>   int eeh_init(void);
>   int __init eeh_ops_register(struct eeh_ops *ops);
> @@ -322,7 +323,8 @@ static inline int eeh_init(void)
>   	return 0;
>   }
>
> -static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
> +static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
> +					   struct pci_controller *phb)
>   {
>   	return NULL;
>   }
> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
> index f33ce5b..7486932 100644
> --- a/arch/powerpc/kernel/eeh_dev.c
> +++ b/arch/powerpc/kernel/eeh_dev.c
> @@ -44,14 +44,14 @@
>   /**
>    * eeh_dev_init - Create EEH device according to OF node
>    * @pdn: PCI device node
> - * @data: PHB
> + * @phb: PCI controller
>    *
>    * It will create EEH device according to the given OF node. The function
>    * might be called by PCI emunation, DR, PHB hotplug.
>    */
> -void *eeh_dev_init(struct pci_dn *pdn, void *data)
> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
> +			     struct pci_controller *phb)
>   {
> -	struct pci_controller *phb = data;
>   	struct eeh_dev *edev;
>
>   	/* Allocate EEH device */
> @@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>   	edev->phb = phb;
>   	INIT_LIST_HEAD(&edev->list);
>
> -	return NULL;
> +	return edev;
>   }
>
>   /**
> @@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>    */
>   void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
>   {
> -	struct pci_dn *root = phb->pci_data;
> -
>   	/* EEH PE for PHB */
>   	eeh_phb_pe_create(phb);
> -
> -	/* EEH device for PHB */
> -	eeh_dev_init(root, phb);
> -
> -	/* EEH devices for children OF nodes */
> -	traverse_pci_dn(root, eeh_dev_init, phb);
>   }
>
>   /**
> @@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void)
>   	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>   		eeh_dev_phb_init_dynamic(phb);
>
> -	pr_info("EEH: devices created\n");
> -
>   	return 0;
>   }
>
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index d3833af..abc81fa 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>   	const __be32 *regs;
>   	struct device_node *parent;
>   	struct pci_dn *pdn;
> +#ifdef CONFIG_EEH
> +	struct eeh_dev *edev;
> +#endif
>
>   	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>   	if (pdn == NULL)
> @@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>   	/* Extended config space */
>   	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
>
> +	/* Initialize EEH device */
> +#ifdef CONFIG_EEH

You do not need this #ifdef - you have a stub for eeh_dev_init() in 
arch/powerpc/include/asm/eeh.h


> +	edev = eeh_dev_init(pdn, phb);
> +	if (!edev) {


s/!edev/eeh_dev_init(pdn, phb)/ and get rid of @edev local variable at all 
- you do not use it anyway?


> +		kfree(pdn);
> +		return NULL;
> +	}
> +#endif
> +
>   	/* Attach to parent node */
>   	INIT_LIST_HEAD(&pdn->child_list);
>   	INIT_LIST_HEAD(&pdn->list);
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 5f80758..92974aa 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
>   	switch (action) {
>   	case OF_RECONFIG_ATTACH_NODE:
>   		pci = np->parent->data;
> -		if (pci) {
> +		if (pci)
>   			update_dn_pci_info(np, pci->phb);
> -
> -			/* Create EEH device for the OF node */
> -			eeh_dev_init(PCI_DN(np), pci->phb);
> -		}
>   		break;
>   	default:
>   		err = NOTIFY_DONE;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
@ 2015-05-09 15:08     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 15:08 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The eeh_dev is always created based on pci_dn, but with initcall
> supported by core_initcall_sync(). The patch creates eeh_dev
> when pci_dn is created, indicating they have same life cycle.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/eeh.h         |  6 ++++--
>   arch/powerpc/kernel/eeh_dev.c          | 18 ++++--------------
>   arch/powerpc/kernel/pci_dn.c           | 12 ++++++++++++
>   arch/powerpc/platforms/pseries/setup.c |  6 +-----
>   4 files changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 2793d24..4ed88f6 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
>   const char *eeh_pe_loc_get(struct eeh_pe *pe);
>   struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>
> -void *eeh_dev_init(struct pci_dn *pdn, void *data);
> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
> +			     struct pci_controller *phb);


Everywhere else (?) you name these pci_controller pointer variables "hose" 
but not in this patch.


>   void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>   int eeh_init(void);
>   int __init eeh_ops_register(struct eeh_ops *ops);
> @@ -322,7 +323,8 @@ static inline int eeh_init(void)
>   	return 0;
>   }
>
> -static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
> +static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
> +					   struct pci_controller *phb)
>   {
>   	return NULL;
>   }
> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
> index f33ce5b..7486932 100644
> --- a/arch/powerpc/kernel/eeh_dev.c
> +++ b/arch/powerpc/kernel/eeh_dev.c
> @@ -44,14 +44,14 @@
>   /**
>    * eeh_dev_init - Create EEH device according to OF node
>    * @pdn: PCI device node
> - * @data: PHB
> + * @phb: PCI controller
>    *
>    * It will create EEH device according to the given OF node. The function
>    * might be called by PCI emunation, DR, PHB hotplug.
>    */
> -void *eeh_dev_init(struct pci_dn *pdn, void *data)
> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
> +			     struct pci_controller *phb)
>   {
> -	struct pci_controller *phb = data;
>   	struct eeh_dev *edev;
>
>   	/* Allocate EEH device */
> @@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>   	edev->phb = phb;
>   	INIT_LIST_HEAD(&edev->list);
>
> -	return NULL;
> +	return edev;
>   }
>
>   /**
> @@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>    */
>   void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
>   {
> -	struct pci_dn *root = phb->pci_data;
> -
>   	/* EEH PE for PHB */
>   	eeh_phb_pe_create(phb);
> -
> -	/* EEH device for PHB */
> -	eeh_dev_init(root, phb);
> -
> -	/* EEH devices for children OF nodes */
> -	traverse_pci_dn(root, eeh_dev_init, phb);
>   }
>
>   /**
> @@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void)
>   	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>   		eeh_dev_phb_init_dynamic(phb);
>
> -	pr_info("EEH: devices created\n");
> -
>   	return 0;
>   }
>
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index d3833af..abc81fa 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>   	const __be32 *regs;
>   	struct device_node *parent;
>   	struct pci_dn *pdn;
> +#ifdef CONFIG_EEH
> +	struct eeh_dev *edev;
> +#endif
>
>   	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>   	if (pdn == NULL)
> @@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>   	/* Extended config space */
>   	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
>
> +	/* Initialize EEH device */
> +#ifdef CONFIG_EEH

You do not need this #ifdef - you have a stub for eeh_dev_init() in 
arch/powerpc/include/asm/eeh.h


> +	edev = eeh_dev_init(pdn, phb);
> +	if (!edev) {


s/!edev/eeh_dev_init(pdn, phb)/ and get rid of @edev local variable at all 
- you do not use it anyway?


> +		kfree(pdn);
> +		return NULL;
> +	}
> +#endif
> +
>   	/* Attach to parent node */
>   	INIT_LIST_HEAD(&pdn->child_list);
>   	INIT_LIST_HEAD(&pdn->list);
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 5f80758..92974aa 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
>   	switch (action) {
>   	case OF_RECONFIG_ATTACH_NODE:
>   		pci = np->parent->data;
> -		if (pci) {
> +		if (pci)
>   			update_dn_pci_info(np, pci->phb);
> -
> -			/* Create EEH device for the OF node */
> -			eeh_dev_init(PCI_DN(np), pci->phb);
> -		}
>   		break;
>   	default:
>   		err = NOTIFY_DONE;
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-09 15:54     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 15:54 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: linux-pci, benh, bhelgaas

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The patch intends to add standalone driver to support PCI hotplug
> for PowerPC PowerNV platform, which runs on top of skiboot firmware.
> The firmware identified hotpluggable slots and marked their device
> tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware".
> The driver simply scans device-tree to create/register PCI hotplug slot
> accordingly.
>
> If the skiboot firmware doesn't support slot status retrieval, the PCI
> slot device node shouldn't have property "ibm,reset-by-firmware". In
> that case, none of valid PCI slots will be detected from device tree.
> The skiboot firmware doesn't export the capability to access attention
> LEDs yet and it's something for TBD.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   drivers/pci/hotplug/Kconfig            |  12 +
>   drivers/pci/hotplug/Makefile           |   4 +
>   drivers/pci/hotplug/powernv_php.c      | 146 ++++++++
>   drivers/pci/hotplug/powernv_php.h      |  78 ++++
>   drivers/pci/hotplug/powernv_php_slot.c | 643 +++++++++++++++++++++++++++++++++
>   5 files changed, 883 insertions(+)
>   create mode 100644 drivers/pci/hotplug/powernv_php.c
>   create mode 100644 drivers/pci/hotplug/powernv_php.h
>   create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>
> diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
> index df8caec..ef55dae 100644
> --- a/drivers/pci/hotplug/Kconfig
> +++ b/drivers/pci/hotplug/Kconfig
> @@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC
>
>   	  When in doubt, say N.
>
> +config HOTPLUG_PCI_POWERNV
> +	tristate "PowerPC PowerNV PCI Hotplug driver"
> +	depends on PPC_POWERNV && EEH
> +	help
> +	  Say Y here if you run PowerPC PowerNV platform that supports
> +          PCI Hotplug
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called powernv-php.
> +
> +	  When in doubt, say N.
> +
>   config HOTPLUG_PCI_RPA
>   	tristate "RPA PCI Hotplug driver"
>   	depends on PPC_PSERIES && EEH
> diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
> index 4a9aa08..a69665e 100644
> --- a/drivers/pci/hotplug/Makefile
> +++ b/drivers/pci/hotplug/Makefile
> @@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)		+= pciehp.o
>   obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550)	+= cpcihp_zt5550.o
>   obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)	+= cpcihp_generic.o
>   obj-$(CONFIG_HOTPLUG_PCI_SHPC)		+= shpchp.o
> +obj-$(CONFIG_HOTPLUG_PCI_POWERNV)	+= powernv-php.o
>   obj-$(CONFIG_HOTPLUG_PCI_RPA)		+= rpaphp.o
>   obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR)	+= rpadlpar_io.o
>   obj-$(CONFIG_HOTPLUG_PCI_SGI)		+= sgi_hotplug.o
> @@ -50,6 +51,9 @@ ibmphp-objs		:=	ibmphp_core.o	\
>   acpiphp-objs		:=	acpiphp_core.o	\
>   				acpiphp_glue.o
>
> +powernv-php-objs	:=	powernv_php.o	\
> +				powernv_php_slot.o
> +
>   rpaphp-objs		:=	rpaphp_core.o	\
>   				rpaphp_pci.o	\
>   				rpaphp_slot.o
> diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c
> new file mode 100644
> index 0000000..5cf9e717
> --- /dev/null
> +++ b/drivers/pci/hotplug/powernv_php.c
> @@ -0,0 +1,146 @@
> +/*
> + * PCI Hotplug Driver for PowerPC PowerNV platform.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2015.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/sysfs.h>
> +#include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <asm/opal.h>
> +#include <asm/pnv-pci.h>
> +
> +#include "powernv_php.h"

Compiles without linux/kernel.h, linux/sysfs.h, linux/string.h, 
linux/slab.h. Sure you need all of these?


> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
> +#define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
> +
> +static struct notifier_block php_msg_nb = {
> +	.notifier_call	= powernv_php_msg_handler,
> +	.next		= NULL,
> +	.priority	= 0,
> +};
> +
> +static int powernv_php_register_one(struct device_node *dn)
> +{
> +	struct powernv_php_slot *slot;
> +	const __be32 *prop32;
> +	int ret;
> +
> +	/* Check if it's hotpluggable slot */
> +	prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL);
> +	if (!prop32 || !of_read_number(prop32, 1))
> +		return 0;

Although nobody checks the return code, this should be -ENXIO or something 
but zero. And the check below too.


> +
> +	prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL);
> +	if (!prop32 || !of_read_number(prop32, 1))
> +		return 0;
> +
> +	/* Allocate slot */
> +	slot = powernv_php_slot_alloc(dn);
> +	if (!slot)
> +		return -ENODEV;
> +
> +	/* Register it */
> +	ret = powernv_php_slot_register(slot);
> +	if (ret) {
> +		powernv_php_slot_put(slot);
> +		return ret;
> +	}
> +
> +	return powernv_php_slot_enable(slot->php_slot, false, false);
> +}
> +
> +int powernv_php_register(struct device_node *dn)
> +{
> +	struct device_node *child;
> +	int ret = 0;
> +
> +	/*
> +	 * The parent slots should be registered before their
> +	 * child slots.
> +	 */
> +	for_each_child_of_node(dn, child) {
> +		ret = powernv_php_register_one(child);
> +		if (ret)
> +			break;
> +
> +		powernv_php_register(child);
> +	}
> +
> +	return ret;
> +}
> +
> +static void powernv_php_unregister_one(struct device_node *dn)
> +{
> +	struct powernv_php_slot *slot;
> +
> +	slot = powernv_php_slot_find(dn);
> +	if (!slot)
> +		return;
> +
> +	pci_hp_deregister(slot->php_slot);
> +}
> +
> +void powernv_php_unregister(struct device_node *dn)
> +{
> +	struct device_node *child;
> +
> +	/* The child slots should go before their parent slots */
> +	for_each_child_of_node(dn, child) {
> +		powernv_php_unregister(child);
> +		powernv_php_unregister_one(child);
> +	}
> +}
> +
> +static int __init powernv_php_init(void)
> +{
> +	struct device_node *dn;
> +
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> +	/* Register hotplug message handler */
> +	if (pnv_pci_hotplug_notifier(&php_msg_nb, true)) {

If you called the function "pnv_pci_hotplug_notifier_register", you would 
not need the comment above.


> +		pr_warn("%s: Cannot register hotplug message notifier\n",
> +			__func__);
> +		return -EIO;
> +	}
> +
> +	/* Scan PHB nodes and their children */
> +	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
> +		powernv_php_register(dn);
> +	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
> +		powernv_php_register(dn);
> +
> +	return 0;
> +}
> +
> +static void __exit powernv_php_exit(void)
> +{
> +	struct device_node *dn;
> +
> +	pnv_pci_hotplug_notifier(&php_msg_nb, false);
> +
> +	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
> +		powernv_php_unregister(dn);
> +	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
> +		powernv_php_unregister(dn);
> +}
> +
> +module_init(powernv_php_init);
> +module_exit(powernv_php_exit);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/pci/hotplug/powernv_php.h b/drivers/pci/hotplug/powernv_php.h
> new file mode 100644
> index 0000000..87ba0d0
> --- /dev/null
> +++ b/drivers/pci/hotplug/powernv_php.h
> @@ -0,0 +1,78 @@
> +/*
> + * PCI Hotplug Driver for PowerPC PowerNV platform.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2015.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef _POWERNV_PHP_H
> +#define _POWERNV_PHP_H

I would put these (and dependencies if any) here:

#include <linux/kref.h>
#include <linux/pci.h>
#include <linux/pci_hotplug.h>

and remove them from .c files.


> +
> +/* Slot power status */
> +#define POWERNV_PHP_SLOT_POWER_OFF	0
> +#define POWERNV_PHP_SLOT_POWER_ON	1
> +
> +/* Slot presence status */
> +#define POWERNV_PHP_SLOT_EMPTY		0
> +#define POWERNV_PHP_SLOT_PRESENT	1
> +
> +/* Slot attention status */
> +#define POWERNV_PHP_SLOT_ATTEN_OFF	0
> +#define POWERNV_PHP_SLOT_ATTEN_ON	1
> +#define POWERNV_PHP_SLOT_ATTEN_IND	2
> +#define POWERNV_PHP_SLOT_ATTEN_ACT	3
> +
> +struct powernv_php_slot {
> +	struct kref		kref;
> +	int			state;
> +#define POWERNV_PHP_SLOT_STATE_INIT		0x0
> +#define POWERNV_PHP_SLOT_STATE_REGISTER		0x1
> +#define POWERNV_PHP_SLOT_STATE_POPULATED	0x2

I believe these are not bitmasks but bit numbers, right? Decimal values are 
normally used for them.


> +	char			*name;
> +	struct device_node	*dn;
> +	struct pci_bus		*bus;
> +	uint64_t		id;
> +	int			slot_no;
> +	int			check_power_status;
> +	int			status_confirmed;
> +	struct opal_msg		*msg;
> +	struct work_struct	work;
> +	wait_queue_head_t	queue;
> +	struct hotplug_slot	*php_slot;
> +	struct powernv_php_slot	*parent;
> +	void (*release)(struct kref *kref);

What is the point in this? Just use php_slot_free() directly in 
powernv_php_slot_put, no?


> +	struct list_head	children;
> +	struct list_head	link;
> +};
> +
> +#define to_powernv_php_slot(kref) container_of(kref, struct powernv_php_slot, kref)
> +
> +static inline void powernv_php_slot_get(struct powernv_php_slot *slot)
> +{
> +	if (slot)
> +		kref_get(&slot->kref);
> +}
> +
> +static inline int powernv_php_slot_put(struct powernv_php_slot *slot)
> +{
> +	if (slot)
> +		return kref_put(&slot->kref, slot->release);
> +
> +	return 0;
> +}
> +
> +int powernv_php_msg_handler(struct notifier_block *nb,
> +			    unsigned long type, void *message);
> +struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn);
> +struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn);
> +int powernv_php_slot_register(struct powernv_php_slot *slot);
> +int powernv_php_slot_enable(struct hotplug_slot *php_slot,
> +			    bool rescan_bus, bool rescan_slot);

Just an observation - rescan_bus and rescan_slot are both true or both 
false and never different. And the only caller requesting rescan is in the 
same file as powernv_php_slot_enable() and it could do this rescan if 
powernv_php_slot_enable() could signal that rescan is needed (return 1?).

And no "goto" in powernv_php_slot_enable would be needed. Do not insist though.



> +int powernv_php_register(struct device_node *dn);
> +void powernv_php_unregister(struct device_node *dn);
> +
> +#endif /* !_POWERNV_PHP_H */
> diff --git a/drivers/pci/hotplug/powernv_php_slot.c b/drivers/pci/hotplug/powernv_php_slot.c
> new file mode 100644
> index 0000000..fc82355
> --- /dev/null
> +++ b/drivers/pci/hotplug/powernv_php_slot.c
> @@ -0,0 +1,643 @@
> +/*
> + * PCI Hotplug Driver for PowerPC PowerNV platform.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2015.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/sysfs.h>
> +#include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/wait.h>
> +#include <linux/workqueue.h>
> +
> +#include <asm/opal.h>
> +#include <asm/pnv-pci.h>
> +#include <asm/ppc-pci.h>
> +
> +#include "powernv_php.h"

I have a suspicion you won't need all these headers here too ;)


> +
> +static LIST_HEAD(php_slot_list);
> +static DEFINE_SPINLOCK(php_slot_lock);
> +
> +/*
> + * Release firmware data for all child device nodes of the
> + * indicated one.
> + */
> +static void release_device_nodes_info(struct device_node *np)
> +{
> +	struct device_node *child;
> +
> +	for_each_child_of_node(np, child) {
> +		/* In depth first */
> +		release_device_nodes_info(child);

Why is this "release", not "remove" (as this is what it does - calling 
remove_lalala in a loop)?

> +
> +		remove_pci_device_node_info(child);
> +	}
> +}
> +
> +/*
> + * Release all subordinate device nodes of the indicated one.
> + * Those device nodes in deepest path should be released firstly.
> + */
> +static int release_device_nodes(struct device_node *parent)
> +{
> +	struct device_node *np, *child;
> +	int ret = 0;
> +
> +	/* If the device node has children, remove them firstly */
> +	for_each_child_of_node(parent, np) {
> +		ret = release_device_nodes(np);
> +		if (ret)
> +			return ret;
> +
> +		/* The device shouldn't have alive children */
> +		child = of_get_next_child(np, NULL);
> +		if (child) {
> +			of_node_put(child);
> +			of_node_put(np);
> +			pr_err("%s: Alive children of node <%s>\n",
> +			       __func__, of_node_full_name(np));
> +			return -EBUSY;
> +		}
> +
> +		/* Detach the device node */
> +		of_detach_node(np);
> +		of_node_put(np);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * The function processes the message sent by firmware
> + * to remove all device tree nodes beneath the slot's
> + * nodes, and the associated auxillary data.
> + */
> +static void slot_power_off_handler(struct powernv_php_slot *slot)
> +{
> +	int ret;
> +
> +	/* Release the firmware data for the child device nodes */
> +	release_device_nodes_info(slot->dn);
> +
> +	/* Release the child device nodes */
> +	ret = release_device_nodes(slot->dn);
> +	if (ret)
> +		pr_warn("%s: Error %d releasing children of <%s>\n",
> +			__func__, ret, of_node_full_name(slot->dn));
> +
> +	/* Confirm status change */
> +	slot->status_confirmed = 1;
> +	wake_up_interruptible(&slot->queue);
> +}
> +
> +static void slot_power_on_handler(struct powernv_php_slot *slot)
> +{
> +	struct opal_msg *msg = slot->msg;
> +	unsigned long phys = be64_to_cpu(msg->params[2]);
> +	unsigned long len = be64_to_cpu(msg->params[3]);
> +	void *blob = (phys && len > 0) ? __va(phys) : NULL;
> +
> +	/* There might have nothing behind the slot yet */
> +	if (!blob || !len)

"!len" is redundand here - blob will be NULL if len<=0.

> +		goto out;
> +
> +	/* Copy the FDT blob and parse it */
> +	of_fdt_add_subtree(slot->dn, blob);
> +
> +	/* Add device node firmware data */
> +	traverse_pci_device_nodes(slot->dn,
> +				  add_pci_device_node_info,
> +				  pci_bus_to_host(slot->bus));
> +
> +out:
> +	/* Confirm status change */
> +	slot->status_confirmed = 1;
> +	wake_up_interruptible(&slot->queue);
> +}
> +
> +static void powernv_php_slot_work(struct work_struct *data)
> +{
> +	struct powernv_php_slot *slot = container_of(data,
> +						     struct powernv_php_slot,
> +						     work);
> +	uint64_t php_event = be64_to_cpu(slot->msg->params[0]);
> +
> +	switch (php_event) {
> +	case 0: /* Slot power off */
> +		slot_power_off_handler(slot);
> +		break;
> +	case 1: /* Slot power on */
> +		slot_power_on_handler(slot);
> +		break;
> +	default:
> +		pr_warn("%s: Unsupported hotplug event %lld\n",
> +			__func__, php_event);
> +	}
> +
> +	of_node_put(slot->dn);
> +}
> +
> +int powernv_php_msg_handler(struct notifier_block *nb,
> +			    unsigned long type, void *message)
> +{
> +	phandle h;
> +	struct device_node *np;
> +	struct powernv_php_slot *slot;
> +	struct opal_msg *msg = message;
> +
> +	/* Check the message type */
> +	if (type != OPAL_MSG_PCI_HOTPLUG) {
> +		pr_warn("%s: Wrong message type %ld received!\n",
> +			__func__, type);
> +		return 0;
> +	}
> +
> +	/* Find the device node */
> +	h = (phandle)be64_to_cpu(msg->params[1]);
> +	np = of_find_node_by_phandle(h);
> +	if (!np) {
> +		pr_warn("%s: No device node for phandle 0x%08x\n",
> +			__func__, h);
> +		return 0;
> +	}
> +
> +	/* Find the slot */
> +	slot = powernv_php_slot_find(np);
> +	if (!slot) {
> +		pr_warn("%s: No slot found for node <%s>\n",
> +			__func__, of_node_full_name(np));
> +		of_node_put(np);
> +		return 0;
> +	}
> +
> +	/* Schedule the work */
> +	slot->msg = msg;
> +	schedule_work(&slot->work);
> +	return 0;
> +}
> +
> +static int set_power_status(struct hotplug_slot *php_slot, u8 val)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	int ret;
> +
> +	/* Set power status */
> +	slot->status_confirmed = 0;
> +	ret = pnv_pci_set_power_status(slot->id, val);
> +	if (ret) {
> +		pr_warn("%s: Error %d powering %s slot %016llx\n",
> +			__func__, ret, val ? "on" : "off", slot->id);
> +		return ret;
> +	}
> +
> +	/* Waiting until the device tree is updated */
> +	ret = wait_event_timeout(slot->queue,
> +				 !slot->status_confirmed,
> +				 10 * HZ);
> +	if (ret) {
> +		pr_warn("%s: Error %d completing power-%s slot %016llx\n",
> +			__func__, ret, val ? "on" : "off", slot->id);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int get_power_status(struct hotplug_slot *php_slot, u8 *val)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t state;
> +	int ret;
> +
> +	/*
> +	 * Retrieve power status from firmware. If we fail
> +	 * getting that, the power status fails back to
> +	 * be on.
> +	 */
> +	ret = pnv_pci_get_power_status(slot->id, &state);
> +	if (ret) {
> +		*val = POWERNV_PHP_SLOT_POWER_ON;
> +		pr_warn("%s: Error %d getting power status of slot %016llx\n",
> +			__func__, ret, slot->id);
> +	} else {
> +		*val = state ? POWERNV_PHP_SLOT_POWER_ON :
> +			       POWERNV_PHP_SLOT_POWER_OFF;
> +		php_slot->info->power_status = *val;
> +	}
> +
> +	return 0;
> +}
> +
> +static int get_adapter_status(struct hotplug_slot *php_slot, u8 *val)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t state;
> +	int ret;
> +
> +	/*
> +	 * Retrieve presence status from firmware. If we can't
> +	 * get that, it will fail back to be empty.
> +	 */
> +	ret = pnv_pci_get_presence_status(slot->id, &state);
> +	if (ret >= 0) {
> +                *val = state ? POWERNV_PHP_SLOT_PRESENT :
> +                               POWERNV_PHP_SLOT_EMPTY;
> +                php_slot->info->adapter_status = *val;

ret = 0;


> +	} else {
> +		*val = POWERNV_PHP_SLOT_EMPTY;
> +		pr_warn("%s: Error %d getting presence of slot %016llx\n",
> +			__func__, ret, slot->id);
> +	}
> +
> +	return ret < 0 ? ret : 0;


return ret;


> +}
> +
> +static int set_attention_status(struct hotplug_slot *php_slot, u8 val)
> +{
> +	/* The default operation would to turn on the attention */
> +	switch (val) {
> +	case POWERNV_PHP_SLOT_ATTEN_OFF:
> +	case POWERNV_PHP_SLOT_ATTEN_ON:
> +	case POWERNV_PHP_SLOT_ATTEN_IND:
> +	case POWERNV_PHP_SLOT_ATTEN_ACT:
> +		break;
> +	default:
> +		val = POWERNV_PHP_SLOT_ATTEN_ON;

Is not @val a garbage in this case?


> +	}
> +
> +	/* FIXME: Make it real once firmware supports it */
> +	php_slot->info->attention_status = val;
> +
> +	return 0;
> +}
> +
> +int powernv_php_slot_enable(struct hotplug_slot *php_slot,
> +			    bool rescan_bus, bool rescan_slot)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t presence, power_status;
> +	int ret;
> +
> +	/* Check if the slot has been configured */
> +	if (slot->state != POWERNV_PHP_SLOT_STATE_REGISTER)
> +		return 0;
> +
> +	/* Retrieve slot presence status */
> +	ret = php_slot->ops->get_adapter_status(php_slot, &presence);
> +	if (ret) {
> +		pr_warn("%s: Error %d getting presence of slot %016llx\n",
> +			__func__, ret, slot->id);
> +		return ret;
> +	}
> +
> +	/* Proceed if there have nothing behind the slot */
> +	if (presence == POWERNV_PHP_SLOT_EMPTY)
> +		goto scan;
> +
> +	/*
> +	 * If we don't detect something behind the slot, we need
> +	 * make sure the power suply to the slot is on. Otherwise,
> +	 * the slot downstream PCIe linkturn should be down.
> +	 *
> +	 * On the first time, we don't change the power status to
> +	 * boost system boot with assumption that the firmware
> +	 * supplies consistent slot power status: empty slot always
> +	 * has its power off and non-empty slot has its power on.
> +	 */
> +	if (!slot->check_power_status) {
> +		slot->check_power_status = 1;
> +		goto scan;
> +	}
> +
> +	/* Check the power status. Scan the slot if that's already on */
> +	ret = php_slot->ops->get_power_status(php_slot, &power_status);
> +	if (ret) {
> +		pr_warn("%s: Error %d getting power status of slot %016llx\n",
> +			__func__, ret, slot->id);
> +		return ret;
> +	}
> +	if (power_status == POWERNV_PHP_SLOT_POWER_ON)
> +		goto scan;
> +
> +	/* Power is off, turn it on and then scan the slot */
> +	ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_ON);
> +	if (ret) {
> +		pr_warn("%s: Error %d powering on slot %016llx\n",
> +			__func__, ret, slot->id);
> +		return ret;
> +	}
> +
> +scan:
> +	switch (presence) {
> +	case POWERNV_PHP_SLOT_PRESENT:
> +		if (rescan_bus) {
> +			pci_lock_rescan_remove();
> +			pcibios_add_pci_devices(slot->bus);
> +			pci_unlock_rescan_remove();
> +		}
> +
> +		/* Rescan for child hotpluggable slots */
> +		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
> +		if (rescan_slot)
> +			powernv_php_register(slot->dn);
> +		break;
> +	case POWERNV_PHP_SLOT_EMPTY:
> +		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
> +		break;
> +	default:
> +		pr_warn("%s: Invalid presence status %d of slot %016llx\n",
> +			__func__, presence, slot->id);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int enable_slot(struct hotplug_slot *php_slot)
> +{
> +	return powernv_php_slot_enable(php_slot, true, true);
> +}
> +
> +static int disable_slot(struct hotplug_slot *php_slot)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t power_status;
> +	int ret;
> +
> +	if (slot->state != POWERNV_PHP_SLOT_STATE_POPULATED)
> +		return 0;
> +
> +	/* Remove all devices behind the slot */
> +	pci_lock_rescan_remove();
> +	pcibios_remove_pci_devices(slot->bus);
> +	pci_unlock_rescan_remove();
> +
> +	/* Detach the child hotpluggable slots */
> +	powernv_php_unregister(slot->dn);
> +
> +	/*
> +	 * Check the power status and turn it off if necessary. If we
> +	 * fail to get the power status, the power will be forced to
> +	 * be off.
> +	 */
> +	ret = php_slot->ops->get_power_status(php_slot, &power_status);
> +	if (ret || power_status == POWERNV_PHP_SLOT_POWER_ON) {
> +		ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_OFF);
> +		if (ret)
> +			pr_warn("%s: Error %d powering off slot %016llx\n",
> +				__func__, ret, slot->id);
> +	}
> +
> +	/* Update slot state */
> +	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
> +	return 0;
> +}
> +
> +static struct hotplug_slot_ops php_slot_ops = {
> +	.get_power_status	= get_power_status,
> +	.get_adapter_status	= get_adapter_status,
> +	.set_attention_status	= set_attention_status,
> +	.enable_slot		= enable_slot,
> +	.disable_slot		= disable_slot,
> +};
> +
> +static struct powernv_php_slot *php_slot_match(struct device_node *dn,
> +					       struct powernv_php_slot *slot)
> +{
> +	struct powernv_php_slot *target, *tmp;
> +
> +	if (slot->dn == dn)
> +		return slot;
> +
> +	list_for_each_entry(tmp, &slot->children, link) {
> +		target = php_slot_match(dn, tmp);
> +		if (target)
> +			return target;
> +	}
> +
> +	return NULL;
> +}
> +
> +struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn)
> +{
> +	struct powernv_php_slot *slot, *tmp;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&php_slot_lock, flags);
> +	list_for_each_entry(tmp, &php_slot_list, link) {
> +		slot = php_slot_match(dn, tmp);
> +		if (slot) {
> +			spin_unlock_irqrestore(&php_slot_lock, flags);
> +			return slot;
> +		}
> +	}
> +	spin_unlock_irqrestore(&php_slot_lock, flags);
> +
> +	return NULL;
> +}
> +
> +static void php_slot_free(struct kref *kref)
> +{
> +	struct powernv_php_slot *slot = to_powernv_php_slot(kref);
> +
> +	WARN_ON(!list_empty(&slot->children));
> +	kfree(slot->name);
> +	kfree(slot);
> +}
> +
> +static void php_slot_release(struct hotplug_slot *hp_slot)
> +{
> +	struct powernv_php_slot *slot = hp_slot->private;
> +	unsigned long flags;
> +
> +	/* Remove from global or child list */
> +	spin_lock_irqsave(&php_slot_lock, flags);
> +	list_del(&slot->link);
> +	spin_unlock_irqrestore(&php_slot_lock, flags);
> +
> +	/* Detach from parent */
> +	powernv_php_slot_put(slot);
> +	powernv_php_slot_put(slot->parent);
> +}
> +
> +static bool php_slot_get_id(struct device_node *dn,
> +			    uint64_t *id)
> +{
> +	struct device_node *parent = dn;
> +	const __be64 *prop64;
> +	const __be32 *prop32;
> +
> +	/*
> +	 * The hotpluggable slot always has a compound Id, which
> +	 * consists of 16-bits PHB Id, 16 bits bus/slot/function
> +	 * number, and compound indicator
> +	 */
> +	*id = (0x1ul << 63);
> +
> +	/* Bus/Slot/Function number */
> +	prop32 = of_get_property(dn, "reg", NULL);
> +	if (!prop32)
> +		return false;
> +	*id |= ((of_read_number(prop32, 1) & 0x00ffff00) << 8);
> +
> +	/* PHB Id */
> +	while ((parent = of_get_parent(parent))) {
> +		if (!PCI_DN(parent)) {
> +			of_node_put(parent);
> +			break;
> +		}
> +
> +		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
> +		    !of_device_is_compatible(parent, "ibm,ioda-phb")) {
> +			of_node_put(parent);
> +			continue;
> +		}
> +
> +		prop64 = of_get_property(parent, "ibm,opal-phbid", NULL);
> +		if (!prop64) {
> +			of_node_put(parent);
> +			return false;
> +		}
> +
> +		*id |= be64_to_cpup(prop64);
> +		of_node_put(parent);
> +		return true;
> +	}
> +
> +        return false;
> +}
> +
> +struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn)
> +{
> +	struct pci_bus *bus;
> +	struct powernv_php_slot *slot;
> +	const char *label;
> +	uint64_t id;
> +	int slot_no;
> +	size_t size;
> +	void *pmem;
> +
> +	/* Slot name */
> +	label = of_get_property(dn, "ibm,slot-label", NULL);
> +	if (!label)
> +		return NULL;
> +
> +	/* Slot indentifier */
> +	if (!php_slot_get_id(dn, &id))
> +		return NULL;
> +
> +	/* PCI bus */
> +	bus = pcibios_find_pci_bus(dn);
> +	if (!bus)
> +		return NULL;
> +
> +	/* Slot number */
> +	if (dn->child && PCI_DN(dn->child))
> +		slot_no = PCI_SLOT(PCI_DN(dn->child)->devfn);
> +	else
> +		slot_no = -1;
> +
> +	/* Allocate slot */
> +	size = sizeof(struct powernv_php_slot) +
> +	       sizeof(struct hotplug_slot) +
> +	       sizeof(struct hotplug_slot_info);
> +	pmem = kzalloc(size, GFP_KERNEL);
> +	if (!pmem) {
> +		pr_warn("%s: Cannot allocate slot for node %s\n",
> +			__func__, dn->full_name);
> +		return NULL;
> +	}
> +
> +	/* Assign memory blocks */
> +	slot = pmem;
> +	slot->php_slot = pmem + sizeof(struct powernv_php_slot);
> +	slot->php_slot->info = pmem + sizeof(struct powernv_php_slot) +
> +			      sizeof(struct hotplug_slot);
> +	slot->name = kstrdup(label, GFP_KERNEL);
> +	if (!slot->name) {
> +		pr_warn("%s: Cannot populate name for node %s\n",
> +			__func__, dn->full_name);
> +		kfree(pmem);
> +		return NULL;
> +	}
> +
> +	/* Initialize slot */
> +	kref_init(&slot->kref);
> +	slot->state = POWERNV_PHP_SLOT_STATE_INIT;
> +	slot->dn = dn;
> +	slot->bus = bus;
> +	slot->id = id;
> +	slot->slot_no = slot_no;
> +	INIT_WORK(&slot->work, powernv_php_slot_work);
> +	init_waitqueue_head(&slot->queue);
> +	slot->check_power_status = 0;
> +	slot->status_confirmed = 0;
> +	slot->release = php_slot_free;
> +	slot->php_slot->ops = &php_slot_ops;
> +	slot->php_slot->release = php_slot_release;
> +	slot->php_slot->private = slot;
> +	INIT_LIST_HEAD(&slot->children);
> +	INIT_LIST_HEAD(&slot->link);
> +
> +	return slot;
> +}
> +
> +int powernv_php_slot_register(struct powernv_php_slot *slot)
> +{
> +	struct powernv_php_slot *parent;
> +	struct device_node *dn = slot->dn;
> +	unsigned long flags;
> +	int ret;
> +
> +	/* Avoid register same slot for twice */
> +	if (powernv_php_slot_find(slot->dn))
> +		return -EEXIST;
> +
> +	/* Register slot */
> +	ret = pci_hp_register(slot->php_slot, slot->bus,
> +			      slot->slot_no, slot->name);
> +	if (ret) {
> +		pr_warn("%s: Cannot register slot %s (%d)\n",
> +			__func__, slot->name, ret);
> +		return ret;
> +	}
> +
> +	/* Put into global or parent list */
> +	while ((dn = of_get_parent(dn))) {
> +		if (!PCI_DN(dn)) {
> +			of_node_put(dn);
> +			break;
> +		}
> +
> +		parent = powernv_php_slot_find(dn);
> +		if (parent) {
> +			of_node_put(dn);
> +			break;
> +		}
> +	}
> +
> +	spin_lock_irqsave(&php_slot_lock, flags);
> +	if (parent) {
> +		powernv_php_slot_get(parent);
> +		slot->parent = parent;
> +		list_add_tail(&slot->link, &parent->children);
> +	} else {
> +		list_add_tail(&slot->link, &php_slot_list);
> +	}
> +	spin_unlock_irqrestore(&php_slot_lock, flags);
> +
> +	/* Update slot state */
> +	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
> +	return 0;
> +}
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
@ 2015-05-09 15:54     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-09 15:54 UTC (permalink / raw)
  To: Gavin Shan, linuxppc-dev; +Cc: bhelgaas, linux-pci

On 05/01/2015 04:03 PM, Gavin Shan wrote:
> The patch intends to add standalone driver to support PCI hotplug
> for PowerPC PowerNV platform, which runs on top of skiboot firmware.
> The firmware identified hotpluggable slots and marked their device
> tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware".
> The driver simply scans device-tree to create/register PCI hotplug slot
> accordingly.
>
> If the skiboot firmware doesn't support slot status retrieval, the PCI
> slot device node shouldn't have property "ibm,reset-by-firmware". In
> that case, none of valid PCI slots will be detected from device tree.
> The skiboot firmware doesn't export the capability to access attention
> LEDs yet and it's something for TBD.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>   drivers/pci/hotplug/Kconfig            |  12 +
>   drivers/pci/hotplug/Makefile           |   4 +
>   drivers/pci/hotplug/powernv_php.c      | 146 ++++++++
>   drivers/pci/hotplug/powernv_php.h      |  78 ++++
>   drivers/pci/hotplug/powernv_php_slot.c | 643 +++++++++++++++++++++++++++++++++
>   5 files changed, 883 insertions(+)
>   create mode 100644 drivers/pci/hotplug/powernv_php.c
>   create mode 100644 drivers/pci/hotplug/powernv_php.h
>   create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>
> diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
> index df8caec..ef55dae 100644
> --- a/drivers/pci/hotplug/Kconfig
> +++ b/drivers/pci/hotplug/Kconfig
> @@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC
>
>   	  When in doubt, say N.
>
> +config HOTPLUG_PCI_POWERNV
> +	tristate "PowerPC PowerNV PCI Hotplug driver"
> +	depends on PPC_POWERNV && EEH
> +	help
> +	  Say Y here if you run PowerPC PowerNV platform that supports
> +          PCI Hotplug
> +
> +	  To compile this driver as a module, choose M here: the
> +	  module will be called powernv-php.
> +
> +	  When in doubt, say N.
> +
>   config HOTPLUG_PCI_RPA
>   	tristate "RPA PCI Hotplug driver"
>   	depends on PPC_PSERIES && EEH
> diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
> index 4a9aa08..a69665e 100644
> --- a/drivers/pci/hotplug/Makefile
> +++ b/drivers/pci/hotplug/Makefile
> @@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)		+= pciehp.o
>   obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550)	+= cpcihp_zt5550.o
>   obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)	+= cpcihp_generic.o
>   obj-$(CONFIG_HOTPLUG_PCI_SHPC)		+= shpchp.o
> +obj-$(CONFIG_HOTPLUG_PCI_POWERNV)	+= powernv-php.o
>   obj-$(CONFIG_HOTPLUG_PCI_RPA)		+= rpaphp.o
>   obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR)	+= rpadlpar_io.o
>   obj-$(CONFIG_HOTPLUG_PCI_SGI)		+= sgi_hotplug.o
> @@ -50,6 +51,9 @@ ibmphp-objs		:=	ibmphp_core.o	\
>   acpiphp-objs		:=	acpiphp_core.o	\
>   				acpiphp_glue.o
>
> +powernv-php-objs	:=	powernv_php.o	\
> +				powernv_php_slot.o
> +
>   rpaphp-objs		:=	rpaphp_core.o	\
>   				rpaphp_pci.o	\
>   				rpaphp_slot.o
> diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c
> new file mode 100644
> index 0000000..5cf9e717
> --- /dev/null
> +++ b/drivers/pci/hotplug/powernv_php.c
> @@ -0,0 +1,146 @@
> +/*
> + * PCI Hotplug Driver for PowerPC PowerNV platform.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2015.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/sysfs.h>
> +#include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <asm/opal.h>
> +#include <asm/pnv-pci.h>
> +
> +#include "powernv_php.h"

Compiles without linux/kernel.h, linux/sysfs.h, linux/string.h, 
linux/slab.h. Sure you need all of these?


> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
> +#define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
> +
> +static struct notifier_block php_msg_nb = {
> +	.notifier_call	= powernv_php_msg_handler,
> +	.next		= NULL,
> +	.priority	= 0,
> +};
> +
> +static int powernv_php_register_one(struct device_node *dn)
> +{
> +	struct powernv_php_slot *slot;
> +	const __be32 *prop32;
> +	int ret;
> +
> +	/* Check if it's hotpluggable slot */
> +	prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL);
> +	if (!prop32 || !of_read_number(prop32, 1))
> +		return 0;

Although nobody checks the return code, this should be -ENXIO or something 
but zero. And the check below too.


> +
> +	prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL);
> +	if (!prop32 || !of_read_number(prop32, 1))
> +		return 0;
> +
> +	/* Allocate slot */
> +	slot = powernv_php_slot_alloc(dn);
> +	if (!slot)
> +		return -ENODEV;
> +
> +	/* Register it */
> +	ret = powernv_php_slot_register(slot);
> +	if (ret) {
> +		powernv_php_slot_put(slot);
> +		return ret;
> +	}
> +
> +	return powernv_php_slot_enable(slot->php_slot, false, false);
> +}
> +
> +int powernv_php_register(struct device_node *dn)
> +{
> +	struct device_node *child;
> +	int ret = 0;
> +
> +	/*
> +	 * The parent slots should be registered before their
> +	 * child slots.
> +	 */
> +	for_each_child_of_node(dn, child) {
> +		ret = powernv_php_register_one(child);
> +		if (ret)
> +			break;
> +
> +		powernv_php_register(child);
> +	}
> +
> +	return ret;
> +}
> +
> +static void powernv_php_unregister_one(struct device_node *dn)
> +{
> +	struct powernv_php_slot *slot;
> +
> +	slot = powernv_php_slot_find(dn);
> +	if (!slot)
> +		return;
> +
> +	pci_hp_deregister(slot->php_slot);
> +}
> +
> +void powernv_php_unregister(struct device_node *dn)
> +{
> +	struct device_node *child;
> +
> +	/* The child slots should go before their parent slots */
> +	for_each_child_of_node(dn, child) {
> +		powernv_php_unregister(child);
> +		powernv_php_unregister_one(child);
> +	}
> +}
> +
> +static int __init powernv_php_init(void)
> +{
> +	struct device_node *dn;
> +
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> +	/* Register hotplug message handler */
> +	if (pnv_pci_hotplug_notifier(&php_msg_nb, true)) {

If you called the function "pnv_pci_hotplug_notifier_register", you would 
not need the comment above.


> +		pr_warn("%s: Cannot register hotplug message notifier\n",
> +			__func__);
> +		return -EIO;
> +	}
> +
> +	/* Scan PHB nodes and their children */
> +	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
> +		powernv_php_register(dn);
> +	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
> +		powernv_php_register(dn);
> +
> +	return 0;
> +}
> +
> +static void __exit powernv_php_exit(void)
> +{
> +	struct device_node *dn;
> +
> +	pnv_pci_hotplug_notifier(&php_msg_nb, false);
> +
> +	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
> +		powernv_php_unregister(dn);
> +	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
> +		powernv_php_unregister(dn);
> +}
> +
> +module_init(powernv_php_init);
> +module_exit(powernv_php_exit);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/pci/hotplug/powernv_php.h b/drivers/pci/hotplug/powernv_php.h
> new file mode 100644
> index 0000000..87ba0d0
> --- /dev/null
> +++ b/drivers/pci/hotplug/powernv_php.h
> @@ -0,0 +1,78 @@
> +/*
> + * PCI Hotplug Driver for PowerPC PowerNV platform.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2015.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef _POWERNV_PHP_H
> +#define _POWERNV_PHP_H

I would put these (and dependencies if any) here:

#include <linux/kref.h>
#include <linux/pci.h>
#include <linux/pci_hotplug.h>

and remove them from .c files.


> +
> +/* Slot power status */
> +#define POWERNV_PHP_SLOT_POWER_OFF	0
> +#define POWERNV_PHP_SLOT_POWER_ON	1
> +
> +/* Slot presence status */
> +#define POWERNV_PHP_SLOT_EMPTY		0
> +#define POWERNV_PHP_SLOT_PRESENT	1
> +
> +/* Slot attention status */
> +#define POWERNV_PHP_SLOT_ATTEN_OFF	0
> +#define POWERNV_PHP_SLOT_ATTEN_ON	1
> +#define POWERNV_PHP_SLOT_ATTEN_IND	2
> +#define POWERNV_PHP_SLOT_ATTEN_ACT	3
> +
> +struct powernv_php_slot {
> +	struct kref		kref;
> +	int			state;
> +#define POWERNV_PHP_SLOT_STATE_INIT		0x0
> +#define POWERNV_PHP_SLOT_STATE_REGISTER		0x1
> +#define POWERNV_PHP_SLOT_STATE_POPULATED	0x2

I believe these are not bitmasks but bit numbers, right? Decimal values are 
normally used for them.


> +	char			*name;
> +	struct device_node	*dn;
> +	struct pci_bus		*bus;
> +	uint64_t		id;
> +	int			slot_no;
> +	int			check_power_status;
> +	int			status_confirmed;
> +	struct opal_msg		*msg;
> +	struct work_struct	work;
> +	wait_queue_head_t	queue;
> +	struct hotplug_slot	*php_slot;
> +	struct powernv_php_slot	*parent;
> +	void (*release)(struct kref *kref);

What is the point in this? Just use php_slot_free() directly in 
powernv_php_slot_put, no?


> +	struct list_head	children;
> +	struct list_head	link;
> +};
> +
> +#define to_powernv_php_slot(kref) container_of(kref, struct powernv_php_slot, kref)
> +
> +static inline void powernv_php_slot_get(struct powernv_php_slot *slot)
> +{
> +	if (slot)
> +		kref_get(&slot->kref);
> +}
> +
> +static inline int powernv_php_slot_put(struct powernv_php_slot *slot)
> +{
> +	if (slot)
> +		return kref_put(&slot->kref, slot->release);
> +
> +	return 0;
> +}
> +
> +int powernv_php_msg_handler(struct notifier_block *nb,
> +			    unsigned long type, void *message);
> +struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn);
> +struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn);
> +int powernv_php_slot_register(struct powernv_php_slot *slot);
> +int powernv_php_slot_enable(struct hotplug_slot *php_slot,
> +			    bool rescan_bus, bool rescan_slot);

Just an observation - rescan_bus and rescan_slot are both true or both 
false and never different. And the only caller requesting rescan is in the 
same file as powernv_php_slot_enable() and it could do this rescan if 
powernv_php_slot_enable() could signal that rescan is needed (return 1?).

And no "goto" in powernv_php_slot_enable would be needed. Do not insist though.



> +int powernv_php_register(struct device_node *dn);
> +void powernv_php_unregister(struct device_node *dn);
> +
> +#endif /* !_POWERNV_PHP_H */
> diff --git a/drivers/pci/hotplug/powernv_php_slot.c b/drivers/pci/hotplug/powernv_php_slot.c
> new file mode 100644
> index 0000000..fc82355
> --- /dev/null
> +++ b/drivers/pci/hotplug/powernv_php_slot.c
> @@ -0,0 +1,643 @@
> +/*
> + * PCI Hotplug Driver for PowerPC PowerNV platform.
> + *
> + * Copyright Gavin Shan, IBM Corporation 2015.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/sysfs.h>
> +#include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/wait.h>
> +#include <linux/workqueue.h>
> +
> +#include <asm/opal.h>
> +#include <asm/pnv-pci.h>
> +#include <asm/ppc-pci.h>
> +
> +#include "powernv_php.h"

I have a suspicion you won't need all these headers here too ;)


> +
> +static LIST_HEAD(php_slot_list);
> +static DEFINE_SPINLOCK(php_slot_lock);
> +
> +/*
> + * Release firmware data for all child device nodes of the
> + * indicated one.
> + */
> +static void release_device_nodes_info(struct device_node *np)
> +{
> +	struct device_node *child;
> +
> +	for_each_child_of_node(np, child) {
> +		/* In depth first */
> +		release_device_nodes_info(child);

Why is this "release", not "remove" (as this is what it does - calling 
remove_lalala in a loop)?

> +
> +		remove_pci_device_node_info(child);
> +	}
> +}
> +
> +/*
> + * Release all subordinate device nodes of the indicated one.
> + * Those device nodes in deepest path should be released firstly.
> + */
> +static int release_device_nodes(struct device_node *parent)
> +{
> +	struct device_node *np, *child;
> +	int ret = 0;
> +
> +	/* If the device node has children, remove them firstly */
> +	for_each_child_of_node(parent, np) {
> +		ret = release_device_nodes(np);
> +		if (ret)
> +			return ret;
> +
> +		/* The device shouldn't have alive children */
> +		child = of_get_next_child(np, NULL);
> +		if (child) {
> +			of_node_put(child);
> +			of_node_put(np);
> +			pr_err("%s: Alive children of node <%s>\n",
> +			       __func__, of_node_full_name(np));
> +			return -EBUSY;
> +		}
> +
> +		/* Detach the device node */
> +		of_detach_node(np);
> +		of_node_put(np);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * The function processes the message sent by firmware
> + * to remove all device tree nodes beneath the slot's
> + * nodes, and the associated auxillary data.
> + */
> +static void slot_power_off_handler(struct powernv_php_slot *slot)
> +{
> +	int ret;
> +
> +	/* Release the firmware data for the child device nodes */
> +	release_device_nodes_info(slot->dn);
> +
> +	/* Release the child device nodes */
> +	ret = release_device_nodes(slot->dn);
> +	if (ret)
> +		pr_warn("%s: Error %d releasing children of <%s>\n",
> +			__func__, ret, of_node_full_name(slot->dn));
> +
> +	/* Confirm status change */
> +	slot->status_confirmed = 1;
> +	wake_up_interruptible(&slot->queue);
> +}
> +
> +static void slot_power_on_handler(struct powernv_php_slot *slot)
> +{
> +	struct opal_msg *msg = slot->msg;
> +	unsigned long phys = be64_to_cpu(msg->params[2]);
> +	unsigned long len = be64_to_cpu(msg->params[3]);
> +	void *blob = (phys && len > 0) ? __va(phys) : NULL;
> +
> +	/* There might have nothing behind the slot yet */
> +	if (!blob || !len)

"!len" is redundand here - blob will be NULL if len<=0.

> +		goto out;
> +
> +	/* Copy the FDT blob and parse it */
> +	of_fdt_add_subtree(slot->dn, blob);
> +
> +	/* Add device node firmware data */
> +	traverse_pci_device_nodes(slot->dn,
> +				  add_pci_device_node_info,
> +				  pci_bus_to_host(slot->bus));
> +
> +out:
> +	/* Confirm status change */
> +	slot->status_confirmed = 1;
> +	wake_up_interruptible(&slot->queue);
> +}
> +
> +static void powernv_php_slot_work(struct work_struct *data)
> +{
> +	struct powernv_php_slot *slot = container_of(data,
> +						     struct powernv_php_slot,
> +						     work);
> +	uint64_t php_event = be64_to_cpu(slot->msg->params[0]);
> +
> +	switch (php_event) {
> +	case 0: /* Slot power off */
> +		slot_power_off_handler(slot);
> +		break;
> +	case 1: /* Slot power on */
> +		slot_power_on_handler(slot);
> +		break;
> +	default:
> +		pr_warn("%s: Unsupported hotplug event %lld\n",
> +			__func__, php_event);
> +	}
> +
> +	of_node_put(slot->dn);
> +}
> +
> +int powernv_php_msg_handler(struct notifier_block *nb,
> +			    unsigned long type, void *message)
> +{
> +	phandle h;
> +	struct device_node *np;
> +	struct powernv_php_slot *slot;
> +	struct opal_msg *msg = message;
> +
> +	/* Check the message type */
> +	if (type != OPAL_MSG_PCI_HOTPLUG) {
> +		pr_warn("%s: Wrong message type %ld received!\n",
> +			__func__, type);
> +		return 0;
> +	}
> +
> +	/* Find the device node */
> +	h = (phandle)be64_to_cpu(msg->params[1]);
> +	np = of_find_node_by_phandle(h);
> +	if (!np) {
> +		pr_warn("%s: No device node for phandle 0x%08x\n",
> +			__func__, h);
> +		return 0;
> +	}
> +
> +	/* Find the slot */
> +	slot = powernv_php_slot_find(np);
> +	if (!slot) {
> +		pr_warn("%s: No slot found for node <%s>\n",
> +			__func__, of_node_full_name(np));
> +		of_node_put(np);
> +		return 0;
> +	}
> +
> +	/* Schedule the work */
> +	slot->msg = msg;
> +	schedule_work(&slot->work);
> +	return 0;
> +}
> +
> +static int set_power_status(struct hotplug_slot *php_slot, u8 val)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	int ret;
> +
> +	/* Set power status */
> +	slot->status_confirmed = 0;
> +	ret = pnv_pci_set_power_status(slot->id, val);
> +	if (ret) {
> +		pr_warn("%s: Error %d powering %s slot %016llx\n",
> +			__func__, ret, val ? "on" : "off", slot->id);
> +		return ret;
> +	}
> +
> +	/* Waiting until the device tree is updated */
> +	ret = wait_event_timeout(slot->queue,
> +				 !slot->status_confirmed,
> +				 10 * HZ);
> +	if (ret) {
> +		pr_warn("%s: Error %d completing power-%s slot %016llx\n",
> +			__func__, ret, val ? "on" : "off", slot->id);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int get_power_status(struct hotplug_slot *php_slot, u8 *val)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t state;
> +	int ret;
> +
> +	/*
> +	 * Retrieve power status from firmware. If we fail
> +	 * getting that, the power status fails back to
> +	 * be on.
> +	 */
> +	ret = pnv_pci_get_power_status(slot->id, &state);
> +	if (ret) {
> +		*val = POWERNV_PHP_SLOT_POWER_ON;
> +		pr_warn("%s: Error %d getting power status of slot %016llx\n",
> +			__func__, ret, slot->id);
> +	} else {
> +		*val = state ? POWERNV_PHP_SLOT_POWER_ON :
> +			       POWERNV_PHP_SLOT_POWER_OFF;
> +		php_slot->info->power_status = *val;
> +	}
> +
> +	return 0;
> +}
> +
> +static int get_adapter_status(struct hotplug_slot *php_slot, u8 *val)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t state;
> +	int ret;
> +
> +	/*
> +	 * Retrieve presence status from firmware. If we can't
> +	 * get that, it will fail back to be empty.
> +	 */
> +	ret = pnv_pci_get_presence_status(slot->id, &state);
> +	if (ret >= 0) {
> +                *val = state ? POWERNV_PHP_SLOT_PRESENT :
> +                               POWERNV_PHP_SLOT_EMPTY;
> +                php_slot->info->adapter_status = *val;

ret = 0;


> +	} else {
> +		*val = POWERNV_PHP_SLOT_EMPTY;
> +		pr_warn("%s: Error %d getting presence of slot %016llx\n",
> +			__func__, ret, slot->id);
> +	}
> +
> +	return ret < 0 ? ret : 0;


return ret;


> +}
> +
> +static int set_attention_status(struct hotplug_slot *php_slot, u8 val)
> +{
> +	/* The default operation would to turn on the attention */
> +	switch (val) {
> +	case POWERNV_PHP_SLOT_ATTEN_OFF:
> +	case POWERNV_PHP_SLOT_ATTEN_ON:
> +	case POWERNV_PHP_SLOT_ATTEN_IND:
> +	case POWERNV_PHP_SLOT_ATTEN_ACT:
> +		break;
> +	default:
> +		val = POWERNV_PHP_SLOT_ATTEN_ON;

Is not @val a garbage in this case?


> +	}
> +
> +	/* FIXME: Make it real once firmware supports it */
> +	php_slot->info->attention_status = val;
> +
> +	return 0;
> +}
> +
> +int powernv_php_slot_enable(struct hotplug_slot *php_slot,
> +			    bool rescan_bus, bool rescan_slot)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t presence, power_status;
> +	int ret;
> +
> +	/* Check if the slot has been configured */
> +	if (slot->state != POWERNV_PHP_SLOT_STATE_REGISTER)
> +		return 0;
> +
> +	/* Retrieve slot presence status */
> +	ret = php_slot->ops->get_adapter_status(php_slot, &presence);
> +	if (ret) {
> +		pr_warn("%s: Error %d getting presence of slot %016llx\n",
> +			__func__, ret, slot->id);
> +		return ret;
> +	}
> +
> +	/* Proceed if there have nothing behind the slot */
> +	if (presence == POWERNV_PHP_SLOT_EMPTY)
> +		goto scan;
> +
> +	/*
> +	 * If we don't detect something behind the slot, we need
> +	 * make sure the power suply to the slot is on. Otherwise,
> +	 * the slot downstream PCIe linkturn should be down.
> +	 *
> +	 * On the first time, we don't change the power status to
> +	 * boost system boot with assumption that the firmware
> +	 * supplies consistent slot power status: empty slot always
> +	 * has its power off and non-empty slot has its power on.
> +	 */
> +	if (!slot->check_power_status) {
> +		slot->check_power_status = 1;
> +		goto scan;
> +	}
> +
> +	/* Check the power status. Scan the slot if that's already on */
> +	ret = php_slot->ops->get_power_status(php_slot, &power_status);
> +	if (ret) {
> +		pr_warn("%s: Error %d getting power status of slot %016llx\n",
> +			__func__, ret, slot->id);
> +		return ret;
> +	}
> +	if (power_status == POWERNV_PHP_SLOT_POWER_ON)
> +		goto scan;
> +
> +	/* Power is off, turn it on and then scan the slot */
> +	ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_ON);
> +	if (ret) {
> +		pr_warn("%s: Error %d powering on slot %016llx\n",
> +			__func__, ret, slot->id);
> +		return ret;
> +	}
> +
> +scan:
> +	switch (presence) {
> +	case POWERNV_PHP_SLOT_PRESENT:
> +		if (rescan_bus) {
> +			pci_lock_rescan_remove();
> +			pcibios_add_pci_devices(slot->bus);
> +			pci_unlock_rescan_remove();
> +		}
> +
> +		/* Rescan for child hotpluggable slots */
> +		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
> +		if (rescan_slot)
> +			powernv_php_register(slot->dn);
> +		break;
> +	case POWERNV_PHP_SLOT_EMPTY:
> +		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
> +		break;
> +	default:
> +		pr_warn("%s: Invalid presence status %d of slot %016llx\n",
> +			__func__, presence, slot->id);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int enable_slot(struct hotplug_slot *php_slot)
> +{
> +	return powernv_php_slot_enable(php_slot, true, true);
> +}
> +
> +static int disable_slot(struct hotplug_slot *php_slot)
> +{
> +	struct powernv_php_slot *slot = php_slot->private;
> +	uint8_t power_status;
> +	int ret;
> +
> +	if (slot->state != POWERNV_PHP_SLOT_STATE_POPULATED)
> +		return 0;
> +
> +	/* Remove all devices behind the slot */
> +	pci_lock_rescan_remove();
> +	pcibios_remove_pci_devices(slot->bus);
> +	pci_unlock_rescan_remove();
> +
> +	/* Detach the child hotpluggable slots */
> +	powernv_php_unregister(slot->dn);
> +
> +	/*
> +	 * Check the power status and turn it off if necessary. If we
> +	 * fail to get the power status, the power will be forced to
> +	 * be off.
> +	 */
> +	ret = php_slot->ops->get_power_status(php_slot, &power_status);
> +	if (ret || power_status == POWERNV_PHP_SLOT_POWER_ON) {
> +		ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_OFF);
> +		if (ret)
> +			pr_warn("%s: Error %d powering off slot %016llx\n",
> +				__func__, ret, slot->id);
> +	}
> +
> +	/* Update slot state */
> +	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
> +	return 0;
> +}
> +
> +static struct hotplug_slot_ops php_slot_ops = {
> +	.get_power_status	= get_power_status,
> +	.get_adapter_status	= get_adapter_status,
> +	.set_attention_status	= set_attention_status,
> +	.enable_slot		= enable_slot,
> +	.disable_slot		= disable_slot,
> +};
> +
> +static struct powernv_php_slot *php_slot_match(struct device_node *dn,
> +					       struct powernv_php_slot *slot)
> +{
> +	struct powernv_php_slot *target, *tmp;
> +
> +	if (slot->dn == dn)
> +		return slot;
> +
> +	list_for_each_entry(tmp, &slot->children, link) {
> +		target = php_slot_match(dn, tmp);
> +		if (target)
> +			return target;
> +	}
> +
> +	return NULL;
> +}
> +
> +struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn)
> +{
> +	struct powernv_php_slot *slot, *tmp;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&php_slot_lock, flags);
> +	list_for_each_entry(tmp, &php_slot_list, link) {
> +		slot = php_slot_match(dn, tmp);
> +		if (slot) {
> +			spin_unlock_irqrestore(&php_slot_lock, flags);
> +			return slot;
> +		}
> +	}
> +	spin_unlock_irqrestore(&php_slot_lock, flags);
> +
> +	return NULL;
> +}
> +
> +static void php_slot_free(struct kref *kref)
> +{
> +	struct powernv_php_slot *slot = to_powernv_php_slot(kref);
> +
> +	WARN_ON(!list_empty(&slot->children));
> +	kfree(slot->name);
> +	kfree(slot);
> +}
> +
> +static void php_slot_release(struct hotplug_slot *hp_slot)
> +{
> +	struct powernv_php_slot *slot = hp_slot->private;
> +	unsigned long flags;
> +
> +	/* Remove from global or child list */
> +	spin_lock_irqsave(&php_slot_lock, flags);
> +	list_del(&slot->link);
> +	spin_unlock_irqrestore(&php_slot_lock, flags);
> +
> +	/* Detach from parent */
> +	powernv_php_slot_put(slot);
> +	powernv_php_slot_put(slot->parent);
> +}
> +
> +static bool php_slot_get_id(struct device_node *dn,
> +			    uint64_t *id)
> +{
> +	struct device_node *parent = dn;
> +	const __be64 *prop64;
> +	const __be32 *prop32;
> +
> +	/*
> +	 * The hotpluggable slot always has a compound Id, which
> +	 * consists of 16-bits PHB Id, 16 bits bus/slot/function
> +	 * number, and compound indicator
> +	 */
> +	*id = (0x1ul << 63);
> +
> +	/* Bus/Slot/Function number */
> +	prop32 = of_get_property(dn, "reg", NULL);
> +	if (!prop32)
> +		return false;
> +	*id |= ((of_read_number(prop32, 1) & 0x00ffff00) << 8);
> +
> +	/* PHB Id */
> +	while ((parent = of_get_parent(parent))) {
> +		if (!PCI_DN(parent)) {
> +			of_node_put(parent);
> +			break;
> +		}
> +
> +		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
> +		    !of_device_is_compatible(parent, "ibm,ioda-phb")) {
> +			of_node_put(parent);
> +			continue;
> +		}
> +
> +		prop64 = of_get_property(parent, "ibm,opal-phbid", NULL);
> +		if (!prop64) {
> +			of_node_put(parent);
> +			return false;
> +		}
> +
> +		*id |= be64_to_cpup(prop64);
> +		of_node_put(parent);
> +		return true;
> +	}
> +
> +        return false;
> +}
> +
> +struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn)
> +{
> +	struct pci_bus *bus;
> +	struct powernv_php_slot *slot;
> +	const char *label;
> +	uint64_t id;
> +	int slot_no;
> +	size_t size;
> +	void *pmem;
> +
> +	/* Slot name */
> +	label = of_get_property(dn, "ibm,slot-label", NULL);
> +	if (!label)
> +		return NULL;
> +
> +	/* Slot indentifier */
> +	if (!php_slot_get_id(dn, &id))
> +		return NULL;
> +
> +	/* PCI bus */
> +	bus = pcibios_find_pci_bus(dn);
> +	if (!bus)
> +		return NULL;
> +
> +	/* Slot number */
> +	if (dn->child && PCI_DN(dn->child))
> +		slot_no = PCI_SLOT(PCI_DN(dn->child)->devfn);
> +	else
> +		slot_no = -1;
> +
> +	/* Allocate slot */
> +	size = sizeof(struct powernv_php_slot) +
> +	       sizeof(struct hotplug_slot) +
> +	       sizeof(struct hotplug_slot_info);
> +	pmem = kzalloc(size, GFP_KERNEL);
> +	if (!pmem) {
> +		pr_warn("%s: Cannot allocate slot for node %s\n",
> +			__func__, dn->full_name);
> +		return NULL;
> +	}
> +
> +	/* Assign memory blocks */
> +	slot = pmem;
> +	slot->php_slot = pmem + sizeof(struct powernv_php_slot);
> +	slot->php_slot->info = pmem + sizeof(struct powernv_php_slot) +
> +			      sizeof(struct hotplug_slot);
> +	slot->name = kstrdup(label, GFP_KERNEL);
> +	if (!slot->name) {
> +		pr_warn("%s: Cannot populate name for node %s\n",
> +			__func__, dn->full_name);
> +		kfree(pmem);
> +		return NULL;
> +	}
> +
> +	/* Initialize slot */
> +	kref_init(&slot->kref);
> +	slot->state = POWERNV_PHP_SLOT_STATE_INIT;
> +	slot->dn = dn;
> +	slot->bus = bus;
> +	slot->id = id;
> +	slot->slot_no = slot_no;
> +	INIT_WORK(&slot->work, powernv_php_slot_work);
> +	init_waitqueue_head(&slot->queue);
> +	slot->check_power_status = 0;
> +	slot->status_confirmed = 0;
> +	slot->release = php_slot_free;
> +	slot->php_slot->ops = &php_slot_ops;
> +	slot->php_slot->release = php_slot_release;
> +	slot->php_slot->private = slot;
> +	INIT_LIST_HEAD(&slot->children);
> +	INIT_LIST_HEAD(&slot->link);
> +
> +	return slot;
> +}
> +
> +int powernv_php_slot_register(struct powernv_php_slot *slot)
> +{
> +	struct powernv_php_slot *parent;
> +	struct device_node *dn = slot->dn;
> +	unsigned long flags;
> +	int ret;
> +
> +	/* Avoid register same slot for twice */
> +	if (powernv_php_slot_find(slot->dn))
> +		return -EEXIST;
> +
> +	/* Register slot */
> +	ret = pci_hp_register(slot->php_slot, slot->bus,
> +			      slot->slot_no, slot->name);
> +	if (ret) {
> +		pr_warn("%s: Cannot register slot %s (%d)\n",
> +			__func__, slot->name, ret);
> +		return ret;
> +	}
> +
> +	/* Put into global or parent list */
> +	while ((dn = of_get_parent(dn))) {
> +		if (!PCI_DN(dn)) {
> +			of_node_put(dn);
> +			break;
> +		}
> +
> +		parent = powernv_php_slot_find(dn);
> +		if (parent) {
> +			of_node_put(dn);
> +			break;
> +		}
> +	}
> +
> +	spin_lock_irqsave(&php_slot_lock, flags);
> +	if (parent) {
> +		powernv_php_slot_get(parent);
> +		slot->parent = parent;
> +		list_add_tail(&slot->link, &parent->children);
> +	} else {
> +		list_add_tail(&slot->link, &php_slot_list);
> +	}
> +	spin_unlock_irqrestore(&php_slot_lock, flags);
> +
> +	/* Update slot state */
> +	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
> +	return 0;
> +}
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
  2015-05-07 22:12     ` Bjorn Helgaas
@ 2015-05-11  1:59       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  1:59 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh

On Thu, May 07, 2015 at 05:12:24PM -0500, Bjorn Helgaas wrote:
>Hi Gavin,
>
>[Please run "git log --oneline drivers/pci/setup-bus.c" and observe the
>capitalization convention.]
>

Bjorn, thanks for review. Yeah, it should be something like
"PCI: xxxx" for the subject. I'll fix it up in next revision.

>On Fri, May 01, 2015 at 04:02:48PM +1000, Gavin Shan wrote:
>> Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
>> which is called for once after PCI probing and resource assignment
>> are completed, to allocate platform required resources for PCI devices:
>> PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
>> Obviously, it's not hotplug friendly.
>> 
>> The patch adds weak function pcibios_setup_bridge(), which is called
>> by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
>> to assign above platform required resources to newly added PCI devices,
>> in order to support PCI hotplug on PowerPC PowerNV platform.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  drivers/pci/setup-bus.c | 12 +++++++++---
>>  include/linux/pci.h     |  1 +
>>  2 files changed, 10 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index 4fd0cac..a7d0c3c 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
>>  	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
>>  }
>>  
>> -static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
>> +
>> +void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
>>  {
>>  	struct pci_dev *bridge = bus->self;
>>  
>> @@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
>>  	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
>>  }
>>  
>> +void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
>> +{
>> +	pci_setup_bridge_resources(bus, type);
>> +}
>
>I'm not opposed to adding a pcibios_setup_bridge(), but I would rather do
>the architected updates in the generic PCI core code instead of down in the
>pcibios code.  In other words, I would rather have this:
>
>  void pci_setup_bridge(struct pci_bus *bus)
>  {
>    pcibios_setup_bridge(bus, type);
>    pci_setup_bridge_resources(bus, type);
>  }
>
>That way the default pcibios hook is empty, showing that by default there's
>no arch-specific code in this path, and we only have to look at the generic
>core code to verify that we actually do program the bridge windows.
>

Ok. I'll change the code accordingly in next revision. Thanks for the
suggestion.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 01/21] pci: Add pcibios_setup_bridge()
@ 2015-05-11  1:59       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  1:59 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linuxppc-dev, Gavin Shan

On Thu, May 07, 2015 at 05:12:24PM -0500, Bjorn Helgaas wrote:
>Hi Gavin,
>
>[Please run "git log --oneline drivers/pci/setup-bus.c" and observe the
>capitalization convention.]
>

Bjorn, thanks for review. Yeah, it should be something like
"PCI: xxxx" for the subject. I'll fix it up in next revision.

>On Fri, May 01, 2015 at 04:02:48PM +1000, Gavin Shan wrote:
>> Currently, PowerPC PowerNV platform utilizes ppc_md.pcibios_fixup(),
>> which is called for once after PCI probing and resource assignment
>> are completed, to allocate platform required resources for PCI devices:
>> PE#, IO and MMIO mapping, DMA address translation (TCE) table etc.
>> Obviously, it's not hotplug friendly.
>> 
>> The patch adds weak function pcibios_setup_bridge(), which is called
>> by pci_setup_bridge(). PowerPC PowerNV platform will reuse the function
>> to assign above platform required resources to newly added PCI devices,
>> in order to support PCI hotplug on PowerPC PowerNV platform.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  drivers/pci/setup-bus.c | 12 +++++++++---
>>  include/linux/pci.h     |  1 +
>>  2 files changed, 10 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index 4fd0cac..a7d0c3c 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -674,7 +674,8 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
>>  	pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, lu);
>>  }
>>  
>> -static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
>> +
>> +void pci_setup_bridge_resources(struct pci_bus *bus, unsigned long type)
>>  {
>>  	struct pci_dev *bridge = bus->self;
>>  
>> @@ -693,12 +694,17 @@ static void __pci_setup_bridge(struct pci_bus *bus, unsigned long type)
>>  	pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
>>  }
>>  
>> +void __weak pcibios_setup_bridge(struct pci_bus *bus, unsigned long type)
>> +{
>> +	pci_setup_bridge_resources(bus, type);
>> +}
>
>I'm not opposed to adding a pcibios_setup_bridge(), but I would rather do
>the architected updates in the generic PCI core code instead of down in the
>pcibios code.  In other words, I would rather have this:
>
>  void pci_setup_bridge(struct pci_bus *bus)
>  {
>    pcibios_setup_bridge(bus, type);
>    pci_setup_bridge_resources(bus, type);
>  }
>
>That way the default pcibios hook is empty, showing that by default there's
>no arch-specific code in this path, and we only have to look at the generic
>core code to verify that we actually do program the bridge windows.
>

Ok. I'll change the code accordingly in next revision. Thanks for the
suggestion.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
  2015-05-09  0:18     ` Alexey Kardashevskiy
@ 2015-05-11  4:37       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 10:18:42AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The patch enables M64 window on P7IOC, which has been enabled on
>>PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them
>>are divided to 8 segments.
>
>"compared to something" means you will tell about PHB3 too :)
>

Ok. I'll add something about PHB3 in next revision.

>Do I understand correctly that IODA==IODA1==P7IOC  and P7IOC != IODA2? The
>code does not use "PHB3" or "P7IOC" acronym so it is a bit confusing.
>
>

Your understanding is correct.

>>So each PHB can support 128 M64 segments.
>>Also, P7IOC has M64DT, which helps mapping one particular M64
>>segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs)
>>segments and fixed mapping between PE# and M64 segment# in order
>>to keep same logic to support M64 for PHB3 and P7IOC. In turn, we
>>just need different phb->init_m64() hooks for P7IOC and PHB3.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++++++++++++++++++++++++++----
>>  1 file changed, 103 insertions(+), 12 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index f8bc950..646962f 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>  	clear_bit(pe, phb->ioda.pe_alloc);
>>  }
>>
>>+static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>+{
>>+	struct resource *r;
>>+	int seg;
>>+	s64 rc;
>
>Here @rc is of the "s64" type.
>

Ok. I'll have "int64_t rc" as you pointed in the following replies.

>>+
>>+	/* Each PHB supports 16 separate M64 BARs, each of which are
>>+	 * divided into 8 segments. So there are number of M64 segments
>>+	 * as total PE#, which is 128.
>>+	 */
>
>"there are as many M64 segments as a maximum number of PEs which is 128"?
>

Thanks, your description is obviously more clear. I will have it in
next revision.

>>+	for (seg = 0; seg < phb->ioda.total_pe; seg += 8) {
>>+		unsigned long base;
>>+
>>+		base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize;
>>+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
>>+						 OPAL_M64_WINDOW_TYPE,
>>+						 seg / 8,
>>+						 base,
>>+						 0, /* unused */
>>+						 8 * phb->ioda.m64_segsize);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("  Failure %lld configuring M64 BAR#%d on PHB#%d\n",
>>+				rc, seg / 8, phb->hose->global_number);
>>+			goto fail;
>>+		}
>>+
>>+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
>>+					      OPAL_M64_WINDOW_TYPE,
>>+					      seg / 8,
>>+					      OPAL_ENABLE_M64_SPLIT);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("  Failure %lld enabling M64 BAR#%d on PHB#%d\n",
>>+				rc, seg / 8, phb->hose->global_number);
>>+			goto fail;
>>+		}
>>+	}
>>+
>>+	/* Strip of the segment used by the reserved PE, which
>>+	 * is expected to be 0 or last supported PE#
>>+	 */
>>+	r = &phb->hose->mem_resources[1];
>
>mem_resources[0] is IO, mem_resources[1] is MMIO, mem_resources[2] is for
>what? Would be nice to have this commented somewhere.
>

The fixed layout is determined by skiboot firmware. mem_resource[2] is for
64-bits prefetchable MMIO. I'll see if I can put some comments about them
somewhere in next revision.

>>+	if (phb->ioda.reserved_pe == 0)
>>+		r->start += phb->ioda.m64_segsize;
>>+	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
>>+		r->end -= phb->ioda.m64_segsize;
>>+	else
>>+		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
>>+			phb->ioda.reserved_pe);
>>+
>>+	return 0;
>>+
>>+fail:
>>+	for ( ; seg >= 0; seg -= 8)
>>+		opal_pci_phb_mmio_enable(phb->opal_id,
>>+					 OPAL_M64_WINDOW_TYPE,
>>+					 seg / 8,
>>+					 OPAL_DISABLE_M64);
>
>Out of curiosity - is not there a counterpart for
>opal_pci_set_phb_mem_window() for cleanup?
>
>

No.

>>+
>>+	return -EIO;
>>+}
>>+
>>  /* The default M64 BAR is shared by all PEs */
>>  static int pnv_ioda2_init_m64(struct pnv_phb *phb)
>>  {
>>@@ -222,7 +283,7 @@ fail:
>>  	return -EIO;
>>  }
>>
>>-static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
>>+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
>>  {
>>  	resource_size_t sgsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>@@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
>>  	}
>>  }
>>
>>-static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb,
>>-				 struct pci_bus *bus, int all)
>>+static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>+				struct pci_bus *bus, int all)
>>  {
>>  	resource_size_t segsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>@@ -346,6 +407,28 @@ done:
>>  			pe->master = master_pe;
>>  			list_add_tail(&pe->list, &master_pe->slaves);
>>  		}
>>+
>>+		/* P7IOC supports M64DT, which helps mapping M64 segment
>>+		 * to one particular PE#. Unfortunately, PHB3 has fixed
>
>Why is it "Unfortunately"? This is just the way it is :)
>

It's true that PHB3 is designed without M64DT while P7IOC has. I think
it's a nice thing providing more flexibility: Arbitrary M64 segment
can be mapped to any one PE# with its help. So I said "unfortunately" :-)

>>+		 * mapping between M64 segment and PE#. In order for same
>>+		 * logic for P7IOC and PHB3, we enforce fixed mapping
>>+		 * between M64 segment and PE# on P7IOC.
>>+		 */
>>+		if (phb->type == PNV_PHB_IODA1) {
>>+			int64_t rc;
>
>Here @rc is of the "int64_t" type. And this one and the one above are used
>for return code from OPAL API. Make them the same (int64_t or long, up to
>you).
>

Yep. It will be "int64_t rc" as I said above.

>>+
>>+			rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>+							 pe->pe_number,
>>+							 OPAL_M64_WINDOW_TYPE,
>>+							 pe->pe_number / 8,
>>+							 pe->pe_number % 8);
>>+			if (rc != OPAL_SUCCESS)
>>+				pr_warn("%s: Failure %lld mapping "
>>+					"M64 for PHB#%d-PE#%d\n",
>>+					__func__, rc,
>>+					phb->hose->global_number,
>>+					pe->pe_number);
>>+		}
>>  	}
>>
>>  	kfree(pe_alloc);
>>@@ -360,12 +443,6 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>  	const u32 *r;
>>  	u64 pci_addr;
>>
>>-	/* FIXME: Support M64 for P7IOC */
>>-	if (phb->type != PNV_PHB_IODA2) {
>>-		pr_info("  Not support M64 window\n");
>>-		return;
>>-	}
>>-
>>  	if (!firmware_has_feature(FW_FEATURE_OPALv3)) {
>>  		pr_info("  Firmware too old to support M64 window\n");
>>  		return;
>>@@ -394,9 +471,23 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>
>>  	/* Use last M64 BAR to cover M64 window */
>>  	phb->ioda.m64_bar_idx = 15;
>>-	phb->init_m64 = pnv_ioda2_init_m64;
>>-	phb->reserve_m64_pe = pnv_ioda2_reserve_m64_pe;
>>-	phb->pick_m64_pe = pnv_ioda2_pick_m64_pe;
>>+	phb->reserve_m64_pe = pnv_ioda_reserve_m64_pe;
>
>
>reserve_m64_pe() is called once from pnv_pci_ioda_setup_PEs() so it is
>IODA-only and in this case reserve_m64_pe != NULL and
>pnv_ioda_reserve_m64_pe() will be called always.
>
>In general, it feels like pnv_phb has too many callbacks while they could be
>just direct calls.
>

We will have another type of IODA compatible PHB soon. I'm not sure if it's
legal to reveal its name now. The new PHB won't have M64 support. I do think
callbacks give us more flexibility (for supporting M64 or not). Lets keep it.

>>+	phb->pick_m64_pe = pnv_ioda_pick_m64_pe;
>>+	switch (phb->type) {
>>+	case PNV_PHB_IODA1:
>>+		phb->init_m64 = pnv_ioda1_init_m64;
>>+		break;
>>+	case PNV_PHB_IODA2:
>>+		phb->init_m64 = pnv_ioda2_init_m64;
>>+		break;
>>+	default:
>>+		phb->init_m64 = NULL;
>>+		phb->reserve_m64_pe = NULL;
>>+		phb->pick_m64_pe = NULL;
>>+		phb->ioda.m64_size = 0;
>>+		phb->ioda.m64_segsize = 0;
>>+		phb->ioda.m64_base = 0;
>
>There are just 2 PHB types - IODA1 and IODA2, right? And the fields you reset
>after "default" - they have to be zeroes already, no? And on what hardware
>would the default branch actuall work? None?
>

Yeah, you're right those piece of garbage can be removed in next revision.

Thanks,
Gavin

>
>>+	}
>>  }
>>
>>  static void pnv_ioda_freeze_pe(struct pnv_phb *phb, int pe_no)
>>
>
>
>-- 
>Alexey
>


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
@ 2015-05-11  4:37       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 10:18:42AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The patch enables M64 window on P7IOC, which has been enabled on
>>PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them
>>are divided to 8 segments.
>
>"compared to something" means you will tell about PHB3 too :)
>

Ok. I'll add something about PHB3 in next revision.

>Do I understand correctly that IODA==IODA1==P7IOC  and P7IOC != IODA2? The
>code does not use "PHB3" or "P7IOC" acronym so it is a bit confusing.
>
>

Your understanding is correct.

>>So each PHB can support 128 M64 segments.
>>Also, P7IOC has M64DT, which helps mapping one particular M64
>>segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs)
>>segments and fixed mapping between PE# and M64 segment# in order
>>to keep same logic to support M64 for PHB3 and P7IOC. In turn, we
>>just need different phb->init_m64() hooks for P7IOC and PHB3.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++++++++++++++++++++++++++----
>>  1 file changed, 103 insertions(+), 12 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index f8bc950..646962f 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>  	clear_bit(pe, phb->ioda.pe_alloc);
>>  }
>>
>>+static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>+{
>>+	struct resource *r;
>>+	int seg;
>>+	s64 rc;
>
>Here @rc is of the "s64" type.
>

Ok. I'll have "int64_t rc" as you pointed in the following replies.

>>+
>>+	/* Each PHB supports 16 separate M64 BARs, each of which are
>>+	 * divided into 8 segments. So there are number of M64 segments
>>+	 * as total PE#, which is 128.
>>+	 */
>
>"there are as many M64 segments as a maximum number of PEs which is 128"?
>

Thanks, your description is obviously more clear. I will have it in
next revision.

>>+	for (seg = 0; seg < phb->ioda.total_pe; seg += 8) {
>>+		unsigned long base;
>>+
>>+		base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize;
>>+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
>>+						 OPAL_M64_WINDOW_TYPE,
>>+						 seg / 8,
>>+						 base,
>>+						 0, /* unused */
>>+						 8 * phb->ioda.m64_segsize);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("  Failure %lld configuring M64 BAR#%d on PHB#%d\n",
>>+				rc, seg / 8, phb->hose->global_number);
>>+			goto fail;
>>+		}
>>+
>>+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
>>+					      OPAL_M64_WINDOW_TYPE,
>>+					      seg / 8,
>>+					      OPAL_ENABLE_M64_SPLIT);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("  Failure %lld enabling M64 BAR#%d on PHB#%d\n",
>>+				rc, seg / 8, phb->hose->global_number);
>>+			goto fail;
>>+		}
>>+	}
>>+
>>+	/* Strip of the segment used by the reserved PE, which
>>+	 * is expected to be 0 or last supported PE#
>>+	 */
>>+	r = &phb->hose->mem_resources[1];
>
>mem_resources[0] is IO, mem_resources[1] is MMIO, mem_resources[2] is for
>what? Would be nice to have this commented somewhere.
>

The fixed layout is determined by skiboot firmware. mem_resource[2] is for
64-bits prefetchable MMIO. I'll see if I can put some comments about them
somewhere in next revision.

>>+	if (phb->ioda.reserved_pe == 0)
>>+		r->start += phb->ioda.m64_segsize;
>>+	else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1))
>>+		r->end -= phb->ioda.m64_segsize;
>>+	else
>>+		pr_warn("  Cannot strip M64 segment for reserved PE#%d\n",
>>+			phb->ioda.reserved_pe);
>>+
>>+	return 0;
>>+
>>+fail:
>>+	for ( ; seg >= 0; seg -= 8)
>>+		opal_pci_phb_mmio_enable(phb->opal_id,
>>+					 OPAL_M64_WINDOW_TYPE,
>>+					 seg / 8,
>>+					 OPAL_DISABLE_M64);
>
>Out of curiosity - is not there a counterpart for
>opal_pci_set_phb_mem_window() for cleanup?
>
>

No.

>>+
>>+	return -EIO;
>>+}
>>+
>>  /* The default M64 BAR is shared by all PEs */
>>  static int pnv_ioda2_init_m64(struct pnv_phb *phb)
>>  {
>>@@ -222,7 +283,7 @@ fail:
>>  	return -EIO;
>>  }
>>
>>-static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
>>+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
>>  {
>>  	resource_size_t sgsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>@@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb)
>>  	}
>>  }
>>
>>-static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb,
>>-				 struct pci_bus *bus, int all)
>>+static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>+				struct pci_bus *bus, int all)
>>  {
>>  	resource_size_t segsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>@@ -346,6 +407,28 @@ done:
>>  			pe->master = master_pe;
>>  			list_add_tail(&pe->list, &master_pe->slaves);
>>  		}
>>+
>>+		/* P7IOC supports M64DT, which helps mapping M64 segment
>>+		 * to one particular PE#. Unfortunately, PHB3 has fixed
>
>Why is it "Unfortunately"? This is just the way it is :)
>

It's true that PHB3 is designed without M64DT while P7IOC has. I think
it's a nice thing providing more flexibility: Arbitrary M64 segment
can be mapped to any one PE# with its help. So I said "unfortunately" :-)

>>+		 * mapping between M64 segment and PE#. In order for same
>>+		 * logic for P7IOC and PHB3, we enforce fixed mapping
>>+		 * between M64 segment and PE# on P7IOC.
>>+		 */
>>+		if (phb->type == PNV_PHB_IODA1) {
>>+			int64_t rc;
>
>Here @rc is of the "int64_t" type. And this one and the one above are used
>for return code from OPAL API. Make them the same (int64_t or long, up to
>you).
>

Yep. It will be "int64_t rc" as I said above.

>>+
>>+			rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>+							 pe->pe_number,
>>+							 OPAL_M64_WINDOW_TYPE,
>>+							 pe->pe_number / 8,
>>+							 pe->pe_number % 8);
>>+			if (rc != OPAL_SUCCESS)
>>+				pr_warn("%s: Failure %lld mapping "
>>+					"M64 for PHB#%d-PE#%d\n",
>>+					__func__, rc,
>>+					phb->hose->global_number,
>>+					pe->pe_number);
>>+		}
>>  	}
>>
>>  	kfree(pe_alloc);
>>@@ -360,12 +443,6 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>  	const u32 *r;
>>  	u64 pci_addr;
>>
>>-	/* FIXME: Support M64 for P7IOC */
>>-	if (phb->type != PNV_PHB_IODA2) {
>>-		pr_info("  Not support M64 window\n");
>>-		return;
>>-	}
>>-
>>  	if (!firmware_has_feature(FW_FEATURE_OPALv3)) {
>>  		pr_info("  Firmware too old to support M64 window\n");
>>  		return;
>>@@ -394,9 +471,23 @@ static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>
>>  	/* Use last M64 BAR to cover M64 window */
>>  	phb->ioda.m64_bar_idx = 15;
>>-	phb->init_m64 = pnv_ioda2_init_m64;
>>-	phb->reserve_m64_pe = pnv_ioda2_reserve_m64_pe;
>>-	phb->pick_m64_pe = pnv_ioda2_pick_m64_pe;
>>+	phb->reserve_m64_pe = pnv_ioda_reserve_m64_pe;
>
>
>reserve_m64_pe() is called once from pnv_pci_ioda_setup_PEs() so it is
>IODA-only and in this case reserve_m64_pe != NULL and
>pnv_ioda_reserve_m64_pe() will be called always.
>
>In general, it feels like pnv_phb has too many callbacks while they could be
>just direct calls.
>

We will have another type of IODA compatible PHB soon. I'm not sure if it's
legal to reveal its name now. The new PHB won't have M64 support. I do think
callbacks give us more flexibility (for supporting M64 or not). Lets keep it.

>>+	phb->pick_m64_pe = pnv_ioda_pick_m64_pe;
>>+	switch (phb->type) {
>>+	case PNV_PHB_IODA1:
>>+		phb->init_m64 = pnv_ioda1_init_m64;
>>+		break;
>>+	case PNV_PHB_IODA2:
>>+		phb->init_m64 = pnv_ioda2_init_m64;
>>+		break;
>>+	default:
>>+		phb->init_m64 = NULL;
>>+		phb->reserve_m64_pe = NULL;
>>+		phb->pick_m64_pe = NULL;
>>+		phb->ioda.m64_size = 0;
>>+		phb->ioda.m64_segsize = 0;
>>+		phb->ioda.m64_base = 0;
>
>There are just 2 PHB types - IODA1 and IODA2, right? And the fields you reset
>after "default" - they have to be zeroes already, no? And on what hardware
>would the default branch actuall work? None?
>

Yeah, you're right those piece of garbage can be removed in next revision.

Thanks,
Gavin

>
>>+	}
>>  }
>>
>>  static void pnv_ioda_freeze_pe(struct pnv_phb *phb, int pe_no)
>>
>
>
>-- 
>Alexey
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 03/21] powerpc/powernv: M64 support improvement
  2015-05-09 10:24     ` Alexey Kardashevskiy
@ 2015-05-11  4:47       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 08:24:14PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>We're having the hardware or enforced (on P7IOC) limitation: M64
>
>I would think if it is enforced, then it is enforced by hardware but you say
>"hardware OR enforced" :)
>

PHB3 doesn't have M64DT from hardware. P7IOC supports that, but I
don't utilize the capability. So I called it's enforced. Maybe it
may be more clear to have "software enforced" ? :-)

>
>>segment#x can only be assigned to PE#x. IO and M32 segment can be
>>mapped to arbitrary PE# via IODT and M32DT. It means the PE number
>>should be x if M64 segment#x has been assigned to the PE. Also, each
>>PE own one M64 segment at most. Currently, we are reserving PE#
>>according to root port's M64 window. It won't be reliable once we
>>extend M64 windows of root port, or the upstream port of the PCIE
>>switch behind root port to PHB's M64 window, in order to support
>>PCI hotplug in future.
>>
>>The patch reserves PE# for M64 segments according to the M64 resources
>>of the PCI devices (not bridges) contained in the PE. Besides, it's
>>always worthy to trace the M64 segments consumed by the PE, which can
>>be released at PCI unplugging time.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++++++++++++++++++------------
>>  arch/powerpc/platforms/powernv/pci.h      |  10 +-
>>  2 files changed, 122 insertions(+), 78 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index 646962f..a994882 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -283,28 +283,78 @@ fail:
>>  	return -EIO;
>>  }
>>
>>-static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
>>+/* We extend the M64 window of root port, or the upstream bridge port
>>+ * of the PCIE switch behind root port. So we shouldn't reserve PEs
>>+ * for M64 resources because there are no (normal) PCI devices consuming
>
>"PCI devices"? Not "root ports or PCI bridges"?
>

I have "no (normal) PCI devices" here, which means root port and PCI bridges
are excluded.

>>+ * M64 resources on the PCI buses leading from root port, or the upstream
>>+ * bridge port.The function returns true if the indicated PCI bus needs
>>+ * reserved PEs because of M64 resources in advance. Otherwise, the
>>+ * function returns false.
>>+ */
>>+static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb,
>>+				 struct pci_bus *bus)
>>  {
>>-	resource_size_t sgsz = phb->ioda.m64_segsize;
>>+	/* Root bus */
>
>The comment is too obvious as the call below is called "pci_is_root_bus" :)
>

Indeed, it will be dropped in next revision.

>>+	if (!bus || pci_is_root_bus(bus))
>>+		return false;
>>+
>>+	/* Bus leading from root port. We need check what types of PCI
>>+	 * devices on the bus. If it's connecting PCI bridge, we don't
>>+	 * need reserve M64 PEs for it. Otherwise, we still need to do
>>+	 * that.
>>+	 */
>>+	if (pci_is_root_bus(bus->self->bus)) {
>>+		struct pci_dev *pdev;
>>+
>>+		list_for_each_entry(pdev, &bus->devices, bus_list) {
>>+			if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
>>+				return true;
>>+		}
>>+
>>+		return false;
>>+	}
>>+
>>+	/* Bus leading from the upstream bridge port on top level */
>>+	if (pci_is_root_bus(bus->self->bus->self->bus))
>
>
>Is it for second level bridges? Like root->bridge->bridge? And for 3 levels
>you will need a PE?
>

It's for upstream port of PCIe switch behind root port (a bit complicated).
Yes, the bus leaded from the downstream port will need a PE as you said.

>>+		return false;
>>+
>>+	return true;
>>+}
>>+
>>+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>+				    struct pci_bus *bus)
>>+{
>>+	resource_size_t segsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>  	struct resource *r;
>>-	int base, step, i;
>>+	unsigned long pe_no, limit;
>>+	int i;
>>
>>-	/*
>>-	 * Root bus always has full M64 range and root port has
>>-	 * M64 range used in reality. So we're checking root port
>>-	 * instead of root bus.
>>+	if (!pnv_ioda_need_m64_pe(phb, bus))
>>+		return;
>>+
>>+	/* The bridge's M64 window might have been extended to the
>>+	 * PHB's M64 window in order to support PCI hotplug. So the
>>+	 * bridge's M64 window isn't reliable to be used for picking
>>+	 * PE# for its leading PCI bus. We have to check the M64
>>+	 * resources consumed by the PCI devices, which seat on the
>>+	 * PCI bus.
>>  	 */
>>-	list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) {
>>-		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
>>-			r = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
>>-			if (!r->parent ||
>>-			    !pnv_pci_is_mem_pref_64(r->flags))
>>+	list_for_each_entry(pdev, &bus->devices, bus_list) {
>>+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
>>+#ifdef CONFIG_PCI_IOV
>>+			if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END)
>>+				continue;
>>+#endif
>>+			r = &pdev->resource[i];
>>+			if (!r->flags || r->start >= r->end ||
>>+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>>  				continue;
>>
>>-			base = (r->start - phb->ioda.m64_base) / sgsz;
>>-			for (step = 0; step < resource_size(r) / sgsz; step++)
>>-				pnv_ioda_reserve_pe(phb, base + step);
>>+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
>>+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
>>+			for (; pe_no < limit; pe_no++)
>>+				pnv_ioda_reserve_pe(phb, pe_no);
>>  		}
>>  	}
>>  }
>>@@ -316,85 +366,64 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	struct pci_dev *pdev;
>>  	struct resource *r;
>>  	struct pnv_ioda_pe *master_pe, *pe;
>>-	unsigned long size, *pe_alloc;
>>-	bool found;
>>-	int start, i, j;
>>-
>>-	/* Root bus shouldn't use M64 */
>>-	if (pci_is_root_bus(bus))
>>-		return IODA_INVALID_PE;
>>-
>>-	/* We support only one M64 window on each bus */
>>-	found = false;
>>-	pci_bus_for_each_resource(bus, r, i) {
>>-		if (r && r->parent &&
>>-		    pnv_pci_is_mem_pref_64(r->flags)) {
>>-			found = true;
>>-			break;
>>-		}
>>-	}
>>+	unsigned long size, *pe_bitsmap;
>
>s/pe_bitsmap/pe_bitmap/
>

Yeah, will fix it up. Thanks!

>>+	unsigned long pe_no, limit;
>>+	int i;
>>
>>-	/* No M64 window found ? */
>>-	if (!found)
>>+	if (!pnv_ioda_need_m64_pe(phb, bus))
>>  		return IODA_INVALID_PE;
>>
>>-	/* Allocate bitmap */
>>+        /* Allocate bitmap */
>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>-	pe_alloc = kzalloc(size, GFP_KERNEL);
>>-	if (!pe_alloc) {
>>-		pr_warn("%s: Out of memory !\n",
>>-			__func__);
>>+	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>+	if (!pe_bitsmap) {
>>+		pr_warn("%s: Out of memory !\n", __func__);
>>  		return IODA_INVALID_PE;
>>  	}
>>
>>-	/*
>>-	 * Figure out reserved PE numbers by the PE
>>-	 * the its child PEs.
>>-	 */
>>-	start = (r->start - phb->ioda.m64_base) / segsz;
>>-	for (i = 0; i < resource_size(r) / segsz; i++)
>>-		set_bit(start + i, pe_alloc);
>>-
>>-	if (all)
>>-		goto done;
>>-
>>-	/*
>>-	 * If the PE doesn't cover all subordinate buses,
>>-	 * we need subtract from reserved PEs for children.
>>+	/* The bridge's M64 window might be extended to PHB's M64
>>+	 * window by intention to support PCI hotplug. So we have
>>+	 * to check the M64 resources consumed by the PCI devices
>>+	 * on the PCI bus.
>>  	 */
>>  	list_for_each_entry(pdev, &bus->devices, bus_list) {
>>-		if (!pdev->subordinate)
>>-			continue;
>>+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
>>+#ifdef CONFIG_PCI_IOV
>>+			if (i >= PCI_IOV_RESOURCES &&
>>+			    i <= PCI_IOV_RESOURCE_END)
>>+				continue;
>>+#endif
>>+			/* Don't scan bridge's window if the PE
>>+			 * doesn't contain its subordinate bus.
>>+			 */
>>+			if (!all && i >= PCI_BRIDGE_RESOURCES &&
>>+			    i <= PCI_BRIDGE_RESOURCE_END)
>>+				continue;
>>
>>-		pci_bus_for_each_resource(pdev->subordinate, r, i) {
>>-			if (!r || !r->parent ||
>>-			    !pnv_pci_is_mem_pref_64(r->flags))
>>+			r = &pdev->resource[i];
>>+			if (!r->flags || r->start >= r->end ||
>>+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>>  				continue;
>>
>>-			start = (r->start - phb->ioda.m64_base) / segsz;
>>-			for (j = 0; j < resource_size(r) / segsz ; j++)
>>-				clear_bit(start + j, pe_alloc);
>>-                }
>>-        }
>>+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
>>+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
>>+			for (; pe_no < limit; pe_no++)
>>+				set_bit(pe_no, pe_bitsmap);
>>+		}
>>+	}
>>
>>-	/*
>>-	 * the current bus might not own M64 window and that's all
>>-	 * contributed by its child buses. For the case, we needn't
>>-	 * pick M64 dependent PE#.
>>-	 */
>>-	if (bitmap_empty(pe_alloc, phb->ioda.total_pe)) {
>>-		kfree(pe_alloc);
>>+	/* No M64 window found ? */
>>+	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>+		kfree(pe_bitsmap);
>>  		return IODA_INVALID_PE;
>>  	}
>>
>>-	/*
>>-	 * Figure out the master PE and put all slave PEs to master
>>-	 * PE's list to form compound PE.
>>+	/* Figure out the master PE and put all slave PEs
>>+	 * to master PE's list to form compound PE.
>>  	 */
>>-done:
>>  	master_pe = NULL;
>>  	i = -1;
>>-	while ((i = find_next_bit(pe_alloc, phb->ioda.total_pe, i + 1)) <
>>+	while ((i = find_next_bit(pe_bitsmap, phb->ioda.total_pe, i + 1)) <
>>  		phb->ioda.total_pe) {
>>  		pe = &phb->ioda.pe_array[i];
>>
>>@@ -408,6 +437,13 @@ done:
>>  			list_add_tail(&pe->list, &master_pe->slaves);
>>  		}
>>
>>+		/* Pick the M64 segment, which should be available. Also,
>
>test_and_set_bit() does not pick or choose, it just marks PE#pe_number used.
>

It's true. I will replace "Pick" with "Reserve" M64 segment in next revision.
If that's still not what you're suggesting, please let me know :-)

>>+		 * those M64 segments consumed by slave PEs are contributed
>>+		 * to the master PE.
>>+		 */
>>+		BUG_ON(test_and_set_bit(pe->pe_number, phb->ioda.m64_segmap));
>>+		BUG_ON(test_and_set_bit(pe->pe_number, master_pe->m64_segmap));
>>+
>>  		/* P7IOC supports M64DT, which helps mapping M64 segment
>>  		 * to one particular PE#. Unfortunately, PHB3 has fixed
>>  		 * mapping between M64 segment and PE#. In order for same
>>@@ -431,7 +467,7 @@ done:
>>  		}
>>  	}
>>
>>-	kfree(pe_alloc);
>>+	kfree(pe_bitsmap);
>>  	return master_pe->pe_number;
>>  }
>>
>>@@ -1233,7 +1269,7 @@ static void pnv_pci_ioda_setup_PEs(void)
>>
>>  		/* M64 layout might affect PE allocation */
>>  		if (phb->reserve_m64_pe)
>>-			phb->reserve_m64_pe(phb);
>>+			phb->reserve_m64_pe(phb, phb->hose->bus);
>>
>>  		pnv_ioda_setup_PEs(hose->bus);
>>  	}
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 070ee88..19022cf 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -49,6 +49,13 @@ struct pnv_ioda_pe {
>>  	/* PE number */
>>  	unsigned int		pe_number;
>>
>>+	/* IO/M32/M64 segments consumed by the PE. Each PE can
>>+	 * have one M64 segment at most, but M64 segments consumed
>>+	 * by slave PEs will be contributed to the master PE. One
>>+	 * PE can own multiple IO and M32 segments.
>>+	 */
>>+	unsigned long		m64_segmap[8];
>
>
>Why 8? 64*8 = 512 segments?  s'8'512/sizeof(unsigned long)' may be?
>

There are 128 M64 segments for P7IOC, but 256 M64 segments for PHB3.
512 is number bigger than 128 and 256. I still prefer m64_segmap[8] :-)

>
>>+
>>  	/* "Weight" assigned to the PE for the sake of DMA resource
>>  	 * allocations
>>  	 */
>>@@ -114,7 +121,7 @@ struct pnv_phb {
>>  	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
>>  	void (*shutdown)(struct pnv_phb *phb);
>>  	int (*init_m64)(struct pnv_phb *phb);
>>-	void (*reserve_m64_pe)(struct pnv_phb *phb);
>>+	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>  	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>  	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>  	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>@@ -153,6 +160,7 @@ struct pnv_phb {
>>  			struct mutex		pe_alloc_mutex;
>>
>>  			/* M32 & IO segment maps */
>>+			unsigned long		m64_segmap[8];
>>  			unsigned int		*m32_segmap;
>>  			unsigned int		*io_segmap;
>>  			struct pnv_ioda_pe	*pe_array;
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 03/21] powerpc/powernv: M64 support improvement
@ 2015-05-11  4:47       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 08:24:14PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>We're having the hardware or enforced (on P7IOC) limitation: M64
>
>I would think if it is enforced, then it is enforced by hardware but you say
>"hardware OR enforced" :)
>

PHB3 doesn't have M64DT from hardware. P7IOC supports that, but I
don't utilize the capability. So I called it's enforced. Maybe it
may be more clear to have "software enforced" ? :-)

>
>>segment#x can only be assigned to PE#x. IO and M32 segment can be
>>mapped to arbitrary PE# via IODT and M32DT. It means the PE number
>>should be x if M64 segment#x has been assigned to the PE. Also, each
>>PE own one M64 segment at most. Currently, we are reserving PE#
>>according to root port's M64 window. It won't be reliable once we
>>extend M64 windows of root port, or the upstream port of the PCIE
>>switch behind root port to PHB's M64 window, in order to support
>>PCI hotplug in future.
>>
>>The patch reserves PE# for M64 segments according to the M64 resources
>>of the PCI devices (not bridges) contained in the PE. Besides, it's
>>always worthy to trace the M64 segments consumed by the PE, which can
>>be released at PCI unplugging time.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++++++++++++++++++------------
>>  arch/powerpc/platforms/powernv/pci.h      |  10 +-
>>  2 files changed, 122 insertions(+), 78 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index 646962f..a994882 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -283,28 +283,78 @@ fail:
>>  	return -EIO;
>>  }
>>
>>-static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb)
>>+/* We extend the M64 window of root port, or the upstream bridge port
>>+ * of the PCIE switch behind root port. So we shouldn't reserve PEs
>>+ * for M64 resources because there are no (normal) PCI devices consuming
>
>"PCI devices"? Not "root ports or PCI bridges"?
>

I have "no (normal) PCI devices" here, which means root port and PCI bridges
are excluded.

>>+ * M64 resources on the PCI buses leading from root port, or the upstream
>>+ * bridge port.The function returns true if the indicated PCI bus needs
>>+ * reserved PEs because of M64 resources in advance. Otherwise, the
>>+ * function returns false.
>>+ */
>>+static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb,
>>+				 struct pci_bus *bus)
>>  {
>>-	resource_size_t sgsz = phb->ioda.m64_segsize;
>>+	/* Root bus */
>
>The comment is too obvious as the call below is called "pci_is_root_bus" :)
>

Indeed, it will be dropped in next revision.

>>+	if (!bus || pci_is_root_bus(bus))
>>+		return false;
>>+
>>+	/* Bus leading from root port. We need check what types of PCI
>>+	 * devices on the bus. If it's connecting PCI bridge, we don't
>>+	 * need reserve M64 PEs for it. Otherwise, we still need to do
>>+	 * that.
>>+	 */
>>+	if (pci_is_root_bus(bus->self->bus)) {
>>+		struct pci_dev *pdev;
>>+
>>+		list_for_each_entry(pdev, &bus->devices, bus_list) {
>>+			if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
>>+				return true;
>>+		}
>>+
>>+		return false;
>>+	}
>>+
>>+	/* Bus leading from the upstream bridge port on top level */
>>+	if (pci_is_root_bus(bus->self->bus->self->bus))
>
>
>Is it for second level bridges? Like root->bridge->bridge? And for 3 levels
>you will need a PE?
>

It's for upstream port of PCIe switch behind root port (a bit complicated).
Yes, the bus leaded from the downstream port will need a PE as you said.

>>+		return false;
>>+
>>+	return true;
>>+}
>>+
>>+static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>+				    struct pci_bus *bus)
>>+{
>>+	resource_size_t segsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>  	struct resource *r;
>>-	int base, step, i;
>>+	unsigned long pe_no, limit;
>>+	int i;
>>
>>-	/*
>>-	 * Root bus always has full M64 range and root port has
>>-	 * M64 range used in reality. So we're checking root port
>>-	 * instead of root bus.
>>+	if (!pnv_ioda_need_m64_pe(phb, bus))
>>+		return;
>>+
>>+	/* The bridge's M64 window might have been extended to the
>>+	 * PHB's M64 window in order to support PCI hotplug. So the
>>+	 * bridge's M64 window isn't reliable to be used for picking
>>+	 * PE# for its leading PCI bus. We have to check the M64
>>+	 * resources consumed by the PCI devices, which seat on the
>>+	 * PCI bus.
>>  	 */
>>-	list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) {
>>-		for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
>>-			r = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
>>-			if (!r->parent ||
>>-			    !pnv_pci_is_mem_pref_64(r->flags))
>>+	list_for_each_entry(pdev, &bus->devices, bus_list) {
>>+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
>>+#ifdef CONFIG_PCI_IOV
>>+			if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END)
>>+				continue;
>>+#endif
>>+			r = &pdev->resource[i];
>>+			if (!r->flags || r->start >= r->end ||
>>+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>>  				continue;
>>
>>-			base = (r->start - phb->ioda.m64_base) / sgsz;
>>-			for (step = 0; step < resource_size(r) / sgsz; step++)
>>-				pnv_ioda_reserve_pe(phb, base + step);
>>+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
>>+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
>>+			for (; pe_no < limit; pe_no++)
>>+				pnv_ioda_reserve_pe(phb, pe_no);
>>  		}
>>  	}
>>  }
>>@@ -316,85 +366,64 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	struct pci_dev *pdev;
>>  	struct resource *r;
>>  	struct pnv_ioda_pe *master_pe, *pe;
>>-	unsigned long size, *pe_alloc;
>>-	bool found;
>>-	int start, i, j;
>>-
>>-	/* Root bus shouldn't use M64 */
>>-	if (pci_is_root_bus(bus))
>>-		return IODA_INVALID_PE;
>>-
>>-	/* We support only one M64 window on each bus */
>>-	found = false;
>>-	pci_bus_for_each_resource(bus, r, i) {
>>-		if (r && r->parent &&
>>-		    pnv_pci_is_mem_pref_64(r->flags)) {
>>-			found = true;
>>-			break;
>>-		}
>>-	}
>>+	unsigned long size, *pe_bitsmap;
>
>s/pe_bitsmap/pe_bitmap/
>

Yeah, will fix it up. Thanks!

>>+	unsigned long pe_no, limit;
>>+	int i;
>>
>>-	/* No M64 window found ? */
>>-	if (!found)
>>+	if (!pnv_ioda_need_m64_pe(phb, bus))
>>  		return IODA_INVALID_PE;
>>
>>-	/* Allocate bitmap */
>>+        /* Allocate bitmap */
>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>-	pe_alloc = kzalloc(size, GFP_KERNEL);
>>-	if (!pe_alloc) {
>>-		pr_warn("%s: Out of memory !\n",
>>-			__func__);
>>+	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>+	if (!pe_bitsmap) {
>>+		pr_warn("%s: Out of memory !\n", __func__);
>>  		return IODA_INVALID_PE;
>>  	}
>>
>>-	/*
>>-	 * Figure out reserved PE numbers by the PE
>>-	 * the its child PEs.
>>-	 */
>>-	start = (r->start - phb->ioda.m64_base) / segsz;
>>-	for (i = 0; i < resource_size(r) / segsz; i++)
>>-		set_bit(start + i, pe_alloc);
>>-
>>-	if (all)
>>-		goto done;
>>-
>>-	/*
>>-	 * If the PE doesn't cover all subordinate buses,
>>-	 * we need subtract from reserved PEs for children.
>>+	/* The bridge's M64 window might be extended to PHB's M64
>>+	 * window by intention to support PCI hotplug. So we have
>>+	 * to check the M64 resources consumed by the PCI devices
>>+	 * on the PCI bus.
>>  	 */
>>  	list_for_each_entry(pdev, &bus->devices, bus_list) {
>>-		if (!pdev->subordinate)
>>-			continue;
>>+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
>>+#ifdef CONFIG_PCI_IOV
>>+			if (i >= PCI_IOV_RESOURCES &&
>>+			    i <= PCI_IOV_RESOURCE_END)
>>+				continue;
>>+#endif
>>+			/* Don't scan bridge's window if the PE
>>+			 * doesn't contain its subordinate bus.
>>+			 */
>>+			if (!all && i >= PCI_BRIDGE_RESOURCES &&
>>+			    i <= PCI_BRIDGE_RESOURCE_END)
>>+				continue;
>>
>>-		pci_bus_for_each_resource(pdev->subordinate, r, i) {
>>-			if (!r || !r->parent ||
>>-			    !pnv_pci_is_mem_pref_64(r->flags))
>>+			r = &pdev->resource[i];
>>+			if (!r->flags || r->start >= r->end ||
>>+			    !r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>>  				continue;
>>
>>-			start = (r->start - phb->ioda.m64_base) / segsz;
>>-			for (j = 0; j < resource_size(r) / segsz ; j++)
>>-				clear_bit(start + j, pe_alloc);
>>-                }
>>-        }
>>+			pe_no = (r->start - phb->ioda.m64_base) / segsz;
>>+			limit = ALIGN(r->end - phb->ioda.m64_base, segsz) / segsz;
>>+			for (; pe_no < limit; pe_no++)
>>+				set_bit(pe_no, pe_bitsmap);
>>+		}
>>+	}
>>
>>-	/*
>>-	 * the current bus might not own M64 window and that's all
>>-	 * contributed by its child buses. For the case, we needn't
>>-	 * pick M64 dependent PE#.
>>-	 */
>>-	if (bitmap_empty(pe_alloc, phb->ioda.total_pe)) {
>>-		kfree(pe_alloc);
>>+	/* No M64 window found ? */
>>+	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>+		kfree(pe_bitsmap);
>>  		return IODA_INVALID_PE;
>>  	}
>>
>>-	/*
>>-	 * Figure out the master PE and put all slave PEs to master
>>-	 * PE's list to form compound PE.
>>+	/* Figure out the master PE and put all slave PEs
>>+	 * to master PE's list to form compound PE.
>>  	 */
>>-done:
>>  	master_pe = NULL;
>>  	i = -1;
>>-	while ((i = find_next_bit(pe_alloc, phb->ioda.total_pe, i + 1)) <
>>+	while ((i = find_next_bit(pe_bitsmap, phb->ioda.total_pe, i + 1)) <
>>  		phb->ioda.total_pe) {
>>  		pe = &phb->ioda.pe_array[i];
>>
>>@@ -408,6 +437,13 @@ done:
>>  			list_add_tail(&pe->list, &master_pe->slaves);
>>  		}
>>
>>+		/* Pick the M64 segment, which should be available. Also,
>
>test_and_set_bit() does not pick or choose, it just marks PE#pe_number used.
>

It's true. I will replace "Pick" with "Reserve" M64 segment in next revision.
If that's still not what you're suggesting, please let me know :-)

>>+		 * those M64 segments consumed by slave PEs are contributed
>>+		 * to the master PE.
>>+		 */
>>+		BUG_ON(test_and_set_bit(pe->pe_number, phb->ioda.m64_segmap));
>>+		BUG_ON(test_and_set_bit(pe->pe_number, master_pe->m64_segmap));
>>+
>>  		/* P7IOC supports M64DT, which helps mapping M64 segment
>>  		 * to one particular PE#. Unfortunately, PHB3 has fixed
>>  		 * mapping between M64 segment and PE#. In order for same
>>@@ -431,7 +467,7 @@ done:
>>  		}
>>  	}
>>
>>-	kfree(pe_alloc);
>>+	kfree(pe_bitsmap);
>>  	return master_pe->pe_number;
>>  }
>>
>>@@ -1233,7 +1269,7 @@ static void pnv_pci_ioda_setup_PEs(void)
>>
>>  		/* M64 layout might affect PE allocation */
>>  		if (phb->reserve_m64_pe)
>>-			phb->reserve_m64_pe(phb);
>>+			phb->reserve_m64_pe(phb, phb->hose->bus);
>>
>>  		pnv_ioda_setup_PEs(hose->bus);
>>  	}
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 070ee88..19022cf 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -49,6 +49,13 @@ struct pnv_ioda_pe {
>>  	/* PE number */
>>  	unsigned int		pe_number;
>>
>>+	/* IO/M32/M64 segments consumed by the PE. Each PE can
>>+	 * have one M64 segment at most, but M64 segments consumed
>>+	 * by slave PEs will be contributed to the master PE. One
>>+	 * PE can own multiple IO and M32 segments.
>>+	 */
>>+	unsigned long		m64_segmap[8];
>
>
>Why 8? 64*8 = 512 segments?  s'8'512/sizeof(unsigned long)' may be?
>

There are 128 M64 segments for P7IOC, but 256 M64 segments for PHB3.
512 is number bigger than 128 and 256. I still prefer m64_segmap[8] :-)

>
>>+
>>  	/* "Weight" assigned to the PE for the sake of DMA resource
>>  	 * allocations
>>  	 */
>>@@ -114,7 +121,7 @@ struct pnv_phb {
>>  	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
>>  	void (*shutdown)(struct pnv_phb *phb);
>>  	int (*init_m64)(struct pnv_phb *phb);
>>-	void (*reserve_m64_pe)(struct pnv_phb *phb);
>>+	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>  	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>  	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>  	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>@@ -153,6 +160,7 @@ struct pnv_phb {
>>  			struct mutex		pe_alloc_mutex;
>>
>>  			/* M32 & IO segment maps */
>>+			unsigned long		m64_segmap[8];
>>  			unsigned int		*m32_segmap;
>>  			unsigned int		*io_segmap;
>>  			struct pnv_ioda_pe	*pe_array;
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
  2015-05-09 10:53     ` Alexey Kardashevskiy
@ 2015-05-11  4:52       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 08:53:38PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The PHB's IO or M32 window is divided evenly to segments, each of
>>them can be mapped to arbitrary PE# by IODT or M32DT. Current code
>>figures out the consumed IO and M32 segments by one particular PE
>>from the windows of the PE's upstream bridge. It won't be reliable
>>once we extend M64 windows of root port, or the upstream port of
>>the PCIE switch behind root port to PHB's IO or M32 window, in order
>>to support PCI hotplug in future.
>>
>>The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed
>>IO or M32 segments from its contained devices, no bridge involved any
>>more. Also, the logic to mapping IO and M32 segments are combined to
>>simplify the code. Besides, it's always worthy to trace the IO and M32
>>segments consumed by one PE, which can be released at PCI unplugging
>>time.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 150 ++++++++++++++++--------------
>>  arch/powerpc/platforms/powernv/pci.h      |  13 +--
>>  2 files changed, 85 insertions(+), 78 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index a994882..7e6e266 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  }
>>  #endif /* CONFIG_PCI_IOV */
>>
>>-/*
>>- * This function is supposed to be called on basis of PE from top
>>- * to bottom style. So the the I/O or MMIO segment assigned to
>>- * parent PE could be overrided by its child PEs if necessary.
>>- */
>>-static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
>>-				  struct pnv_ioda_pe *pe)
>>+static int pnv_ioda_map_pe_one_res(struct pci_controller *hose,
>>+				   struct pnv_ioda_pe *pe,
>>+				   struct resource *res)
>>  {
>>  	struct pnv_phb *phb = hose->private_data;
>>  	struct pci_bus_region region;
>>-	struct resource *res;
>>-	int i, index;
>>-	int rc;
>>+	unsigned int segsize, index;
>>+	unsigned long *segmap, *pe_segmap;
>>+	uint16_t win_type;
>>+	int64_t rc;
>>
>>-	/*
>>-	 * NOTE: We only care PCI bus based PE for now. For PCI
>>-	 * device based PE, for example SRIOV sensitive VF should
>>-	 * be figured out later.
>>-	 */
>>-	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
>>+	/* Check if we need map the resource */
>>+	if (!res->parent || !res->flags ||
>>+	    res->start > res->end ||
>>+	    pnv_pci_is_mem_pref_64(res->flags))
>>+		return 0;
>>
>>-	pci_bus_for_each_resource(pe->pbus, res, i) {
>>-		if (!res || !res->flags ||
>>-		    res->start > res->end)
>>-			continue;
>>+	if (res->flags & IORESOURCE_IO) {
>>+		segmap = phb->ioda.io_segmap;
>>+		pe_segmap = pe->io_segmap;
>>+		region.start = res->start - phb->ioda.io_pci_base;
>>+		region.end = res->end - phb->ioda.io_pci_base;
>>+		segsize = phb->ioda.io_segsize;
>>+		win_type = OPAL_IO_WINDOW_TYPE;
>>+	} else {
>>+		segmap = phb->ioda.m32_segmap;
>>+		pe_segmap = pe->m32_segmap;
>>+		region.start = res->start -
>>+			       hose->mem_offset[0] -
>>+			       phb->ioda.m32_pci_base;
>>+		region.end = res->end -
>>+			     hose->mem_offset[0] -
>>+			     phb->ioda.m32_pci_base;
>>+		segsize = phb->ioda.m32_segsize;
>>+		win_type = OPAL_M32_WINDOW_TYPE;
>>+	}
>>+
>>+	index = region.start / segsize;
>>+	while (index < phb->ioda.total_pe &&
>>+	       region.start <= region.end) {
>>+		rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>+				pe->pe_number, win_type, 0, index);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n",
>>+				__func__, rc, win_type, index, pe->pe_number);
>>+			return -EIO;
>>+		}
>>
>>-		if (res->flags & IORESOURCE_IO) {
>>-			region.start = res->start - phb->ioda.io_pci_base;
>>-			region.end   = res->end - phb->ioda.io_pci_base;
>>-			index = region.start / phb->ioda.io_segsize;
>>+		set_bit(index, segmap);
>>+		set_bit(index, pe_segmap);
>>+		region.start += segsize;
>>+		index++;
>>+	}
>>
>>-			while (index < phb->ioda.total_pe &&
>>-			       region.start <= region.end) {
>>-				phb->ioda.io_segmap[index] = pe->pe_number;
>>-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>-					pe->pe_number, OPAL_IO_WINDOW_TYPE, 0, index);
>>-				if (rc != OPAL_SUCCESS) {
>>-					pr_err("%s: OPAL error %d when mapping IO "
>>-					       "segment #%d to PE#%d\n",
>>-					       __func__, rc, index, pe->pe_number);
>>-					break;
>>-				}
>>+	return 0;
>>+}
>>
>>-				region.start += phb->ioda.io_segsize;
>>-				index++;
>>-			}
>>-		} else if ((res->flags & IORESOURCE_MEM) &&
>>-			   !pnv_pci_is_mem_pref_64(res->flags)) {
>>-			region.start = res->start -
>>-				       hose->mem_offset[0] -
>>-				       phb->ioda.m32_pci_base;
>>-			region.end   = res->end -
>>-				       hose->mem_offset[0] -
>>-				       phb->ioda.m32_pci_base;
>>-			index = region.start / phb->ioda.m32_segsize;
>>-
>>-			while (index < phb->ioda.total_pe &&
>>-			       region.start <= region.end) {
>>-				phb->ioda.m32_segmap[index] = pe->pe_number;
>>-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>-					pe->pe_number, OPAL_M32_WINDOW_TYPE, 0, index);
>>-				if (rc != OPAL_SUCCESS) {
>>-					pr_err("%s: OPAL error %d when mapping M32 "
>>-					       "segment#%d to PE#%d",
>>-					       __func__, rc, index, pe->pe_number);
>>-					break;
>>-				}
>>+static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
>>+				  struct pnv_ioda_pe *pe)
>>+{
>>+	struct pci_dev *pdev;
>>+	struct resource *res;
>>+	int i;
>>
>>-				region.start += phb->ioda.m32_segsize;
>>-				index++;
>>-			}
>>+	/* This function only works for bus dependent PE */
>>+	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
>>+
>>+	list_for_each_entry(pdev, &pe->pbus->devices, bus_list) {
>>+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
>>+			res = &pdev->resource[i];
>>+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
>>+				return;
>>+		}
>>+
>>+		/* If the PE contains all subordinate PCI buses, the
>>+		 * resources of the child bridges should be mapped
>>+		 * to the PE as well.
>>+		 */
>>+		if (!(pe->flags & PNV_IODA_PE_BUS_ALL) ||
>>+		    (pdev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
>>+			continue;
>>+
>>+		for (i = 0; i <= PCI_BRIDGE_RESOURCE_NUM; i++) {
>>+			res = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
>>+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
>>+				return;
>
>
>This chunk is really hard to review. Looks like you completely reimplemented
>the function instead of patching it. For review-ability and bisect-ability it
>would help to split it to several simpler patches.
>
>

Yep, it's good suggestion. I'll check if I can split it up in next revision.

>
>>  		}
>>  	}
>>  }
>>@@ -2780,7 +2795,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  {
>>  	struct pci_controller *hose;
>>  	struct pnv_phb *phb;
>>-	unsigned long size, m32map_off, pemap_off, iomap_off = 0;
>>+	unsigned long size, pemap_off;
>>  	const __be64 *prop64;
>>  	const __be32 *prop32;
>>  	int len;
>>@@ -2865,19 +2880,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>
>>  	/* Allocate aux data & arrays. We don't have IO ports on PHB3 */
>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>-	m32map_off = size;
>>-	size += phb->ioda.total_pe * sizeof(phb->ioda.m32_segmap[0]);
>>-	if (phb->type == PNV_PHB_IODA1) {
>>-		iomap_off = size;
>>-		size += phb->ioda.total_pe * sizeof(phb->ioda.io_segmap[0]);
>>-	}
>>  	pemap_off = size;
>>  	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
>>  	aux = memblock_virt_alloc(size, 0);
>>  	phb->ioda.pe_alloc = aux;
>>-	phb->ioda.m32_segmap = aux + m32map_off;
>>-	if (phb->type == PNV_PHB_IODA1)
>>-		phb->ioda.io_segmap = aux + iomap_off;
>>  	phb->ioda.pe_array = aux + pemap_off;
>>  	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 19022cf..f604bb7 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -54,6 +54,8 @@ struct pnv_ioda_pe {
>>  	 * by slave PEs will be contributed to the master PE. One
>>  	 * PE can own multiple IO and M32 segments.
>>  	 */
>>+	unsigned long		io_segmap[8];
>>+	unsigned long		m32_segmap[8];
>>  	unsigned long		m64_segmap[8];
>>
>>  	/* "Weight" assigned to the PE for the sake of DMA resource
>>@@ -154,16 +156,15 @@ struct pnv_phb {
>>  			unsigned int		io_segsize;
>>  			unsigned int		io_pci_base;
>>
>>-			/* PE allocation bitmap */
>>+			/* PE allocation */
>>+			struct pnv_ioda_pe	*pe_array;
>>  			unsigned long		*pe_alloc;
>>-			/* PE allocation mutex */
>>  			struct mutex		pe_alloc_mutex;
>>
>>-			/* M32 & IO segment maps */
>>+			/* IO/M32/M64 segment bitmaps */
>>+			unsigned long		io_segmap[8];
>>+			unsigned long		m32_segmap[8];
>>  			unsigned long		m64_segmap[8];
>
>
>Is this a copy of the same name fields above, in pnv_ioda_pe? Why 8?
>

Yes, the fields you pointed for pnv_ioda_pe and pnv_phb are for different
purposes: The former fields are tracing M64 segments one particular PE
consumes, but the later fields are tracing M64 segments on one particular
PHB that have been assigned/reserved.

>
>>-			unsigned int		*m32_segmap;
>>-			unsigned int		*io_segmap;
>>-			struct pnv_ioda_pe	*pe_array;
>>
>
>Why moved this?
>

The only reason I think pe_array/pe_alloc should be put closely enough as
they're depending on each other from the code: When allocating a PE instance,
the PE# is picked and then the PE instance :-)

>>  			/* IRQ chip */
>>  			int			irq_chip_init;
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
@ 2015-05-11  4:52       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:52 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 08:53:38PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The PHB's IO or M32 window is divided evenly to segments, each of
>>them can be mapped to arbitrary PE# by IODT or M32DT. Current code
>>figures out the consumed IO and M32 segments by one particular PE
>>from the windows of the PE's upstream bridge. It won't be reliable
>>once we extend M64 windows of root port, or the upstream port of
>>the PCIE switch behind root port to PHB's IO or M32 window, in order
>>to support PCI hotplug in future.
>>
>>The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed
>>IO or M32 segments from its contained devices, no bridge involved any
>>more. Also, the logic to mapping IO and M32 segments are combined to
>>simplify the code. Besides, it's always worthy to trace the IO and M32
>>segments consumed by one PE, which can be released at PCI unplugging
>>time.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 150 ++++++++++++++++--------------
>>  arch/powerpc/platforms/powernv/pci.h      |  13 +--
>>  2 files changed, 85 insertions(+), 78 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index a994882..7e6e266 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  }
>>  #endif /* CONFIG_PCI_IOV */
>>
>>-/*
>>- * This function is supposed to be called on basis of PE from top
>>- * to bottom style. So the the I/O or MMIO segment assigned to
>>- * parent PE could be overrided by its child PEs if necessary.
>>- */
>>-static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
>>-				  struct pnv_ioda_pe *pe)
>>+static int pnv_ioda_map_pe_one_res(struct pci_controller *hose,
>>+				   struct pnv_ioda_pe *pe,
>>+				   struct resource *res)
>>  {
>>  	struct pnv_phb *phb = hose->private_data;
>>  	struct pci_bus_region region;
>>-	struct resource *res;
>>-	int i, index;
>>-	int rc;
>>+	unsigned int segsize, index;
>>+	unsigned long *segmap, *pe_segmap;
>>+	uint16_t win_type;
>>+	int64_t rc;
>>
>>-	/*
>>-	 * NOTE: We only care PCI bus based PE for now. For PCI
>>-	 * device based PE, for example SRIOV sensitive VF should
>>-	 * be figured out later.
>>-	 */
>>-	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
>>+	/* Check if we need map the resource */
>>+	if (!res->parent || !res->flags ||
>>+	    res->start > res->end ||
>>+	    pnv_pci_is_mem_pref_64(res->flags))
>>+		return 0;
>>
>>-	pci_bus_for_each_resource(pe->pbus, res, i) {
>>-		if (!res || !res->flags ||
>>-		    res->start > res->end)
>>-			continue;
>>+	if (res->flags & IORESOURCE_IO) {
>>+		segmap = phb->ioda.io_segmap;
>>+		pe_segmap = pe->io_segmap;
>>+		region.start = res->start - phb->ioda.io_pci_base;
>>+		region.end = res->end - phb->ioda.io_pci_base;
>>+		segsize = phb->ioda.io_segsize;
>>+		win_type = OPAL_IO_WINDOW_TYPE;
>>+	} else {
>>+		segmap = phb->ioda.m32_segmap;
>>+		pe_segmap = pe->m32_segmap;
>>+		region.start = res->start -
>>+			       hose->mem_offset[0] -
>>+			       phb->ioda.m32_pci_base;
>>+		region.end = res->end -
>>+			     hose->mem_offset[0] -
>>+			     phb->ioda.m32_pci_base;
>>+		segsize = phb->ioda.m32_segsize;
>>+		win_type = OPAL_M32_WINDOW_TYPE;
>>+	}
>>+
>>+	index = region.start / segsize;
>>+	while (index < phb->ioda.total_pe &&
>>+	       region.start <= region.end) {
>>+		rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>+				pe->pe_number, win_type, 0, index);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n",
>>+				__func__, rc, win_type, index, pe->pe_number);
>>+			return -EIO;
>>+		}
>>
>>-		if (res->flags & IORESOURCE_IO) {
>>-			region.start = res->start - phb->ioda.io_pci_base;
>>-			region.end   = res->end - phb->ioda.io_pci_base;
>>-			index = region.start / phb->ioda.io_segsize;
>>+		set_bit(index, segmap);
>>+		set_bit(index, pe_segmap);
>>+		region.start += segsize;
>>+		index++;
>>+	}
>>
>>-			while (index < phb->ioda.total_pe &&
>>-			       region.start <= region.end) {
>>-				phb->ioda.io_segmap[index] = pe->pe_number;
>>-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>-					pe->pe_number, OPAL_IO_WINDOW_TYPE, 0, index);
>>-				if (rc != OPAL_SUCCESS) {
>>-					pr_err("%s: OPAL error %d when mapping IO "
>>-					       "segment #%d to PE#%d\n",
>>-					       __func__, rc, index, pe->pe_number);
>>-					break;
>>-				}
>>+	return 0;
>>+}
>>
>>-				region.start += phb->ioda.io_segsize;
>>-				index++;
>>-			}
>>-		} else if ((res->flags & IORESOURCE_MEM) &&
>>-			   !pnv_pci_is_mem_pref_64(res->flags)) {
>>-			region.start = res->start -
>>-				       hose->mem_offset[0] -
>>-				       phb->ioda.m32_pci_base;
>>-			region.end   = res->end -
>>-				       hose->mem_offset[0] -
>>-				       phb->ioda.m32_pci_base;
>>-			index = region.start / phb->ioda.m32_segsize;
>>-
>>-			while (index < phb->ioda.total_pe &&
>>-			       region.start <= region.end) {
>>-				phb->ioda.m32_segmap[index] = pe->pe_number;
>>-				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>>-					pe->pe_number, OPAL_M32_WINDOW_TYPE, 0, index);
>>-				if (rc != OPAL_SUCCESS) {
>>-					pr_err("%s: OPAL error %d when mapping M32 "
>>-					       "segment#%d to PE#%d",
>>-					       __func__, rc, index, pe->pe_number);
>>-					break;
>>-				}
>>+static void pnv_ioda_setup_pe_seg(struct pci_controller *hose,
>>+				  struct pnv_ioda_pe *pe)
>>+{
>>+	struct pci_dev *pdev;
>>+	struct resource *res;
>>+	int i;
>>
>>-				region.start += phb->ioda.m32_segsize;
>>-				index++;
>>-			}
>>+	/* This function only works for bus dependent PE */
>>+	BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)));
>>+
>>+	list_for_each_entry(pdev, &pe->pbus->devices, bus_list) {
>>+		for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
>>+			res = &pdev->resource[i];
>>+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
>>+				return;
>>+		}
>>+
>>+		/* If the PE contains all subordinate PCI buses, the
>>+		 * resources of the child bridges should be mapped
>>+		 * to the PE as well.
>>+		 */
>>+		if (!(pe->flags & PNV_IODA_PE_BUS_ALL) ||
>>+		    (pdev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
>>+			continue;
>>+
>>+		for (i = 0; i <= PCI_BRIDGE_RESOURCE_NUM; i++) {
>>+			res = &pdev->resource[PCI_BRIDGE_RESOURCES + i];
>>+			if (pnv_ioda_map_pe_one_res(hose, pe, res))
>>+				return;
>
>
>This chunk is really hard to review. Looks like you completely reimplemented
>the function instead of patching it. For review-ability and bisect-ability it
>would help to split it to several simpler patches.
>
>

Yep, it's good suggestion. I'll check if I can split it up in next revision.

>
>>  		}
>>  	}
>>  }
>>@@ -2780,7 +2795,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  {
>>  	struct pci_controller *hose;
>>  	struct pnv_phb *phb;
>>-	unsigned long size, m32map_off, pemap_off, iomap_off = 0;
>>+	unsigned long size, pemap_off;
>>  	const __be64 *prop64;
>>  	const __be32 *prop32;
>>  	int len;
>>@@ -2865,19 +2880,10 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>
>>  	/* Allocate aux data & arrays. We don't have IO ports on PHB3 */
>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>-	m32map_off = size;
>>-	size += phb->ioda.total_pe * sizeof(phb->ioda.m32_segmap[0]);
>>-	if (phb->type == PNV_PHB_IODA1) {
>>-		iomap_off = size;
>>-		size += phb->ioda.total_pe * sizeof(phb->ioda.io_segmap[0]);
>>-	}
>>  	pemap_off = size;
>>  	size += phb->ioda.total_pe * sizeof(struct pnv_ioda_pe);
>>  	aux = memblock_virt_alloc(size, 0);
>>  	phb->ioda.pe_alloc = aux;
>>-	phb->ioda.m32_segmap = aux + m32map_off;
>>-	if (phb->type == PNV_PHB_IODA1)
>>-		phb->ioda.io_segmap = aux + iomap_off;
>>  	phb->ioda.pe_array = aux + pemap_off;
>>  	set_bit(phb->ioda.reserved_pe, phb->ioda.pe_alloc);
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 19022cf..f604bb7 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -54,6 +54,8 @@ struct pnv_ioda_pe {
>>  	 * by slave PEs will be contributed to the master PE. One
>>  	 * PE can own multiple IO and M32 segments.
>>  	 */
>>+	unsigned long		io_segmap[8];
>>+	unsigned long		m32_segmap[8];
>>  	unsigned long		m64_segmap[8];
>>
>>  	/* "Weight" assigned to the PE for the sake of DMA resource
>>@@ -154,16 +156,15 @@ struct pnv_phb {
>>  			unsigned int		io_segsize;
>>  			unsigned int		io_pci_base;
>>
>>-			/* PE allocation bitmap */
>>+			/* PE allocation */
>>+			struct pnv_ioda_pe	*pe_array;
>>  			unsigned long		*pe_alloc;
>>-			/* PE allocation mutex */
>>  			struct mutex		pe_alloc_mutex;
>>
>>-			/* M32 & IO segment maps */
>>+			/* IO/M32/M64 segment bitmaps */
>>+			unsigned long		io_segmap[8];
>>+			unsigned long		m32_segmap[8];
>>  			unsigned long		m64_segmap[8];
>
>
>Is this a copy of the same name fields above, in pnv_ioda_pe? Why 8?
>

Yes, the fields you pointed for pnv_ioda_pe and pnv_phb are for different
purposes: The former fields are tracing M64 segments one particular PE
consumes, but the later fields are tracing M64 segments on one particular
PHB that have been assigned/reserved.

>
>>-			unsigned int		*m32_segmap;
>>-			unsigned int		*io_segmap;
>>-			struct pnv_ioda_pe	*pe_array;
>>
>
>Why moved this?
>

The only reason I think pe_array/pe_alloc should be put closely enough as
they're depending on each other from the code: When allocating a PE instance,
the PE# is picked and then the PE instance :-)

>>  			/* IRQ chip */
>>  			int			irq_chip_init;
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
  2015-05-09 11:43     ` Alexey Kardashevskiy
@ 2015-05-11  4:55       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 09:43:16PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>Currently, the PEs and their associated resources are assigned
>>in ppc_md.pcibios_fixup(). The function is called for once after
>>PCI probing and resources assignment are finished. Obviously, it's
>>not hotplug friendly. The patch creates PEs dynamically by
>>ppc_md.pcibios_setup_bridge(), which is called on the event during
>>system bootup and PCI hotplug: updating PCI bridge's windows after
>>resource assignment/reassignment are finished. For partial hotplug
>>case, where not all PCI devices belonging to the PE are unplugged
>>and plugged again, we just need unbinding/binding the affected
>>PCI devices with the corresponding PE without creating new one.
>
>
>Some PEs are already created dynamically (SRIOV). I'd suggest to make subject
>more specific.
>

Sure, will do.

>>Besides, it might require addtional resources (e.g. M32) to the
>>windows of the PCI bridge when unplugging current adapter, and
>>insert a different adapter if there is one PCI slot, which is
>>assumed behind root port, or the downstream bridge of the PCIE
>>switch behind root port. The parent bridge of the newly plugged
>>adapter would reject the request to add more resources, leading
>>to hotplug failure. For the issue, the patch extends the windows
>>of root port, or the upstream port of the PCIe switch behind root
>>port to PHB's windows when ppc_md.pcibios_setup_bridge() is called.
>>
>>There is no upstream bridge for root bus, so we have to reserve
>>PE#, which is next to the reserved PE# in advance and fixing the
>>PE for root bus in ppc_md.pcibios_setup_bridge().
>>
>>The patch also changes the rule assigning PE#: PE# reserved for
>>prefetchable 64-bits memory resource and SRIOV VFs starts from
>>zero while PE# for dynamic allocations starts from ioda.total_pe
>>reversely. It's because PE# for prefetchable 64-bits memory resource,
>>which is ually allocated begining with the PHB's aperatus and PE#
>
>s/aperatus/apertures/?
>

I need look into Chinese-English dictionary to confirm :-)

>May be it is just me but it looks like the patch moves existing bits and also
>adds this dynamic PE creation, cannot it be separated somehow into smaller
>patches as it is really hard to track all the changes you are making here?
>

It's good suggestion as I said in previous replies. Yeah, I'll see
if I can split it up to help review and bisecting.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
@ 2015-05-11  4:55       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  4:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 09:43:16PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>Currently, the PEs and their associated resources are assigned
>>in ppc_md.pcibios_fixup(). The function is called for once after
>>PCI probing and resources assignment are finished. Obviously, it's
>>not hotplug friendly. The patch creates PEs dynamically by
>>ppc_md.pcibios_setup_bridge(), which is called on the event during
>>system bootup and PCI hotplug: updating PCI bridge's windows after
>>resource assignment/reassignment are finished. For partial hotplug
>>case, where not all PCI devices belonging to the PE are unplugged
>>and plugged again, we just need unbinding/binding the affected
>>PCI devices with the corresponding PE without creating new one.
>
>
>Some PEs are already created dynamically (SRIOV). I'd suggest to make subject
>more specific.
>

Sure, will do.

>>Besides, it might require addtional resources (e.g. M32) to the
>>windows of the PCI bridge when unplugging current adapter, and
>>insert a different adapter if there is one PCI slot, which is
>>assumed behind root port, or the downstream bridge of the PCIE
>>switch behind root port. The parent bridge of the newly plugged
>>adapter would reject the request to add more resources, leading
>>to hotplug failure. For the issue, the patch extends the windows
>>of root port, or the upstream port of the PCIe switch behind root
>>port to PHB's windows when ppc_md.pcibios_setup_bridge() is called.
>>
>>There is no upstream bridge for root bus, so we have to reserve
>>PE#, which is next to the reserved PE# in advance and fixing the
>>PE for root bus in ppc_md.pcibios_setup_bridge().
>>
>>The patch also changes the rule assigning PE#: PE# reserved for
>>prefetchable 64-bits memory resource and SRIOV VFs starts from
>>zero while PE# for dynamic allocations starts from ioda.total_pe
>>reversely. It's because PE# for prefetchable 64-bits memory resource,
>>which is ually allocated begining with the PHB's aperatus and PE#
>
>s/aperatus/apertures/?
>

I need look into Chinese-English dictionary to confirm :-)

>May be it is just me but it looks like the patch moves existing bits and also
>adds this dynamic PE creation, cannot it be separated somehow into smaller
>patches as it is really hard to track all the changes you are making here?
>

It's good suggestion as I said in previous replies. Yeah, I'll see
if I can split it up to help review and bisecting.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-09 12:43     ` Alexey Kardashevskiy
@ 2015-05-11  6:25       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  6:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The original code doesn't support releasing PEs dynamically, meaning
>>that PE and the associated resources (IO, M32, M64 and DMA) can't
>>be released when unplugging a PCI adapter from one hotpluggable slot.
>>
>>The patch takes object oriented methodology, introducs reference
>>count to PE, which is initialized to 1 and increased with 1 when a
>>new PCI device joins the PE. Once the last PCI device leaves the
>>PE, the PE is going to be release together with its associated
>>(IO, M32, M64, DMA) resources.
>
>
>Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>

Ok. I'll add more details in next revision.

>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>  arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>  arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>  4 files changed, 432 insertions(+), 238 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>index 5367eb3..a6ad4b1 100644
>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>@@ -31,6 +31,9 @@ struct pci_controller_ops {
>>  	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>  	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>  	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>+
>>+	/* Called when PCI device is released */
>>+	void		(*release_device)(struct pci_dev *);
>>  };
>>
>>  /*
>>diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>index 7ed85a6..0040343 100644
>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>@@ -29,6 +29,11 @@
>>   */
>>  void pcibios_release_device(struct pci_dev *dev)
>>  {
>>+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>+
>>+	if (hose->controller_ops.release_device)
>>+		hose->controller_ops.release_device(dev);
>>+
>>  	eeh_remove_device(dev);
>>  }
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index 910fb67..ef8c216 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -12,6 +12,8 @@
>>  #undef DEBUG
>>
>>  #include <linux/kernel.h>
>>+#include <linux/atomic.h>
>>+#include <linux/kref.h>
>>  #include <linux/pci.h>
>>  #include <linux/crash_dump.h>
>>  #include <linux/debugfs.h>
>>@@ -47,6 +49,8 @@
>>  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>  #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>
>>+static void pnv_ioda_release_pe(struct kref *kref);
>>+
>>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>  			    const char *fmt, ...)
>>  {
>>@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>  		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>  }
>>
>>-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>  {
>>-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>-			__func__, pe_no, phb->hose->global_number);
>>+	if (!pe)
>>+		return;
>>+
>>+	kref_get(&pe->kref);
>>+}
>>+
>>+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>+{
>>+	unsigned int count;
>>+
>>+	if (!pe)
>>  		return;
>>+
>>+	/*
>>+	 * The count is initialized to 1 and increased with 1 when
>>+	 * a new PCI device is bound with the PE. Once the last PCI
>>+	 * device is leaving from the PE, the PE is going to be
>>+	 * released.
>>+	 */
>>+	count = atomic_read(&pe->kref.refcount);
>>+	if (count == 2)
>>+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>+	else
>>+		kref_put(&pe->kref, pnv_ioda_release_pe);
>
>
>What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>

Yeah, that would have problem. But it shouldn't happen because the
PCI devices are joining the parent PE# in strictly serialized mode.
Same thing happens when detaching PCI devices from its parent PE.

>>+}
>>+
>>+static void pnv_pci_release_device(struct pci_dev *pdev)
>>+{
>>+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>+	struct pnv_phb *phb = hose->private_data;
>>+	struct pci_dn *pdn = pci_get_pdn(pdev);
>>+	struct pnv_ioda_pe *pe;
>>+
>>+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>+		pe = &phb->ioda.pe_array[pdn->pe_number];
>>+		pnv_ioda_pe_put(pe);
>>+		pdn->pe_number = IODA_INVALID_PE;
>>  	}
>>+}
>>
>>-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>-			__func__, pe_no, phb->hose->global_number);
>>+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	int index, count;
>>+	unsigned long tbl_addr, tbl_size;
>>+
>>+	/* No DMA capability for slave PEs */
>>+	if (pe->flags & PNV_IODA_PE_SLAVE)
>>+		return;
>>+
>>+	/* Bypass DMA window */
>>+	if (phb->type == PNV_PHB_IODA2 &&
>>+	    pe->tce_bypass_enabled &&
>>+	    pe->tce32_table &&
>>+	    pe->tce32_table->set_bypass)
>>+		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>+
>>+	/* 32-bits DMA window */
>>+	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>+	tbl_addr = pe->tce32_table->it_base;
>>+	if (!count)
>>  		return;
>>+
>>+	/* Free IOMMU table */
>>+	iommu_free_table(pe->tce32_table,
>>+			 of_node_full_name(phb->hose->dn));
>>+
>>+	/* Deconfigure TCE table */
>>+	switch (phb->type) {
>>+	case PNV_PHB_IODA1:
>>+		for (index = 0; index < count; index++)
>>+			opal_pci_map_pe_dma_window(phb->opal_id,
>>+						   pe->pe_number,
>>+						   pe->tce32_seg_start + index,
>>+						   1,
>>+						   __pa(tbl_addr) +
>>+						   index * TCE32_TABLE_SIZE,
>>+						   0,
>>+						   0x1000);
>>+		bitmap_clear(phb->ioda.tce32_segmap,
>>+			     pe->tce32_seg_start,
>>+			     count);
>>+		tbl_size = TCE32_TABLE_SIZE * count;
>>+		break;
>>+	case PNV_PHB_IODA2:
>>+		opal_pci_map_pe_dma_window(phb->opal_id,
>>+					   pe->pe_number,
>>+					   pe->pe_number << 1,
>>+					   1,
>>+					   __pa(tbl_addr),
>>+					   0,
>>+					   0x1000);
>>+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>+		break;
>>+	default:
>>+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>+		return;
>>+	}
>>+
>>+	/* Free memory of IOMMU table */
>>+	free_pages(tbl_addr, get_order(tbl_size));
>
>
>You just programmed the table address to TVT and then you are releasing the
>pages. It does not seem right, it will leave garbage in TVT. Also, I am
>adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>that ;) ).
>

I assume you're talking about TVE. I don't understand how garbage will be left
in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
with zero'ed "tce_table_size". The pages previously allocated for TCE table is
released to buddy system, which can be allocated by somebody else (from buddy
or slab).

Ok. Please put me into the cc list. I guess the whole series of patches is
better to rebased on your DDW patchset, which is to be merged first, I believe.

>
>>+	pe->tce32_table = NULL;
>>+	pe->tce32_seg_start = 0;
>>+	pe->tce32_seg_end = 0;
>>+}
>>+
>>+static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	unsigned long *segmap = NULL, *pe_segmap = NULL;
>>+	int i;
>>+	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
>>+				     OPAL_M32_WINDOW_TYPE,
>>+				     OPAL_M64_WINDOW_TYPE };
>>+
>>+	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
>>+		switch (win_type[win]) {
>>+		case OPAL_IO_WINDOW_TYPE:
>>+			segmap = phb->ioda.io_segmap;
>>+			pe_segmap = pe->io_segmap;
>>+			break;
>>+		case OPAL_M32_WINDOW_TYPE:
>>+			segmap = phb->ioda.m32_segmap;
>>+			pe_segmap = pe->m32_segmap;
>>+			break;
>>+		case OPAL_M64_WINDOW_TYPE:
>>+			segmap = phb->ioda.m64_segmap;
>>+			pe_segmap = pe->m64_segmap;
>>+			break;
>>+		}
>>+		i = -1;
>>+		while ((i = find_next_bit(pe_segmap,
>>+			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
>>+			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
>>+			    win_type[win] == OPAL_M32_WINDOW_TYPE)
>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>+						phb->ioda.reserved_pe,
>>+						win_type[win], 0, i);
>>+			else if (phb->type == PNV_PHB_IODA1)
>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>+						phb->ioda.reserved_pe,
>>+						win_type[win],
>>+						i / 8, i % 8);
>
>The function is called ""release" but it programs something what looks like
>reasonable values, is it correct?
>

It's out of problem, When the segment is deallocated, it's mapped to the
reserved PE#.

>
>
>>+
>>+			clear_bit(i, pe_segmap);
>>+			clear_bit(i, segmap);
>>+		}
>>+	}
>>+}
>>+
>>+static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>+				  struct pnv_ioda_pe *parent,
>>+				  struct pnv_ioda_pe *child,
>>+				  bool is_add)
>>+{
>>+	const char *desc = is_add ? "adding" : "removing";
>>+	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>+			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>+	struct pnv_ioda_pe *slave;
>>+	long rc;
>>+
>>+	/* Parent PE affects child PE */
>>+	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>+				child->pe_number, op);
>>+	if (rc != OPAL_SUCCESS) {
>>+		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>+			rc, desc);
>>+		return -ENXIO;
>>+	}
>>+
>>+	if (!(child->flags & PNV_IODA_PE_MASTER))
>>+		return 0;
>>+
>>+	/* Compound case: parent PE affects slave PEs */
>>+	list_for_each_entry(slave, &child->slaves, list) {
>>+		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>+					slave->pe_number, op);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>+				rc, desc);
>>+			return -ENXIO;
>>+		}
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	struct pnv_ioda_pe *slave;
>>+	struct pci_dev *pdev = NULL;
>>+	int ret;
>>+
>>+	/*
>>+	 * Clear PE frozen state. If it's master PE, we need
>>+	 * clear slave PE frozen state as well.
>>+	 */
>>+	opal_pci_eeh_freeze_clear(phb->opal_id,
>>+				  pe->pe_number,
>>+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>+			opal_pci_eeh_freeze_clear(phb->opal_id,
>>+						  slave->pe_number,
>>+						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>+		}
>>+	}
>>+
>>+	/*
>>+	 * Associate PE in PELT. We need add the PE into the
>>+	 * corresponding PELT-V as well. Otherwise, the error
>>+	 * originated from the PE might contribute to other
>>+	 * PEs.
>>+	 */
>>+	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>+	if (ret)
>>+		return ret;
>>+
>>+	/* For compound PEs, any one affects all of them */
>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>+			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>+			if (ret)
>>+				return ret;
>>+		}
>>+	}
>>+
>>+	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>+		pdev = pe->pbus->self;
>>+	else if (pe->flags & PNV_IODA_PE_DEV)
>>+		pdev = pe->pdev->bus->self;
>>+#ifdef CONFIG_PCI_IOV
>>+	else if (pe->flags & PNV_IODA_PE_VF)
>>+		pdev = pe->parent_dev->bus->self;
>>+#endif /* CONFIG_PCI_IOV */
>>+
>>+	while (pdev) {
>>+		struct pci_dn *pdn = pci_get_pdn(pdev);
>>+		struct pnv_ioda_pe *parent;
>>+
>>+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>+			parent = &phb->ioda.pe_array[pdn->pe_number];
>>+			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>+			if (ret)
>>+				return ret;
>>+		}
>>+
>>+		pdev = pdev->bus->self;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
>
>
>It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just
>moving of this function to a different place deserves a separate patch with a
>comment why ("it is going to be used now for non-SRIOV case too" may be?).
>

Yeah, it makes sense to me. Will fix it up.

>
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	struct pci_dev *parent;
>>+	uint8_t bcomp, dcomp, fcomp;
>>+	long rid_end, rid;
>>+	int64_t rc;
>>+
>>+	/* Tear down MVE */
>>+	if (phb->type == PNV_PHB_IODA1 &&
>>+	    pe->mve_number != -1) {
>>+		rc = opal_pci_set_mve(phb->opal_id,
>>+				      pe->mve_number,
>>+				      phb->ioda.reserved_pe);
>>+		if (rc != OPAL_SUCCESS)
>>+			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
>>+				rc, pe->mve_number);
>>+		rc = opal_pci_set_mve_enable(phb->opal_id,
>>+					     pe->mve_number,
>>+					     OPAL_DISABLE_MVE);
>>+		if (rc != OPAL_SUCCESS)
>>+			pe_warn(pe, "Error %lld disabling MVE#%d\n",
>>+				rc, pe->mve_number);
>>+		pe->mve_number = -1;
>>+	}
>>+
>>+	/* Unmapping PELTV */
>>+	pnv_ioda_set_peltv(pe, false);
>>+
>>+	/* To unmap PELTM */
>>+	if (pe->pbus) {
>>+		int count;
>>+
>>+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>+		parent = pe->pbus->self;
>>+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>+			count = pe->pbus->busn_res.end -
>>+				pe->pbus->busn_res.start + 1;
>>+		else
>>+			count = 1;
>>+
>>+		switch(count) {
>>+		case  1: bcomp = OpalPciBusAll;   break;
>>+		case  2: bcomp = OpalPciBus7Bits; break;
>>+		case  4: bcomp = OpalPciBus6Bits; break;
>>+		case  8: bcomp = OpalPciBus5Bits; break;
>>+		case 16: bcomp = OpalPciBus4Bits; break;
>>+		case 32: bcomp = OpalPciBus3Bits; break;
>>+		default:
>>+			/* Fail back to case of one bus */
>>+			pe_warn(pe, "Cannot support %d buses\n", count);
>>+			bcomp = OpalPciBusAll;
>>+		}
>>+		rid_end = pe->rid + (count << 8);
>>+	} else {
>>+#ifdef CONFIG_PCI_IOV
>>+		if (pe->flags & PNV_IODA_PE_VF)
>>+			parent = pe->parent_dev;
>>+		else
>>+#endif
>>+			parent = pe->pdev->bus->self;
>>+		bcomp = OpalPciBusAll;
>>+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>+		rid_end = pe->rid + 1;
>>+	}
>>+
>>+	/* Clear RID mapping */
>>+	for (rid = pe->rid; rid < rid_end; rid++)
>>+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>+
>>+	/* Unmapping PELTM */
>>+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>+	if (rc)
>>+		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
>>+}
>>+
>>+static void pnv_ioda_release_pe(struct kref *kref)
>>+{
>>+	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
>>+	struct pnv_ioda_pe *tmp, *slave;
>>+	struct pnv_phb *phb = pe->phb;
>>+
>>+	pnv_ioda_release_pe_dma(pe);
>>+	pnv_ioda_release_pe_seg(pe);
>>+	pnv_ioda_deconfigure_pe(pe);
>>+
>>+	/* Release slave PEs for compound PE */
>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>+		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
>>+			pnv_ioda_pe_put(slave);
>>+	}
>>+
>>+	/* Remove the PE from various list. We need remove slave
>>+	 * PE from master's list.
>>+	 */
>>+	list_del(&pe->dma_link);
>>+	list_del(&pe->list);
>>+
>>+	/* Free PE number */
>>+	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
>>+}
>>+
>>+static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
>>+					    int pe_no)
>>+{
>>+	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
>>+
>>+	kref_init(&pe->kref);
>>+	pe->phb = phb;
>>+	pe->pe_number = pe_no;
>>+	INIT_LIST_HEAD(&pe->dma_link);
>>+	INIT_LIST_HEAD(&pe->list);
>>+
>>+	return pe;
>>+}
>>+
>>+static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
>>+					       int pe_no)
>>+{
>>+	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>+		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>+			__func__, pe_no, phb->hose->global_number);
>>+		return NULL;
>>  	}
>>
>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>+	/*
>>+	 * Same PE might be reserved for multiple times, which
>>+	 * is out of problem actually.
>>+	 */
>>+	set_bit(pe_no, phb->ioda.pe_alloc);
>>+	return pnv_ioda_init_pe(phb, pe_no);
>>  }
>>
>>-static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>+static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>  {
>>  	unsigned long pe_no;
>>  	unsigned long limit = phb->ioda.total_pe - 1;
>>@@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>  			break;
>>
>>  		if (--limit >= phb->ioda.total_pe)
>>-			return IODA_INVALID_PE;
>>+			return NULL;
>>  	} while(1);
>>
>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>-	return pe_no;
>>-}
>>-
>>-static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>-{
>>-	WARN_ON(phb->ioda.pe_array[pe].pdev);
>>-
>>-	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
>>-	clear_bit(pe, phb->ioda.pe_alloc);
>>+	return pnv_ioda_init_pe(phb, pe_no);
>>  }
>>
>>  static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>@@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>  	}
>>  }
>>
>>-static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>-				struct pci_bus *bus, int all)
>>+static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>+						struct pci_bus *bus,
>>+						int all)
>
>
>Mechanic changes like this could easily go to a separate patch.
>

Indeed. I'll see how I can split the patches up in next revision.
Thanks for the suggestion.

>>  {
>>  	resource_size_t segsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>@@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	int i;
>>
>>  	if (!pnv_ioda_need_m64_pe(phb, bus))
>>-		return IODA_INVALID_PE;
>>+		return NULL;
>>
>>          /* Allocate bitmap */
>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>  	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>  	if (!pe_bitsmap) {
>>  		pr_warn("%s: Out of memory !\n", __func__);
>>-		return IODA_INVALID_PE;
>>+		return NULL;
>>  	}
>>
>>  	/* The bridge's M64 window might be extended to PHB's M64
>>@@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	/* No M64 window found ? */
>>  	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>  		kfree(pe_bitsmap);
>>-		return IODA_INVALID_PE;
>>+		return NULL;
>>  	}
>>
>>  	/* Figure out the master PE and put all slave PEs
>>@@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	}
>>
>>  	kfree(pe_bitsmap);
>>-	return master_pe->pe_number;
>>+	return master_pe;
>>  }
>>
>>  static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>@@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>   * but in the meantime, we need to protect them to avoid warnings
>>   */
>>  #ifdef CONFIG_PCI_MSI
>>-static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>+static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>>  {
>>  	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>  	struct pnv_phb *phb = hose->private_data;
>>@@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>  }
>>  #endif /* CONFIG_PCI_MSI */
>>
>>-static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>-				  struct pnv_ioda_pe *parent,
>>-				  struct pnv_ioda_pe *child,
>>-				  bool is_add)
>>-{
>>-	const char *desc = is_add ? "adding" : "removing";
>>-	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>-			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>-	struct pnv_ioda_pe *slave;
>>-	long rc;
>>-
>>-	/* Parent PE affects child PE */
>>-	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>-				child->pe_number, op);
>>-	if (rc != OPAL_SUCCESS) {
>>-		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>-			rc, desc);
>>-		return -ENXIO;
>>-	}
>>-
>>-	if (!(child->flags & PNV_IODA_PE_MASTER))
>>-		return 0;
>>-
>>-	/* Compound case: parent PE affects slave PEs */
>>-	list_for_each_entry(slave, &child->slaves, list) {
>>-		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>-					slave->pe_number, op);
>>-		if (rc != OPAL_SUCCESS) {
>>-			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>-				rc, desc);
>>-			return -ENXIO;
>>-		}
>>-	}
>>-
>>-	return 0;
>>-}
>>-
>>-static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>>-			      struct pnv_ioda_pe *pe,
>>-			      bool is_add)
>>-{
>>-	struct pnv_ioda_pe *slave;
>>-	struct pci_dev *pdev = NULL;
>>-	int ret;
>>-
>>-	/*
>>-	 * Clear PE frozen state. If it's master PE, we need
>>-	 * clear slave PE frozen state as well.
>>-	 */
>>-	if (is_add) {
>>-		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
>>-					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>-		if (pe->flags & PNV_IODA_PE_MASTER) {
>>-			list_for_each_entry(slave, &pe->slaves, list)
>>-				opal_pci_eeh_freeze_clear(phb->opal_id,
>>-							  slave->pe_number,
>>-							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>-		}
>>-	}
>>-
>>-	/*
>>-	 * Associate PE in PELT. We need add the PE into the
>>-	 * corresponding PELT-V as well. Otherwise, the error
>>-	 * originated from the PE might contribute to other
>>-	 * PEs.
>>-	 */
>>-	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>-	if (ret)
>>-		return ret;
>>-
>>-	/* For compound PEs, any one affects all of them */
>>-	if (pe->flags & PNV_IODA_PE_MASTER) {
>>-		list_for_each_entry(slave, &pe->slaves, list) {
>>-			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>-			if (ret)
>>-				return ret;
>>-		}
>>-	}
>>-
>>-	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>-		pdev = pe->pbus->self;
>>-	else if (pe->flags & PNV_IODA_PE_DEV)
>>-		pdev = pe->pdev->bus->self;
>>-#ifdef CONFIG_PCI_IOV
>>-	else if (pe->flags & PNV_IODA_PE_VF)
>>-		pdev = pe->parent_dev->bus->self;
>>-#endif /* CONFIG_PCI_IOV */
>>-	while (pdev) {
>>-		struct pci_dn *pdn = pci_get_pdn(pdev);
>>-		struct pnv_ioda_pe *parent;
>>-
>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>-			parent = &phb->ioda.pe_array[pdn->pe_number];
>>-			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>-			if (ret)
>>-				return ret;
>>-		}
>>-
>>-		pdev = pdev->bus->self;
>>-	}
>>-
>>-	return 0;
>>-}
>>-
>>-#ifdef CONFIG_PCI_IOV
>>-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>-{
>>-	struct pci_dev *parent;
>>-	uint8_t bcomp, dcomp, fcomp;
>>-	int64_t rc;
>>-	long rid_end, rid;
>>-
>>-	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
>>-	if (pe->pbus) {
>>-		int count;
>>-
>>-		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>-		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>-		parent = pe->pbus->self;
>>-		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>-			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
>>-		else
>>-			count = 1;
>>-
>>-		switch(count) {
>>-		case  1: bcomp = OpalPciBusAll;         break;
>>-		case  2: bcomp = OpalPciBus7Bits;       break;
>>-		case  4: bcomp = OpalPciBus6Bits;       break;
>>-		case  8: bcomp = OpalPciBus5Bits;       break;
>>-		case 16: bcomp = OpalPciBus4Bits;       break;
>>-		case 32: bcomp = OpalPciBus3Bits;       break;
>>-		default:
>>-			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
>>-			        count);
>>-			/* Do an exact match only */
>>-			bcomp = OpalPciBusAll;
>>-		}
>>-		rid_end = pe->rid + (count << 8);
>>-	} else {
>>-		if (pe->flags & PNV_IODA_PE_VF)
>>-			parent = pe->parent_dev;
>>-		else
>>-			parent = pe->pdev->bus->self;
>>-		bcomp = OpalPciBusAll;
>>-		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>-		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>-		rid_end = pe->rid + 1;
>>-	}
>>-
>>-	/* Clear the reverse map */
>>-	for (rid = pe->rid; rid < rid_end; rid++)
>>-		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>-
>>-	/* Release from all parents PELT-V */
>>-	while (parent) {
>>-		struct pci_dn *pdn = pci_get_pdn(parent);
>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
>>-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>-			/* XXX What to do in case of error ? */
>
>
>Not much :) Free associated memory and mark it "dead" so it won't be used
>again till reboot. In what circumstance can this opal_pci_set_peltv() fail at
>all?
>

Yeah, maybe. Until now, I didn't see this failure since the code is there
from the day. Note the code has been there for almost 4 years since the
day Ben wrote it.

>
>>-		}
>>-		parent = parent->bus->self;
>>-	}
>>-
>>-	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
>>-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>-
>>-	/* Disassociate PE in PELT */
>>-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
>>-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>-	if (rc)
>>-		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
>>-	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>-			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>-	if (rc)
>>-		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
>>-
>>-	pe->pbus = NULL;
>>-	pe->pdev = NULL;
>>-	pe->parent_dev = NULL;
>>-
>>-	return 0;
>>-}
>>-#endif /* CONFIG_PCI_IOV */
>>-
>>  static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>  {
>>  	struct pci_dev *parent;
>>@@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>  	}
>>
>>  	/* Configure PELTV */
>>-	pnv_ioda_set_peltv(phb, pe, true);
>>+	pnv_ioda_set_peltv(pe, true);
>>
>>  	/* Setup reverse map */
>>  	for (rid = pe->rid; rid < rid_end; rid++)
>>@@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>>  		if (pdn->pe_number != IODA_INVALID_PE)
>>  			continue;
>>
>>+		/* Increase reference count of the parent PE */
>
>When you comment like this, I read it as the comment belongs to the whole
>next chunk till the first empty line, i.e. to all 5 lines below, which is not
>the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() name
>suggests incrementing the reference counter 2) "pe" is always parent in this
>function. I do not insist though.
>

Agree on your explaining. I'll remove this unuseful comments.

>
>>+		pnv_ioda_pe_get(pe);
>>  		pdn->pe_number = pe->pe_number;
>>  		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>>@@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>  {
>>  	struct pci_controller *hose = pci_bus_to_host(bus);
>>  	struct pnv_phb *phb = hose->private_data;
>>-	struct pnv_ioda_pe *pe;
>>+	struct pnv_ioda_pe *pe = NULL;
>>  	int pe_num = IODA_INVALID_PE;
>>
>>  	/* For partial hotplug case, the PE instance hasn't been destroyed
>>@@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>  	}
>>
>>  	/* PE number for root bus should have been reserved */
>>-	if (pci_is_root_bus(bus))
>>-		pe_num = phb->ioda.root_pe_no;
>>+	if (pci_is_root_bus(bus) &&
>>+	    phb->ioda.root_pe_no != IODA_INVALID_PE)
>>+		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>>
>>  	/* Check if PE is determined by M64 */
>>-	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
>>-		pe_num = phb->pick_m64_pe(phb, bus, all);
>>+	if (!pe && phb->pick_m64_pe)
>>+		pe = phb->pick_m64_pe(phb, bus, all);
>>
>>  	/* The PE number isn't pinned by M64 */
>>-	if (pe_num == IODA_INVALID_PE)
>>-		pe_num = pnv_ioda_alloc_pe(phb);
>>+	if (!pe)
>>+		pe = pnv_ioda_alloc_pe(phb);
>>
>>-	if (pe_num == IODA_INVALID_PE) {
>>-		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
>>+	if (!pe) {
>>+		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>>  			__func__, pci_domain_nr(bus), bus->number);
>>  		return NULL;
>>  	}
>>
>>-	pe = &phb->ioda.pe_array[pe_num];
>>  	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>>  	pe->pbus = bus;
>>  	pe->pdev = NULL;
>>@@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>
>>  	if (pnv_ioda_configure_pe(phb, pe)) {
>>  		/* XXX What do we do here ? */
>>-		if (pe_num)
>>-			pnv_ioda_free_pe(phb, pe_num);
>>-		pe->pbus = NULL;
>>+		pnv_ioda_pe_put(pe);
>>  		return NULL;
>>  	}
>>
>>  	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>>-			GFP_KERNEL, hose->node);
>>+				       GFP_KERNEL, hose->node);
>
>Seems like spaces change only - if you really want this change (which I hate
>- makes code look inaccurate to my taste but it seems I am in minority here
>:) ), please put it to the separate patch.
>

Ok. Confirm with you: You prefer the original format? I don't know
why I prefer the later one. Maybe my eyes are quite broken :-)

>
>>  	pe->tce32_table->data = pe;
>>
>>  	/* Associate it with all child devices */
>>@@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>  		list_del(&pe->list);
>>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>>
>>-		pnv_ioda_deconfigure_pe(phb, pe);
>>+		pnv_ioda_deconfigure_pe(pe);
>
>
>Is this change necessary to get "Release PEs dynamically" working? Move it to
>mechanical changes patch may be?
>

Ok. I'll try to do that.

>
>>
>>-		pnv_ioda_free_pe(phb, pe->pe_number);
>>+		pnv_ioda_pe_put(pe);
>>  	}
>>  }
>>
>>@@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>
>>  		if (pnv_ioda_configure_pe(phb, pe)) {
>>  			/* XXX What do we do here ? */
>>-			if (pe_num)
>>-				pnv_ioda_free_pe(phb, pe_num);
>>-			pe->pdev = NULL;
>>+			pnv_ioda_pe_put(pe);
>>  			continue;
>>  		}
>>
>>@@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>>  	struct pnv_ioda_pe *pe;
>>  	int rc;
>>
>>-	pe = pnv_ioda_get_pe(dev);
>>+	pe = pnv_ioda_pci_dev_to_pe(dev);
>
>
>And this change could to separately. Not clear how this helps to "Release PEs
>dynamically".
>
>

It's not related to "Release PEs dynamically". The change is introduced by
the function rename: Original pnv_ioda_get_pe() is renamed to pnv_ioda_pci_dev_to_pe().

>>  	if (!pe)
>>  		return -ENODEV;
>>
>>@@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>>  	struct pnv_ioda_pe *pe;
>>  	int rc;
>>
>>-	if (!(pe = pnv_ioda_get_pe(dev)))
>>+	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>>  		return -ENODEV;
>>
>>  	/* Assign XIVE to PE */
>>@@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>>  				  unsigned int hwirq, unsigned int virq,
>>  				  unsigned int is_64, struct msi_msg *msg)
>>  {
>>-	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
>>+	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>>  	unsigned int xive_num = hwirq - phb->msi_base;
>>  	__be32 data;
>>  	int rc;
>>@@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>>  	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>>  	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>+	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>>  	hose->controller_ops = pnv_pci_controller_ops;
>>
>>  #ifdef CONFIG_PCI_IOV
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 1bea3a8..8b10f01 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -28,6 +28,7 @@ enum pnv_phb_model {
>>  /* Data associated with a PE, including IOMMU tracking etc.. */
>>  struct pnv_phb;
>>  struct pnv_ioda_pe {
>>+	struct kref		kref;
>>  	unsigned long		flags;
>>  	struct pnv_phb		*phb;
>>
>>@@ -120,7 +121,8 @@ struct pnv_phb {
>>  	void (*shutdown)(struct pnv_phb *phb);
>>  	int (*init_m64)(struct pnv_phb *phb);
>>  	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>-	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>+	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
>>+					   struct pci_bus *bus, int all);
>>  	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>  	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>  	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-11  6:25       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  6:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The original code doesn't support releasing PEs dynamically, meaning
>>that PE and the associated resources (IO, M32, M64 and DMA) can't
>>be released when unplugging a PCI adapter from one hotpluggable slot.
>>
>>The patch takes object oriented methodology, introducs reference
>>count to PE, which is initialized to 1 and increased with 1 when a
>>new PCI device joins the PE. Once the last PCI device leaves the
>>PE, the PE is going to be release together with its associated
>>(IO, M32, M64, DMA) resources.
>
>
>Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>

Ok. I'll add more details in next revision.

>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>  arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>  arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>  4 files changed, 432 insertions(+), 238 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>index 5367eb3..a6ad4b1 100644
>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>@@ -31,6 +31,9 @@ struct pci_controller_ops {
>>  	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>  	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>  	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>+
>>+	/* Called when PCI device is released */
>>+	void		(*release_device)(struct pci_dev *);
>>  };
>>
>>  /*
>>diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>index 7ed85a6..0040343 100644
>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>@@ -29,6 +29,11 @@
>>   */
>>  void pcibios_release_device(struct pci_dev *dev)
>>  {
>>+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>+
>>+	if (hose->controller_ops.release_device)
>>+		hose->controller_ops.release_device(dev);
>>+
>>  	eeh_remove_device(dev);
>>  }
>>
>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>index 910fb67..ef8c216 100644
>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>@@ -12,6 +12,8 @@
>>  #undef DEBUG
>>
>>  #include <linux/kernel.h>
>>+#include <linux/atomic.h>
>>+#include <linux/kref.h>
>>  #include <linux/pci.h>
>>  #include <linux/crash_dump.h>
>>  #include <linux/debugfs.h>
>>@@ -47,6 +49,8 @@
>>  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>  #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>
>>+static void pnv_ioda_release_pe(struct kref *kref);
>>+
>>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>  			    const char *fmt, ...)
>>  {
>>@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>  		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>  }
>>
>>-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>  {
>>-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>-			__func__, pe_no, phb->hose->global_number);
>>+	if (!pe)
>>+		return;
>>+
>>+	kref_get(&pe->kref);
>>+}
>>+
>>+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>+{
>>+	unsigned int count;
>>+
>>+	if (!pe)
>>  		return;
>>+
>>+	/*
>>+	 * The count is initialized to 1 and increased with 1 when
>>+	 * a new PCI device is bound with the PE. Once the last PCI
>>+	 * device is leaving from the PE, the PE is going to be
>>+	 * released.
>>+	 */
>>+	count = atomic_read(&pe->kref.refcount);
>>+	if (count == 2)
>>+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>+	else
>>+		kref_put(&pe->kref, pnv_ioda_release_pe);
>
>
>What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>

Yeah, that would have problem. But it shouldn't happen because the
PCI devices are joining the parent PE# in strictly serialized mode.
Same thing happens when detaching PCI devices from its parent PE.

>>+}
>>+
>>+static void pnv_pci_release_device(struct pci_dev *pdev)
>>+{
>>+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>+	struct pnv_phb *phb = hose->private_data;
>>+	struct pci_dn *pdn = pci_get_pdn(pdev);
>>+	struct pnv_ioda_pe *pe;
>>+
>>+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>+		pe = &phb->ioda.pe_array[pdn->pe_number];
>>+		pnv_ioda_pe_put(pe);
>>+		pdn->pe_number = IODA_INVALID_PE;
>>  	}
>>+}
>>
>>-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>-			__func__, pe_no, phb->hose->global_number);
>>+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	int index, count;
>>+	unsigned long tbl_addr, tbl_size;
>>+
>>+	/* No DMA capability for slave PEs */
>>+	if (pe->flags & PNV_IODA_PE_SLAVE)
>>+		return;
>>+
>>+	/* Bypass DMA window */
>>+	if (phb->type == PNV_PHB_IODA2 &&
>>+	    pe->tce_bypass_enabled &&
>>+	    pe->tce32_table &&
>>+	    pe->tce32_table->set_bypass)
>>+		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>+
>>+	/* 32-bits DMA window */
>>+	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>+	tbl_addr = pe->tce32_table->it_base;
>>+	if (!count)
>>  		return;
>>+
>>+	/* Free IOMMU table */
>>+	iommu_free_table(pe->tce32_table,
>>+			 of_node_full_name(phb->hose->dn));
>>+
>>+	/* Deconfigure TCE table */
>>+	switch (phb->type) {
>>+	case PNV_PHB_IODA1:
>>+		for (index = 0; index < count; index++)
>>+			opal_pci_map_pe_dma_window(phb->opal_id,
>>+						   pe->pe_number,
>>+						   pe->tce32_seg_start + index,
>>+						   1,
>>+						   __pa(tbl_addr) +
>>+						   index * TCE32_TABLE_SIZE,
>>+						   0,
>>+						   0x1000);
>>+		bitmap_clear(phb->ioda.tce32_segmap,
>>+			     pe->tce32_seg_start,
>>+			     count);
>>+		tbl_size = TCE32_TABLE_SIZE * count;
>>+		break;
>>+	case PNV_PHB_IODA2:
>>+		opal_pci_map_pe_dma_window(phb->opal_id,
>>+					   pe->pe_number,
>>+					   pe->pe_number << 1,
>>+					   1,
>>+					   __pa(tbl_addr),
>>+					   0,
>>+					   0x1000);
>>+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>+		break;
>>+	default:
>>+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>+		return;
>>+	}
>>+
>>+	/* Free memory of IOMMU table */
>>+	free_pages(tbl_addr, get_order(tbl_size));
>
>
>You just programmed the table address to TVT and then you are releasing the
>pages. It does not seem right, it will leave garbage in TVT. Also, I am
>adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>that ;) ).
>

I assume you're talking about TVE. I don't understand how garbage will be left
in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
with zero'ed "tce_table_size". The pages previously allocated for TCE table is
released to buddy system, which can be allocated by somebody else (from buddy
or slab).

Ok. Please put me into the cc list. I guess the whole series of patches is
better to rebased on your DDW patchset, which is to be merged first, I believe.

>
>>+	pe->tce32_table = NULL;
>>+	pe->tce32_seg_start = 0;
>>+	pe->tce32_seg_end = 0;
>>+}
>>+
>>+static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	unsigned long *segmap = NULL, *pe_segmap = NULL;
>>+	int i;
>>+	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
>>+				     OPAL_M32_WINDOW_TYPE,
>>+				     OPAL_M64_WINDOW_TYPE };
>>+
>>+	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
>>+		switch (win_type[win]) {
>>+		case OPAL_IO_WINDOW_TYPE:
>>+			segmap = phb->ioda.io_segmap;
>>+			pe_segmap = pe->io_segmap;
>>+			break;
>>+		case OPAL_M32_WINDOW_TYPE:
>>+			segmap = phb->ioda.m32_segmap;
>>+			pe_segmap = pe->m32_segmap;
>>+			break;
>>+		case OPAL_M64_WINDOW_TYPE:
>>+			segmap = phb->ioda.m64_segmap;
>>+			pe_segmap = pe->m64_segmap;
>>+			break;
>>+		}
>>+		i = -1;
>>+		while ((i = find_next_bit(pe_segmap,
>>+			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
>>+			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
>>+			    win_type[win] == OPAL_M32_WINDOW_TYPE)
>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>+						phb->ioda.reserved_pe,
>>+						win_type[win], 0, i);
>>+			else if (phb->type == PNV_PHB_IODA1)
>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>+						phb->ioda.reserved_pe,
>>+						win_type[win],
>>+						i / 8, i % 8);
>
>The function is called ""release" but it programs something what looks like
>reasonable values, is it correct?
>

It's out of problem, When the segment is deallocated, it's mapped to the
reserved PE#.

>
>
>>+
>>+			clear_bit(i, pe_segmap);
>>+			clear_bit(i, segmap);
>>+		}
>>+	}
>>+}
>>+
>>+static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>+				  struct pnv_ioda_pe *parent,
>>+				  struct pnv_ioda_pe *child,
>>+				  bool is_add)
>>+{
>>+	const char *desc = is_add ? "adding" : "removing";
>>+	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>+			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>+	struct pnv_ioda_pe *slave;
>>+	long rc;
>>+
>>+	/* Parent PE affects child PE */
>>+	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>+				child->pe_number, op);
>>+	if (rc != OPAL_SUCCESS) {
>>+		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>+			rc, desc);
>>+		return -ENXIO;
>>+	}
>>+
>>+	if (!(child->flags & PNV_IODA_PE_MASTER))
>>+		return 0;
>>+
>>+	/* Compound case: parent PE affects slave PEs */
>>+	list_for_each_entry(slave, &child->slaves, list) {
>>+		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>+					slave->pe_number, op);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>+				rc, desc);
>>+			return -ENXIO;
>>+		}
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	struct pnv_ioda_pe *slave;
>>+	struct pci_dev *pdev = NULL;
>>+	int ret;
>>+
>>+	/*
>>+	 * Clear PE frozen state. If it's master PE, we need
>>+	 * clear slave PE frozen state as well.
>>+	 */
>>+	opal_pci_eeh_freeze_clear(phb->opal_id,
>>+				  pe->pe_number,
>>+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>+			opal_pci_eeh_freeze_clear(phb->opal_id,
>>+						  slave->pe_number,
>>+						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>+		}
>>+	}
>>+
>>+	/*
>>+	 * Associate PE in PELT. We need add the PE into the
>>+	 * corresponding PELT-V as well. Otherwise, the error
>>+	 * originated from the PE might contribute to other
>>+	 * PEs.
>>+	 */
>>+	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>+	if (ret)
>>+		return ret;
>>+
>>+	/* For compound PEs, any one affects all of them */
>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>+			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>+			if (ret)
>>+				return ret;
>>+		}
>>+	}
>>+
>>+	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>+		pdev = pe->pbus->self;
>>+	else if (pe->flags & PNV_IODA_PE_DEV)
>>+		pdev = pe->pdev->bus->self;
>>+#ifdef CONFIG_PCI_IOV
>>+	else if (pe->flags & PNV_IODA_PE_VF)
>>+		pdev = pe->parent_dev->bus->self;
>>+#endif /* CONFIG_PCI_IOV */
>>+
>>+	while (pdev) {
>>+		struct pci_dn *pdn = pci_get_pdn(pdev);
>>+		struct pnv_ioda_pe *parent;
>>+
>>+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>+			parent = &phb->ioda.pe_array[pdn->pe_number];
>>+			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>+			if (ret)
>>+				return ret;
>>+		}
>>+
>>+		pdev = pdev->bus->self;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
>
>
>It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just
>moving of this function to a different place deserves a separate patch with a
>comment why ("it is going to be used now for non-SRIOV case too" may be?).
>

Yeah, it makes sense to me. Will fix it up.

>
>>+{
>>+	struct pnv_phb *phb = pe->phb;
>>+	struct pci_dev *parent;
>>+	uint8_t bcomp, dcomp, fcomp;
>>+	long rid_end, rid;
>>+	int64_t rc;
>>+
>>+	/* Tear down MVE */
>>+	if (phb->type == PNV_PHB_IODA1 &&
>>+	    pe->mve_number != -1) {
>>+		rc = opal_pci_set_mve(phb->opal_id,
>>+				      pe->mve_number,
>>+				      phb->ioda.reserved_pe);
>>+		if (rc != OPAL_SUCCESS)
>>+			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
>>+				rc, pe->mve_number);
>>+		rc = opal_pci_set_mve_enable(phb->opal_id,
>>+					     pe->mve_number,
>>+					     OPAL_DISABLE_MVE);
>>+		if (rc != OPAL_SUCCESS)
>>+			pe_warn(pe, "Error %lld disabling MVE#%d\n",
>>+				rc, pe->mve_number);
>>+		pe->mve_number = -1;
>>+	}
>>+
>>+	/* Unmapping PELTV */
>>+	pnv_ioda_set_peltv(pe, false);
>>+
>>+	/* To unmap PELTM */
>>+	if (pe->pbus) {
>>+		int count;
>>+
>>+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>+		parent = pe->pbus->self;
>>+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>+			count = pe->pbus->busn_res.end -
>>+				pe->pbus->busn_res.start + 1;
>>+		else
>>+			count = 1;
>>+
>>+		switch(count) {
>>+		case  1: bcomp = OpalPciBusAll;   break;
>>+		case  2: bcomp = OpalPciBus7Bits; break;
>>+		case  4: bcomp = OpalPciBus6Bits; break;
>>+		case  8: bcomp = OpalPciBus5Bits; break;
>>+		case 16: bcomp = OpalPciBus4Bits; break;
>>+		case 32: bcomp = OpalPciBus3Bits; break;
>>+		default:
>>+			/* Fail back to case of one bus */
>>+			pe_warn(pe, "Cannot support %d buses\n", count);
>>+			bcomp = OpalPciBusAll;
>>+		}
>>+		rid_end = pe->rid + (count << 8);
>>+	} else {
>>+#ifdef CONFIG_PCI_IOV
>>+		if (pe->flags & PNV_IODA_PE_VF)
>>+			parent = pe->parent_dev;
>>+		else
>>+#endif
>>+			parent = pe->pdev->bus->self;
>>+		bcomp = OpalPciBusAll;
>>+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>+		rid_end = pe->rid + 1;
>>+	}
>>+
>>+	/* Clear RID mapping */
>>+	for (rid = pe->rid; rid < rid_end; rid++)
>>+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>+
>>+	/* Unmapping PELTM */
>>+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>+	if (rc)
>>+		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
>>+}
>>+
>>+static void pnv_ioda_release_pe(struct kref *kref)
>>+{
>>+	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
>>+	struct pnv_ioda_pe *tmp, *slave;
>>+	struct pnv_phb *phb = pe->phb;
>>+
>>+	pnv_ioda_release_pe_dma(pe);
>>+	pnv_ioda_release_pe_seg(pe);
>>+	pnv_ioda_deconfigure_pe(pe);
>>+
>>+	/* Release slave PEs for compound PE */
>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>+		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
>>+			pnv_ioda_pe_put(slave);
>>+	}
>>+
>>+	/* Remove the PE from various list. We need remove slave
>>+	 * PE from master's list.
>>+	 */
>>+	list_del(&pe->dma_link);
>>+	list_del(&pe->list);
>>+
>>+	/* Free PE number */
>>+	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
>>+}
>>+
>>+static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
>>+					    int pe_no)
>>+{
>>+	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
>>+
>>+	kref_init(&pe->kref);
>>+	pe->phb = phb;
>>+	pe->pe_number = pe_no;
>>+	INIT_LIST_HEAD(&pe->dma_link);
>>+	INIT_LIST_HEAD(&pe->list);
>>+
>>+	return pe;
>>+}
>>+
>>+static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
>>+					       int pe_no)
>>+{
>>+	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>+		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>+			__func__, pe_no, phb->hose->global_number);
>>+		return NULL;
>>  	}
>>
>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>+	/*
>>+	 * Same PE might be reserved for multiple times, which
>>+	 * is out of problem actually.
>>+	 */
>>+	set_bit(pe_no, phb->ioda.pe_alloc);
>>+	return pnv_ioda_init_pe(phb, pe_no);
>>  }
>>
>>-static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>+static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>  {
>>  	unsigned long pe_no;
>>  	unsigned long limit = phb->ioda.total_pe - 1;
>>@@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>  			break;
>>
>>  		if (--limit >= phb->ioda.total_pe)
>>-			return IODA_INVALID_PE;
>>+			return NULL;
>>  	} while(1);
>>
>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>-	return pe_no;
>>-}
>>-
>>-static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>-{
>>-	WARN_ON(phb->ioda.pe_array[pe].pdev);
>>-
>>-	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
>>-	clear_bit(pe, phb->ioda.pe_alloc);
>>+	return pnv_ioda_init_pe(phb, pe_no);
>>  }
>>
>>  static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>@@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>  	}
>>  }
>>
>>-static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>-				struct pci_bus *bus, int all)
>>+static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>+						struct pci_bus *bus,
>>+						int all)
>
>
>Mechanic changes like this could easily go to a separate patch.
>

Indeed. I'll see how I can split the patches up in next revision.
Thanks for the suggestion.

>>  {
>>  	resource_size_t segsz = phb->ioda.m64_segsize;
>>  	struct pci_dev *pdev;
>>@@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	int i;
>>
>>  	if (!pnv_ioda_need_m64_pe(phb, bus))
>>-		return IODA_INVALID_PE;
>>+		return NULL;
>>
>>          /* Allocate bitmap */
>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>  	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>  	if (!pe_bitsmap) {
>>  		pr_warn("%s: Out of memory !\n", __func__);
>>-		return IODA_INVALID_PE;
>>+		return NULL;
>>  	}
>>
>>  	/* The bridge's M64 window might be extended to PHB's M64
>>@@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	/* No M64 window found ? */
>>  	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>  		kfree(pe_bitsmap);
>>-		return IODA_INVALID_PE;
>>+		return NULL;
>>  	}
>>
>>  	/* Figure out the master PE and put all slave PEs
>>@@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>  	}
>>
>>  	kfree(pe_bitsmap);
>>-	return master_pe->pe_number;
>>+	return master_pe;
>>  }
>>
>>  static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>@@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>   * but in the meantime, we need to protect them to avoid warnings
>>   */
>>  #ifdef CONFIG_PCI_MSI
>>-static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>+static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>>  {
>>  	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>  	struct pnv_phb *phb = hose->private_data;
>>@@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>  }
>>  #endif /* CONFIG_PCI_MSI */
>>
>>-static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>-				  struct pnv_ioda_pe *parent,
>>-				  struct pnv_ioda_pe *child,
>>-				  bool is_add)
>>-{
>>-	const char *desc = is_add ? "adding" : "removing";
>>-	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>-			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>-	struct pnv_ioda_pe *slave;
>>-	long rc;
>>-
>>-	/* Parent PE affects child PE */
>>-	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>-				child->pe_number, op);
>>-	if (rc != OPAL_SUCCESS) {
>>-		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>-			rc, desc);
>>-		return -ENXIO;
>>-	}
>>-
>>-	if (!(child->flags & PNV_IODA_PE_MASTER))
>>-		return 0;
>>-
>>-	/* Compound case: parent PE affects slave PEs */
>>-	list_for_each_entry(slave, &child->slaves, list) {
>>-		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>-					slave->pe_number, op);
>>-		if (rc != OPAL_SUCCESS) {
>>-			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>-				rc, desc);
>>-			return -ENXIO;
>>-		}
>>-	}
>>-
>>-	return 0;
>>-}
>>-
>>-static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>>-			      struct pnv_ioda_pe *pe,
>>-			      bool is_add)
>>-{
>>-	struct pnv_ioda_pe *slave;
>>-	struct pci_dev *pdev = NULL;
>>-	int ret;
>>-
>>-	/*
>>-	 * Clear PE frozen state. If it's master PE, we need
>>-	 * clear slave PE frozen state as well.
>>-	 */
>>-	if (is_add) {
>>-		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
>>-					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>-		if (pe->flags & PNV_IODA_PE_MASTER) {
>>-			list_for_each_entry(slave, &pe->slaves, list)
>>-				opal_pci_eeh_freeze_clear(phb->opal_id,
>>-							  slave->pe_number,
>>-							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>-		}
>>-	}
>>-
>>-	/*
>>-	 * Associate PE in PELT. We need add the PE into the
>>-	 * corresponding PELT-V as well. Otherwise, the error
>>-	 * originated from the PE might contribute to other
>>-	 * PEs.
>>-	 */
>>-	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>-	if (ret)
>>-		return ret;
>>-
>>-	/* For compound PEs, any one affects all of them */
>>-	if (pe->flags & PNV_IODA_PE_MASTER) {
>>-		list_for_each_entry(slave, &pe->slaves, list) {
>>-			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>-			if (ret)
>>-				return ret;
>>-		}
>>-	}
>>-
>>-	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>-		pdev = pe->pbus->self;
>>-	else if (pe->flags & PNV_IODA_PE_DEV)
>>-		pdev = pe->pdev->bus->self;
>>-#ifdef CONFIG_PCI_IOV
>>-	else if (pe->flags & PNV_IODA_PE_VF)
>>-		pdev = pe->parent_dev->bus->self;
>>-#endif /* CONFIG_PCI_IOV */
>>-	while (pdev) {
>>-		struct pci_dn *pdn = pci_get_pdn(pdev);
>>-		struct pnv_ioda_pe *parent;
>>-
>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>-			parent = &phb->ioda.pe_array[pdn->pe_number];
>>-			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>-			if (ret)
>>-				return ret;
>>-		}
>>-
>>-		pdev = pdev->bus->self;
>>-	}
>>-
>>-	return 0;
>>-}
>>-
>>-#ifdef CONFIG_PCI_IOV
>>-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>-{
>>-	struct pci_dev *parent;
>>-	uint8_t bcomp, dcomp, fcomp;
>>-	int64_t rc;
>>-	long rid_end, rid;
>>-
>>-	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
>>-	if (pe->pbus) {
>>-		int count;
>>-
>>-		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>-		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>-		parent = pe->pbus->self;
>>-		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>-			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
>>-		else
>>-			count = 1;
>>-
>>-		switch(count) {
>>-		case  1: bcomp = OpalPciBusAll;         break;
>>-		case  2: bcomp = OpalPciBus7Bits;       break;
>>-		case  4: bcomp = OpalPciBus6Bits;       break;
>>-		case  8: bcomp = OpalPciBus5Bits;       break;
>>-		case 16: bcomp = OpalPciBus4Bits;       break;
>>-		case 32: bcomp = OpalPciBus3Bits;       break;
>>-		default:
>>-			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
>>-			        count);
>>-			/* Do an exact match only */
>>-			bcomp = OpalPciBusAll;
>>-		}
>>-		rid_end = pe->rid + (count << 8);
>>-	} else {
>>-		if (pe->flags & PNV_IODA_PE_VF)
>>-			parent = pe->parent_dev;
>>-		else
>>-			parent = pe->pdev->bus->self;
>>-		bcomp = OpalPciBusAll;
>>-		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>-		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>-		rid_end = pe->rid + 1;
>>-	}
>>-
>>-	/* Clear the reverse map */
>>-	for (rid = pe->rid; rid < rid_end; rid++)
>>-		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>-
>>-	/* Release from all parents PELT-V */
>>-	while (parent) {
>>-		struct pci_dn *pdn = pci_get_pdn(parent);
>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
>>-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>-			/* XXX What to do in case of error ? */
>
>
>Not much :) Free associated memory and mark it "dead" so it won't be used
>again till reboot. In what circumstance can this opal_pci_set_peltv() fail at
>all?
>

Yeah, maybe. Until now, I didn't see this failure since the code is there
from the day. Note the code has been there for almost 4 years since the
day Ben wrote it.

>
>>-		}
>>-		parent = parent->bus->self;
>>-	}
>>-
>>-	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
>>-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>-
>>-	/* Disassociate PE in PELT */
>>-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
>>-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>-	if (rc)
>>-		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
>>-	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>-			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>-	if (rc)
>>-		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
>>-
>>-	pe->pbus = NULL;
>>-	pe->pdev = NULL;
>>-	pe->parent_dev = NULL;
>>-
>>-	return 0;
>>-}
>>-#endif /* CONFIG_PCI_IOV */
>>-
>>  static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>  {
>>  	struct pci_dev *parent;
>>@@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>  	}
>>
>>  	/* Configure PELTV */
>>-	pnv_ioda_set_peltv(phb, pe, true);
>>+	pnv_ioda_set_peltv(pe, true);
>>
>>  	/* Setup reverse map */
>>  	for (rid = pe->rid; rid < rid_end; rid++)
>>@@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>>  		if (pdn->pe_number != IODA_INVALID_PE)
>>  			continue;
>>
>>+		/* Increase reference count of the parent PE */
>
>When you comment like this, I read it as the comment belongs to the whole
>next chunk till the first empty line, i.e. to all 5 lines below, which is not
>the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() name
>suggests incrementing the reference counter 2) "pe" is always parent in this
>function. I do not insist though.
>

Agree on your explaining. I'll remove this unuseful comments.

>
>>+		pnv_ioda_pe_get(pe);
>>  		pdn->pe_number = pe->pe_number;
>>  		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>>@@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>  {
>>  	struct pci_controller *hose = pci_bus_to_host(bus);
>>  	struct pnv_phb *phb = hose->private_data;
>>-	struct pnv_ioda_pe *pe;
>>+	struct pnv_ioda_pe *pe = NULL;
>>  	int pe_num = IODA_INVALID_PE;
>>
>>  	/* For partial hotplug case, the PE instance hasn't been destroyed
>>@@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>  	}
>>
>>  	/* PE number for root bus should have been reserved */
>>-	if (pci_is_root_bus(bus))
>>-		pe_num = phb->ioda.root_pe_no;
>>+	if (pci_is_root_bus(bus) &&
>>+	    phb->ioda.root_pe_no != IODA_INVALID_PE)
>>+		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>>
>>  	/* Check if PE is determined by M64 */
>>-	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
>>-		pe_num = phb->pick_m64_pe(phb, bus, all);
>>+	if (!pe && phb->pick_m64_pe)
>>+		pe = phb->pick_m64_pe(phb, bus, all);
>>
>>  	/* The PE number isn't pinned by M64 */
>>-	if (pe_num == IODA_INVALID_PE)
>>-		pe_num = pnv_ioda_alloc_pe(phb);
>>+	if (!pe)
>>+		pe = pnv_ioda_alloc_pe(phb);
>>
>>-	if (pe_num == IODA_INVALID_PE) {
>>-		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
>>+	if (!pe) {
>>+		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>>  			__func__, pci_domain_nr(bus), bus->number);
>>  		return NULL;
>>  	}
>>
>>-	pe = &phb->ioda.pe_array[pe_num];
>>  	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>>  	pe->pbus = bus;
>>  	pe->pdev = NULL;
>>@@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>
>>  	if (pnv_ioda_configure_pe(phb, pe)) {
>>  		/* XXX What do we do here ? */
>>-		if (pe_num)
>>-			pnv_ioda_free_pe(phb, pe_num);
>>-		pe->pbus = NULL;
>>+		pnv_ioda_pe_put(pe);
>>  		return NULL;
>>  	}
>>
>>  	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>>-			GFP_KERNEL, hose->node);
>>+				       GFP_KERNEL, hose->node);
>
>Seems like spaces change only - if you really want this change (which I hate
>- makes code look inaccurate to my taste but it seems I am in minority here
>:) ), please put it to the separate patch.
>

Ok. Confirm with you: You prefer the original format? I don't know
why I prefer the later one. Maybe my eyes are quite broken :-)

>
>>  	pe->tce32_table->data = pe;
>>
>>  	/* Associate it with all child devices */
>>@@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>  		list_del(&pe->list);
>>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>>
>>-		pnv_ioda_deconfigure_pe(phb, pe);
>>+		pnv_ioda_deconfigure_pe(pe);
>
>
>Is this change necessary to get "Release PEs dynamically" working? Move it to
>mechanical changes patch may be?
>

Ok. I'll try to do that.

>
>>
>>-		pnv_ioda_free_pe(phb, pe->pe_number);
>>+		pnv_ioda_pe_put(pe);
>>  	}
>>  }
>>
>>@@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>
>>  		if (pnv_ioda_configure_pe(phb, pe)) {
>>  			/* XXX What do we do here ? */
>>-			if (pe_num)
>>-				pnv_ioda_free_pe(phb, pe_num);
>>-			pe->pdev = NULL;
>>+			pnv_ioda_pe_put(pe);
>>  			continue;
>>  		}
>>
>>@@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>>  	struct pnv_ioda_pe *pe;
>>  	int rc;
>>
>>-	pe = pnv_ioda_get_pe(dev);
>>+	pe = pnv_ioda_pci_dev_to_pe(dev);
>
>
>And this change could to separately. Not clear how this helps to "Release PEs
>dynamically".
>
>

It's not related to "Release PEs dynamically". The change is introduced by
the function rename: Original pnv_ioda_get_pe() is renamed to pnv_ioda_pci_dev_to_pe().

>>  	if (!pe)
>>  		return -ENODEV;
>>
>>@@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>>  	struct pnv_ioda_pe *pe;
>>  	int rc;
>>
>>-	if (!(pe = pnv_ioda_get_pe(dev)))
>>+	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>>  		return -ENODEV;
>>
>>  	/* Assign XIVE to PE */
>>@@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>>  				  unsigned int hwirq, unsigned int virq,
>>  				  unsigned int is_64, struct msi_msg *msg)
>>  {
>>-	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
>>+	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>>  	unsigned int xive_num = hwirq - phb->msi_base;
>>  	__be32 data;
>>  	int rc;
>>@@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>>  	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>>  	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>+	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>>  	hose->controller_ops = pnv_pci_controller_ops;
>>
>>  #ifdef CONFIG_PCI_IOV
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 1bea3a8..8b10f01 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -28,6 +28,7 @@ enum pnv_phb_model {
>>  /* Data associated with a PE, including IOMMU tracking etc.. */
>>  struct pnv_phb;
>>  struct pnv_ioda_pe {
>>+	struct kref		kref;
>>  	unsigned long		flags;
>>  	struct pnv_phb		*phb;
>>
>>@@ -120,7 +121,8 @@ struct pnv_phb {
>>  	void (*shutdown)(struct pnv_phb *phb);
>>  	int (*init_m64)(struct pnv_phb *phb);
>>  	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>-	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>+	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
>>+					   struct pci_bus *bus, int all);
>>  	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>  	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>  	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
  2015-05-09 13:41     ` Alexey Kardashevskiy
@ 2015-05-11  6:45       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  6:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 11:41:05PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>For PowerNV platform, running on top of skiboot, all PE level reset
>>should be routed to firmware if the bridge of the PE primary bus has
>>device-node property "ibm,reset-by-firmware". Otherwise, the kernel
>>has to issue hot reset on PE's primary bus despite the requested reset
>>types, which is the behaviour before the firmware supports PCI slot
>>reset. So the changes don't depend on the PCI slot reset capability
>>exposed from the firmware.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h               |   1 +
>>  arch/powerpc/include/asm/opal.h              |   4 +-
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
>>  3 files changed, 102 insertions(+), 109 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index c5eb86f..2793d24 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -190,6 +190,7 @@ enum {
>>  #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>>  #define EEH_RESET_HOT		1	/* Hot reset			*/
>>  #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
>>+#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
>>  #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
>>  #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
>>
>>diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
>>index 042af1a..6d467df 100644
>>--- a/arch/powerpc/include/asm/opal.h
>>+++ b/arch/powerpc/include/asm/opal.h
>>@@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
>>  int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
>>  					uint16_t dma_window_number, uint64_t pci_start_addr,
>>  					uint64_t pci_mem_size);
>>-int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
>>+int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
>>
>>  int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
>>  				   uint64_t diag_buffer_len);
>>@@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
>>  int64_t opal_set_system_attention_led(uint8_t led_action);
>>  int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
>>  			    __be16 *pci_error_type, __be16 *severity);
>>-int64_t opal_pci_poll(uint64_t phb_id);
>>+int64_t opal_pci_poll(uint64_t id, uint8_t *val);
>>  int64_t opal_return_cpu(void);
>>  int64_t opal_check_token(uint64_t token);
>>  int64_t opal_reinit_cpus(uint64_t flags);
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index ce738ab..3c01095 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>>  	return ret;
>>  }
>>
>>-static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>+static s64 pnv_eeh_poll(uint64_t id)
>>  {
>>  	s64 rc = OPAL_HARDWARE;
>>
>>  	while (1) {
>>-		rc = opal_pci_poll(phb->opal_id);
>>+		rc = opal_pci_poll(id, NULL);
>>  		if (rc <= 0)
>>  			break;
>>
>>@@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>  int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>  {
>>  	struct pnv_phb *phb = hose->private_data;
>>+	uint8_t scope;
>>  	s64 rc = OPAL_HARDWARE;
>>
>>  	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>  		 __func__, hose->global_number, option);
>>-
>>-	/* Issue PHB complete reset request */
>>-	if (option == EEH_RESET_FUNDAMENTAL ||
>>-	    option == EEH_RESET_HOT)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PHB_COMPLETE,
>>-				    OPAL_ASSERT_RESET);
>>-	else if (option == EEH_RESET_DEACTIVATE)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PHB_COMPLETE,
>>-				    OPAL_DEASSERT_RESET);
>>-	if (rc < 0)
>>-		goto out;
>>-
>>-	/*
>>-	 * Poll state of the PHB until the request is done
>>-	 * successfully. The PHB reset is usually PHB complete
>>-	 * reset followed by hot reset on root bus. So we also
>>-	 * need the PCI bus settlement delay.
>>-	 */
>>-	rc = pnv_eeh_phb_poll(phb);
>>-	if (option == EEH_RESET_DEACTIVATE) {
>>-		if (system_state < SYSTEM_RUNNING)
>>-			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
>>-		else
>>-			msleep(EEH_PE_RST_SETTLE_TIME);
>
>
>These udelay() and msleep() are gone. How come they are not needed anymore?
>Worth commenting in the commit log or remove those in a separate patch.
>
>I just remember you mentioning some missing delays somewhere which caused
>NVIDIA device to issue EEH and I do not want those to disappear :)
>

Yeah, I think you're correct that it's not safe to remove this yet because
the old firmware (without corresponding PCI hotplug changes) doesn't have
the required delays from opal_pci_poll() yet. I'll add this in next revision.

>
>>+	switch (option) {
>>+	case EEH_RESET_HOT:
>>+		scope = OPAL_RESET_PCI_HOT;
>>+		break;
>>+	case EEH_RESET_FUNDAMENTAL:
>>+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>+		break;
>>+	case EEH_RESET_COMPLETE:
>>+		scope = OPAL_RESET_PHB_COMPLETE;
>>+		break;
>>+	case EEH_RESET_DEACTIVATE:
>>+		return 0;
>>+	default:
>>+		pr_warn("%s: Unsupported option %d\n",
>>+			__func__, option);
>>+		return -EINVAL;
>>  	}
>>-out:
>>-	if (rc != OPAL_SUCCESS)
>>-		return -EIO;
>>
>>-	return 0;
>>-}
>>-
>>-static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
>>-{
>>-	struct pnv_phb *phb = hose->private_data;
>>-	s64 rc = OPAL_HARDWARE;
>>+	/* Issue reset and poll until it's completed */
>>+	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
>>+	if (rc > 0)
>>+		rc = pnv_eeh_poll(phb->opal_id);
>>
>>-	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>-		 __func__, hose->global_number, option);
>>-
>>-	/*
>>-	 * During the reset deassert time, we needn't care
>>-	 * the reset scope because the firmware does nothing
>>-	 * for fundamental or hot reset during deassert phase.
>>-	 */
>>-	if (option == EEH_RESET_FUNDAMENTAL)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PCI_FUNDAMENTAL,
>>-				    OPAL_ASSERT_RESET);
>>-	else if (option == EEH_RESET_HOT)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PCI_HOT,
>>-				    OPAL_ASSERT_RESET);
>>-	else if (option == EEH_RESET_DEACTIVATE)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PCI_HOT,
>>-				    OPAL_DEASSERT_RESET);
>>-	if (rc < 0)
>>-		goto out;
>>-
>>-	/* Poll state of the PHB until the request is done */
>>-	rc = pnv_eeh_phb_poll(phb);
>>-	if (option == EEH_RESET_DEACTIVATE)
>>-		msleep(EEH_PE_RST_SETTLE_TIME);
>>-out:
>>-	if (rc != OPAL_SUCCESS)
>>-		return -EIO;
>>-
>>-	return 0;
>>+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>  }
>>
>>-static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>+static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  {
>>  	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
>>  	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>@@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  	return 0;
>>  }
>>
>>+static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>+{
>>+	struct pci_controller *hose;
>>+	struct pnv_phb *phb;
>>+	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
>>+	uint64_t id = (0x1ul << 60);
>>+	uint8_t scope;
>>+	s64 rc;
>
>
>int64_t for @rc?
>
>

Yes.

>>+
>>+	/*
>>+	 * If the firmware can't handle it, we will issue hot reset
>>+	 * on the secondary bus despite the requested reset type
>>+	 */
>>+	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
>>+		return __pnv_eeh_bridge_reset(dev, option);
>>+
>>+	/* The firmware can handle the request */
>>+	switch (option) {
>>+	case EEH_RESET_HOT:
>>+		scope = OPAL_RESET_PCI_HOT;
>>+		break;
>>+	case EEH_RESET_FUNDAMENTAL:
>>+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>+		break;
>>+	case EEH_RESET_DEACTIVATE:
>>+		return 0;
>>+	case EEH_RESET_COMPLETE:
>>+	default:
>>+		pr_warn("%s: Unsupported option %d on device %s\n",
>>+			__func__, option, pci_name(dev));
>>+		return -EINVAL;
>>+	}
>
>
>This is the same switch as earlier in this patch (slightly different order).
>Move it and opal_pci_reset() into a helper and call it pnv_opal_pci_reset()?
>
>

It sounds a good idea. I'll do accordingly.

>>+
>>+	hose = pci_bus_to_host(dev->bus);
>>+	phb = hose->private_data;
>
>Previously you would initialize @hose and @phb where you declared those but
>not here. If you did the same thing as before, the patch could have been
>smaller and easier to read.
>

Sure.

>>+	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>>+	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
>>+	if (rc > 0)
>>+		rc = pnv_eeh_poll(id);
>>+
>>+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>+}
>>+
>>  void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>  {
>>  	struct pci_controller *hose;
>>
>>  	if (pci_is_root_bus(dev->bus)) {
>>  		hose = pci_bus_to_host(dev->bus);
>>-		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
>>-		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
>>+		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>+		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>  	} else {
>>  		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>  		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>@@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>  static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>  {
>>  	struct pci_controller *hose = pe->phb;
>>+	struct pnv_phb *phb;
>>  	struct pci_bus *bus;
>>-	int ret;
>>+	s64 rc;
>>
>>  	/*
>>  	 * For PHB reset, we always have complete reset. For those PEs whose
>>@@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>  	 * reset. The side effect is that EEH core has to clear the frozen
>>  	 * state explicitly after BAR restore.
>>  	 */
>>-	if (pe->type & EEH_PE_PHB) {
>>-		ret = pnv_eeh_phb_reset(hose, option);
>>-	} else {
>>-		struct pnv_phb *phb;
>>-		s64 rc;
>>+	if (pe->type & EEH_PE_PHB)
>
>I would keep "{" in the line above ....
>
>>+		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);
>
>...put "} else {" here...
>
>and the chunk below would become 1) very small 2) very trivial... And then
>you could make a trivial patch which would do scope removal but without
>functional changes. Or vice versa.
>

I intended to remove nested if(). If you really want me to change the code
according to your comments, I'll do. Otherwise, I prefer to keep it as
of being.

>>
>>-		/*
>>-		 * The frozen PE might be caused by PAPR error injection
>>-		 * registers, which are expected to be cleared after hitting
>>-		 * frozen PE as stated in the hardware spec. Unfortunately,
>>-		 * that's not true on P7IOC. So we have to clear it manually
>>-		 * to avoid recursive EEH errors during recovery.
>>-		 */
>>-		phb = hose->private_data;
>>-		if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>-		    (option == EEH_RESET_HOT ||
>>-		    option == EEH_RESET_FUNDAMENTAL)) {
>>-			rc = opal_pci_reset(phb->opal_id,
>>-					    OPAL_RESET_PHB_ERROR,
>>-					    OPAL_ASSERT_RESET);
>>-			if (rc != OPAL_SUCCESS) {
>>-				pr_warn("%s: Failure %lld clearing "
>>-					"error injection registers\n",
>>-					__func__, rc);
>>-				return -EIO;
>>-			}
>>+	/*
>>+	 * The frozen PE might be caused by PAPR error injection
>>+	 * registers, which are expected to be cleared after hitting
>>+	 * frozen PE as stated in the hardware spec. Unfortunately,
>>+	 * that's not true on P7IOC. So we have to clear it manually
>>+	 * to avoid recursive EEH errors during recovery.
>>+	 */
>>+	phb = hose->private_data;
>>+	if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>+	    (option == EEH_RESET_HOT ||
>>+	    option == EEH_RESET_FUNDAMENTAL)) {
>>+		rc = opal_pci_reset(phb->opal_id,
>>+				    OPAL_RESET_PHB_ERROR,
>>+				    OPAL_ASSERT_RESET);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("%s: Failure %lld clearing error "
>>+				"injection registers on PHB#%d\n",
>>+				__func__, rc, hose->global_number);
>>+			return -EIO;
>>  		}
>>-
>>-		bus = eeh_pe_bus_get(pe);
>>-		if (pci_is_root_bus(bus) ||
>>-			pci_is_root_bus(bus->parent))
>>-			ret = pnv_eeh_root_reset(hose, option);
>>-		else
>>-			ret = pnv_eeh_bridge_reset(bus->self, option);
>>  	}
>>
>>-	return ret;
>>+	/* Route the reset request to PHB or upstream bridge */
>>+	bus = eeh_pe_bus_get(pe);
>>+	if (pci_is_root_bus(bus))
>>+		return pnv_eeh_phb_reset(hose, option);
>>+
>>+	return pnv_eeh_bridge_reset(bus->self, option);
>>  }
>>
>>  /**
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
@ 2015-05-11  6:45       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  6:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 11:41:05PM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>For PowerNV platform, running on top of skiboot, all PE level reset
>>should be routed to firmware if the bridge of the PE primary bus has
>>device-node property "ibm,reset-by-firmware". Otherwise, the kernel
>>has to issue hot reset on PE's primary bus despite the requested reset
>>types, which is the behaviour before the firmware supports PCI slot
>>reset. So the changes don't depend on the PCI slot reset capability
>>exposed from the firmware.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h               |   1 +
>>  arch/powerpc/include/asm/opal.h              |   4 +-
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
>>  3 files changed, 102 insertions(+), 109 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index c5eb86f..2793d24 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -190,6 +190,7 @@ enum {
>>  #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>>  #define EEH_RESET_HOT		1	/* Hot reset			*/
>>  #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
>>+#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
>>  #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
>>  #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
>>
>>diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
>>index 042af1a..6d467df 100644
>>--- a/arch/powerpc/include/asm/opal.h
>>+++ b/arch/powerpc/include/asm/opal.h
>>@@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
>>  int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
>>  					uint16_t dma_window_number, uint64_t pci_start_addr,
>>  					uint64_t pci_mem_size);
>>-int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
>>+int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
>>
>>  int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
>>  				   uint64_t diag_buffer_len);
>>@@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
>>  int64_t opal_set_system_attention_led(uint8_t led_action);
>>  int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
>>  			    __be16 *pci_error_type, __be16 *severity);
>>-int64_t opal_pci_poll(uint64_t phb_id);
>>+int64_t opal_pci_poll(uint64_t id, uint8_t *val);
>>  int64_t opal_return_cpu(void);
>>  int64_t opal_check_token(uint64_t token);
>>  int64_t opal_reinit_cpus(uint64_t flags);
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index ce738ab..3c01095 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>>  	return ret;
>>  }
>>
>>-static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>+static s64 pnv_eeh_poll(uint64_t id)
>>  {
>>  	s64 rc = OPAL_HARDWARE;
>>
>>  	while (1) {
>>-		rc = opal_pci_poll(phb->opal_id);
>>+		rc = opal_pci_poll(id, NULL);
>>  		if (rc <= 0)
>>  			break;
>>
>>@@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>  int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>  {
>>  	struct pnv_phb *phb = hose->private_data;
>>+	uint8_t scope;
>>  	s64 rc = OPAL_HARDWARE;
>>
>>  	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>  		 __func__, hose->global_number, option);
>>-
>>-	/* Issue PHB complete reset request */
>>-	if (option == EEH_RESET_FUNDAMENTAL ||
>>-	    option == EEH_RESET_HOT)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PHB_COMPLETE,
>>-				    OPAL_ASSERT_RESET);
>>-	else if (option == EEH_RESET_DEACTIVATE)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PHB_COMPLETE,
>>-				    OPAL_DEASSERT_RESET);
>>-	if (rc < 0)
>>-		goto out;
>>-
>>-	/*
>>-	 * Poll state of the PHB until the request is done
>>-	 * successfully. The PHB reset is usually PHB complete
>>-	 * reset followed by hot reset on root bus. So we also
>>-	 * need the PCI bus settlement delay.
>>-	 */
>>-	rc = pnv_eeh_phb_poll(phb);
>>-	if (option == EEH_RESET_DEACTIVATE) {
>>-		if (system_state < SYSTEM_RUNNING)
>>-			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
>>-		else
>>-			msleep(EEH_PE_RST_SETTLE_TIME);
>
>
>These udelay() and msleep() are gone. How come they are not needed anymore?
>Worth commenting in the commit log or remove those in a separate patch.
>
>I just remember you mentioning some missing delays somewhere which caused
>NVIDIA device to issue EEH and I do not want those to disappear :)
>

Yeah, I think you're correct that it's not safe to remove this yet because
the old firmware (without corresponding PCI hotplug changes) doesn't have
the required delays from opal_pci_poll() yet. I'll add this in next revision.

>
>>+	switch (option) {
>>+	case EEH_RESET_HOT:
>>+		scope = OPAL_RESET_PCI_HOT;
>>+		break;
>>+	case EEH_RESET_FUNDAMENTAL:
>>+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>+		break;
>>+	case EEH_RESET_COMPLETE:
>>+		scope = OPAL_RESET_PHB_COMPLETE;
>>+		break;
>>+	case EEH_RESET_DEACTIVATE:
>>+		return 0;
>>+	default:
>>+		pr_warn("%s: Unsupported option %d\n",
>>+			__func__, option);
>>+		return -EINVAL;
>>  	}
>>-out:
>>-	if (rc != OPAL_SUCCESS)
>>-		return -EIO;
>>
>>-	return 0;
>>-}
>>-
>>-static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
>>-{
>>-	struct pnv_phb *phb = hose->private_data;
>>-	s64 rc = OPAL_HARDWARE;
>>+	/* Issue reset and poll until it's completed */
>>+	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
>>+	if (rc > 0)
>>+		rc = pnv_eeh_poll(phb->opal_id);
>>
>>-	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>-		 __func__, hose->global_number, option);
>>-
>>-	/*
>>-	 * During the reset deassert time, we needn't care
>>-	 * the reset scope because the firmware does nothing
>>-	 * for fundamental or hot reset during deassert phase.
>>-	 */
>>-	if (option == EEH_RESET_FUNDAMENTAL)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PCI_FUNDAMENTAL,
>>-				    OPAL_ASSERT_RESET);
>>-	else if (option == EEH_RESET_HOT)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PCI_HOT,
>>-				    OPAL_ASSERT_RESET);
>>-	else if (option == EEH_RESET_DEACTIVATE)
>>-		rc = opal_pci_reset(phb->opal_id,
>>-				    OPAL_RESET_PCI_HOT,
>>-				    OPAL_DEASSERT_RESET);
>>-	if (rc < 0)
>>-		goto out;
>>-
>>-	/* Poll state of the PHB until the request is done */
>>-	rc = pnv_eeh_phb_poll(phb);
>>-	if (option == EEH_RESET_DEACTIVATE)
>>-		msleep(EEH_PE_RST_SETTLE_TIME);
>>-out:
>>-	if (rc != OPAL_SUCCESS)
>>-		return -EIO;
>>-
>>-	return 0;
>>+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>  }
>>
>>-static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>+static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  {
>>  	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
>>  	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>@@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  	return 0;
>>  }
>>
>>+static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>+{
>>+	struct pci_controller *hose;
>>+	struct pnv_phb *phb;
>>+	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
>>+	uint64_t id = (0x1ul << 60);
>>+	uint8_t scope;
>>+	s64 rc;
>
>
>int64_t for @rc?
>
>

Yes.

>>+
>>+	/*
>>+	 * If the firmware can't handle it, we will issue hot reset
>>+	 * on the secondary bus despite the requested reset type
>>+	 */
>>+	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
>>+		return __pnv_eeh_bridge_reset(dev, option);
>>+
>>+	/* The firmware can handle the request */
>>+	switch (option) {
>>+	case EEH_RESET_HOT:
>>+		scope = OPAL_RESET_PCI_HOT;
>>+		break;
>>+	case EEH_RESET_FUNDAMENTAL:
>>+		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>+		break;
>>+	case EEH_RESET_DEACTIVATE:
>>+		return 0;
>>+	case EEH_RESET_COMPLETE:
>>+	default:
>>+		pr_warn("%s: Unsupported option %d on device %s\n",
>>+			__func__, option, pci_name(dev));
>>+		return -EINVAL;
>>+	}
>
>
>This is the same switch as earlier in this patch (slightly different order).
>Move it and opal_pci_reset() into a helper and call it pnv_opal_pci_reset()?
>
>

It sounds a good idea. I'll do accordingly.

>>+
>>+	hose = pci_bus_to_host(dev->bus);
>>+	phb = hose->private_data;
>
>Previously you would initialize @hose and @phb where you declared those but
>not here. If you did the same thing as before, the patch could have been
>smaller and easier to read.
>

Sure.

>>+	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>>+	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
>>+	if (rc > 0)
>>+		rc = pnv_eeh_poll(id);
>>+
>>+	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>+}
>>+
>>  void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>  {
>>  	struct pci_controller *hose;
>>
>>  	if (pci_is_root_bus(dev->bus)) {
>>  		hose = pci_bus_to_host(dev->bus);
>>-		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
>>-		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
>>+		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>+		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>  	} else {
>>  		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>  		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>@@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>  static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>  {
>>  	struct pci_controller *hose = pe->phb;
>>+	struct pnv_phb *phb;
>>  	struct pci_bus *bus;
>>-	int ret;
>>+	s64 rc;
>>
>>  	/*
>>  	 * For PHB reset, we always have complete reset. For those PEs whose
>>@@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>  	 * reset. The side effect is that EEH core has to clear the frozen
>>  	 * state explicitly after BAR restore.
>>  	 */
>>-	if (pe->type & EEH_PE_PHB) {
>>-		ret = pnv_eeh_phb_reset(hose, option);
>>-	} else {
>>-		struct pnv_phb *phb;
>>-		s64 rc;
>>+	if (pe->type & EEH_PE_PHB)
>
>I would keep "{" in the line above ....
>
>>+		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);
>
>...put "} else {" here...
>
>and the chunk below would become 1) very small 2) very trivial... And then
>you could make a trivial patch which would do scope removal but without
>functional changes. Or vice versa.
>

I intended to remove nested if(). If you really want me to change the code
according to your comments, I'll do. Otherwise, I prefer to keep it as
of being.

>>
>>-		/*
>>-		 * The frozen PE might be caused by PAPR error injection
>>-		 * registers, which are expected to be cleared after hitting
>>-		 * frozen PE as stated in the hardware spec. Unfortunately,
>>-		 * that's not true on P7IOC. So we have to clear it manually
>>-		 * to avoid recursive EEH errors during recovery.
>>-		 */
>>-		phb = hose->private_data;
>>-		if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>-		    (option == EEH_RESET_HOT ||
>>-		    option == EEH_RESET_FUNDAMENTAL)) {
>>-			rc = opal_pci_reset(phb->opal_id,
>>-					    OPAL_RESET_PHB_ERROR,
>>-					    OPAL_ASSERT_RESET);
>>-			if (rc != OPAL_SUCCESS) {
>>-				pr_warn("%s: Failure %lld clearing "
>>-					"error injection registers\n",
>>-					__func__, rc);
>>-				return -EIO;
>>-			}
>>+	/*
>>+	 * The frozen PE might be caused by PAPR error injection
>>+	 * registers, which are expected to be cleared after hitting
>>+	 * frozen PE as stated in the hardware spec. Unfortunately,
>>+	 * that's not true on P7IOC. So we have to clear it manually
>>+	 * to avoid recursive EEH errors during recovery.
>>+	 */
>>+	phb = hose->private_data;
>>+	if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>+	    (option == EEH_RESET_HOT ||
>>+	    option == EEH_RESET_FUNDAMENTAL)) {
>>+		rc = opal_pci_reset(phb->opal_id,
>>+				    OPAL_RESET_PHB_ERROR,
>>+				    OPAL_ASSERT_RESET);
>>+		if (rc != OPAL_SUCCESS) {
>>+			pr_warn("%s: Failure %lld clearing error "
>>+				"injection registers on PHB#%d\n",
>>+				__func__, rc, hose->global_number);
>>+			return -EIO;
>>  		}
>>-
>>-		bus = eeh_pe_bus_get(pe);
>>-		if (pci_is_root_bus(bus) ||
>>-			pci_is_root_bus(bus->parent))
>>-			ret = pnv_eeh_root_reset(hose, option);
>>-		else
>>-			ret = pnv_eeh_bridge_reset(bus->self, option);
>>  	}
>>
>>-	return ret;
>>+	/* Route the reset request to PHB or upstream bridge */
>>+	bus = eeh_pe_bus_get(pe);
>>+	if (pci_is_root_bus(bus))
>>+		return pnv_eeh_phb_reset(hose, option);
>>+
>>+	return pnv_eeh_bridge_reset(bus->self, option);
>>  }
>>
>>  /**
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
  2015-05-09 14:12     ` Alexey Kardashevskiy
@ 2015-05-11  6:47       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  6:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>Function pnv_pci_reset_secondary_bus() is used to reset specified
>>PCI bus, which is leaded by root complex or PCI bridge. That means
>>the function shouldn't be called on PCI root bus and the patch
>>removes the logic for that case.
>>
>>Also, some adapters beneath the indicated PCI bus may require
>>fundamental reset in order to successfully reload their firmwares
>>after the reset. The patch translates hot reset to fundamental reset
>>for that case.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>>  1 file changed, 26 insertions(+), 9 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index 3c01095..58e4dcf 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>  }
>>
>>-void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>
>
>Why changing dev to pdev? Keeping "dev" could make the patch simpler.
>

In the early stage when I wrote the EEH code, I had "dev" to refer PCI
device, which isn't precisely enough. Actually, "dev" means "struct device"
while "pdev" stands for "struct pci_dev". That's why I changed it.

>>+static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>  {
>>-	struct pci_controller *hose;
>>+	int *freset = data;
>>
>>-	if (pci_is_root_bus(dev->bus)) {
>>-		hose = pci_bus_to_host(dev->bus);
>>-		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>-		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>-	} else {
>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>+	/*
>>+	 * Stop the iteration immediately if there is any
>>+	 * one PCI device requesting fundamental reset
>>+	 */
>>+	*freset |= pdev->needs_freset;
>>+	return *freset;
>>+}
>>+
>>+void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
>>+{
>>+	int option = EEH_RESET_HOT;
>>+	int freset = 0;
>>+
>>+	/* Check if there're any PCI devices asking for fundamental reset */
>>+	if (pdev->subordinate) {
>>+		pci_walk_bus(pdev->subordinate,
>>+			     pnv_pci_dev_reset_type,
>>+			     &freset);
>>+		if (freset)
>>+			option = EEH_RESET_FUNDAMENTAL;
>>  	}
>>+
>>+	/* Issue the requested type of reset */
>>+	pnv_eeh_bridge_reset(pdev, option);
>>+	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>>  }
>>
>>  /**
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
@ 2015-05-11  6:47       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  6:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>Function pnv_pci_reset_secondary_bus() is used to reset specified
>>PCI bus, which is leaded by root complex or PCI bridge. That means
>>the function shouldn't be called on PCI root bus and the patch
>>removes the logic for that case.
>>
>>Also, some adapters beneath the indicated PCI bus may require
>>fundamental reset in order to successfully reload their firmwares
>>after the reset. The patch translates hot reset to fundamental reset
>>for that case.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>>  1 file changed, 26 insertions(+), 9 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index 3c01095..58e4dcf 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>  }
>>
>>-void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>
>
>Why changing dev to pdev? Keeping "dev" could make the patch simpler.
>

In the early stage when I wrote the EEH code, I had "dev" to refer PCI
device, which isn't precisely enough. Actually, "dev" means "struct device"
while "pdev" stands for "struct pci_dev". That's why I changed it.

>>+static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>  {
>>-	struct pci_controller *hose;
>>+	int *freset = data;
>>
>>-	if (pci_is_root_bus(dev->bus)) {
>>-		hose = pci_bus_to_host(dev->bus);
>>-		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>-		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>-	} else {
>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>+	/*
>>+	 * Stop the iteration immediately if there is any
>>+	 * one PCI device requesting fundamental reset
>>+	 */
>>+	*freset |= pdev->needs_freset;
>>+	return *freset;
>>+}
>>+
>>+void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
>>+{
>>+	int option = EEH_RESET_HOT;
>>+	int freset = 0;
>>+
>>+	/* Check if there're any PCI devices asking for fundamental reset */
>>+	if (pdev->subordinate) {
>>+		pci_walk_bus(pdev->subordinate,
>>+			     pnv_pci_dev_reset_type,
>>+			     &freset);
>>+		if (freset)
>>+			option = EEH_RESET_FUNDAMENTAL;
>>  	}
>>+
>>+	/* Issue the requested type of reset */
>>+	pnv_eeh_bridge_reset(pdev, option);
>>+	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>>  }
>>
>>  /**
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-11  6:25       ` Gavin Shan
@ 2015-05-11  7:02         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-11  7:02 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linuxppc-dev, linux-pci, benh, bhelgaas

On 05/11/2015 04:25 PM, Gavin Shan wrote:
> On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>> The original code doesn't support releasing PEs dynamically, meaning
>>> that PE and the associated resources (IO, M32, M64 and DMA) can't
>>> be released when unplugging a PCI adapter from one hotpluggable slot.
>>>
>>> The patch takes object oriented methodology, introducs reference
>>> count to PE, which is initialized to 1 and increased with 1 when a
>>> new PCI device joins the PE. Once the last PCI device leaves the
>>> PE, the PE is going to be release together with its associated
>>> (IO, M32, M64, DMA) resources.
>>
>>
>> Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>
>
> Ok. I'll add more details in next revision.
>
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>   arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>   arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>   arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>   4 files changed, 432 insertions(+), 238 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>> index 5367eb3..a6ad4b1 100644
>>> --- a/arch/powerpc/include/asm/pci-bridge.h
>>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>>> @@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>   	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>   	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>   	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>> +
>>> +	/* Called when PCI device is released */
>>> +	void		(*release_device)(struct pci_dev *);
>>>   };
>>>
>>>   /*
>>> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>> index 7ed85a6..0040343 100644
>>> --- a/arch/powerpc/kernel/pci-hotplug.c
>>> +++ b/arch/powerpc/kernel/pci-hotplug.c
>>> @@ -29,6 +29,11 @@
>>>    */
>>>   void pcibios_release_device(struct pci_dev *dev)
>>>   {
>>> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>> +
>>> +	if (hose->controller_ops.release_device)
>>> +		hose->controller_ops.release_device(dev);
>>> +
>>>   	eeh_remove_device(dev);
>>>   }
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> index 910fb67..ef8c216 100644
>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> @@ -12,6 +12,8 @@
>>>   #undef DEBUG
>>>
>>>   #include <linux/kernel.h>
>>> +#include <linux/atomic.h>
>>> +#include <linux/kref.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/crash_dump.h>
>>>   #include <linux/debugfs.h>
>>> @@ -47,6 +49,8 @@
>>>   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>   #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>
>>> +static void pnv_ioda_release_pe(struct kref *kref);
>>> +
>>>   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>   			    const char *fmt, ...)
>>>   {
>>> @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>   		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>   }
>>>
>>> -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>> +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>   {
>>> -	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>> -		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>> -			__func__, pe_no, phb->hose->global_number);
>>> +	if (!pe)
>>> +		return;
>>> +
>>> +	kref_get(&pe->kref);
>>> +}
>>> +
>>> +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>> +{
>>> +	unsigned int count;
>>> +
>>> +	if (!pe)
>>>   		return;
>>> +
>>> +	/*
>>> +	 * The count is initialized to 1 and increased with 1 when
>>> +	 * a new PCI device is bound with the PE. Once the last PCI
>>> +	 * device is leaving from the PE, the PE is going to be
>>> +	 * released.
>>> +	 */
>>> +	count = atomic_read(&pe->kref.refcount);
>>> +	if (count == 2)
>>> +		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>> +	else
>>> +		kref_put(&pe->kref, pnv_ioda_release_pe);
>>
>>
>> What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>
>
> Yeah, that would have problem. But it shouldn't happen because the
> PCI devices are joining the parent PE# in strictly serialized mode.
> Same thing happens when detaching PCI devices from its parent PE.


oookay. Another thing then - why is this kref counter initialized to 1?
It would make sense if you did something special when the counter becomes 1 
after decrement but you do not.

Also, this kref thing makes sense if you do kref_put() in multiple places 
and do not know which one will be the last one so you pass the callback to 
all of them. Here you do kref_put/sub in one place and you read the counter 
- so you can call pnv_ioda_release_pe() directly. And it feels like a 
simple atomic_t would do the job just fine. If you still feel that the 
counter should start from 1, there are atomic_dec_if_positive() and 
atomic_inc_not_zero() and others.




>>> +}
>>> +
>>> +static void pnv_pci_release_device(struct pci_dev *pdev)
>>> +{
>>> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>> +	struct pnv_phb *phb = hose->private_data;
>>> +	struct pci_dn *pdn = pci_get_pdn(pdev);
>>> +	struct pnv_ioda_pe *pe;
>>> +
>>> +	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> +		pe = &phb->ioda.pe_array[pdn->pe_number];
>>> +		pnv_ioda_pe_put(pe);
>>> +		pdn->pe_number = IODA_INVALID_PE;
>>>   	}
>>> +}
>>>
>>> -	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>> -		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>> -			__func__, pe_no, phb->hose->global_number);
>>> +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	int index, count;
>>> +	unsigned long tbl_addr, tbl_size;
>>> +
>>> +	/* No DMA capability for slave PEs */
>>> +	if (pe->flags & PNV_IODA_PE_SLAVE)
>>> +		return;
>>> +
>>> +	/* Bypass DMA window */
>>> +	if (phb->type == PNV_PHB_IODA2 &&
>>> +	    pe->tce_bypass_enabled &&
>>> +	    pe->tce32_table &&
>>> +	    pe->tce32_table->set_bypass)
>>> +		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>> +
>>> +	/* 32-bits DMA window */
>>> +	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>> +	tbl_addr = pe->tce32_table->it_base;
>>> +	if (!count)
>>>   		return;
>>> +
>>> +	/* Free IOMMU table */
>>> +	iommu_free_table(pe->tce32_table,
>>> +			 of_node_full_name(phb->hose->dn));
>>> +
>>> +	/* Deconfigure TCE table */
>>> +	switch (phb->type) {
>>> +	case PNV_PHB_IODA1:
>>> +		for (index = 0; index < count; index++)
>>> +			opal_pci_map_pe_dma_window(phb->opal_id,
>>> +						   pe->pe_number,
>>> +						   pe->tce32_seg_start + index,
>>> +						   1,
>>> +						   __pa(tbl_addr) +
>>> +						   index * TCE32_TABLE_SIZE,
>>> +						   0,
>>> +						   0x1000);
>>> +		bitmap_clear(phb->ioda.tce32_segmap,
>>> +			     pe->tce32_seg_start,
>>> +			     count);
>>> +		tbl_size = TCE32_TABLE_SIZE * count;
>>> +		break;
>>> +	case PNV_PHB_IODA2:
>>> +		opal_pci_map_pe_dma_window(phb->opal_id,
>>> +					   pe->pe_number,
>>> +					   pe->pe_number << 1,
>>> +					   1,
>>> +					   __pa(tbl_addr),
>>> +					   0,
>>> +					   0x1000);
>>> +		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>> +		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>> +		break;
>>> +	default:
>>> +		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>> +		return;
>>> +	}
>>> +
>>> +	/* Free memory of IOMMU table */
>>> +	free_pages(tbl_addr, get_order(tbl_size));
>>
>>
>> You just programmed the table address to TVT and then you are releasing the
>> pages. It does not seem right, it will leave garbage in TVT. Also, I am
>> adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>> that ;) ).
>>
>
> I assume you're talking about TVE. I don't understand how garbage will be left
> in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
> with zero'ed "tce_table_size". The pages previously allocated for TCE table is
> released to buddy system, which can be allocated by somebody else (from buddy
> or slab).

opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some 
memory which is still allocated. This value goes to a table (which has 2 
entries per PE, one for 32bit DMA window and one for bypass/hugewindow) 
which PHB uses to get the actual TCE table address. What is the name of 
this table? :) Anyway, you write an address there and then you call 
free_pages() so after free_pages(), the value in that TVE/TVT/whatever 
table is a garbage.


>
> Ok. Please put me into the cc list. I guess the whole series of patches is
> better to rebased on your DDW patchset, which is to be merged first, I believe.
>
>>
>>> +	pe->tce32_table = NULL;
>>> +	pe->tce32_seg_start = 0;
>>> +	pe->tce32_seg_end = 0;
>>> +}
>>> +
>>> +static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	unsigned long *segmap = NULL, *pe_segmap = NULL;
>>> +	int i;
>>> +	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
>>> +				     OPAL_M32_WINDOW_TYPE,
>>> +				     OPAL_M64_WINDOW_TYPE };
>>> +
>>> +	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
>>> +		switch (win_type[win]) {
>>> +		case OPAL_IO_WINDOW_TYPE:
>>> +			segmap = phb->ioda.io_segmap;
>>> +			pe_segmap = pe->io_segmap;
>>> +			break;
>>> +		case OPAL_M32_WINDOW_TYPE:
>>> +			segmap = phb->ioda.m32_segmap;
>>> +			pe_segmap = pe->m32_segmap;
>>> +			break;
>>> +		case OPAL_M64_WINDOW_TYPE:
>>> +			segmap = phb->ioda.m64_segmap;
>>> +			pe_segmap = pe->m64_segmap;
>>> +			break;
>>> +		}
>>> +		i = -1;
>>> +		while ((i = find_next_bit(pe_segmap,
>>> +			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
>>> +			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
>>> +			    win_type[win] == OPAL_M32_WINDOW_TYPE)
>>> +				opal_pci_map_pe_mmio_window(phb->opal_id,
>>> +						phb->ioda.reserved_pe,
>>> +						win_type[win], 0, i);
>>> +			else if (phb->type == PNV_PHB_IODA1)
>>> +				opal_pci_map_pe_mmio_window(phb->opal_id,
>>> +						phb->ioda.reserved_pe,
>>> +						win_type[win],
>>> +						i / 8, i % 8);
>>
>> The function is called ""release" but it programs something what looks like
>> reasonable values, is it correct?
>>
>
> It's out of problem, When the segment is deallocated, it's mapped to the
> reserved PE#.
>
>>
>>
>>> +
>>> +			clear_bit(i, pe_segmap);
>>> +			clear_bit(i, segmap);
>>> +		}
>>> +	}
>>> +}
>>> +
>>> +static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>> +				  struct pnv_ioda_pe *parent,
>>> +				  struct pnv_ioda_pe *child,
>>> +				  bool is_add)
>>> +{
>>> +	const char *desc = is_add ? "adding" : "removing";
>>> +	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>> +			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>> +	struct pnv_ioda_pe *slave;
>>> +	long rc;
>>> +
>>> +	/* Parent PE affects child PE */
>>> +	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> +				child->pe_number, op);
>>> +	if (rc != OPAL_SUCCESS) {
>>> +		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>> +			rc, desc);
>>> +		return -ENXIO;
>>> +	}
>>> +
>>> +	if (!(child->flags & PNV_IODA_PE_MASTER))
>>> +		return 0;
>>> +
>>> +	/* Compound case: parent PE affects slave PEs */
>>> +	list_for_each_entry(slave, &child->slaves, list) {
>>> +		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> +					slave->pe_number, op);
>>> +		if (rc != OPAL_SUCCESS) {
>>> +			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>> +				rc, desc);
>>> +			return -ENXIO;
>>> +		}
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	struct pnv_ioda_pe *slave;
>>> +	struct pci_dev *pdev = NULL;
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * Clear PE frozen state. If it's master PE, we need
>>> +	 * clear slave PE frozen state as well.
>>> +	 */
>>> +	opal_pci_eeh_freeze_clear(phb->opal_id,
>>> +				  pe->pe_number,
>>> +				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> +	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> +		list_for_each_entry(slave, &pe->slaves, list) {
>>> +			opal_pci_eeh_freeze_clear(phb->opal_id,
>>> +						  slave->pe_number,
>>> +						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * Associate PE in PELT. We need add the PE into the
>>> +	 * corresponding PELT-V as well. Otherwise, the error
>>> +	 * originated from the PE might contribute to other
>>> +	 * PEs.
>>> +	 */
>>> +	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	/* For compound PEs, any one affects all of them */
>>> +	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> +		list_for_each_entry(slave, &pe->slaves, list) {
>>> +			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>> +	}
>>> +
>>> +	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>> +		pdev = pe->pbus->self;
>>> +	else if (pe->flags & PNV_IODA_PE_DEV)
>>> +		pdev = pe->pdev->bus->self;
>>> +#ifdef CONFIG_PCI_IOV
>>> +	else if (pe->flags & PNV_IODA_PE_VF)
>>> +		pdev = pe->parent_dev->bus->self;
>>> +#endif /* CONFIG_PCI_IOV */
>>> +
>>> +	while (pdev) {
>>> +		struct pci_dn *pdn = pci_get_pdn(pdev);
>>> +		struct pnv_ioda_pe *parent;
>>> +
>>> +		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> +			parent = &phb->ioda.pe_array[pdn->pe_number];
>>> +			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>> +
>>> +		pdev = pdev->bus->self;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
>>
>>
>> It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just
>> moving of this function to a different place deserves a separate patch with a
>> comment why ("it is going to be used now for non-SRIOV case too" may be?).
>>
>
> Yeah, it makes sense to me. Will fix it up.
>
>>
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	struct pci_dev *parent;
>>> +	uint8_t bcomp, dcomp, fcomp;
>>> +	long rid_end, rid;
>>> +	int64_t rc;
>>> +
>>> +	/* Tear down MVE */
>>> +	if (phb->type == PNV_PHB_IODA1 &&
>>> +	    pe->mve_number != -1) {
>>> +		rc = opal_pci_set_mve(phb->opal_id,
>>> +				      pe->mve_number,
>>> +				      phb->ioda.reserved_pe);
>>> +		if (rc != OPAL_SUCCESS)
>>> +			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
>>> +				rc, pe->mve_number);
>>> +		rc = opal_pci_set_mve_enable(phb->opal_id,
>>> +					     pe->mve_number,
>>> +					     OPAL_DISABLE_MVE);
>>> +		if (rc != OPAL_SUCCESS)
>>> +			pe_warn(pe, "Error %lld disabling MVE#%d\n",
>>> +				rc, pe->mve_number);
>>> +		pe->mve_number = -1;
>>> +	}
>>> +
>>> +	/* Unmapping PELTV */
>>> +	pnv_ioda_set_peltv(pe, false);
>>> +
>>> +	/* To unmap PELTM */
>>> +	if (pe->pbus) {
>>> +		int count;
>>> +
>>> +		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>> +		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>> +		parent = pe->pbus->self;
>>> +		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>> +			count = pe->pbus->busn_res.end -
>>> +				pe->pbus->busn_res.start + 1;
>>> +		else
>>> +			count = 1;
>>> +
>>> +		switch(count) {
>>> +		case  1: bcomp = OpalPciBusAll;   break;
>>> +		case  2: bcomp = OpalPciBus7Bits; break;
>>> +		case  4: bcomp = OpalPciBus6Bits; break;
>>> +		case  8: bcomp = OpalPciBus5Bits; break;
>>> +		case 16: bcomp = OpalPciBus4Bits; break;
>>> +		case 32: bcomp = OpalPciBus3Bits; break;
>>> +		default:
>>> +			/* Fail back to case of one bus */
>>> +			pe_warn(pe, "Cannot support %d buses\n", count);
>>> +			bcomp = OpalPciBusAll;
>>> +		}
>>> +		rid_end = pe->rid + (count << 8);
>>> +	} else {
>>> +#ifdef CONFIG_PCI_IOV
>>> +		if (pe->flags & PNV_IODA_PE_VF)
>>> +			parent = pe->parent_dev;
>>> +		else
>>> +#endif
>>> +			parent = pe->pdev->bus->self;
>>> +		bcomp = OpalPciBusAll;
>>> +		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>> +		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>> +		rid_end = pe->rid + 1;
>>> +	}
>>> +
>>> +	/* Clear RID mapping */
>>> +	for (rid = pe->rid; rid < rid_end; rid++)
>>> +		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>> +
>>> +	/* Unmapping PELTM */
>>> +	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>> +			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>> +	if (rc)
>>> +		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
>>> +}
>>> +
>>> +static void pnv_ioda_release_pe(struct kref *kref)
>>> +{
>>> +	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
>>> +	struct pnv_ioda_pe *tmp, *slave;
>>> +	struct pnv_phb *phb = pe->phb;
>>> +
>>> +	pnv_ioda_release_pe_dma(pe);
>>> +	pnv_ioda_release_pe_seg(pe);
>>> +	pnv_ioda_deconfigure_pe(pe);
>>> +
>>> +	/* Release slave PEs for compound PE */
>>> +	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> +		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
>>> +			pnv_ioda_pe_put(slave);
>>> +	}
>>> +
>>> +	/* Remove the PE from various list. We need remove slave
>>> +	 * PE from master's list.
>>> +	 */
>>> +	list_del(&pe->dma_link);
>>> +	list_del(&pe->list);
>>> +
>>> +	/* Free PE number */
>>> +	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
>>> +}
>>> +
>>> +static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
>>> +					    int pe_no)
>>> +{
>>> +	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
>>> +
>>> +	kref_init(&pe->kref);
>>> +	pe->phb = phb;
>>> +	pe->pe_number = pe_no;
>>> +	INIT_LIST_HEAD(&pe->dma_link);
>>> +	INIT_LIST_HEAD(&pe->list);
>>> +
>>> +	return pe;
>>> +}
>>> +
>>> +static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
>>> +					       int pe_no)
>>> +{
>>> +	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>> +		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>> +			__func__, pe_no, phb->hose->global_number);
>>> +		return NULL;
>>>   	}
>>>
>>> -	phb->ioda.pe_array[pe_no].phb = phb;
>>> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>> +	/*
>>> +	 * Same PE might be reserved for multiple times, which
>>> +	 * is out of problem actually.
>>> +	 */
>>> +	set_bit(pe_no, phb->ioda.pe_alloc);
>>> +	return pnv_ioda_init_pe(phb, pe_no);
>>>   }
>>>
>>> -static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>> +static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>   {
>>>   	unsigned long pe_no;
>>>   	unsigned long limit = phb->ioda.total_pe - 1;
>>> @@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>   			break;
>>>
>>>   		if (--limit >= phb->ioda.total_pe)
>>> -			return IODA_INVALID_PE;
>>> +			return NULL;
>>>   	} while(1);
>>>
>>> -	phb->ioda.pe_array[pe_no].phb = phb;
>>> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>> -	return pe_no;
>>> -}
>>> -
>>> -static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>> -{
>>> -	WARN_ON(phb->ioda.pe_array[pe].pdev);
>>> -
>>> -	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
>>> -	clear_bit(pe, phb->ioda.pe_alloc);
>>> +	return pnv_ioda_init_pe(phb, pe_no);
>>>   }
>>>
>>>   static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>> @@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>>   	}
>>>   }
>>>
>>> -static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>> -				struct pci_bus *bus, int all)
>>> +static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>> +						struct pci_bus *bus,
>>> +						int all)
>>
>>
>> Mechanic changes like this could easily go to a separate patch.
>>
>
> Indeed. I'll see how I can split the patches up in next revision.
> Thanks for the suggestion.
>
>>>   {
>>>   	resource_size_t segsz = phb->ioda.m64_segsize;
>>>   	struct pci_dev *pdev;
>>> @@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>   	int i;
>>>
>>>   	if (!pnv_ioda_need_m64_pe(phb, bus))
>>> -		return IODA_INVALID_PE;
>>> +		return NULL;
>>>
>>>           /* Allocate bitmap */
>>>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>>   	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>>   	if (!pe_bitsmap) {
>>>   		pr_warn("%s: Out of memory !\n", __func__);
>>> -		return IODA_INVALID_PE;
>>> +		return NULL;
>>>   	}
>>>
>>>   	/* The bridge's M64 window might be extended to PHB's M64
>>> @@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>   	/* No M64 window found ? */
>>>   	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>>   		kfree(pe_bitsmap);
>>> -		return IODA_INVALID_PE;
>>> +		return NULL;
>>>   	}
>>>
>>>   	/* Figure out the master PE and put all slave PEs
>>> @@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>   	}
>>>
>>>   	kfree(pe_bitsmap);
>>> -	return master_pe->pe_number;
>>> +	return master_pe;
>>>   }
>>>
>>>   static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>> @@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>>    * but in the meantime, we need to protect them to avoid warnings
>>>    */
>>>   #ifdef CONFIG_PCI_MSI
>>> -static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>> +static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>>>   {
>>>   	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>   	struct pnv_phb *phb = hose->private_data;
>>> @@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>>   }
>>>   #endif /* CONFIG_PCI_MSI */
>>>
>>> -static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>> -				  struct pnv_ioda_pe *parent,
>>> -				  struct pnv_ioda_pe *child,
>>> -				  bool is_add)
>>> -{
>>> -	const char *desc = is_add ? "adding" : "removing";
>>> -	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>> -			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>> -	struct pnv_ioda_pe *slave;
>>> -	long rc;
>>> -
>>> -	/* Parent PE affects child PE */
>>> -	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> -				child->pe_number, op);
>>> -	if (rc != OPAL_SUCCESS) {
>>> -		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>> -			rc, desc);
>>> -		return -ENXIO;
>>> -	}
>>> -
>>> -	if (!(child->flags & PNV_IODA_PE_MASTER))
>>> -		return 0;
>>> -
>>> -	/* Compound case: parent PE affects slave PEs */
>>> -	list_for_each_entry(slave, &child->slaves, list) {
>>> -		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> -					slave->pe_number, op);
>>> -		if (rc != OPAL_SUCCESS) {
>>> -			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>> -				rc, desc);
>>> -			return -ENXIO;
>>> -		}
>>> -	}
>>> -
>>> -	return 0;
>>> -}
>>> -
>>> -static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>>> -			      struct pnv_ioda_pe *pe,
>>> -			      bool is_add)
>>> -{
>>> -	struct pnv_ioda_pe *slave;
>>> -	struct pci_dev *pdev = NULL;
>>> -	int ret;
>>> -
>>> -	/*
>>> -	 * Clear PE frozen state. If it's master PE, we need
>>> -	 * clear slave PE frozen state as well.
>>> -	 */
>>> -	if (is_add) {
>>> -		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
>>> -					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> -		if (pe->flags & PNV_IODA_PE_MASTER) {
>>> -			list_for_each_entry(slave, &pe->slaves, list)
>>> -				opal_pci_eeh_freeze_clear(phb->opal_id,
>>> -							  slave->pe_number,
>>> -							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> -		}
>>> -	}
>>> -
>>> -	/*
>>> -	 * Associate PE in PELT. We need add the PE into the
>>> -	 * corresponding PELT-V as well. Otherwise, the error
>>> -	 * originated from the PE might contribute to other
>>> -	 * PEs.
>>> -	 */
>>> -	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>> -	if (ret)
>>> -		return ret;
>>> -
>>> -	/* For compound PEs, any one affects all of them */
>>> -	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> -		list_for_each_entry(slave, &pe->slaves, list) {
>>> -			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>> -			if (ret)
>>> -				return ret;
>>> -		}
>>> -	}
>>> -
>>> -	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>> -		pdev = pe->pbus->self;
>>> -	else if (pe->flags & PNV_IODA_PE_DEV)
>>> -		pdev = pe->pdev->bus->self;
>>> -#ifdef CONFIG_PCI_IOV
>>> -	else if (pe->flags & PNV_IODA_PE_VF)
>>> -		pdev = pe->parent_dev->bus->self;
>>> -#endif /* CONFIG_PCI_IOV */
>>> -	while (pdev) {
>>> -		struct pci_dn *pdn = pci_get_pdn(pdev);
>>> -		struct pnv_ioda_pe *parent;
>>> -
>>> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> -			parent = &phb->ioda.pe_array[pdn->pe_number];
>>> -			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>> -			if (ret)
>>> -				return ret;
>>> -		}
>>> -
>>> -		pdev = pdev->bus->self;
>>> -	}
>>> -
>>> -	return 0;
>>> -}
>>> -
>>> -#ifdef CONFIG_PCI_IOV
>>> -static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>> -{
>>> -	struct pci_dev *parent;
>>> -	uint8_t bcomp, dcomp, fcomp;
>>> -	int64_t rc;
>>> -	long rid_end, rid;
>>> -
>>> -	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
>>> -	if (pe->pbus) {
>>> -		int count;
>>> -
>>> -		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>> -		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>> -		parent = pe->pbus->self;
>>> -		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>> -			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
>>> -		else
>>> -			count = 1;
>>> -
>>> -		switch(count) {
>>> -		case  1: bcomp = OpalPciBusAll;         break;
>>> -		case  2: bcomp = OpalPciBus7Bits;       break;
>>> -		case  4: bcomp = OpalPciBus6Bits;       break;
>>> -		case  8: bcomp = OpalPciBus5Bits;       break;
>>> -		case 16: bcomp = OpalPciBus4Bits;       break;
>>> -		case 32: bcomp = OpalPciBus3Bits;       break;
>>> -		default:
>>> -			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
>>> -			        count);
>>> -			/* Do an exact match only */
>>> -			bcomp = OpalPciBusAll;
>>> -		}
>>> -		rid_end = pe->rid + (count << 8);
>>> -	} else {
>>> -		if (pe->flags & PNV_IODA_PE_VF)
>>> -			parent = pe->parent_dev;
>>> -		else
>>> -			parent = pe->pdev->bus->self;
>>> -		bcomp = OpalPciBusAll;
>>> -		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>> -		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>> -		rid_end = pe->rid + 1;
>>> -	}
>>> -
>>> -	/* Clear the reverse map */
>>> -	for (rid = pe->rid; rid < rid_end; rid++)
>>> -		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>> -
>>> -	/* Release from all parents PELT-V */
>>> -	while (parent) {
>>> -		struct pci_dn *pdn = pci_get_pdn(parent);
>>> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> -			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
>>> -						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>> -			/* XXX What to do in case of error ? */
>>
>>
>> Not much :) Free associated memory and mark it "dead" so it won't be used
>> again till reboot. In what circumstance can this opal_pci_set_peltv() fail at
>> all?
>>
>
> Yeah, maybe. Until now, I didn't see this failure since the code is there
> from the day. Note the code has been there for almost 4 years since the
> day Ben wrote it.


Sure. But if it starts failing, we won't even notice it - there is no even 
pr_err() or WARN_ON.


>
>>
>>> -		}
>>> -		parent = parent->bus->self;
>>> -	}
>>> -
>>> -	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
>>> -				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> -
>>> -	/* Disassociate PE in PELT */
>>> -	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
>>> -				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>> -	if (rc)
>>> -		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
>>> -	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>> -			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>> -	if (rc)
>>> -		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
>>> -
>>> -	pe->pbus = NULL;
>>> -	pe->pdev = NULL;
>>> -	pe->parent_dev = NULL;
>>> -
>>> -	return 0;
>>> -}
>>> -#endif /* CONFIG_PCI_IOV */
>>> -
>>>   static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>   {
>>>   	struct pci_dev *parent;
>>> @@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>   	}
>>>
>>>   	/* Configure PELTV */
>>> -	pnv_ioda_set_peltv(phb, pe, true);
>>> +	pnv_ioda_set_peltv(pe, true);
>>>
>>>   	/* Setup reverse map */
>>>   	for (rid = pe->rid; rid < rid_end; rid++)
>>> @@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>>>   		if (pdn->pe_number != IODA_INVALID_PE)
>>>   			continue;
>>>
>>> +		/* Increase reference count of the parent PE */
>>
>> When you comment like this, I read it as the comment belongs to the whole
>> next chunk till the first empty line, i.e. to all 5 lines below, which is not
>> the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() name
>> suggests incrementing the reference counter 2) "pe" is always parent in this
>> function. I do not insist though.
>>
>
> Agree on your explaining. I'll remove this unuseful comments.
>
>>
>>> +		pnv_ioda_pe_get(pe);
>>>   		pdn->pe_number = pe->pe_number;
>>>   		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>>>   		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>>> @@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>   {
>>>   	struct pci_controller *hose = pci_bus_to_host(bus);
>>>   	struct pnv_phb *phb = hose->private_data;
>>> -	struct pnv_ioda_pe *pe;
>>> +	struct pnv_ioda_pe *pe = NULL;
>>>   	int pe_num = IODA_INVALID_PE;
>>>
>>>   	/* For partial hotplug case, the PE instance hasn't been destroyed
>>> @@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>   	}
>>>
>>>   	/* PE number for root bus should have been reserved */
>>> -	if (pci_is_root_bus(bus))
>>> -		pe_num = phb->ioda.root_pe_no;
>>> +	if (pci_is_root_bus(bus) &&
>>> +	    phb->ioda.root_pe_no != IODA_INVALID_PE)
>>> +		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>>>
>>>   	/* Check if PE is determined by M64 */
>>> -	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
>>> -		pe_num = phb->pick_m64_pe(phb, bus, all);
>>> +	if (!pe && phb->pick_m64_pe)
>>> +		pe = phb->pick_m64_pe(phb, bus, all);
>>>
>>>   	/* The PE number isn't pinned by M64 */
>>> -	if (pe_num == IODA_INVALID_PE)
>>> -		pe_num = pnv_ioda_alloc_pe(phb);
>>> +	if (!pe)
>>> +		pe = pnv_ioda_alloc_pe(phb);
>>>
>>> -	if (pe_num == IODA_INVALID_PE) {
>>> -		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
>>> +	if (!pe) {
>>> +		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>>>   			__func__, pci_domain_nr(bus), bus->number);
>>>   		return NULL;
>>>   	}
>>>
>>> -	pe = &phb->ioda.pe_array[pe_num];
>>>   	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>>>   	pe->pbus = bus;
>>>   	pe->pdev = NULL;
>>> @@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>
>>>   	if (pnv_ioda_configure_pe(phb, pe)) {
>>>   		/* XXX What do we do here ? */
>>> -		if (pe_num)
>>> -			pnv_ioda_free_pe(phb, pe_num);
>>> -		pe->pbus = NULL;
>>> +		pnv_ioda_pe_put(pe);
>>>   		return NULL;
>>>   	}
>>>
>>>   	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>>> -			GFP_KERNEL, hose->node);
>>> +				       GFP_KERNEL, hose->node);
>>
>> Seems like spaces change only - if you really want this change (which I hate
>> - makes code look inaccurate to my taste but it seems I am in minority here
>> :) ), please put it to the separate patch.
>>
>
> Ok. Confirm with you: You prefer the original format? I don't know
> why I prefer the later one. Maybe my eyes are quite broken :-)


I prefer not to change existing whitespaces unless it is done once and for 
the entire file :) Just remove this change from the patch.



>>
>>>   	pe->tce32_table->data = pe;
>>>
>>>   	/* Associate it with all child devices */
>>> @@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>   		list_del(&pe->list);
>>>   		mutex_unlock(&phb->ioda.pe_list_mutex);
>>>
>>> -		pnv_ioda_deconfigure_pe(phb, pe);
>>> +		pnv_ioda_deconfigure_pe(pe);
>>
>>
>> Is this change necessary to get "Release PEs dynamically" working? Move it to
>> mechanical changes patch may be?
>>
>
> Ok. I'll try to do that.
>
>>
>>>
>>> -		pnv_ioda_free_pe(phb, pe->pe_number);
>>> +		pnv_ioda_pe_put(pe);
>>>   	}
>>>   }
>>>
>>> @@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>
>>>   		if (pnv_ioda_configure_pe(phb, pe)) {
>>>   			/* XXX What do we do here ? */
>>> -			if (pe_num)
>>> -				pnv_ioda_free_pe(phb, pe_num);
>>> -			pe->pdev = NULL;
>>> +			pnv_ioda_pe_put(pe);
>>>   			continue;
>>>   		}
>>>
>>> @@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>>>   	struct pnv_ioda_pe *pe;
>>>   	int rc;
>>>
>>> -	pe = pnv_ioda_get_pe(dev);
>>> +	pe = pnv_ioda_pci_dev_to_pe(dev);
>>
>>
>> And this change could to separately. Not clear how this helps to "Release PEs
>> dynamically".
>>
>>
>
> It's not related to "Release PEs dynamically". The change is introduced by
> the function rename: Original pnv_ioda_get_pe() is renamed to pnv_ioda_pci_dev_to_pe().


But the rename happened in this patch and the patch's subj is "Release PEs 
dynamically" so it should be related somehow or move it to a simple 
separate patch "let's give the lalala function a better name to reflect 
what it actually does" (but in this case the new name does not make any 
more sense than the old one).



>>>   	if (!pe)
>>>   		return -ENODEV;
>>>
>>> @@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>>>   	struct pnv_ioda_pe *pe;
>>>   	int rc;
>>>
>>> -	if (!(pe = pnv_ioda_get_pe(dev)))
>>> +	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>>>   		return -ENODEV;
>>>
>>>   	/* Assign XIVE to PE */
>>> @@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>>>   				  unsigned int hwirq, unsigned int virq,
>>>   				  unsigned int is_64, struct msi_msg *msg)
>>>   {
>>> -	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
>>> +	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>>>   	unsigned int xive_num = hwirq - phb->msi_base;
>>>   	__be32 data;
>>>   	int rc;
>>> @@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>>   	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>>>   	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>>>   	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>> +	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>>>   	hose->controller_ops = pnv_pci_controller_ops;
>>>
>>>   #ifdef CONFIG_PCI_IOV
>>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>> index 1bea3a8..8b10f01 100644
>>> --- a/arch/powerpc/platforms/powernv/pci.h
>>> +++ b/arch/powerpc/platforms/powernv/pci.h
>>> @@ -28,6 +28,7 @@ enum pnv_phb_model {
>>>   /* Data associated with a PE, including IOMMU tracking etc.. */
>>>   struct pnv_phb;
>>>   struct pnv_ioda_pe {
>>> +	struct kref		kref;
>>>   	unsigned long		flags;
>>>   	struct pnv_phb		*phb;
>>>
>>> @@ -120,7 +121,8 @@ struct pnv_phb {
>>>   	void (*shutdown)(struct pnv_phb *phb);
>>>   	int (*init_m64)(struct pnv_phb *phb);
>>>   	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>> -	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>> +	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
>>> +					   struct pci_bus *bus, int all);
>>>   	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>>   	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>>   	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>>>
>
> Thanks,
> Gavin
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-11  7:02         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-11  7:02 UTC (permalink / raw)
  To: Gavin Shan; +Cc: bhelgaas, linux-pci, linuxppc-dev

On 05/11/2015 04:25 PM, Gavin Shan wrote:
> On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>> The original code doesn't support releasing PEs dynamically, meaning
>>> that PE and the associated resources (IO, M32, M64 and DMA) can't
>>> be released when unplugging a PCI adapter from one hotpluggable slot.
>>>
>>> The patch takes object oriented methodology, introducs reference
>>> count to PE, which is initialized to 1 and increased with 1 when a
>>> new PCI device joins the PE. Once the last PCI device leaves the
>>> PE, the PE is going to be release together with its associated
>>> (IO, M32, M64, DMA) resources.
>>
>>
>> Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>
>
> Ok. I'll add more details in next revision.
>
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>   arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>   arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>   arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>   4 files changed, 432 insertions(+), 238 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>> index 5367eb3..a6ad4b1 100644
>>> --- a/arch/powerpc/include/asm/pci-bridge.h
>>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>>> @@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>   	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>   	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>   	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>> +
>>> +	/* Called when PCI device is released */
>>> +	void		(*release_device)(struct pci_dev *);
>>>   };
>>>
>>>   /*
>>> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>> index 7ed85a6..0040343 100644
>>> --- a/arch/powerpc/kernel/pci-hotplug.c
>>> +++ b/arch/powerpc/kernel/pci-hotplug.c
>>> @@ -29,6 +29,11 @@
>>>    */
>>>   void pcibios_release_device(struct pci_dev *dev)
>>>   {
>>> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>> +
>>> +	if (hose->controller_ops.release_device)
>>> +		hose->controller_ops.release_device(dev);
>>> +
>>>   	eeh_remove_device(dev);
>>>   }
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> index 910fb67..ef8c216 100644
>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> @@ -12,6 +12,8 @@
>>>   #undef DEBUG
>>>
>>>   #include <linux/kernel.h>
>>> +#include <linux/atomic.h>
>>> +#include <linux/kref.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/crash_dump.h>
>>>   #include <linux/debugfs.h>
>>> @@ -47,6 +49,8 @@
>>>   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>   #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>
>>> +static void pnv_ioda_release_pe(struct kref *kref);
>>> +
>>>   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>   			    const char *fmt, ...)
>>>   {
>>> @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>   		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>   }
>>>
>>> -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>> +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>   {
>>> -	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>> -		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>> -			__func__, pe_no, phb->hose->global_number);
>>> +	if (!pe)
>>> +		return;
>>> +
>>> +	kref_get(&pe->kref);
>>> +}
>>> +
>>> +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>> +{
>>> +	unsigned int count;
>>> +
>>> +	if (!pe)
>>>   		return;
>>> +
>>> +	/*
>>> +	 * The count is initialized to 1 and increased with 1 when
>>> +	 * a new PCI device is bound with the PE. Once the last PCI
>>> +	 * device is leaving from the PE, the PE is going to be
>>> +	 * released.
>>> +	 */
>>> +	count = atomic_read(&pe->kref.refcount);
>>> +	if (count == 2)
>>> +		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>> +	else
>>> +		kref_put(&pe->kref, pnv_ioda_release_pe);
>>
>>
>> What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>
>
> Yeah, that would have problem. But it shouldn't happen because the
> PCI devices are joining the parent PE# in strictly serialized mode.
> Same thing happens when detaching PCI devices from its parent PE.


oookay. Another thing then - why is this kref counter initialized to 1?
It would make sense if you did something special when the counter becomes 1 
after decrement but you do not.

Also, this kref thing makes sense if you do kref_put() in multiple places 
and do not know which one will be the last one so you pass the callback to 
all of them. Here you do kref_put/sub in one place and you read the counter 
- so you can call pnv_ioda_release_pe() directly. And it feels like a 
simple atomic_t would do the job just fine. If you still feel that the 
counter should start from 1, there are atomic_dec_if_positive() and 
atomic_inc_not_zero() and others.




>>> +}
>>> +
>>> +static void pnv_pci_release_device(struct pci_dev *pdev)
>>> +{
>>> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>> +	struct pnv_phb *phb = hose->private_data;
>>> +	struct pci_dn *pdn = pci_get_pdn(pdev);
>>> +	struct pnv_ioda_pe *pe;
>>> +
>>> +	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> +		pe = &phb->ioda.pe_array[pdn->pe_number];
>>> +		pnv_ioda_pe_put(pe);
>>> +		pdn->pe_number = IODA_INVALID_PE;
>>>   	}
>>> +}
>>>
>>> -	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>> -		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>> -			__func__, pe_no, phb->hose->global_number);
>>> +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	int index, count;
>>> +	unsigned long tbl_addr, tbl_size;
>>> +
>>> +	/* No DMA capability for slave PEs */
>>> +	if (pe->flags & PNV_IODA_PE_SLAVE)
>>> +		return;
>>> +
>>> +	/* Bypass DMA window */
>>> +	if (phb->type == PNV_PHB_IODA2 &&
>>> +	    pe->tce_bypass_enabled &&
>>> +	    pe->tce32_table &&
>>> +	    pe->tce32_table->set_bypass)
>>> +		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>> +
>>> +	/* 32-bits DMA window */
>>> +	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>> +	tbl_addr = pe->tce32_table->it_base;
>>> +	if (!count)
>>>   		return;
>>> +
>>> +	/* Free IOMMU table */
>>> +	iommu_free_table(pe->tce32_table,
>>> +			 of_node_full_name(phb->hose->dn));
>>> +
>>> +	/* Deconfigure TCE table */
>>> +	switch (phb->type) {
>>> +	case PNV_PHB_IODA1:
>>> +		for (index = 0; index < count; index++)
>>> +			opal_pci_map_pe_dma_window(phb->opal_id,
>>> +						   pe->pe_number,
>>> +						   pe->tce32_seg_start + index,
>>> +						   1,
>>> +						   __pa(tbl_addr) +
>>> +						   index * TCE32_TABLE_SIZE,
>>> +						   0,
>>> +						   0x1000);
>>> +		bitmap_clear(phb->ioda.tce32_segmap,
>>> +			     pe->tce32_seg_start,
>>> +			     count);
>>> +		tbl_size = TCE32_TABLE_SIZE * count;
>>> +		break;
>>> +	case PNV_PHB_IODA2:
>>> +		opal_pci_map_pe_dma_window(phb->opal_id,
>>> +					   pe->pe_number,
>>> +					   pe->pe_number << 1,
>>> +					   1,
>>> +					   __pa(tbl_addr),
>>> +					   0,
>>> +					   0x1000);
>>> +		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>> +		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>> +		break;
>>> +	default:
>>> +		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>> +		return;
>>> +	}
>>> +
>>> +	/* Free memory of IOMMU table */
>>> +	free_pages(tbl_addr, get_order(tbl_size));
>>
>>
>> You just programmed the table address to TVT and then you are releasing the
>> pages. It does not seem right, it will leave garbage in TVT. Also, I am
>> adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>> that ;) ).
>>
>
> I assume you're talking about TVE. I don't understand how garbage will be left
> in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
> with zero'ed "tce_table_size". The pages previously allocated for TCE table is
> released to buddy system, which can be allocated by somebody else (from buddy
> or slab).

opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some 
memory which is still allocated. This value goes to a table (which has 2 
entries per PE, one for 32bit DMA window and one for bypass/hugewindow) 
which PHB uses to get the actual TCE table address. What is the name of 
this table? :) Anyway, you write an address there and then you call 
free_pages() so after free_pages(), the value in that TVE/TVT/whatever 
table is a garbage.


>
> Ok. Please put me into the cc list. I guess the whole series of patches is
> better to rebased on your DDW patchset, which is to be merged first, I believe.
>
>>
>>> +	pe->tce32_table = NULL;
>>> +	pe->tce32_seg_start = 0;
>>> +	pe->tce32_seg_end = 0;
>>> +}
>>> +
>>> +static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	unsigned long *segmap = NULL, *pe_segmap = NULL;
>>> +	int i;
>>> +	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
>>> +				     OPAL_M32_WINDOW_TYPE,
>>> +				     OPAL_M64_WINDOW_TYPE };
>>> +
>>> +	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
>>> +		switch (win_type[win]) {
>>> +		case OPAL_IO_WINDOW_TYPE:
>>> +			segmap = phb->ioda.io_segmap;
>>> +			pe_segmap = pe->io_segmap;
>>> +			break;
>>> +		case OPAL_M32_WINDOW_TYPE:
>>> +			segmap = phb->ioda.m32_segmap;
>>> +			pe_segmap = pe->m32_segmap;
>>> +			break;
>>> +		case OPAL_M64_WINDOW_TYPE:
>>> +			segmap = phb->ioda.m64_segmap;
>>> +			pe_segmap = pe->m64_segmap;
>>> +			break;
>>> +		}
>>> +		i = -1;
>>> +		while ((i = find_next_bit(pe_segmap,
>>> +			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
>>> +			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
>>> +			    win_type[win] == OPAL_M32_WINDOW_TYPE)
>>> +				opal_pci_map_pe_mmio_window(phb->opal_id,
>>> +						phb->ioda.reserved_pe,
>>> +						win_type[win], 0, i);
>>> +			else if (phb->type == PNV_PHB_IODA1)
>>> +				opal_pci_map_pe_mmio_window(phb->opal_id,
>>> +						phb->ioda.reserved_pe,
>>> +						win_type[win],
>>> +						i / 8, i % 8);
>>
>> The function is called ""release" but it programs something what looks like
>> reasonable values, is it correct?
>>
>
> It's out of problem, When the segment is deallocated, it's mapped to the
> reserved PE#.
>
>>
>>
>>> +
>>> +			clear_bit(i, pe_segmap);
>>> +			clear_bit(i, segmap);
>>> +		}
>>> +	}
>>> +}
>>> +
>>> +static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>> +				  struct pnv_ioda_pe *parent,
>>> +				  struct pnv_ioda_pe *child,
>>> +				  bool is_add)
>>> +{
>>> +	const char *desc = is_add ? "adding" : "removing";
>>> +	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>> +			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>> +	struct pnv_ioda_pe *slave;
>>> +	long rc;
>>> +
>>> +	/* Parent PE affects child PE */
>>> +	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> +				child->pe_number, op);
>>> +	if (rc != OPAL_SUCCESS) {
>>> +		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>> +			rc, desc);
>>> +		return -ENXIO;
>>> +	}
>>> +
>>> +	if (!(child->flags & PNV_IODA_PE_MASTER))
>>> +		return 0;
>>> +
>>> +	/* Compound case: parent PE affects slave PEs */
>>> +	list_for_each_entry(slave, &child->slaves, list) {
>>> +		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> +					slave->pe_number, op);
>>> +		if (rc != OPAL_SUCCESS) {
>>> +			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>> +				rc, desc);
>>> +			return -ENXIO;
>>> +		}
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	struct pnv_ioda_pe *slave;
>>> +	struct pci_dev *pdev = NULL;
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * Clear PE frozen state. If it's master PE, we need
>>> +	 * clear slave PE frozen state as well.
>>> +	 */
>>> +	opal_pci_eeh_freeze_clear(phb->opal_id,
>>> +				  pe->pe_number,
>>> +				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> +	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> +		list_for_each_entry(slave, &pe->slaves, list) {
>>> +			opal_pci_eeh_freeze_clear(phb->opal_id,
>>> +						  slave->pe_number,
>>> +						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * Associate PE in PELT. We need add the PE into the
>>> +	 * corresponding PELT-V as well. Otherwise, the error
>>> +	 * originated from the PE might contribute to other
>>> +	 * PEs.
>>> +	 */
>>> +	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	/* For compound PEs, any one affects all of them */
>>> +	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> +		list_for_each_entry(slave, &pe->slaves, list) {
>>> +			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>> +	}
>>> +
>>> +	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>> +		pdev = pe->pbus->self;
>>> +	else if (pe->flags & PNV_IODA_PE_DEV)
>>> +		pdev = pe->pdev->bus->self;
>>> +#ifdef CONFIG_PCI_IOV
>>> +	else if (pe->flags & PNV_IODA_PE_VF)
>>> +		pdev = pe->parent_dev->bus->self;
>>> +#endif /* CONFIG_PCI_IOV */
>>> +
>>> +	while (pdev) {
>>> +		struct pci_dn *pdn = pci_get_pdn(pdev);
>>> +		struct pnv_ioda_pe *parent;
>>> +
>>> +		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> +			parent = &phb->ioda.pe_array[pdn->pe_number];
>>> +			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>> +			if (ret)
>>> +				return ret;
>>> +		}
>>> +
>>> +		pdev = pdev->bus->self;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
>>
>>
>> It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just
>> moving of this function to a different place deserves a separate patch with a
>> comment why ("it is going to be used now for non-SRIOV case too" may be?).
>>
>
> Yeah, it makes sense to me. Will fix it up.
>
>>
>>> +{
>>> +	struct pnv_phb *phb = pe->phb;
>>> +	struct pci_dev *parent;
>>> +	uint8_t bcomp, dcomp, fcomp;
>>> +	long rid_end, rid;
>>> +	int64_t rc;
>>> +
>>> +	/* Tear down MVE */
>>> +	if (phb->type == PNV_PHB_IODA1 &&
>>> +	    pe->mve_number != -1) {
>>> +		rc = opal_pci_set_mve(phb->opal_id,
>>> +				      pe->mve_number,
>>> +				      phb->ioda.reserved_pe);
>>> +		if (rc != OPAL_SUCCESS)
>>> +			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
>>> +				rc, pe->mve_number);
>>> +		rc = opal_pci_set_mve_enable(phb->opal_id,
>>> +					     pe->mve_number,
>>> +					     OPAL_DISABLE_MVE);
>>> +		if (rc != OPAL_SUCCESS)
>>> +			pe_warn(pe, "Error %lld disabling MVE#%d\n",
>>> +				rc, pe->mve_number);
>>> +		pe->mve_number = -1;
>>> +	}
>>> +
>>> +	/* Unmapping PELTV */
>>> +	pnv_ioda_set_peltv(pe, false);
>>> +
>>> +	/* To unmap PELTM */
>>> +	if (pe->pbus) {
>>> +		int count;
>>> +
>>> +		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>> +		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>> +		parent = pe->pbus->self;
>>> +		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>> +			count = pe->pbus->busn_res.end -
>>> +				pe->pbus->busn_res.start + 1;
>>> +		else
>>> +			count = 1;
>>> +
>>> +		switch(count) {
>>> +		case  1: bcomp = OpalPciBusAll;   break;
>>> +		case  2: bcomp = OpalPciBus7Bits; break;
>>> +		case  4: bcomp = OpalPciBus6Bits; break;
>>> +		case  8: bcomp = OpalPciBus5Bits; break;
>>> +		case 16: bcomp = OpalPciBus4Bits; break;
>>> +		case 32: bcomp = OpalPciBus3Bits; break;
>>> +		default:
>>> +			/* Fail back to case of one bus */
>>> +			pe_warn(pe, "Cannot support %d buses\n", count);
>>> +			bcomp = OpalPciBusAll;
>>> +		}
>>> +		rid_end = pe->rid + (count << 8);
>>> +	} else {
>>> +#ifdef CONFIG_PCI_IOV
>>> +		if (pe->flags & PNV_IODA_PE_VF)
>>> +			parent = pe->parent_dev;
>>> +		else
>>> +#endif
>>> +			parent = pe->pdev->bus->self;
>>> +		bcomp = OpalPciBusAll;
>>> +		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>> +		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>> +		rid_end = pe->rid + 1;
>>> +	}
>>> +
>>> +	/* Clear RID mapping */
>>> +	for (rid = pe->rid; rid < rid_end; rid++)
>>> +		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>> +
>>> +	/* Unmapping PELTM */
>>> +	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>> +			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>> +	if (rc)
>>> +		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
>>> +}
>>> +
>>> +static void pnv_ioda_release_pe(struct kref *kref)
>>> +{
>>> +	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
>>> +	struct pnv_ioda_pe *tmp, *slave;
>>> +	struct pnv_phb *phb = pe->phb;
>>> +
>>> +	pnv_ioda_release_pe_dma(pe);
>>> +	pnv_ioda_release_pe_seg(pe);
>>> +	pnv_ioda_deconfigure_pe(pe);
>>> +
>>> +	/* Release slave PEs for compound PE */
>>> +	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> +		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
>>> +			pnv_ioda_pe_put(slave);
>>> +	}
>>> +
>>> +	/* Remove the PE from various list. We need remove slave
>>> +	 * PE from master's list.
>>> +	 */
>>> +	list_del(&pe->dma_link);
>>> +	list_del(&pe->list);
>>> +
>>> +	/* Free PE number */
>>> +	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
>>> +}
>>> +
>>> +static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
>>> +					    int pe_no)
>>> +{
>>> +	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
>>> +
>>> +	kref_init(&pe->kref);
>>> +	pe->phb = phb;
>>> +	pe->pe_number = pe_no;
>>> +	INIT_LIST_HEAD(&pe->dma_link);
>>> +	INIT_LIST_HEAD(&pe->list);
>>> +
>>> +	return pe;
>>> +}
>>> +
>>> +static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
>>> +					       int pe_no)
>>> +{
>>> +	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>> +		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>> +			__func__, pe_no, phb->hose->global_number);
>>> +		return NULL;
>>>   	}
>>>
>>> -	phb->ioda.pe_array[pe_no].phb = phb;
>>> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>> +	/*
>>> +	 * Same PE might be reserved for multiple times, which
>>> +	 * is out of problem actually.
>>> +	 */
>>> +	set_bit(pe_no, phb->ioda.pe_alloc);
>>> +	return pnv_ioda_init_pe(phb, pe_no);
>>>   }
>>>
>>> -static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>> +static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>   {
>>>   	unsigned long pe_no;
>>>   	unsigned long limit = phb->ioda.total_pe - 1;
>>> @@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>   			break;
>>>
>>>   		if (--limit >= phb->ioda.total_pe)
>>> -			return IODA_INVALID_PE;
>>> +			return NULL;
>>>   	} while(1);
>>>
>>> -	phb->ioda.pe_array[pe_no].phb = phb;
>>> -	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>> -	return pe_no;
>>> -}
>>> -
>>> -static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>> -{
>>> -	WARN_ON(phb->ioda.pe_array[pe].pdev);
>>> -
>>> -	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
>>> -	clear_bit(pe, phb->ioda.pe_alloc);
>>> +	return pnv_ioda_init_pe(phb, pe_no);
>>>   }
>>>
>>>   static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>> @@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>>   	}
>>>   }
>>>
>>> -static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>> -				struct pci_bus *bus, int all)
>>> +static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>> +						struct pci_bus *bus,
>>> +						int all)
>>
>>
>> Mechanic changes like this could easily go to a separate patch.
>>
>
> Indeed. I'll see how I can split the patches up in next revision.
> Thanks for the suggestion.
>
>>>   {
>>>   	resource_size_t segsz = phb->ioda.m64_segsize;
>>>   	struct pci_dev *pdev;
>>> @@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>   	int i;
>>>
>>>   	if (!pnv_ioda_need_m64_pe(phb, bus))
>>> -		return IODA_INVALID_PE;
>>> +		return NULL;
>>>
>>>           /* Allocate bitmap */
>>>   	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>>   	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>>   	if (!pe_bitsmap) {
>>>   		pr_warn("%s: Out of memory !\n", __func__);
>>> -		return IODA_INVALID_PE;
>>> +		return NULL;
>>>   	}
>>>
>>>   	/* The bridge's M64 window might be extended to PHB's M64
>>> @@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>   	/* No M64 window found ? */
>>>   	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>>   		kfree(pe_bitsmap);
>>> -		return IODA_INVALID_PE;
>>> +		return NULL;
>>>   	}
>>>
>>>   	/* Figure out the master PE and put all slave PEs
>>> @@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>   	}
>>>
>>>   	kfree(pe_bitsmap);
>>> -	return master_pe->pe_number;
>>> +	return master_pe;
>>>   }
>>>
>>>   static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>> @@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>>    * but in the meantime, we need to protect them to avoid warnings
>>>    */
>>>   #ifdef CONFIG_PCI_MSI
>>> -static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>> +static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>>>   {
>>>   	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>   	struct pnv_phb *phb = hose->private_data;
>>> @@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>>   }
>>>   #endif /* CONFIG_PCI_MSI */
>>>
>>> -static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>> -				  struct pnv_ioda_pe *parent,
>>> -				  struct pnv_ioda_pe *child,
>>> -				  bool is_add)
>>> -{
>>> -	const char *desc = is_add ? "adding" : "removing";
>>> -	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>> -			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>> -	struct pnv_ioda_pe *slave;
>>> -	long rc;
>>> -
>>> -	/* Parent PE affects child PE */
>>> -	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> -				child->pe_number, op);
>>> -	if (rc != OPAL_SUCCESS) {
>>> -		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>> -			rc, desc);
>>> -		return -ENXIO;
>>> -	}
>>> -
>>> -	if (!(child->flags & PNV_IODA_PE_MASTER))
>>> -		return 0;
>>> -
>>> -	/* Compound case: parent PE affects slave PEs */
>>> -	list_for_each_entry(slave, &child->slaves, list) {
>>> -		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>> -					slave->pe_number, op);
>>> -		if (rc != OPAL_SUCCESS) {
>>> -			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>> -				rc, desc);
>>> -			return -ENXIO;
>>> -		}
>>> -	}
>>> -
>>> -	return 0;
>>> -}
>>> -
>>> -static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>>> -			      struct pnv_ioda_pe *pe,
>>> -			      bool is_add)
>>> -{
>>> -	struct pnv_ioda_pe *slave;
>>> -	struct pci_dev *pdev = NULL;
>>> -	int ret;
>>> -
>>> -	/*
>>> -	 * Clear PE frozen state. If it's master PE, we need
>>> -	 * clear slave PE frozen state as well.
>>> -	 */
>>> -	if (is_add) {
>>> -		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
>>> -					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> -		if (pe->flags & PNV_IODA_PE_MASTER) {
>>> -			list_for_each_entry(slave, &pe->slaves, list)
>>> -				opal_pci_eeh_freeze_clear(phb->opal_id,
>>> -							  slave->pe_number,
>>> -							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> -		}
>>> -	}
>>> -
>>> -	/*
>>> -	 * Associate PE in PELT. We need add the PE into the
>>> -	 * corresponding PELT-V as well. Otherwise, the error
>>> -	 * originated from the PE might contribute to other
>>> -	 * PEs.
>>> -	 */
>>> -	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>> -	if (ret)
>>> -		return ret;
>>> -
>>> -	/* For compound PEs, any one affects all of them */
>>> -	if (pe->flags & PNV_IODA_PE_MASTER) {
>>> -		list_for_each_entry(slave, &pe->slaves, list) {
>>> -			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>> -			if (ret)
>>> -				return ret;
>>> -		}
>>> -	}
>>> -
>>> -	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>> -		pdev = pe->pbus->self;
>>> -	else if (pe->flags & PNV_IODA_PE_DEV)
>>> -		pdev = pe->pdev->bus->self;
>>> -#ifdef CONFIG_PCI_IOV
>>> -	else if (pe->flags & PNV_IODA_PE_VF)
>>> -		pdev = pe->parent_dev->bus->self;
>>> -#endif /* CONFIG_PCI_IOV */
>>> -	while (pdev) {
>>> -		struct pci_dn *pdn = pci_get_pdn(pdev);
>>> -		struct pnv_ioda_pe *parent;
>>> -
>>> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> -			parent = &phb->ioda.pe_array[pdn->pe_number];
>>> -			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>> -			if (ret)
>>> -				return ret;
>>> -		}
>>> -
>>> -		pdev = pdev->bus->self;
>>> -	}
>>> -
>>> -	return 0;
>>> -}
>>> -
>>> -#ifdef CONFIG_PCI_IOV
>>> -static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>> -{
>>> -	struct pci_dev *parent;
>>> -	uint8_t bcomp, dcomp, fcomp;
>>> -	int64_t rc;
>>> -	long rid_end, rid;
>>> -
>>> -	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
>>> -	if (pe->pbus) {
>>> -		int count;
>>> -
>>> -		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>> -		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>> -		parent = pe->pbus->self;
>>> -		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>> -			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
>>> -		else
>>> -			count = 1;
>>> -
>>> -		switch(count) {
>>> -		case  1: bcomp = OpalPciBusAll;         break;
>>> -		case  2: bcomp = OpalPciBus7Bits;       break;
>>> -		case  4: bcomp = OpalPciBus6Bits;       break;
>>> -		case  8: bcomp = OpalPciBus5Bits;       break;
>>> -		case 16: bcomp = OpalPciBus4Bits;       break;
>>> -		case 32: bcomp = OpalPciBus3Bits;       break;
>>> -		default:
>>> -			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
>>> -			        count);
>>> -			/* Do an exact match only */
>>> -			bcomp = OpalPciBusAll;
>>> -		}
>>> -		rid_end = pe->rid + (count << 8);
>>> -	} else {
>>> -		if (pe->flags & PNV_IODA_PE_VF)
>>> -			parent = pe->parent_dev;
>>> -		else
>>> -			parent = pe->pdev->bus->self;
>>> -		bcomp = OpalPciBusAll;
>>> -		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>> -		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>> -		rid_end = pe->rid + 1;
>>> -	}
>>> -
>>> -	/* Clear the reverse map */
>>> -	for (rid = pe->rid; rid < rid_end; rid++)
>>> -		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>> -
>>> -	/* Release from all parents PELT-V */
>>> -	while (parent) {
>>> -		struct pci_dn *pdn = pci_get_pdn(parent);
>>> -		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>> -			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
>>> -						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>> -			/* XXX What to do in case of error ? */
>>
>>
>> Not much :) Free associated memory and mark it "dead" so it won't be used
>> again till reboot. In what circumstance can this opal_pci_set_peltv() fail at
>> all?
>>
>
> Yeah, maybe. Until now, I didn't see this failure since the code is there
> from the day. Note the code has been there for almost 4 years since the
> day Ben wrote it.


Sure. But if it starts failing, we won't even notice it - there is no even 
pr_err() or WARN_ON.


>
>>
>>> -		}
>>> -		parent = parent->bus->self;
>>> -	}
>>> -
>>> -	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
>>> -				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>> -
>>> -	/* Disassociate PE in PELT */
>>> -	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
>>> -				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>> -	if (rc)
>>> -		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
>>> -	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>> -			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>> -	if (rc)
>>> -		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
>>> -
>>> -	pe->pbus = NULL;
>>> -	pe->pdev = NULL;
>>> -	pe->parent_dev = NULL;
>>> -
>>> -	return 0;
>>> -}
>>> -#endif /* CONFIG_PCI_IOV */
>>> -
>>>   static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>   {
>>>   	struct pci_dev *parent;
>>> @@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>   	}
>>>
>>>   	/* Configure PELTV */
>>> -	pnv_ioda_set_peltv(phb, pe, true);
>>> +	pnv_ioda_set_peltv(pe, true);
>>>
>>>   	/* Setup reverse map */
>>>   	for (rid = pe->rid; rid < rid_end; rid++)
>>> @@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>>>   		if (pdn->pe_number != IODA_INVALID_PE)
>>>   			continue;
>>>
>>> +		/* Increase reference count of the parent PE */
>>
>> When you comment like this, I read it as the comment belongs to the whole
>> next chunk till the first empty line, i.e. to all 5 lines below, which is not
>> the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() name
>> suggests incrementing the reference counter 2) "pe" is always parent in this
>> function. I do not insist though.
>>
>
> Agree on your explaining. I'll remove this unuseful comments.
>
>>
>>> +		pnv_ioda_pe_get(pe);
>>>   		pdn->pe_number = pe->pe_number;
>>>   		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>>>   		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>>> @@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>   {
>>>   	struct pci_controller *hose = pci_bus_to_host(bus);
>>>   	struct pnv_phb *phb = hose->private_data;
>>> -	struct pnv_ioda_pe *pe;
>>> +	struct pnv_ioda_pe *pe = NULL;
>>>   	int pe_num = IODA_INVALID_PE;
>>>
>>>   	/* For partial hotplug case, the PE instance hasn't been destroyed
>>> @@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>   	}
>>>
>>>   	/* PE number for root bus should have been reserved */
>>> -	if (pci_is_root_bus(bus))
>>> -		pe_num = phb->ioda.root_pe_no;
>>> +	if (pci_is_root_bus(bus) &&
>>> +	    phb->ioda.root_pe_no != IODA_INVALID_PE)
>>> +		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>>>
>>>   	/* Check if PE is determined by M64 */
>>> -	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
>>> -		pe_num = phb->pick_m64_pe(phb, bus, all);
>>> +	if (!pe && phb->pick_m64_pe)
>>> +		pe = phb->pick_m64_pe(phb, bus, all);
>>>
>>>   	/* The PE number isn't pinned by M64 */
>>> -	if (pe_num == IODA_INVALID_PE)
>>> -		pe_num = pnv_ioda_alloc_pe(phb);
>>> +	if (!pe)
>>> +		pe = pnv_ioda_alloc_pe(phb);
>>>
>>> -	if (pe_num == IODA_INVALID_PE) {
>>> -		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
>>> +	if (!pe) {
>>> +		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>>>   			__func__, pci_domain_nr(bus), bus->number);
>>>   		return NULL;
>>>   	}
>>>
>>> -	pe = &phb->ioda.pe_array[pe_num];
>>>   	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>>>   	pe->pbus = bus;
>>>   	pe->pdev = NULL;
>>> @@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>
>>>   	if (pnv_ioda_configure_pe(phb, pe)) {
>>>   		/* XXX What do we do here ? */
>>> -		if (pe_num)
>>> -			pnv_ioda_free_pe(phb, pe_num);
>>> -		pe->pbus = NULL;
>>> +		pnv_ioda_pe_put(pe);
>>>   		return NULL;
>>>   	}
>>>
>>>   	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>>> -			GFP_KERNEL, hose->node);
>>> +				       GFP_KERNEL, hose->node);
>>
>> Seems like spaces change only - if you really want this change (which I hate
>> - makes code look inaccurate to my taste but it seems I am in minority here
>> :) ), please put it to the separate patch.
>>
>
> Ok. Confirm with you: You prefer the original format? I don't know
> why I prefer the later one. Maybe my eyes are quite broken :-)


I prefer not to change existing whitespaces unless it is done once and for 
the entire file :) Just remove this change from the patch.



>>
>>>   	pe->tce32_table->data = pe;
>>>
>>>   	/* Associate it with all child devices */
>>> @@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>   		list_del(&pe->list);
>>>   		mutex_unlock(&phb->ioda.pe_list_mutex);
>>>
>>> -		pnv_ioda_deconfigure_pe(phb, pe);
>>> +		pnv_ioda_deconfigure_pe(pe);
>>
>>
>> Is this change necessary to get "Release PEs dynamically" working? Move it to
>> mechanical changes patch may be?
>>
>
> Ok. I'll try to do that.
>
>>
>>>
>>> -		pnv_ioda_free_pe(phb, pe->pe_number);
>>> +		pnv_ioda_pe_put(pe);
>>>   	}
>>>   }
>>>
>>> @@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>
>>>   		if (pnv_ioda_configure_pe(phb, pe)) {
>>>   			/* XXX What do we do here ? */
>>> -			if (pe_num)
>>> -				pnv_ioda_free_pe(phb, pe_num);
>>> -			pe->pdev = NULL;
>>> +			pnv_ioda_pe_put(pe);
>>>   			continue;
>>>   		}
>>>
>>> @@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>>>   	struct pnv_ioda_pe *pe;
>>>   	int rc;
>>>
>>> -	pe = pnv_ioda_get_pe(dev);
>>> +	pe = pnv_ioda_pci_dev_to_pe(dev);
>>
>>
>> And this change could to separately. Not clear how this helps to "Release PEs
>> dynamically".
>>
>>
>
> It's not related to "Release PEs dynamically". The change is introduced by
> the function rename: Original pnv_ioda_get_pe() is renamed to pnv_ioda_pci_dev_to_pe().


But the rename happened in this patch and the patch's subj is "Release PEs 
dynamically" so it should be related somehow or move it to a simple 
separate patch "let's give the lalala function a better name to reflect 
what it actually does" (but in this case the new name does not make any 
more sense than the old one).



>>>   	if (!pe)
>>>   		return -ENODEV;
>>>
>>> @@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>>>   	struct pnv_ioda_pe *pe;
>>>   	int rc;
>>>
>>> -	if (!(pe = pnv_ioda_get_pe(dev)))
>>> +	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>>>   		return -ENODEV;
>>>
>>>   	/* Assign XIVE to PE */
>>> @@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>>>   				  unsigned int hwirq, unsigned int virq,
>>>   				  unsigned int is_64, struct msi_msg *msg)
>>>   {
>>> -	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
>>> +	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>>>   	unsigned int xive_num = hwirq - phb->msi_base;
>>>   	__be32 data;
>>>   	int rc;
>>> @@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>>   	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>>>   	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>>>   	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>> +	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>>>   	hose->controller_ops = pnv_pci_controller_ops;
>>>
>>>   #ifdef CONFIG_PCI_IOV
>>> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>> index 1bea3a8..8b10f01 100644
>>> --- a/arch/powerpc/platforms/powernv/pci.h
>>> +++ b/arch/powerpc/platforms/powernv/pci.h
>>> @@ -28,6 +28,7 @@ enum pnv_phb_model {
>>>   /* Data associated with a PE, including IOMMU tracking etc.. */
>>>   struct pnv_phb;
>>>   struct pnv_ioda_pe {
>>> +	struct kref		kref;
>>>   	unsigned long		flags;
>>>   	struct pnv_phb		*phb;
>>>
>>> @@ -120,7 +121,8 @@ struct pnv_phb {
>>>   	void (*shutdown)(struct pnv_phb *phb);
>>>   	int (*init_m64)(struct pnv_phb *phb);
>>>   	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>> -	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>> +	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
>>> +					   struct pci_bus *bus, int all);
>>>   	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>>   	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>>   	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>>>
>
> Thanks,
> Gavin
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
  2015-05-11  6:45       ` Gavin Shan
@ 2015-05-11  7:16         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-11  7:16 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linuxppc-dev, linux-pci, benh, bhelgaas

On 05/11/2015 04:45 PM, Gavin Shan wrote:
> On Sat, May 09, 2015 at 11:41:05PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>> For PowerNV platform, running on top of skiboot, all PE level reset
>>> should be routed to firmware if the bridge of the PE primary bus has
>>> device-node property "ibm,reset-by-firmware". Otherwise, the kernel
>>> has to issue hot reset on PE's primary bus despite the requested reset
>>> types, which is the behaviour before the firmware supports PCI slot
>>> reset. So the changes don't depend on the PCI slot reset capability
>>> exposed from the firmware.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/eeh.h               |   1 +
>>>   arch/powerpc/include/asm/opal.h              |   4 +-
>>>   arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
>>>   3 files changed, 102 insertions(+), 109 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>> index c5eb86f..2793d24 100644
>>> --- a/arch/powerpc/include/asm/eeh.h
>>> +++ b/arch/powerpc/include/asm/eeh.h
>>> @@ -190,6 +190,7 @@ enum {
>>>   #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>>>   #define EEH_RESET_HOT		1	/* Hot reset			*/
>>>   #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
>>> +#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
>>>   #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
>>>   #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
>>>
>>> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
>>> index 042af1a..6d467df 100644
>>> --- a/arch/powerpc/include/asm/opal.h
>>> +++ b/arch/powerpc/include/asm/opal.h
>>> @@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
>>>   int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
>>>   					uint16_t dma_window_number, uint64_t pci_start_addr,
>>>   					uint64_t pci_mem_size);
>>> -int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
>>> +int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
>>>
>>>   int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
>>>   				   uint64_t diag_buffer_len);
>>> @@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
>>>   int64_t opal_set_system_attention_led(uint8_t led_action);
>>>   int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
>>>   			    __be16 *pci_error_type, __be16 *severity);
>>> -int64_t opal_pci_poll(uint64_t phb_id);
>>> +int64_t opal_pci_poll(uint64_t id, uint8_t *val);
>>>   int64_t opal_return_cpu(void);
>>>   int64_t opal_check_token(uint64_t token);
>>>   int64_t opal_reinit_cpus(uint64_t flags);
>>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> index ce738ab..3c01095 100644
>>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> @@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>>>   	return ret;
>>>   }
>>>
>>> -static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>> +static s64 pnv_eeh_poll(uint64_t id)
>>>   {
>>>   	s64 rc = OPAL_HARDWARE;
>>>
>>>   	while (1) {
>>> -		rc = opal_pci_poll(phb->opal_id);
>>> +		rc = opal_pci_poll(id, NULL);
>>>   		if (rc <= 0)
>>>   			break;
>>>
>>> @@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>>   int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>>   {
>>>   	struct pnv_phb *phb = hose->private_data;
>>> +	uint8_t scope;
>>>   	s64 rc = OPAL_HARDWARE;
>>>
>>>   	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>>   		 __func__, hose->global_number, option);
>>> -
>>> -	/* Issue PHB complete reset request */
>>> -	if (option == EEH_RESET_FUNDAMENTAL ||
>>> -	    option == EEH_RESET_HOT)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PHB_COMPLETE,
>>> -				    OPAL_ASSERT_RESET);
>>> -	else if (option == EEH_RESET_DEACTIVATE)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PHB_COMPLETE,
>>> -				    OPAL_DEASSERT_RESET);
>>> -	if (rc < 0)
>>> -		goto out;
>>> -
>>> -	/*
>>> -	 * Poll state of the PHB until the request is done
>>> -	 * successfully. The PHB reset is usually PHB complete
>>> -	 * reset followed by hot reset on root bus. So we also
>>> -	 * need the PCI bus settlement delay.
>>> -	 */
>>> -	rc = pnv_eeh_phb_poll(phb);
>>> -	if (option == EEH_RESET_DEACTIVATE) {
>>> -		if (system_state < SYSTEM_RUNNING)
>>> -			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
>>> -		else
>>> -			msleep(EEH_PE_RST_SETTLE_TIME);
>>
>>
>> These udelay() and msleep() are gone. How come they are not needed anymore?
>> Worth commenting in the commit log or remove those in a separate patch.
>>
>> I just remember you mentioning some missing delays somewhere which caused
>> NVIDIA device to issue EEH and I do not want those to disappear :)
>>
>
> Yeah, I think you're correct that it's not safe to remove this yet because
> the old firmware (without corresponding PCI hotplug changes) doesn't have
> the required delays from opal_pci_poll() yet. I'll add this in next revision.


And in a later patch you add some delays. If they are the same delays but 
in a different place, they should go to the same patch.


>
>>
>>> +	switch (option) {
>>> +	case EEH_RESET_HOT:
>>> +		scope = OPAL_RESET_PCI_HOT;
>>> +		break;
>>> +	case EEH_RESET_FUNDAMENTAL:
>>> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>> +		break;
>>> +	case EEH_RESET_COMPLETE:
>>> +		scope = OPAL_RESET_PHB_COMPLETE;
>>> +		break;
>>> +	case EEH_RESET_DEACTIVATE:
>>> +		return 0;
>>> +	default:
>>> +		pr_warn("%s: Unsupported option %d\n",
>>> +			__func__, option);
>>> +		return -EINVAL;
>>>   	}
>>> -out:
>>> -	if (rc != OPAL_SUCCESS)
>>> -		return -EIO;
>>>
>>> -	return 0;
>>> -}
>>> -
>>> -static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
>>> -{
>>> -	struct pnv_phb *phb = hose->private_data;
>>> -	s64 rc = OPAL_HARDWARE;
>>> +	/* Issue reset and poll until it's completed */
>>> +	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
>>> +	if (rc > 0)
>>> +		rc = pnv_eeh_poll(phb->opal_id);
>>>
>>> -	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>> -		 __func__, hose->global_number, option);
>>> -
>>> -	/*
>>> -	 * During the reset deassert time, we needn't care
>>> -	 * the reset scope because the firmware does nothing
>>> -	 * for fundamental or hot reset during deassert phase.
>>> -	 */
>>> -	if (option == EEH_RESET_FUNDAMENTAL)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PCI_FUNDAMENTAL,
>>> -				    OPAL_ASSERT_RESET);
>>> -	else if (option == EEH_RESET_HOT)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PCI_HOT,
>>> -				    OPAL_ASSERT_RESET);
>>> -	else if (option == EEH_RESET_DEACTIVATE)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PCI_HOT,
>>> -				    OPAL_DEASSERT_RESET);
>>> -	if (rc < 0)
>>> -		goto out;
>>> -
>>> -	/* Poll state of the PHB until the request is done */
>>> -	rc = pnv_eeh_phb_poll(phb);
>>> -	if (option == EEH_RESET_DEACTIVATE)
>>> -		msleep(EEH_PE_RST_SETTLE_TIME);
>>> -out:
>>> -	if (rc != OPAL_SUCCESS)
>>> -		return -EIO;
>>> -
>>> -	return 0;
>>> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>>   }
>>>
>>> -static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>> +static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>   {
>>>   	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
>>>   	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>> @@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>   	return 0;
>>>   }
>>>
>>> +static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>> +{
>>> +	struct pci_controller *hose;
>>> +	struct pnv_phb *phb;
>>> +	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
>>> +	uint64_t id = (0x1ul << 60);
>>> +	uint8_t scope;
>>> +	s64 rc;
>>
>>
>> int64_t for @rc?
>>
>>
>
> Yes.
>
>>> +
>>> +	/*
>>> +	 * If the firmware can't handle it, we will issue hot reset
>>> +	 * on the secondary bus despite the requested reset type
>>> +	 */
>>> +	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
>>> +		return __pnv_eeh_bridge_reset(dev, option);
>>> +
>>> +	/* The firmware can handle the request */
>>> +	switch (option) {
>>> +	case EEH_RESET_HOT:
>>> +		scope = OPAL_RESET_PCI_HOT;
>>> +		break;
>>> +	case EEH_RESET_FUNDAMENTAL:
>>> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>> +		break;
>>> +	case EEH_RESET_DEACTIVATE:
>>> +		return 0;
>>> +	case EEH_RESET_COMPLETE:
>>> +	default:
>>> +		pr_warn("%s: Unsupported option %d on device %s\n",
>>> +			__func__, option, pci_name(dev));
>>> +		return -EINVAL;
>>> +	}
>>
>>
>> This is the same switch as earlier in this patch (slightly different order).
>> Move it and opal_pci_reset() into a helper and call it pnv_opal_pci_reset()?
>>
>>
>
> It sounds a good idea. I'll do accordingly.
>
>>> +
>>> +	hose = pci_bus_to_host(dev->bus);
>>> +	phb = hose->private_data;
>>
>> Previously you would initialize @hose and @phb where you declared those but
>> not here. If you did the same thing as before, the patch could have been
>> smaller and easier to read.
>>
>
> Sure.
>
>>> +	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>>> +	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
>>> +	if (rc > 0)
>>> +		rc = pnv_eeh_poll(id);
>>> +
>>> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>> +}
>>> +
>>>   void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>>   {
>>>   	struct pci_controller *hose;
>>>
>>>   	if (pci_is_root_bus(dev->bus)) {
>>>   		hose = pci_bus_to_host(dev->bus);
>>> -		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
>>> -		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
>>> +		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>> +		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>>   	} else {
>>>   		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>>   		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>> @@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>>   static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>>   {
>>>   	struct pci_controller *hose = pe->phb;
>>> +	struct pnv_phb *phb;
>>>   	struct pci_bus *bus;
>>> -	int ret;
>>> +	s64 rc;
>>>
>>>   	/*
>>>   	 * For PHB reset, we always have complete reset. For those PEs whose
>>> @@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>>   	 * reset. The side effect is that EEH core has to clear the frozen
>>>   	 * state explicitly after BAR restore.
>>>   	 */
>>> -	if (pe->type & EEH_PE_PHB) {
>>> -		ret = pnv_eeh_phb_reset(hose, option);
>>> -	} else {
>>> -		struct pnv_phb *phb;
>>> -		s64 rc;
>>> +	if (pe->type & EEH_PE_PHB)
>>
>> I would keep "{" in the line above ....
>>
>>> +		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);
>>
>> ...put "} else {" here...
>>
>> and the chunk below would become 1) very small 2) very trivial... And then
>> you could make a trivial patch which would do scope removal but without
>> functional changes. Or vice versa.
>>
>
> I intended to remove nested if(). If you really want me to change the code
> according to your comments, I'll do. Otherwise, I prefer to keep it as
> of being.


Use your best judgement :) If do shift the whole block, just make sure that 
all what you is moving and nothing is lost/added during this move.


>>>
>>> -		/*
>>> -		 * The frozen PE might be caused by PAPR error injection
>>> -		 * registers, which are expected to be cleared after hitting
>>> -		 * frozen PE as stated in the hardware spec. Unfortunately,
>>> -		 * that's not true on P7IOC. So we have to clear it manually
>>> -		 * to avoid recursive EEH errors during recovery.
>>> -		 */
>>> -		phb = hose->private_data;
>>> -		if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>> -		    (option == EEH_RESET_HOT ||
>>> -		    option == EEH_RESET_FUNDAMENTAL)) {
>>> -			rc = opal_pci_reset(phb->opal_id,
>>> -					    OPAL_RESET_PHB_ERROR,
>>> -					    OPAL_ASSERT_RESET);
>>> -			if (rc != OPAL_SUCCESS) {
>>> -				pr_warn("%s: Failure %lld clearing "
>>> -					"error injection registers\n",
>>> -					__func__, rc);
>>> -				return -EIO;
>>> -			}
>>> +	/*
>>> +	 * The frozen PE might be caused by PAPR error injection
>>> +	 * registers, which are expected to be cleared after hitting
>>> +	 * frozen PE as stated in the hardware spec. Unfortunately,
>>> +	 * that's not true on P7IOC. So we have to clear it manually
>>> +	 * to avoid recursive EEH errors during recovery.
>>> +	 */
>>> +	phb = hose->private_data;
>>> +	if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>> +	    (option == EEH_RESET_HOT ||
>>> +	    option == EEH_RESET_FUNDAMENTAL)) {
>>> +		rc = opal_pci_reset(phb->opal_id,
>>> +				    OPAL_RESET_PHB_ERROR,
>>> +				    OPAL_ASSERT_RESET);
>>> +		if (rc != OPAL_SUCCESS) {
>>> +			pr_warn("%s: Failure %lld clearing error "
>>> +				"injection registers on PHB#%d\n",
>>> +				__func__, rc, hose->global_number);
>>> +			return -EIO;
>>>   		}
>>> -
>>> -		bus = eeh_pe_bus_get(pe);
>>> -		if (pci_is_root_bus(bus) ||
>>> -			pci_is_root_bus(bus->parent))
>>> -			ret = pnv_eeh_root_reset(hose, option);
>>> -		else
>>> -			ret = pnv_eeh_bridge_reset(bus->self, option);
>>>   	}
>>>
>>> -	return ret;
>>> +	/* Route the reset request to PHB or upstream bridge */
>>> +	bus = eeh_pe_bus_get(pe);
>>> +	if (pci_is_root_bus(bus))
>>> +		return pnv_eeh_phb_reset(hose, option);
>>> +
>>> +	return pnv_eeh_bridge_reset(bus->self, option);
>>>   }
>>>
>>>   /**
>>>
>
> Thanks,
> Gavin
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
@ 2015-05-11  7:16         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-11  7:16 UTC (permalink / raw)
  To: Gavin Shan; +Cc: bhelgaas, linux-pci, linuxppc-dev

On 05/11/2015 04:45 PM, Gavin Shan wrote:
> On Sat, May 09, 2015 at 11:41:05PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>> For PowerNV platform, running on top of skiboot, all PE level reset
>>> should be routed to firmware if the bridge of the PE primary bus has
>>> device-node property "ibm,reset-by-firmware". Otherwise, the kernel
>>> has to issue hot reset on PE's primary bus despite the requested reset
>>> types, which is the behaviour before the firmware supports PCI slot
>>> reset. So the changes don't depend on the PCI slot reset capability
>>> exposed from the firmware.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/eeh.h               |   1 +
>>>   arch/powerpc/include/asm/opal.h              |   4 +-
>>>   arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +++++++++++++--------------
>>>   3 files changed, 102 insertions(+), 109 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>> index c5eb86f..2793d24 100644
>>> --- a/arch/powerpc/include/asm/eeh.h
>>> +++ b/arch/powerpc/include/asm/eeh.h
>>> @@ -190,6 +190,7 @@ enum {
>>>   #define EEH_RESET_DEACTIVATE	0	/* Deactivate the PE reset	*/
>>>   #define EEH_RESET_HOT		1	/* Hot reset			*/
>>>   #define EEH_RESET_FUNDAMENTAL	3	/* Fundamental reset		*/
>>> +#define EEH_RESET_COMPLETE	4	/* PHB complete reset           */
>>>   #define EEH_LOG_TEMP		1	/* EEH temporary error log	*/
>>>   #define EEH_LOG_PERM		2	/* EEH permanent error log	*/
>>>
>>> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
>>> index 042af1a..6d467df 100644
>>> --- a/arch/powerpc/include/asm/opal.h
>>> +++ b/arch/powerpc/include/asm/opal.h
>>> @@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t
>>>   int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number,
>>>   					uint16_t dma_window_number, uint64_t pci_start_addr,
>>>   					uint64_t pci_mem_size);
>>> -int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state);
>>> +int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state);
>>>
>>>   int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer,
>>>   				   uint64_t diag_buffer_len);
>>> @@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status);
>>>   int64_t opal_set_system_attention_led(uint8_t led_action);
>>>   int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe,
>>>   			    __be16 *pci_error_type, __be16 *severity);
>>> -int64_t opal_pci_poll(uint64_t phb_id);
>>> +int64_t opal_pci_poll(uint64_t id, uint8_t *val);
>>>   int64_t opal_return_cpu(void);
>>>   int64_t opal_check_token(uint64_t token);
>>>   int64_t opal_reinit_cpus(uint64_t flags);
>>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> index ce738ab..3c01095 100644
>>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> @@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>>>   	return ret;
>>>   }
>>>
>>> -static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>> +static s64 pnv_eeh_poll(uint64_t id)
>>>   {
>>>   	s64 rc = OPAL_HARDWARE;
>>>
>>>   	while (1) {
>>> -		rc = opal_pci_poll(phb->opal_id);
>>> +		rc = opal_pci_poll(id, NULL);
>>>   		if (rc <= 0)
>>>   			break;
>>>
>>> @@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb)
>>>   int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>>   {
>>>   	struct pnv_phb *phb = hose->private_data;
>>> +	uint8_t scope;
>>>   	s64 rc = OPAL_HARDWARE;
>>>
>>>   	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>>   		 __func__, hose->global_number, option);
>>> -
>>> -	/* Issue PHB complete reset request */
>>> -	if (option == EEH_RESET_FUNDAMENTAL ||
>>> -	    option == EEH_RESET_HOT)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PHB_COMPLETE,
>>> -				    OPAL_ASSERT_RESET);
>>> -	else if (option == EEH_RESET_DEACTIVATE)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PHB_COMPLETE,
>>> -				    OPAL_DEASSERT_RESET);
>>> -	if (rc < 0)
>>> -		goto out;
>>> -
>>> -	/*
>>> -	 * Poll state of the PHB until the request is done
>>> -	 * successfully. The PHB reset is usually PHB complete
>>> -	 * reset followed by hot reset on root bus. So we also
>>> -	 * need the PCI bus settlement delay.
>>> -	 */
>>> -	rc = pnv_eeh_phb_poll(phb);
>>> -	if (option == EEH_RESET_DEACTIVATE) {
>>> -		if (system_state < SYSTEM_RUNNING)
>>> -			udelay(1000 * EEH_PE_RST_SETTLE_TIME);
>>> -		else
>>> -			msleep(EEH_PE_RST_SETTLE_TIME);
>>
>>
>> These udelay() and msleep() are gone. How come they are not needed anymore?
>> Worth commenting in the commit log or remove those in a separate patch.
>>
>> I just remember you mentioning some missing delays somewhere which caused
>> NVIDIA device to issue EEH and I do not want those to disappear :)
>>
>
> Yeah, I think you're correct that it's not safe to remove this yet because
> the old firmware (without corresponding PCI hotplug changes) doesn't have
> the required delays from opal_pci_poll() yet. I'll add this in next revision.


And in a later patch you add some delays. If they are the same delays but 
in a different place, they should go to the same patch.


>
>>
>>> +	switch (option) {
>>> +	case EEH_RESET_HOT:
>>> +		scope = OPAL_RESET_PCI_HOT;
>>> +		break;
>>> +	case EEH_RESET_FUNDAMENTAL:
>>> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>> +		break;
>>> +	case EEH_RESET_COMPLETE:
>>> +		scope = OPAL_RESET_PHB_COMPLETE;
>>> +		break;
>>> +	case EEH_RESET_DEACTIVATE:
>>> +		return 0;
>>> +	default:
>>> +		pr_warn("%s: Unsupported option %d\n",
>>> +			__func__, option);
>>> +		return -EINVAL;
>>>   	}
>>> -out:
>>> -	if (rc != OPAL_SUCCESS)
>>> -		return -EIO;
>>>
>>> -	return 0;
>>> -}
>>> -
>>> -static int pnv_eeh_root_reset(struct pci_controller *hose, int option)
>>> -{
>>> -	struct pnv_phb *phb = hose->private_data;
>>> -	s64 rc = OPAL_HARDWARE;
>>> +	/* Issue reset and poll until it's completed */
>>> +	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
>>> +	if (rc > 0)
>>> +		rc = pnv_eeh_poll(phb->opal_id);
>>>
>>> -	pr_debug("%s: Reset PHB#%x, option=%d\n",
>>> -		 __func__, hose->global_number, option);
>>> -
>>> -	/*
>>> -	 * During the reset deassert time, we needn't care
>>> -	 * the reset scope because the firmware does nothing
>>> -	 * for fundamental or hot reset during deassert phase.
>>> -	 */
>>> -	if (option == EEH_RESET_FUNDAMENTAL)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PCI_FUNDAMENTAL,
>>> -				    OPAL_ASSERT_RESET);
>>> -	else if (option == EEH_RESET_HOT)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PCI_HOT,
>>> -				    OPAL_ASSERT_RESET);
>>> -	else if (option == EEH_RESET_DEACTIVATE)
>>> -		rc = opal_pci_reset(phb->opal_id,
>>> -				    OPAL_RESET_PCI_HOT,
>>> -				    OPAL_DEASSERT_RESET);
>>> -	if (rc < 0)
>>> -		goto out;
>>> -
>>> -	/* Poll state of the PHB until the request is done */
>>> -	rc = pnv_eeh_phb_poll(phb);
>>> -	if (option == EEH_RESET_DEACTIVATE)
>>> -		msleep(EEH_PE_RST_SETTLE_TIME);
>>> -out:
>>> -	if (rc != OPAL_SUCCESS)
>>> -		return -EIO;
>>> -
>>> -	return 0;
>>> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>>   }
>>>
>>> -static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>> +static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>   {
>>>   	struct pci_dn *pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
>>>   	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
>>> @@ -891,14 +845,57 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>   	return 0;
>>>   }
>>>
>>> +static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>> +{
>>> +	struct pci_controller *hose;
>>> +	struct pnv_phb *phb;
>>> +	struct device_node *dn = dev ? pci_device_to_OF_node(dev) : NULL;
>>> +	uint64_t id = (0x1ul << 60);
>>> +	uint8_t scope;
>>> +	s64 rc;
>>
>>
>> int64_t for @rc?
>>
>>
>
> Yes.
>
>>> +
>>> +	/*
>>> +	 * If the firmware can't handle it, we will issue hot reset
>>> +	 * on the secondary bus despite the requested reset type
>>> +	 */
>>> +	if (!dn || !of_get_property(dn, "ibm,reset-by-firmware", NULL))
>>> +		return __pnv_eeh_bridge_reset(dev, option);
>>> +
>>> +	/* The firmware can handle the request */
>>> +	switch (option) {
>>> +	case EEH_RESET_HOT:
>>> +		scope = OPAL_RESET_PCI_HOT;
>>> +		break;
>>> +	case EEH_RESET_FUNDAMENTAL:
>>> +		scope = OPAL_RESET_PCI_FUNDAMENTAL;
>>> +		break;
>>> +	case EEH_RESET_DEACTIVATE:
>>> +		return 0;
>>> +	case EEH_RESET_COMPLETE:
>>> +	default:
>>> +		pr_warn("%s: Unsupported option %d on device %s\n",
>>> +			__func__, option, pci_name(dev));
>>> +		return -EINVAL;
>>> +	}
>>
>>
>> This is the same switch as earlier in this patch (slightly different order).
>> Move it and opal_pci_reset() into a helper and call it pnv_opal_pci_reset()?
>>
>>
>
> It sounds a good idea. I'll do accordingly.
>
>>> +
>>> +	hose = pci_bus_to_host(dev->bus);
>>> +	phb = hose->private_data;
>>
>> Previously you would initialize @hose and @phb where you declared those but
>> not here. If you did the same thing as before, the patch could have been
>> smaller and easier to read.
>>
>
> Sure.
>
>>> +	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>>> +	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
>>> +	if (rc > 0)
>>> +		rc = pnv_eeh_poll(id);
>>> +
>>> +	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>> +}
>>> +
>>>   void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>>   {
>>>   	struct pci_controller *hose;
>>>
>>>   	if (pci_is_root_bus(dev->bus)) {
>>>   		hose = pci_bus_to_host(dev->bus);
>>> -		pnv_eeh_root_reset(hose, EEH_RESET_HOT);
>>> -		pnv_eeh_root_reset(hose, EEH_RESET_DEACTIVATE);
>>> +		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>> +		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>>   	} else {
>>>   		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>>   		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>> @@ -920,8 +917,9 @@ void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>>   static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>>   {
>>>   	struct pci_controller *hose = pe->phb;
>>> +	struct pnv_phb *phb;
>>>   	struct pci_bus *bus;
>>> -	int ret;
>>> +	s64 rc;
>>>
>>>   	/*
>>>   	 * For PHB reset, we always have complete reset. For those PEs whose
>>> @@ -937,43 +935,37 @@ static int pnv_eeh_reset(struct eeh_pe *pe, int option)
>>>   	 * reset. The side effect is that EEH core has to clear the frozen
>>>   	 * state explicitly after BAR restore.
>>>   	 */
>>> -	if (pe->type & EEH_PE_PHB) {
>>> -		ret = pnv_eeh_phb_reset(hose, option);
>>> -	} else {
>>> -		struct pnv_phb *phb;
>>> -		s64 rc;
>>> +	if (pe->type & EEH_PE_PHB)
>>
>> I would keep "{" in the line above ....
>>
>>> +		return pnv_eeh_phb_reset(hose, EEH_RESET_COMPLETE);
>>
>> ...put "} else {" here...
>>
>> and the chunk below would become 1) very small 2) very trivial... And then
>> you could make a trivial patch which would do scope removal but without
>> functional changes. Or vice versa.
>>
>
> I intended to remove nested if(). If you really want me to change the code
> according to your comments, I'll do. Otherwise, I prefer to keep it as
> of being.


Use your best judgement :) If do shift the whole block, just make sure that 
all what you is moving and nothing is lost/added during this move.


>>>
>>> -		/*
>>> -		 * The frozen PE might be caused by PAPR error injection
>>> -		 * registers, which are expected to be cleared after hitting
>>> -		 * frozen PE as stated in the hardware spec. Unfortunately,
>>> -		 * that's not true on P7IOC. So we have to clear it manually
>>> -		 * to avoid recursive EEH errors during recovery.
>>> -		 */
>>> -		phb = hose->private_data;
>>> -		if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>> -		    (option == EEH_RESET_HOT ||
>>> -		    option == EEH_RESET_FUNDAMENTAL)) {
>>> -			rc = opal_pci_reset(phb->opal_id,
>>> -					    OPAL_RESET_PHB_ERROR,
>>> -					    OPAL_ASSERT_RESET);
>>> -			if (rc != OPAL_SUCCESS) {
>>> -				pr_warn("%s: Failure %lld clearing "
>>> -					"error injection registers\n",
>>> -					__func__, rc);
>>> -				return -EIO;
>>> -			}
>>> +	/*
>>> +	 * The frozen PE might be caused by PAPR error injection
>>> +	 * registers, which are expected to be cleared after hitting
>>> +	 * frozen PE as stated in the hardware spec. Unfortunately,
>>> +	 * that's not true on P7IOC. So we have to clear it manually
>>> +	 * to avoid recursive EEH errors during recovery.
>>> +	 */
>>> +	phb = hose->private_data;
>>> +	if (phb->model == PNV_PHB_MODEL_P7IOC &&
>>> +	    (option == EEH_RESET_HOT ||
>>> +	    option == EEH_RESET_FUNDAMENTAL)) {
>>> +		rc = opal_pci_reset(phb->opal_id,
>>> +				    OPAL_RESET_PHB_ERROR,
>>> +				    OPAL_ASSERT_RESET);
>>> +		if (rc != OPAL_SUCCESS) {
>>> +			pr_warn("%s: Failure %lld clearing error "
>>> +				"injection registers on PHB#%d\n",
>>> +				__func__, rc, hose->global_number);
>>> +			return -EIO;
>>>   		}
>>> -
>>> -		bus = eeh_pe_bus_get(pe);
>>> -		if (pci_is_root_bus(bus) ||
>>> -			pci_is_root_bus(bus->parent))
>>> -			ret = pnv_eeh_root_reset(hose, option);
>>> -		else
>>> -			ret = pnv_eeh_bridge_reset(bus->self, option);
>>>   	}
>>>
>>> -	return ret;
>>> +	/* Route the reset request to PHB or upstream bridge */
>>> +	bus = eeh_pe_bus_get(pe);
>>> +	if (pci_is_root_bus(bus))
>>> +		return pnv_eeh_phb_reset(hose, option);
>>> +
>>> +	return pnv_eeh_bridge_reset(bus->self, option);
>>>   }
>>>
>>>   /**
>>>
>
> Thanks,
> Gavin
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
  2015-05-11  6:47       ` Gavin Shan
@ 2015-05-11  7:17         ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-11  7:17 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linuxppc-dev, linux-pci, benh, bhelgaas

On 05/11/2015 04:47 PM, Gavin Shan wrote:
> On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>> Function pnv_pci_reset_secondary_bus() is used to reset specified
>>> PCI bus, which is leaded by root complex or PCI bridge. That means
>>> the function shouldn't be called on PCI root bus and the patch
>>> removes the logic for that case.
>>>
>>> Also, some adapters beneath the indicated PCI bus may require
>>> fundamental reset in order to successfully reload their firmwares
>>> after the reset. The patch translates hot reset to fundamental reset
>>> for that case.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>>>   1 file changed, 26 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> index 3c01095..58e4dcf 100644
>>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> @@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>   	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>>   }
>>>
>>> -void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>
>>
>> Why changing dev to pdev? Keeping "dev" could make the patch simpler.
>>
>
> In the early stage when I wrote the EEH code, I had "dev" to refer PCI
> device, which isn't precisely enough. Actually, "dev" means "struct device"
> while "pdev" stands for "struct pci_dev". That's why I changed it.


The rest of the file and the kernel overall use "dev" for pci_dev just 
fine. I would not bother.


>>> +static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>>   {
>>> -	struct pci_controller *hose;
>>> +	int *freset = data;
>>>
>>> -	if (pci_is_root_bus(dev->bus)) {
>>> -		hose = pci_bus_to_host(dev->bus);
>>> -		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>> -		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>> -	} else {
>>> -		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>> -		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>> +	/*
>>> +	 * Stop the iteration immediately if there is any
>>> +	 * one PCI device requesting fundamental reset
>>> +	 */
>>> +	*freset |= pdev->needs_freset;
>>> +	return *freset;
>>> +}
>>> +
>>> +void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
>>> +{
>>> +	int option = EEH_RESET_HOT;
>>> +	int freset = 0;
>>> +
>>> +	/* Check if there're any PCI devices asking for fundamental reset */
>>> +	if (pdev->subordinate) {
>>> +		pci_walk_bus(pdev->subordinate,
>>> +			     pnv_pci_dev_reset_type,
>>> +			     &freset);
>>> +		if (freset)
>>> +			option = EEH_RESET_FUNDAMENTAL;
>>>   	}
>>> +
>>> +	/* Issue the requested type of reset */
>>> +	pnv_eeh_bridge_reset(pdev, option);
>>> +	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>>>   }
>>>
>>>   /**
>>>
>
> Thanks,
> Gavin
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
@ 2015-05-11  7:17         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-11  7:17 UTC (permalink / raw)
  To: Gavin Shan; +Cc: bhelgaas, linux-pci, linuxppc-dev

On 05/11/2015 04:47 PM, Gavin Shan wrote:
> On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>> Function pnv_pci_reset_secondary_bus() is used to reset specified
>>> PCI bus, which is leaded by root complex or PCI bridge. That means
>>> the function shouldn't be called on PCI root bus and the patch
>>> removes the logic for that case.
>>>
>>> Also, some adapters beneath the indicated PCI bus may require
>>> fundamental reset in order to successfully reload their firmwares
>>> after the reset. The patch translates hot reset to fundamental reset
>>> for that case.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>>   arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>>>   1 file changed, 26 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> index 3c01095..58e4dcf 100644
>>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> @@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>   	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>>   }
>>>
>>> -void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>
>>
>> Why changing dev to pdev? Keeping "dev" could make the patch simpler.
>>
>
> In the early stage when I wrote the EEH code, I had "dev" to refer PCI
> device, which isn't precisely enough. Actually, "dev" means "struct device"
> while "pdev" stands for "struct pci_dev". That's why I changed it.


The rest of the file and the kernel overall use "dev" for pci_dev just 
fine. I would not bother.


>>> +static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>>   {
>>> -	struct pci_controller *hose;
>>> +	int *freset = data;
>>>
>>> -	if (pci_is_root_bus(dev->bus)) {
>>> -		hose = pci_bus_to_host(dev->bus);
>>> -		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>> -		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>> -	} else {
>>> -		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>> -		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>> +	/*
>>> +	 * Stop the iteration immediately if there is any
>>> +	 * one PCI device requesting fundamental reset
>>> +	 */
>>> +	*freset |= pdev->needs_freset;
>>> +	return *freset;
>>> +}
>>> +
>>> +void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
>>> +{
>>> +	int option = EEH_RESET_HOT;
>>> +	int freset = 0;
>>> +
>>> +	/* Check if there're any PCI devices asking for fundamental reset */
>>> +	if (pdev->subordinate) {
>>> +		pci_walk_bus(pdev->subordinate,
>>> +			     pnv_pci_dev_reset_type,
>>> +			     &freset);
>>> +		if (freset)
>>> +			option = EEH_RESET_FUNDAMENTAL;
>>>   	}
>>> +
>>> +	/* Issue the requested type of reset */
>>> +	pnv_eeh_bridge_reset(pdev, option);
>>> +	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>>>   }
>>>
>>>   /**
>>>
>
> Thanks,
> Gavin
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
  2015-05-09 14:30     ` Alexey Kardashevskiy
@ 2015-05-11  7:19       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sun, May 10, 2015 at 12:30:07AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>We might not get some PCI slot information (e.g. power status)
>>immediately by OPAL API. Instead, opal_pci_poll() need to be called
>>for the required information.
>>
>>The patch introduces pnv_pci_poll(), which bases on original
>>pnv_eeh_poll(), to cover the above case
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++--------------------------
>>  arch/powerpc/platforms/powernv/pci.c         | 16 ++++++++++++++++
>>  arch/powerpc/platforms/powernv/pci.h         |  1 +
>>  3 files changed, 19 insertions(+), 26 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index 58e4dcf..9253b9e 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>>  	return ret;
>>  }
>>
>>-static s64 pnv_eeh_poll(uint64_t id)
>>-{
>>-	s64 rc = OPAL_HARDWARE;
>>-
>>-	while (1) {
>>-		rc = opal_pci_poll(id, NULL);
>>-		if (rc <= 0)
>>-			break;
>>-
>>-		if (system_state < SYSTEM_RUNNING)
>>-			udelay(1000 * rc);
>>-		else
>>-			msleep(rc);
>>-	}
>>-
>>-	return rc;
>>-}
>>-
>>  int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>  {
>>  	struct pnv_phb *phb = hose->private_data;
>>@@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>
>>  	/* Issue reset and poll until it's completed */
>>  	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
>>-	if (rc > 0)
>>-		rc = pnv_eeh_poll(phb->opal_id);
>>-
>>-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>+	return pnv_pci_poll(phb->opal_id, rc, NULL);
>
>
>You are carrying a negative value to the new helper too? Looks complicated.
>

Yes, a bit complicated :-)

>Also, before you only cared if opal_pci_reset() returned negative value, now
>you treat it as a timeout, is it new change to OPAL or it has always been
>there?
>

No, I didn't change the behaviour from skiboot side. The function should
have following return values:

0 - success; <0 - Error; >0 - Delay in ms;

>
>>  }
>>
>>  static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>@@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  	phb = hose->private_data;
>>  	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>>  	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
>>-	if (rc > 0)
>>-		rc = pnv_eeh_poll(id);
>>-
>>-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>+	return pnv_pci_poll(id, rc, NULL);
>>  }
>>
>>  static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>>index bca2aeb..a2da9a3 100644
>>--- a/arch/powerpc/platforms/powernv/pci.c
>>+++ b/arch/powerpc/platforms/powernv/pci.c
>>@@ -44,6 +44,22 @@
>>  #define cfg_dbg(fmt...)	do { } while(0)
>>  //#define cfg_dbg(fmt...)	printk(fmt)
>>
>>+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
>>+{
>>+	while (rval > 0) {
>>+		if (system_state < SYSTEM_RUNNING)
>>+			udelay(1000 * rval);
>>+		else
>>+			msleep(rval);
>
>Are these delays the once removed by "PATCH v4 09/21] powerpc/powernv: Use
>PCI slot reset infrastructure"? If so, I would merge this patch into 09/24 or
>move this one before that one, for bisect'ability.
>

No, they are different. The delay here is expected from skiboot firmware, but
that one in "PATCH v4 09/21" is expected from kernel itself.

>
>>+
>>+		rval = opal_pci_poll(id, pval);
>>+		if (rval == OPAL_SUCCESS && pval)
>>+			rval = opal_pci_poll(id, pval);
>
>Why calling it twice?
>

When retrieving slot's presence or power status, we have to do so as the opal_pci_poll()
is implemented like that. It means the first OPAL_SUCCESS from opal_pci_poll() indicates
the state machine, by which the function is implemented has finished. The next call to
the same function will get the result.

>>+	}
>>+
>>+	return rval ? -EIO : 0;
>>+}
>>+
>>  #ifdef CONFIG_PCI_MSI
>>  static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>>  {
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 8b10f01..82c5539 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -202,6 +202,7 @@ struct pnv_phb {
>>
>>  extern struct pci_ops pnv_pci_ops;
>>
>>+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval);
>>  void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
>>  				unsigned char *log_buff);
>>  int pnv_pci_cfg_read(struct pci_dn *pdn,
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
@ 2015-05-11  7:19       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sun, May 10, 2015 at 12:30:07AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>We might not get some PCI slot information (e.g. power status)
>>immediately by OPAL API. Instead, opal_pci_poll() need to be called
>>for the required information.
>>
>>The patch introduces pnv_pci_poll(), which bases on original
>>pnv_eeh_poll(), to cover the above case
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++--------------------------
>>  arch/powerpc/platforms/powernv/pci.c         | 16 ++++++++++++++++
>>  arch/powerpc/platforms/powernv/pci.h         |  1 +
>>  3 files changed, 19 insertions(+), 26 deletions(-)
>>
>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>index 58e4dcf..9253b9e 100644
>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>@@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay)
>>  	return ret;
>>  }
>>
>>-static s64 pnv_eeh_poll(uint64_t id)
>>-{
>>-	s64 rc = OPAL_HARDWARE;
>>-
>>-	while (1) {
>>-		rc = opal_pci_poll(id, NULL);
>>-		if (rc <= 0)
>>-			break;
>>-
>>-		if (system_state < SYSTEM_RUNNING)
>>-			udelay(1000 * rc);
>>-		else
>>-			msleep(rc);
>>-	}
>>-
>>-	return rc;
>>-}
>>-
>>  int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>  {
>>  	struct pnv_phb *phb = hose->private_data;
>>@@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option)
>>
>>  	/* Issue reset and poll until it's completed */
>>  	rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET);
>>-	if (rc > 0)
>>-		rc = pnv_eeh_poll(phb->opal_id);
>>-
>>-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>+	return pnv_pci_poll(phb->opal_id, rc, NULL);
>
>
>You are carrying a negative value to the new helper too? Looks complicated.
>

Yes, a bit complicated :-)

>Also, before you only cared if opal_pci_reset() returned negative value, now
>you treat it as a timeout, is it new change to OPAL or it has always been
>there?
>

No, I didn't change the behaviour from skiboot side. The function should
have following return values:

0 - success; <0 - Error; >0 - Delay in ms;

>
>>  }
>>
>>  static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>@@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>  	phb = hose->private_data;
>>  	id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id;
>>  	rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET);
>>-	if (rc > 0)
>>-		rc = pnv_eeh_poll(id);
>>-
>>-	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>+	return pnv_pci_poll(id, rc, NULL);
>>  }
>>
>>  static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>>index bca2aeb..a2da9a3 100644
>>--- a/arch/powerpc/platforms/powernv/pci.c
>>+++ b/arch/powerpc/platforms/powernv/pci.c
>>@@ -44,6 +44,22 @@
>>  #define cfg_dbg(fmt...)	do { } while(0)
>>  //#define cfg_dbg(fmt...)	printk(fmt)
>>
>>+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval)
>>+{
>>+	while (rval > 0) {
>>+		if (system_state < SYSTEM_RUNNING)
>>+			udelay(1000 * rval);
>>+		else
>>+			msleep(rval);
>
>Are these delays the once removed by "PATCH v4 09/21] powerpc/powernv: Use
>PCI slot reset infrastructure"? If so, I would merge this patch into 09/24 or
>move this one before that one, for bisect'ability.
>

No, they are different. The delay here is expected from skiboot firmware, but
that one in "PATCH v4 09/21" is expected from kernel itself.

>
>>+
>>+		rval = opal_pci_poll(id, pval);
>>+		if (rval == OPAL_SUCCESS && pval)
>>+			rval = opal_pci_poll(id, pval);
>
>Why calling it twice?
>

When retrieving slot's presence or power status, we have to do so as the opal_pci_poll()
is implemented like that. It means the first OPAL_SUCCESS from opal_pci_poll() indicates
the state machine, by which the function is implemented has finished. The next call to
the same function will get the result.

>>+	}
>>+
>>+	return rval ? -EIO : 0;
>>+}
>>+
>>  #ifdef CONFIG_PCI_MSI
>>  static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>>  {
>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>index 8b10f01..82c5539 100644
>>--- a/arch/powerpc/platforms/powernv/pci.h
>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>@@ -202,6 +202,7 @@ struct pnv_phb {
>>
>>  extern struct pci_ops pnv_pci_ops;
>>
>>+int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval);
>>  void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
>>  				unsigned char *log_buff);
>>  int pnv_pci_cfg_read(struct pci_dn *pdn,
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
  2015-05-09 14:55     ` Alexey Kardashevskiy
@ 2015-05-11  7:21       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sun, May 10, 2015 at 12:55:51AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>The pci_dn instances are allocated from memblock or bootmem when
>>creating PCI controller (hoses) in setup_arch(). The PCI hotplug,
>>which will be supported by proceeding patches, will release PCI
>>device nodes and their corresponding pci_dn on unplugging event.
>>The pci_dn instance memory chunks alloed from memblock or bootmem
>>are hard to reused after being released.
>>
>>The patch delay creating pci_dn so that they can be allocated from
>>slab. In turn, the memory chunks for them can be reused after being
>>released without problem. The creation of eeh_dev instances, which
>>depends on pci_dn, is delayed a bit as well.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/ppc-pci.h     |  1 -
>>  arch/powerpc/kernel/eeh_dev.c          |  2 +-
>>  arch/powerpc/kernel/pci_dn.c           | 40 +++++++++++++++++++---------------
>>  arch/powerpc/platforms/maple/pci.c     | 35 +++++++++++++++++------------
>>  arch/powerpc/platforms/pasemi/pci.c    |  3 ---
>>  arch/powerpc/platforms/powermac/pci.c  | 39 ++++++++++++++++++++-------------
>>  arch/powerpc/platforms/powernv/pci.c   |  3 ---
>>  arch/powerpc/platforms/pseries/setup.c |  1 -
>>  8 files changed, 68 insertions(+), 56 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
>>index 4122a86..7388316 100644
>>--- a/arch/powerpc/include/asm/ppc-pci.h
>>+++ b/arch/powerpc/include/asm/ppc-pci.h
>>@@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root,
>>  		      void *(*fn)(struct pci_dn *, void *),
>>  		      void *data);
>>
>>-extern void pci_devs_phb_init(void);
>>  extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
>>
>>  /* From rtas_pci.h */
>>diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
>>index aabba94..f33ce5b 100644
>>--- a/arch/powerpc/kernel/eeh_dev.c
>>+++ b/arch/powerpc/kernel/eeh_dev.c
>>@@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void)
>>  	return 0;
>>  }
>>
>>-core_initcall(eeh_dev_phb_init);
>>+core_initcall_sync(eeh_dev_phb_init);
>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>index b3b4df9..d3833af 100644
>>--- a/arch/powerpc/kernel/pci_dn.c
>>+++ b/arch/powerpc/kernel/pci_dn.c
>>@@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>>  	struct device_node *parent;
>>  	struct pci_dn *pdn;
>>
>>-	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
>>+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>>  	if (pdn == NULL)
>>  		return NULL;
>>  	dn->data = pdn;
>>@@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
>>  	traverse_pci_devices(dn, update_dn_pci_info, phb);
>>  }
>>
>>-/**
>>+static void pci_dev_pdn_setup(struct pci_dev *pdev)
>>+{
>>+	struct pci_dn *pdn;
>>+
>>+	if (pdev->dev.archdata.pci_data)
>>+		return;
>>+
>>+	/* Setup the fast path */
>>+	pdn = pci_get_pdn(pdev);
>>+	pdev->dev.archdata.pci_data = pdn;
>>+}
>>+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
>
>
>How does moving of the chunk above help to "Delay creating pci_dn"?
>

It's already in "Delay creating pci_dn", isn't it? :-)

>
>>+
>>+/*
>>   * pci_devs_phb_init - Initialize phbs and pci devs under them.
>>- *
>>- * This routine walks over all phb's (pci-host bridges) on the
>>- * system, and sets up assorted pci-related structures
>>+ *
>>+ * This routine walks over all phb's (pci-host bridges) on
>>+ * the system, and sets up assorted pci-related structures
>>   * (including pci info in the device node structs) for each
>>   * pci device found underneath.  This routine runs once,
>>   * early in the boot sequence.
>>   */
>>-void __init pci_devs_phb_init(void)
>>+static int __init pci_devs_phb_init(void)
>>  {
>>  	struct pci_controller *phb, *tmp;
>>
>>  	/* This must be done first so the device nodes have valid pci info! */
>>  	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>>  		pci_devs_phb_init_dynamic(phb);
>>-}
>>-
>>-static void pci_dev_pdn_setup(struct pci_dev *pdev)
>>-{
>>-	struct pci_dn *pdn;
>>
>>-	if (pdev->dev.archdata.pci_data)
>>-		return;
>>-
>>-	/* Setup the fast path */
>>-	pdn = pci_get_pdn(pdev);
>>-	pdev->dev.archdata.pci_data = pdn;
>>+	return 0;
>>  }
>>-DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
>>+
>>+core_initcall(pci_devs_phb_init);
>>diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c
>>index a923230..04a69a8 100644
>>--- a/arch/powerpc/platforms/maple/pci.c
>>+++ b/arch/powerpc/platforms/maple/pci.c
>>@@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev)
>>  	DBG(" <- maple_pci_irq_fixup\n");
>>  }
>>
>>+static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge)
>>+{
>>+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
>>+	struct device_node *np, *child;
>>+
>>+	if (hose != u3_agp)
>>+		return 0;
>>+
>>+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>+	 * assume there is no P2P bridge on the AGP bus, which should be a
>>+	 * safe assumptions hopefully.
>>+	 */
>>+	np = hose->dn;
>>+	PCI_DN(np)->busno = 0xf0;
>>+	for_each_child_of_node(np, child)
>>+		PCI_DN(child)->busno = 0xf0;
>>+
>>+	return 0;
>>+}
>>+
>>  void __init maple_pci_init(void)
>>  {
>>  	struct device_node *np, *root;
>>@@ -605,20 +625,7 @@ void __init maple_pci_init(void)
>>  	if (ht && maple_add_bridge(ht) != 0)
>>  		of_node_put(ht);
>>
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>-
>>-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>-	 * assume there is no P2P bridge on the AGP bus, which should be a
>>-	 * safe assumptions hopefully.
>>-	 */
>>-	if (u3_agp) {
>>-		struct device_node *np = u3_agp->dn;
>>-		PCI_DN(np)->busno = 0xf0;
>>-		for (np = np->child; np; np = np->sibling)
>>-			PCI_DN(np)->busno = 0xf0;
>>-	}
>>-
>>+	ppc_md.pcibios_root_bridge_prepare = maple_pci_root_bridge_prepare;
>>  	/* Tell pci.c to not change any resource allocations.  */
>>  	pci_add_flags(PCI_PROBE_ONLY);
>>  }
>>diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
>>index f3a68a0..10c4e8f 100644
>>--- a/arch/powerpc/platforms/pasemi/pci.c
>>+++ b/arch/powerpc/platforms/pasemi/pci.c
>>@@ -229,9 +229,6 @@ void __init pas_pci_init(void)
>>  			of_node_get(np);
>>
>>  	of_node_put(root);
>>-
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>  }
>>
>>  void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)
>>diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
>>index 59ab16f..368716f 100644
>>--- a/arch/powerpc/platforms/powermac/pci.c
>>+++ b/arch/powerpc/platforms/powermac/pci.c
>>@@ -878,6 +878,29 @@ void pmac_pci_irq_fixup(struct pci_dev *dev)
>>  #endif /* CONFIG_PPC32 */
>>  }
>>
>>+#ifdef CONFIG_PPC64
>>+static int pmac_pci_root_bridge_prepare(struct pci_hot_bridge *bridge)
>>+{
>>+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
>>+	struct device_node *np, *child;
>>+
>>+	if (hose != u3_agp)
>>+		return 0;
>>+
>>+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>+	 * assume there is no P2P bridge on the AGP bus, which should be a
>>+	 * safe assumptions for now. We should do something better in the
>>+	 * future though
>>+	 */
>>+	np = hose->dn;
>>+	PCI_DN(np)->busno = 0xf0;
>>+	for_each_child_of_node(np, child)
>>+		PCI_DN(child)->busno = 0xf0;
>>+
>>+	return 0;
>>+}
>>+#endif /* CONFIG_PPC64 */
>>+
>>  void __init pmac_pci_init(void)
>>  {
>>  	struct device_node *np, *root;
>>@@ -914,22 +937,8 @@ void __init pmac_pci_init(void)
>>  	if (ht && pmac_add_bridge(ht) != 0)
>>  		of_node_put(ht);
>>
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>-
>>-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>-	 * assume there is no P2P bridge on the AGP bus, which should be a
>>-	 * safe assumptions for now. We should do something better in the
>>-	 * future though
>>-	 */
>>-	if (u3_agp) {
>>-		struct device_node *np = u3_agp->dn;
>>-		PCI_DN(np)->busno = 0xf0;
>>-		for (np = np->child; np; np = np->sibling)
>>-			PCI_DN(np)->busno = 0xf0;
>>-	}
>>  	/* pmac_check_ht_link(); */
>>-
>>+	ppc_md.pcibios_root_bridge_prepare = pmac_pci_root_bridge_prepare;
>>  #else /* CONFIG_PPC64 */
>>  	init_p2pbridge();
>>  	init_second_ohare();
>>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>>index 60e6d65..21a4eb3 100644
>>--- a/arch/powerpc/platforms/powernv/pci.c
>>+++ b/arch/powerpc/platforms/powernv/pci.c
>>@@ -819,9 +819,6 @@ void __init pnv_pci_init(void)
>>  	for_each_compatible_node(np, NULL, "ibm,ioda2-phb")
>>  		pnv_pci_init_ioda2_phb(np);
>>
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>-
>>  	/* Configure IOMMU DMA hooks */
>>  	ppc_md.tce_build = pnv_tce_build_vm;
>>  	ppc_md.tce_free = pnv_tce_free_vm;
>>diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>index df6a704..5f80758 100644
>>--- a/arch/powerpc/platforms/pseries/setup.c
>>+++ b/arch/powerpc/platforms/pseries/setup.c
>>@@ -482,7 +482,6 @@ static void __init find_and_init_phbs(void)
>>  	}
>>
>>  	of_node_put(root);
>>-	pci_devs_phb_init();
>>
>>  	/*
>>  	 * PCI_PROBE_ONLY and PCI_REASSIGN_ALL_BUS can be set via properties
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
@ 2015-05-11  7:21       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sun, May 10, 2015 at 12:55:51AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>The pci_dn instances are allocated from memblock or bootmem when
>>creating PCI controller (hoses) in setup_arch(). The PCI hotplug,
>>which will be supported by proceeding patches, will release PCI
>>device nodes and their corresponding pci_dn on unplugging event.
>>The pci_dn instance memory chunks alloed from memblock or bootmem
>>are hard to reused after being released.
>>
>>The patch delay creating pci_dn so that they can be allocated from
>>slab. In turn, the memory chunks for them can be reused after being
>>released without problem. The creation of eeh_dev instances, which
>>depends on pci_dn, is delayed a bit as well.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/ppc-pci.h     |  1 -
>>  arch/powerpc/kernel/eeh_dev.c          |  2 +-
>>  arch/powerpc/kernel/pci_dn.c           | 40 +++++++++++++++++++---------------
>>  arch/powerpc/platforms/maple/pci.c     | 35 +++++++++++++++++------------
>>  arch/powerpc/platforms/pasemi/pci.c    |  3 ---
>>  arch/powerpc/platforms/powermac/pci.c  | 39 ++++++++++++++++++++-------------
>>  arch/powerpc/platforms/powernv/pci.c   |  3 ---
>>  arch/powerpc/platforms/pseries/setup.c |  1 -
>>  8 files changed, 68 insertions(+), 56 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
>>index 4122a86..7388316 100644
>>--- a/arch/powerpc/include/asm/ppc-pci.h
>>+++ b/arch/powerpc/include/asm/ppc-pci.h
>>@@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root,
>>  		      void *(*fn)(struct pci_dn *, void *),
>>  		      void *data);
>>
>>-extern void pci_devs_phb_init(void);
>>  extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
>>
>>  /* From rtas_pci.h */
>>diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
>>index aabba94..f33ce5b 100644
>>--- a/arch/powerpc/kernel/eeh_dev.c
>>+++ b/arch/powerpc/kernel/eeh_dev.c
>>@@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void)
>>  	return 0;
>>  }
>>
>>-core_initcall(eeh_dev_phb_init);
>>+core_initcall_sync(eeh_dev_phb_init);
>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>index b3b4df9..d3833af 100644
>>--- a/arch/powerpc/kernel/pci_dn.c
>>+++ b/arch/powerpc/kernel/pci_dn.c
>>@@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>>  	struct device_node *parent;
>>  	struct pci_dn *pdn;
>>
>>-	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
>>+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>>  	if (pdn == NULL)
>>  		return NULL;
>>  	dn->data = pdn;
>>@@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
>>  	traverse_pci_devices(dn, update_dn_pci_info, phb);
>>  }
>>
>>-/**
>>+static void pci_dev_pdn_setup(struct pci_dev *pdev)
>>+{
>>+	struct pci_dn *pdn;
>>+
>>+	if (pdev->dev.archdata.pci_data)
>>+		return;
>>+
>>+	/* Setup the fast path */
>>+	pdn = pci_get_pdn(pdev);
>>+	pdev->dev.archdata.pci_data = pdn;
>>+}
>>+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
>
>
>How does moving of the chunk above help to "Delay creating pci_dn"?
>

It's already in "Delay creating pci_dn", isn't it? :-)

>
>>+
>>+/*
>>   * pci_devs_phb_init - Initialize phbs and pci devs under them.
>>- *
>>- * This routine walks over all phb's (pci-host bridges) on the
>>- * system, and sets up assorted pci-related structures
>>+ *
>>+ * This routine walks over all phb's (pci-host bridges) on
>>+ * the system, and sets up assorted pci-related structures
>>   * (including pci info in the device node structs) for each
>>   * pci device found underneath.  This routine runs once,
>>   * early in the boot sequence.
>>   */
>>-void __init pci_devs_phb_init(void)
>>+static int __init pci_devs_phb_init(void)
>>  {
>>  	struct pci_controller *phb, *tmp;
>>
>>  	/* This must be done first so the device nodes have valid pci info! */
>>  	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>>  		pci_devs_phb_init_dynamic(phb);
>>-}
>>-
>>-static void pci_dev_pdn_setup(struct pci_dev *pdev)
>>-{
>>-	struct pci_dn *pdn;
>>
>>-	if (pdev->dev.archdata.pci_data)
>>-		return;
>>-
>>-	/* Setup the fast path */
>>-	pdn = pci_get_pdn(pdev);
>>-	pdev->dev.archdata.pci_data = pdn;
>>+	return 0;
>>  }
>>-DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
>>+
>>+core_initcall(pci_devs_phb_init);
>>diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c
>>index a923230..04a69a8 100644
>>--- a/arch/powerpc/platforms/maple/pci.c
>>+++ b/arch/powerpc/platforms/maple/pci.c
>>@@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev)
>>  	DBG(" <- maple_pci_irq_fixup\n");
>>  }
>>
>>+static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge)
>>+{
>>+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
>>+	struct device_node *np, *child;
>>+
>>+	if (hose != u3_agp)
>>+		return 0;
>>+
>>+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>+	 * assume there is no P2P bridge on the AGP bus, which should be a
>>+	 * safe assumptions hopefully.
>>+	 */
>>+	np = hose->dn;
>>+	PCI_DN(np)->busno = 0xf0;
>>+	for_each_child_of_node(np, child)
>>+		PCI_DN(child)->busno = 0xf0;
>>+
>>+	return 0;
>>+}
>>+
>>  void __init maple_pci_init(void)
>>  {
>>  	struct device_node *np, *root;
>>@@ -605,20 +625,7 @@ void __init maple_pci_init(void)
>>  	if (ht && maple_add_bridge(ht) != 0)
>>  		of_node_put(ht);
>>
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>-
>>-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>-	 * assume there is no P2P bridge on the AGP bus, which should be a
>>-	 * safe assumptions hopefully.
>>-	 */
>>-	if (u3_agp) {
>>-		struct device_node *np = u3_agp->dn;
>>-		PCI_DN(np)->busno = 0xf0;
>>-		for (np = np->child; np; np = np->sibling)
>>-			PCI_DN(np)->busno = 0xf0;
>>-	}
>>-
>>+	ppc_md.pcibios_root_bridge_prepare = maple_pci_root_bridge_prepare;
>>  	/* Tell pci.c to not change any resource allocations.  */
>>  	pci_add_flags(PCI_PROBE_ONLY);
>>  }
>>diff --git a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
>>index f3a68a0..10c4e8f 100644
>>--- a/arch/powerpc/platforms/pasemi/pci.c
>>+++ b/arch/powerpc/platforms/pasemi/pci.c
>>@@ -229,9 +229,6 @@ void __init pas_pci_init(void)
>>  			of_node_get(np);
>>
>>  	of_node_put(root);
>>-
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>  }
>>
>>  void __iomem *pasemi_pci_getcfgaddr(struct pci_dev *dev, int offset)
>>diff --git a/arch/powerpc/platforms/powermac/pci.c b/arch/powerpc/platforms/powermac/pci.c
>>index 59ab16f..368716f 100644
>>--- a/arch/powerpc/platforms/powermac/pci.c
>>+++ b/arch/powerpc/platforms/powermac/pci.c
>>@@ -878,6 +878,29 @@ void pmac_pci_irq_fixup(struct pci_dev *dev)
>>  #endif /* CONFIG_PPC32 */
>>  }
>>
>>+#ifdef CONFIG_PPC64
>>+static int pmac_pci_root_bridge_prepare(struct pci_hot_bridge *bridge)
>>+{
>>+	struct pci_controller *hose = pci_bus_to_host(bridge->bus);
>>+	struct device_node *np, *child;
>>+
>>+	if (hose != u3_agp)
>>+		return 0;
>>+
>>+	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>+	 * assume there is no P2P bridge on the AGP bus, which should be a
>>+	 * safe assumptions for now. We should do something better in the
>>+	 * future though
>>+	 */
>>+	np = hose->dn;
>>+	PCI_DN(np)->busno = 0xf0;
>>+	for_each_child_of_node(np, child)
>>+		PCI_DN(child)->busno = 0xf0;
>>+
>>+	return 0;
>>+}
>>+#endif /* CONFIG_PPC64 */
>>+
>>  void __init pmac_pci_init(void)
>>  {
>>  	struct device_node *np, *root;
>>@@ -914,22 +937,8 @@ void __init pmac_pci_init(void)
>>  	if (ht && pmac_add_bridge(ht) != 0)
>>  		of_node_put(ht);
>>
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>-
>>-	/* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We
>>-	 * assume there is no P2P bridge on the AGP bus, which should be a
>>-	 * safe assumptions for now. We should do something better in the
>>-	 * future though
>>-	 */
>>-	if (u3_agp) {
>>-		struct device_node *np = u3_agp->dn;
>>-		PCI_DN(np)->busno = 0xf0;
>>-		for (np = np->child; np; np = np->sibling)
>>-			PCI_DN(np)->busno = 0xf0;
>>-	}
>>  	/* pmac_check_ht_link(); */
>>-
>>+	ppc_md.pcibios_root_bridge_prepare = pmac_pci_root_bridge_prepare;
>>  #else /* CONFIG_PPC64 */
>>  	init_p2pbridge();
>>  	init_second_ohare();
>>diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>>index 60e6d65..21a4eb3 100644
>>--- a/arch/powerpc/platforms/powernv/pci.c
>>+++ b/arch/powerpc/platforms/powernv/pci.c
>>@@ -819,9 +819,6 @@ void __init pnv_pci_init(void)
>>  	for_each_compatible_node(np, NULL, "ibm,ioda2-phb")
>>  		pnv_pci_init_ioda2_phb(np);
>>
>>-	/* Setup the linkage between OF nodes and PHBs */
>>-	pci_devs_phb_init();
>>-
>>  	/* Configure IOMMU DMA hooks */
>>  	ppc_md.tce_build = pnv_tce_build_vm;
>>  	ppc_md.tce_free = pnv_tce_free_vm;
>>diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>index df6a704..5f80758 100644
>>--- a/arch/powerpc/platforms/pseries/setup.c
>>+++ b/arch/powerpc/platforms/pseries/setup.c
>>@@ -482,7 +482,6 @@ static void __init find_and_init_phbs(void)
>>  	}
>>
>>  	of_node_put(root);
>>-	pci_devs_phb_init();
>>
>>  	/*
>>  	 * PCI_PROBE_ONLY and PCI_REASSIGN_ALL_BUS can be set via properties
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
  2015-05-09 15:08     ` Alexey Kardashevskiy
@ 2015-05-11  7:24       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sun, May 10, 2015 at 01:08:28AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>The eeh_dev is always created based on pci_dn, but with initcall
>>supported by core_initcall_sync(). The patch creates eeh_dev
>>when pci_dn is created, indicating they have same life cycle.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h         |  6 ++++--
>>  arch/powerpc/kernel/eeh_dev.c          | 18 ++++--------------
>>  arch/powerpc/kernel/pci_dn.c           | 12 ++++++++++++
>>  arch/powerpc/platforms/pseries/setup.c |  6 +-----
>>  4 files changed, 21 insertions(+), 21 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 2793d24..4ed88f6 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
>>  const char *eeh_pe_loc_get(struct eeh_pe *pe);
>>  struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>>
>>-void *eeh_dev_init(struct pci_dn *pdn, void *data);
>>+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
>>+			     struct pci_controller *phb);
>
>
>Everywhere else (?) you name these pci_controller pointer variables "hose"
>but not in this patch.
>

Yeah, better to have "struct pci_controller *hose" actually. For PCI related
code in platforms/powernv/*.c, we have "struct pci_controller *hose" and
"struct pnv_phb *phb".

>>  void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>>  int eeh_init(void);
>>  int __init eeh_ops_register(struct eeh_ops *ops);
>>@@ -322,7 +323,8 @@ static inline int eeh_init(void)
>>  	return 0;
>>  }
>>
>>-static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>+static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
>>+					   struct pci_controller *phb)
>>  {
>>  	return NULL;
>>  }
>>diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
>>index f33ce5b..7486932 100644
>>--- a/arch/powerpc/kernel/eeh_dev.c
>>+++ b/arch/powerpc/kernel/eeh_dev.c
>>@@ -44,14 +44,14 @@
>>  /**
>>   * eeh_dev_init - Create EEH device according to OF node
>>   * @pdn: PCI device node
>>- * @data: PHB
>>+ * @phb: PCI controller
>>   *
>>   * It will create EEH device according to the given OF node. The function
>>   * might be called by PCI emunation, DR, PHB hotplug.
>>   */
>>-void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
>>+			     struct pci_controller *phb)
>>  {
>>-	struct pci_controller *phb = data;
>>  	struct eeh_dev *edev;
>>
>>  	/* Allocate EEH device */
>>@@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>  	edev->phb = phb;
>>  	INIT_LIST_HEAD(&edev->list);
>>
>>-	return NULL;
>>+	return edev;
>>  }
>>
>>  /**
>>@@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>   */
>>  void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
>>  {
>>-	struct pci_dn *root = phb->pci_data;
>>-
>>  	/* EEH PE for PHB */
>>  	eeh_phb_pe_create(phb);
>>-
>>-	/* EEH device for PHB */
>>-	eeh_dev_init(root, phb);
>>-
>>-	/* EEH devices for children OF nodes */
>>-	traverse_pci_dn(root, eeh_dev_init, phb);
>>  }
>>
>>  /**
>>@@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void)
>>  	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>>  		eeh_dev_phb_init_dynamic(phb);
>>
>>-	pr_info("EEH: devices created\n");
>>-
>>  	return 0;
>>  }
>>
>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>index d3833af..abc81fa 100644
>>--- a/arch/powerpc/kernel/pci_dn.c
>>+++ b/arch/powerpc/kernel/pci_dn.c
>>@@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>>  	const __be32 *regs;
>>  	struct device_node *parent;
>>  	struct pci_dn *pdn;
>>+#ifdef CONFIG_EEH
>>+	struct eeh_dev *edev;
>>+#endif
>>
>>  	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>>  	if (pdn == NULL)
>>@@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>>  	/* Extended config space */
>>  	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
>>
>>+	/* Initialize EEH device */
>>+#ifdef CONFIG_EEH
>
>You do not need this #ifdef - you have a stub for eeh_dev_init() in
>arch/powerpc/include/asm/eeh.h
>
>
>>+	edev = eeh_dev_init(pdn, phb);
>>+	if (!edev) {
>
>
>s/!edev/eeh_dev_init(pdn, phb)/ and get rid of @edev local variable at all -
>you do not use it anyway?
>
>

Yep, you're correct and I'll fix it up.

>>+		kfree(pdn);
>>+		return NULL;
>>+	}
>>+#endif
>>+
>>  	/* Attach to parent node */
>>  	INIT_LIST_HEAD(&pdn->child_list);
>>  	INIT_LIST_HEAD(&pdn->list);
>>diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>index 5f80758..92974aa 100644
>>--- a/arch/powerpc/platforms/pseries/setup.c
>>+++ b/arch/powerpc/platforms/pseries/setup.c
>>@@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
>>  	switch (action) {
>>  	case OF_RECONFIG_ATTACH_NODE:
>>  		pci = np->parent->data;
>>-		if (pci) {
>>+		if (pci)
>>  			update_dn_pci_info(np, pci->phb);
>>-
>>-			/* Create EEH device for the OF node */
>>-			eeh_dev_init(PCI_DN(np), pci->phb);
>>-		}
>>  		break;
>>  	default:
>>  		err = NOTIFY_DONE;
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
@ 2015-05-11  7:24       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sun, May 10, 2015 at 01:08:28AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>The eeh_dev is always created based on pci_dn, but with initcall
>>supported by core_initcall_sync(). The patch creates eeh_dev
>>when pci_dn is created, indicating they have same life cycle.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  arch/powerpc/include/asm/eeh.h         |  6 ++++--
>>  arch/powerpc/kernel/eeh_dev.c          | 18 ++++--------------
>>  arch/powerpc/kernel/pci_dn.c           | 12 ++++++++++++
>>  arch/powerpc/platforms/pseries/setup.c |  6 +-----
>>  4 files changed, 21 insertions(+), 21 deletions(-)
>>
>>diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>>index 2793d24..4ed88f6 100644
>>--- a/arch/powerpc/include/asm/eeh.h
>>+++ b/arch/powerpc/include/asm/eeh.h
>>@@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
>>  const char *eeh_pe_loc_get(struct eeh_pe *pe);
>>  struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>>
>>-void *eeh_dev_init(struct pci_dn *pdn, void *data);
>>+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
>>+			     struct pci_controller *phb);
>
>
>Everywhere else (?) you name these pci_controller pointer variables "hose"
>but not in this patch.
>

Yeah, better to have "struct pci_controller *hose" actually. For PCI related
code in platforms/powernv/*.c, we have "struct pci_controller *hose" and
"struct pnv_phb *phb".

>>  void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>>  int eeh_init(void);
>>  int __init eeh_ops_register(struct eeh_ops *ops);
>>@@ -322,7 +323,8 @@ static inline int eeh_init(void)
>>  	return 0;
>>  }
>>
>>-static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>+static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
>>+					   struct pci_controller *phb)
>>  {
>>  	return NULL;
>>  }
>>diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
>>index f33ce5b..7486932 100644
>>--- a/arch/powerpc/kernel/eeh_dev.c
>>+++ b/arch/powerpc/kernel/eeh_dev.c
>>@@ -44,14 +44,14 @@
>>  /**
>>   * eeh_dev_init - Create EEH device according to OF node
>>   * @pdn: PCI device node
>>- * @data: PHB
>>+ * @phb: PCI controller
>>   *
>>   * It will create EEH device according to the given OF node. The function
>>   * might be called by PCI emunation, DR, PHB hotplug.
>>   */
>>-void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>+struct eeh_dev *eeh_dev_init(struct pci_dn *pdn,
>>+			     struct pci_controller *phb)
>>  {
>>-	struct pci_controller *phb = data;
>>  	struct eeh_dev *edev;
>>
>>  	/* Allocate EEH device */
>>@@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>  	edev->phb = phb;
>>  	INIT_LIST_HEAD(&edev->list);
>>
>>-	return NULL;
>>+	return edev;
>>  }
>>
>>  /**
>>@@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>>   */
>>  void eeh_dev_phb_init_dynamic(struct pci_controller *phb)
>>  {
>>-	struct pci_dn *root = phb->pci_data;
>>-
>>  	/* EEH PE for PHB */
>>  	eeh_phb_pe_create(phb);
>>-
>>-	/* EEH device for PHB */
>>-	eeh_dev_init(root, phb);
>>-
>>-	/* EEH devices for children OF nodes */
>>-	traverse_pci_dn(root, eeh_dev_init, phb);
>>  }
>>
>>  /**
>>@@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void)
>>  	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
>>  		eeh_dev_phb_init_dynamic(phb);
>>
>>-	pr_info("EEH: devices created\n");
>>-
>>  	return 0;
>>  }
>>
>>diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
>>index d3833af..abc81fa 100644
>>--- a/arch/powerpc/kernel/pci_dn.c
>>+++ b/arch/powerpc/kernel/pci_dn.c
>>@@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>>  	const __be32 *regs;
>>  	struct device_node *parent;
>>  	struct pci_dn *pdn;
>>+#ifdef CONFIG_EEH
>>+	struct eeh_dev *edev;
>>+#endif
>>
>>  	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>>  	if (pdn == NULL)
>>@@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
>>  	/* Extended config space */
>>  	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
>>
>>+	/* Initialize EEH device */
>>+#ifdef CONFIG_EEH
>
>You do not need this #ifdef - you have a stub for eeh_dev_init() in
>arch/powerpc/include/asm/eeh.h
>
>
>>+	edev = eeh_dev_init(pdn, phb);
>>+	if (!edev) {
>
>
>s/!edev/eeh_dev_init(pdn, phb)/ and get rid of @edev local variable at all -
>you do not use it anyway?
>
>

Yep, you're correct and I'll fix it up.

>>+		kfree(pdn);
>>+		return NULL;
>>+	}
>>+#endif
>>+
>>  	/* Attach to parent node */
>>  	INIT_LIST_HEAD(&pdn->child_list);
>>  	INIT_LIST_HEAD(&pdn->list);
>>diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>index 5f80758..92974aa 100644
>>--- a/arch/powerpc/platforms/pseries/setup.c
>>+++ b/arch/powerpc/platforms/pseries/setup.c
>>@@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act
>>  	switch (action) {
>>  	case OF_RECONFIG_ATTACH_NODE:
>>  		pci = np->parent->data;
>>-		if (pci) {
>>+		if (pci)
>>  			update_dn_pci_info(np, pci->phb);
>>-
>>-			/* Create EEH device for the OF node */
>>-			eeh_dev_init(PCI_DN(np), pci->phb);
>>-		}
>>  		break;
>>  	default:
>>  		err = NOTIFY_DONE;
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
  2015-05-09 15:54     ` Alexey Kardashevskiy
@ 2015-05-11  7:38       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sun, May 10, 2015 at 01:54:31AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>The patch intends to add standalone driver to support PCI hotplug
>>for PowerPC PowerNV platform, which runs on top of skiboot firmware.
>>The firmware identified hotpluggable slots and marked their device
>>tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware".
>>The driver simply scans device-tree to create/register PCI hotplug slot
>>accordingly.
>>
>>If the skiboot firmware doesn't support slot status retrieval, the PCI
>>slot device node shouldn't have property "ibm,reset-by-firmware". In
>>that case, none of valid PCI slots will be detected from device tree.
>>The skiboot firmware doesn't export the capability to access attention
>>LEDs yet and it's something for TBD.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  drivers/pci/hotplug/Kconfig            |  12 +
>>  drivers/pci/hotplug/Makefile           |   4 +
>>  drivers/pci/hotplug/powernv_php.c      | 146 ++++++++
>>  drivers/pci/hotplug/powernv_php.h      |  78 ++++
>>  drivers/pci/hotplug/powernv_php_slot.c | 643 +++++++++++++++++++++++++++++++++
>>  5 files changed, 883 insertions(+)
>>  create mode 100644 drivers/pci/hotplug/powernv_php.c
>>  create mode 100644 drivers/pci/hotplug/powernv_php.h
>>  create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>>
>>diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
>>index df8caec..ef55dae 100644
>>--- a/drivers/pci/hotplug/Kconfig
>>+++ b/drivers/pci/hotplug/Kconfig
>>@@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC
>>
>>  	  When in doubt, say N.
>>
>>+config HOTPLUG_PCI_POWERNV
>>+	tristate "PowerPC PowerNV PCI Hotplug driver"
>>+	depends on PPC_POWERNV && EEH
>>+	help
>>+	  Say Y here if you run PowerPC PowerNV platform that supports
>>+          PCI Hotplug
>>+
>>+	  To compile this driver as a module, choose M here: the
>>+	  module will be called powernv-php.
>>+
>>+	  When in doubt, say N.
>>+
>>  config HOTPLUG_PCI_RPA
>>  	tristate "RPA PCI Hotplug driver"
>>  	depends on PPC_PSERIES && EEH
>>diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
>>index 4a9aa08..a69665e 100644
>>--- a/drivers/pci/hotplug/Makefile
>>+++ b/drivers/pci/hotplug/Makefile
>>@@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)		+= pciehp.o
>>  obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550)	+= cpcihp_zt5550.o
>>  obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)	+= cpcihp_generic.o
>>  obj-$(CONFIG_HOTPLUG_PCI_SHPC)		+= shpchp.o
>>+obj-$(CONFIG_HOTPLUG_PCI_POWERNV)	+= powernv-php.o
>>  obj-$(CONFIG_HOTPLUG_PCI_RPA)		+= rpaphp.o
>>  obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR)	+= rpadlpar_io.o
>>  obj-$(CONFIG_HOTPLUG_PCI_SGI)		+= sgi_hotplug.o
>>@@ -50,6 +51,9 @@ ibmphp-objs		:=	ibmphp_core.o	\
>>  acpiphp-objs		:=	acpiphp_core.o	\
>>  				acpiphp_glue.o
>>
>>+powernv-php-objs	:=	powernv_php.o	\
>>+				powernv_php_slot.o
>>+
>>  rpaphp-objs		:=	rpaphp_core.o	\
>>  				rpaphp_pci.o	\
>>  				rpaphp_slot.o
>>diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c
>>new file mode 100644
>>index 0000000..5cf9e717
>>--- /dev/null
>>+++ b/drivers/pci/hotplug/powernv_php.c
>>@@ -0,0 +1,146 @@
>>+/*
>>+ * PCI Hotplug Driver for PowerPC PowerNV platform.
>>+ *
>>+ * Copyright Gavin Shan, IBM Corporation 2015.
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#include <linux/kernel.h>
>>+#include <linux/module.h>
>>+#include <linux/sysfs.h>
>>+#include <linux/pci.h>
>>+#include <linux/pci_hotplug.h>
>>+#include <linux/string.h>
>>+#include <linux/slab.h>
>>+#include <asm/opal.h>
>>+#include <asm/pnv-pci.h>
>>+
>>+#include "powernv_php.h"
>
>Compiles without linux/kernel.h, linux/sysfs.h, linux/string.h, linux/slab.h.
>Sure you need all of these?
>

Thanks, I'll check them one by one.

>>+
>>+#define DRIVER_VERSION	"0.1"
>>+#define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
>>+#define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
>>+
>>+static struct notifier_block php_msg_nb = {
>>+	.notifier_call	= powernv_php_msg_handler,
>>+	.next		= NULL,
>>+	.priority	= 0,
>>+};
>>+
>>+static int powernv_php_register_one(struct device_node *dn)
>>+{
>>+	struct powernv_php_slot *slot;
>>+	const __be32 *prop32;
>>+	int ret;
>>+
>>+	/* Check if it's hotpluggable slot */
>>+	prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL);
>>+	if (!prop32 || !of_read_number(prop32, 1))
>>+		return 0;
>
>Although nobody checks the return code, this should be -ENXIO or something
>but zero. And the check below too.
>

Yeah, it makes sense to me.

>>+
>>+	prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL);
>>+	if (!prop32 || !of_read_number(prop32, 1))
>>+		return 0;
>>+
>>+	/* Allocate slot */
>>+	slot = powernv_php_slot_alloc(dn);
>>+	if (!slot)
>>+		return -ENODEV;
>>+
>>+	/* Register it */
>>+	ret = powernv_php_slot_register(slot);
>>+	if (ret) {
>>+		powernv_php_slot_put(slot);
>>+		return ret;
>>+	}
>>+
>>+	return powernv_php_slot_enable(slot->php_slot, false, false);
>>+}
>>+
>>+int powernv_php_register(struct device_node *dn)
>>+{
>>+	struct device_node *child;
>>+	int ret = 0;
>>+
>>+	/*
>>+	 * The parent slots should be registered before their
>>+	 * child slots.
>>+	 */
>>+	for_each_child_of_node(dn, child) {
>>+		ret = powernv_php_register_one(child);
>>+		if (ret)
>>+			break;
>>+
>>+		powernv_php_register(child);
>>+	}
>>+
>>+	return ret;
>>+}
>>+
>>+static void powernv_php_unregister_one(struct device_node *dn)
>>+{
>>+	struct powernv_php_slot *slot;
>>+
>>+	slot = powernv_php_slot_find(dn);
>>+	if (!slot)
>>+		return;
>>+
>>+	pci_hp_deregister(slot->php_slot);
>>+}
>>+
>>+void powernv_php_unregister(struct device_node *dn)
>>+{
>>+	struct device_node *child;
>>+
>>+	/* The child slots should go before their parent slots */
>>+	for_each_child_of_node(dn, child) {
>>+		powernv_php_unregister(child);
>>+		powernv_php_unregister_one(child);
>>+	}
>>+}
>>+
>>+static int __init powernv_php_init(void)
>>+{
>>+	struct device_node *dn;
>>+
>>+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
>>+
>>+	/* Register hotplug message handler */
>>+	if (pnv_pci_hotplug_notifier(&php_msg_nb, true)) {
>
>If you called the function "pnv_pci_hotplug_notifier_register", you would not
>need the comment above.
>

pnv_pci_hotplug_notifier() has second argument to indicate it's registering
or unregistering notifier. So you're expecting something like this?

	pnv_pci_hotplug_notifier_register(notifier, true or false);

>
>>+		pr_warn("%s: Cannot register hotplug message notifier\n",
>>+			__func__);
>>+		return -EIO;
>>+	}
>>+
>>+	/* Scan PHB nodes and their children */
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
>>+		powernv_php_register(dn);
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
>>+		powernv_php_register(dn);
>>+
>>+	return 0;
>>+}
>>+
>>+static void __exit powernv_php_exit(void)
>>+{
>>+	struct device_node *dn;
>>+
>>+	pnv_pci_hotplug_notifier(&php_msg_nb, false);
>>+
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
>>+		powernv_php_unregister(dn);
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
>>+		powernv_php_unregister(dn);
>>+}
>>+
>>+module_init(powernv_php_init);
>>+module_exit(powernv_php_exit);
>>+
>>+MODULE_VERSION(DRIVER_VERSION);
>>+MODULE_LICENSE("GPL v2");
>>+MODULE_AUTHOR(DRIVER_AUTHOR);
>>+MODULE_DESCRIPTION(DRIVER_DESC);
>>diff --git a/drivers/pci/hotplug/powernv_php.h b/drivers/pci/hotplug/powernv_php.h
>>new file mode 100644
>>index 0000000..87ba0d0
>>--- /dev/null
>>+++ b/drivers/pci/hotplug/powernv_php.h
>>@@ -0,0 +1,78 @@
>>+/*
>>+ * PCI Hotplug Driver for PowerPC PowerNV platform.
>>+ *
>>+ * Copyright Gavin Shan, IBM Corporation 2015.
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#ifndef _POWERNV_PHP_H
>>+#define _POWERNV_PHP_H
>
>I would put these (and dependencies if any) here:
>
>#include <linux/kref.h>
>#include <linux/pci.h>
>#include <linux/pci_hotplug.h>
>
>and remove them from .c files.
>

Yeah, will check and do.

>>+
>>+/* Slot power status */
>>+#define POWERNV_PHP_SLOT_POWER_OFF	0
>>+#define POWERNV_PHP_SLOT_POWER_ON	1
>>+
>>+/* Slot presence status */
>>+#define POWERNV_PHP_SLOT_EMPTY		0
>>+#define POWERNV_PHP_SLOT_PRESENT	1
>>+
>>+/* Slot attention status */
>>+#define POWERNV_PHP_SLOT_ATTEN_OFF	0
>>+#define POWERNV_PHP_SLOT_ATTEN_ON	1
>>+#define POWERNV_PHP_SLOT_ATTEN_IND	2
>>+#define POWERNV_PHP_SLOT_ATTEN_ACT	3
>>+
>>+struct powernv_php_slot {
>>+	struct kref		kref;
>>+	int			state;
>>+#define POWERNV_PHP_SLOT_STATE_INIT		0x0
>>+#define POWERNV_PHP_SLOT_STATE_REGISTER		0x1
>>+#define POWERNV_PHP_SLOT_STATE_POPULATED	0x2
>
>I believe these are not bitmasks but bit numbers, right? Decimal values are
>normally used for them.
>

Ok. Will change accordingly.

>>+	char			*name;
>>+	struct device_node	*dn;
>>+	struct pci_bus		*bus;
>>+	uint64_t		id;
>>+	int			slot_no;
>>+	int			check_power_status;
>>+	int			status_confirmed;
>>+	struct opal_msg		*msg;
>>+	struct work_struct	work;
>>+	wait_queue_head_t	queue;
>>+	struct hotplug_slot	*php_slot;
>>+	struct powernv_php_slot	*parent;
>>+	void (*release)(struct kref *kref);
>
>What is the point in this? Just use php_slot_free() directly in
>powernv_php_slot_put, no?
>

Yes, I can if you insist on. But current code isn't wrong :-)

>
>>+	struct list_head	children;
>>+	struct list_head	link;
>>+};
>>+
>>+#define to_powernv_php_slot(kref) container_of(kref, struct powernv_php_slot, kref)
>>+
>>+static inline void powernv_php_slot_get(struct powernv_php_slot *slot)
>>+{
>>+	if (slot)
>>+		kref_get(&slot->kref);
>>+}
>>+
>>+static inline int powernv_php_slot_put(struct powernv_php_slot *slot)
>>+{
>>+	if (slot)
>>+		return kref_put(&slot->kref, slot->release);
>>+
>>+	return 0;
>>+}
>>+
>>+int powernv_php_msg_handler(struct notifier_block *nb,
>>+			    unsigned long type, void *message);
>>+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn);
>>+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn);
>>+int powernv_php_slot_register(struct powernv_php_slot *slot);
>>+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
>>+			    bool rescan_bus, bool rescan_slot);
>
>Just an observation - rescan_bus and rescan_slot are both true or both false
>and never different. And the only caller requesting rescan is in the same
>file as powernv_php_slot_enable() and it could do this rescan if
>powernv_php_slot_enable() could signal that rescan is needed (return 1?).
>
>And no "goto" in powernv_php_slot_enable would be needed. Do not insist though.
>

Thanks for your careful review. I'll change the code accordingly.

>>+int powernv_php_register(struct device_node *dn);
>>+void powernv_php_unregister(struct device_node *dn);
>>+
>>+#endif /* !_POWERNV_PHP_H */
>>diff --git a/drivers/pci/hotplug/powernv_php_slot.c b/drivers/pci/hotplug/powernv_php_slot.c
>>new file mode 100644
>>index 0000000..fc82355
>>--- /dev/null
>>+++ b/drivers/pci/hotplug/powernv_php_slot.c
>>@@ -0,0 +1,643 @@
>>+/*
>>+ * PCI Hotplug Driver for PowerPC PowerNV platform.
>>+ *
>>+ * Copyright Gavin Shan, IBM Corporation 2015.
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#include <linux/kernel.h>
>>+#include <linux/module.h>
>>+#include <linux/sysfs.h>
>>+#include <linux/pci.h>
>>+#include <linux/pci_hotplug.h>
>>+#include <linux/string.h>
>>+#include <linux/slab.h>
>>+#include <linux/spinlock.h>
>>+#include <linux/wait.h>
>>+#include <linux/workqueue.h>
>>+
>>+#include <asm/opal.h>
>>+#include <asm/pnv-pci.h>
>>+#include <asm/ppc-pci.h>
>>+
>>+#include "powernv_php.h"
>
>I have a suspicion you won't need all these headers here too ;)
>

Most possibly. I'll check.

>>+
>>+static LIST_HEAD(php_slot_list);
>>+static DEFINE_SPINLOCK(php_slot_lock);
>>+
>>+/*
>>+ * Release firmware data for all child device nodes of the
>>+ * indicated one.
>>+ */
>>+static void release_device_nodes_info(struct device_node *np)
>>+{
>>+	struct device_node *child;
>>+
>>+	for_each_child_of_node(np, child) {
>>+		/* In depth first */
>>+		release_device_nodes_info(child);
>
>Why is this "release", not "remove" (as this is what it does - calling
>remove_lalala in a loop)?
>

The device node is expected to be free'ed. "release" is equal "free"
while "remove" isn't "free" necesarily. "remove" possibly means we're
removing something from the system or list, but their occupied memory
chunks are still there.

>>+
>>+		remove_pci_device_node_info(child);
>>+	}
>>+}
>>+
>>+/*
>>+ * Release all subordinate device nodes of the indicated one.
>>+ * Those device nodes in deepest path should be released firstly.
>>+ */
>>+static int release_device_nodes(struct device_node *parent)
>>+{
>>+	struct device_node *np, *child;
>>+	int ret = 0;
>>+
>>+	/* If the device node has children, remove them firstly */
>>+	for_each_child_of_node(parent, np) {
>>+		ret = release_device_nodes(np);
>>+		if (ret)
>>+			return ret;
>>+
>>+		/* The device shouldn't have alive children */
>>+		child = of_get_next_child(np, NULL);
>>+		if (child) {
>>+			of_node_put(child);
>>+			of_node_put(np);
>>+			pr_err("%s: Alive children of node <%s>\n",
>>+			       __func__, of_node_full_name(np));
>>+			return -EBUSY;
>>+		}
>>+
>>+		/* Detach the device node */
>>+		of_detach_node(np);
>>+		of_node_put(np);
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+/*
>>+ * The function processes the message sent by firmware
>>+ * to remove all device tree nodes beneath the slot's
>>+ * nodes, and the associated auxillary data.
>>+ */
>>+static void slot_power_off_handler(struct powernv_php_slot *slot)
>>+{
>>+	int ret;
>>+
>>+	/* Release the firmware data for the child device nodes */
>>+	release_device_nodes_info(slot->dn);
>>+
>>+	/* Release the child device nodes */
>>+	ret = release_device_nodes(slot->dn);
>>+	if (ret)
>>+		pr_warn("%s: Error %d releasing children of <%s>\n",
>>+			__func__, ret, of_node_full_name(slot->dn));
>>+
>>+	/* Confirm status change */
>>+	slot->status_confirmed = 1;
>>+	wake_up_interruptible(&slot->queue);
>>+}
>>+
>>+static void slot_power_on_handler(struct powernv_php_slot *slot)
>>+{
>>+	struct opal_msg *msg = slot->msg;
>>+	unsigned long phys = be64_to_cpu(msg->params[2]);
>>+	unsigned long len = be64_to_cpu(msg->params[3]);
>>+	void *blob = (phys && len > 0) ? __va(phys) : NULL;
>>+
>>+	/* There might have nothing behind the slot yet */
>>+	if (!blob || !len)
>
>"!len" is redundand here - blob will be NULL if len<=0.
>

Sure, I'll remove it, but it's not harmful.

>>+		goto out;
>>+
>>+	/* Copy the FDT blob and parse it */
>>+	of_fdt_add_subtree(slot->dn, blob);
>>+
>>+	/* Add device node firmware data */
>>+	traverse_pci_device_nodes(slot->dn,
>>+				  add_pci_device_node_info,
>>+				  pci_bus_to_host(slot->bus));
>>+
>>+out:
>>+	/* Confirm status change */
>>+	slot->status_confirmed = 1;
>>+	wake_up_interruptible(&slot->queue);
>>+}
>>+
>>+static void powernv_php_slot_work(struct work_struct *data)
>>+{
>>+	struct powernv_php_slot *slot = container_of(data,
>>+						     struct powernv_php_slot,
>>+						     work);
>>+	uint64_t php_event = be64_to_cpu(slot->msg->params[0]);
>>+
>>+	switch (php_event) {
>>+	case 0: /* Slot power off */
>>+		slot_power_off_handler(slot);
>>+		break;
>>+	case 1: /* Slot power on */
>>+		slot_power_on_handler(slot);
>>+		break;
>>+	default:
>>+		pr_warn("%s: Unsupported hotplug event %lld\n",
>>+			__func__, php_event);
>>+	}
>>+
>>+	of_node_put(slot->dn);
>>+}
>>+
>>+int powernv_php_msg_handler(struct notifier_block *nb,
>>+			    unsigned long type, void *message)
>>+{
>>+	phandle h;
>>+	struct device_node *np;
>>+	struct powernv_php_slot *slot;
>>+	struct opal_msg *msg = message;
>>+
>>+	/* Check the message type */
>>+	if (type != OPAL_MSG_PCI_HOTPLUG) {
>>+		pr_warn("%s: Wrong message type %ld received!\n",
>>+			__func__, type);
>>+		return 0;
>>+	}
>>+
>>+	/* Find the device node */
>>+	h = (phandle)be64_to_cpu(msg->params[1]);
>>+	np = of_find_node_by_phandle(h);
>>+	if (!np) {
>>+		pr_warn("%s: No device node for phandle 0x%08x\n",
>>+			__func__, h);
>>+		return 0;
>>+	}
>>+
>>+	/* Find the slot */
>>+	slot = powernv_php_slot_find(np);
>>+	if (!slot) {
>>+		pr_warn("%s: No slot found for node <%s>\n",
>>+			__func__, of_node_full_name(np));
>>+		of_node_put(np);
>>+		return 0;
>>+	}
>>+
>>+	/* Schedule the work */
>>+	slot->msg = msg;
>>+	schedule_work(&slot->work);
>>+	return 0;
>>+}
>>+
>>+static int set_power_status(struct hotplug_slot *php_slot, u8 val)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	int ret;
>>+
>>+	/* Set power status */
>>+	slot->status_confirmed = 0;
>>+	ret = pnv_pci_set_power_status(slot->id, val);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d powering %s slot %016llx\n",
>>+			__func__, ret, val ? "on" : "off", slot->id);
>>+		return ret;
>>+	}
>>+
>>+	/* Waiting until the device tree is updated */
>>+	ret = wait_event_timeout(slot->queue,
>>+				 !slot->status_confirmed,
>>+				 10 * HZ);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d completing power-%s slot %016llx\n",
>>+			__func__, ret, val ? "on" : "off", slot->id);
>>+		return ret;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int get_power_status(struct hotplug_slot *php_slot, u8 *val)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t state;
>>+	int ret;
>>+
>>+	/*
>>+	 * Retrieve power status from firmware. If we fail
>>+	 * getting that, the power status fails back to
>>+	 * be on.
>>+	 */
>>+	ret = pnv_pci_get_power_status(slot->id, &state);
>>+	if (ret) {
>>+		*val = POWERNV_PHP_SLOT_POWER_ON;
>>+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+	} else {
>>+		*val = state ? POWERNV_PHP_SLOT_POWER_ON :
>>+			       POWERNV_PHP_SLOT_POWER_OFF;
>>+		php_slot->info->power_status = *val;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int get_adapter_status(struct hotplug_slot *php_slot, u8 *val)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t state;
>>+	int ret;
>>+
>>+	/*
>>+	 * Retrieve presence status from firmware. If we can't
>>+	 * get that, it will fail back to be empty.
>>+	 */
>>+	ret = pnv_pci_get_presence_status(slot->id, &state);
>>+	if (ret >= 0) {
>>+                *val = state ? POWERNV_PHP_SLOT_PRESENT :
>>+                               POWERNV_PHP_SLOT_EMPTY;
>>+                php_slot->info->adapter_status = *val;
>
>ret = 0;
>
>
>>+	} else {
>>+		*val = POWERNV_PHP_SLOT_EMPTY;
>>+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+	}
>>+
>>+	return ret < 0 ? ret : 0;
>
>
>return ret;
>

Ok. I'll fix it up.

>>+}
>>+
>>+static int set_attention_status(struct hotplug_slot *php_slot, u8 val)
>>+{
>>+	/* The default operation would to turn on the attention */
>>+	switch (val) {
>>+	case POWERNV_PHP_SLOT_ATTEN_OFF:
>>+	case POWERNV_PHP_SLOT_ATTEN_ON:
>>+	case POWERNV_PHP_SLOT_ATTEN_IND:
>>+	case POWERNV_PHP_SLOT_ATTEN_ACT:
>>+		break;
>>+	default:
>>+		val = POWERNV_PHP_SLOT_ATTEN_ON;
>
>Is not @val a garbage in this case?
>

No, the kerenl takes everything that's valid. If not, it will
be POWERNV_PHP_SLOT_ATTEN_ON.

>>+	}
>>+
>>+	/* FIXME: Make it real once firmware supports it */
>>+	php_slot->info->attention_status = val;
>>+
>>+	return 0;
>>+}
>>+
>>+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
>>+			    bool rescan_bus, bool rescan_slot)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t presence, power_status;
>>+	int ret;
>>+
>>+	/* Check if the slot has been configured */
>>+	if (slot->state != POWERNV_PHP_SLOT_STATE_REGISTER)
>>+		return 0;
>>+
>>+	/* Retrieve slot presence status */
>>+	ret = php_slot->ops->get_adapter_status(php_slot, &presence);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+		return ret;
>>+	}
>>+
>>+	/* Proceed if there have nothing behind the slot */
>>+	if (presence == POWERNV_PHP_SLOT_EMPTY)
>>+		goto scan;
>>+
>>+	/*
>>+	 * If we don't detect something behind the slot, we need
>>+	 * make sure the power suply to the slot is on. Otherwise,
>>+	 * the slot downstream PCIe linkturn should be down.
>>+	 *
>>+	 * On the first time, we don't change the power status to
>>+	 * boost system boot with assumption that the firmware
>>+	 * supplies consistent slot power status: empty slot always
>>+	 * has its power off and non-empty slot has its power on.
>>+	 */
>>+	if (!slot->check_power_status) {
>>+		slot->check_power_status = 1;
>>+		goto scan;
>>+	}
>>+
>>+	/* Check the power status. Scan the slot if that's already on */
>>+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+		return ret;
>>+	}
>>+	if (power_status == POWERNV_PHP_SLOT_POWER_ON)
>>+		goto scan;
>>+
>>+	/* Power is off, turn it on and then scan the slot */
>>+	ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_ON);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d powering on slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+		return ret;
>>+	}
>>+
>>+scan:
>>+	switch (presence) {
>>+	case POWERNV_PHP_SLOT_PRESENT:
>>+		if (rescan_bus) {
>>+			pci_lock_rescan_remove();
>>+			pcibios_add_pci_devices(slot->bus);
>>+			pci_unlock_rescan_remove();
>>+		}
>>+
>>+		/* Rescan for child hotpluggable slots */
>>+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
>>+		if (rescan_slot)
>>+			powernv_php_register(slot->dn);
>>+		break;
>>+	case POWERNV_PHP_SLOT_EMPTY:
>>+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
>>+		break;
>>+	default:
>>+		pr_warn("%s: Invalid presence status %d of slot %016llx\n",
>>+			__func__, presence, slot->id);
>>+		return -EINVAL;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int enable_slot(struct hotplug_slot *php_slot)
>>+{
>>+	return powernv_php_slot_enable(php_slot, true, true);
>>+}
>>+
>>+static int disable_slot(struct hotplug_slot *php_slot)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t power_status;
>>+	int ret;
>>+
>>+	if (slot->state != POWERNV_PHP_SLOT_STATE_POPULATED)
>>+		return 0;
>>+
>>+	/* Remove all devices behind the slot */
>>+	pci_lock_rescan_remove();
>>+	pcibios_remove_pci_devices(slot->bus);
>>+	pci_unlock_rescan_remove();
>>+
>>+	/* Detach the child hotpluggable slots */
>>+	powernv_php_unregister(slot->dn);
>>+
>>+	/*
>>+	 * Check the power status and turn it off if necessary. If we
>>+	 * fail to get the power status, the power will be forced to
>>+	 * be off.
>>+	 */
>>+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
>>+	if (ret || power_status == POWERNV_PHP_SLOT_POWER_ON) {
>>+		ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_OFF);
>>+		if (ret)
>>+			pr_warn("%s: Error %d powering off slot %016llx\n",
>>+				__func__, ret, slot->id);
>>+	}
>>+
>>+	/* Update slot state */
>>+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
>>+	return 0;
>>+}
>>+
>>+static struct hotplug_slot_ops php_slot_ops = {
>>+	.get_power_status	= get_power_status,
>>+	.get_adapter_status	= get_adapter_status,
>>+	.set_attention_status	= set_attention_status,
>>+	.enable_slot		= enable_slot,
>>+	.disable_slot		= disable_slot,
>>+};
>>+
>>+static struct powernv_php_slot *php_slot_match(struct device_node *dn,
>>+					       struct powernv_php_slot *slot)
>>+{
>>+	struct powernv_php_slot *target, *tmp;
>>+
>>+	if (slot->dn == dn)
>>+		return slot;
>>+
>>+	list_for_each_entry(tmp, &slot->children, link) {
>>+		target = php_slot_match(dn, tmp);
>>+		if (target)
>>+			return target;
>>+	}
>>+
>>+	return NULL;
>>+}
>>+
>>+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn)
>>+{
>>+	struct powernv_php_slot *slot, *tmp;
>>+	unsigned long flags;
>>+
>>+	spin_lock_irqsave(&php_slot_lock, flags);
>>+	list_for_each_entry(tmp, &php_slot_list, link) {
>>+		slot = php_slot_match(dn, tmp);
>>+		if (slot) {
>>+			spin_unlock_irqrestore(&php_slot_lock, flags);
>>+			return slot;
>>+		}
>>+	}
>>+	spin_unlock_irqrestore(&php_slot_lock, flags);
>>+
>>+	return NULL;
>>+}
>>+
>>+static void php_slot_free(struct kref *kref)
>>+{
>>+	struct powernv_php_slot *slot = to_powernv_php_slot(kref);
>>+
>>+	WARN_ON(!list_empty(&slot->children));
>>+	kfree(slot->name);
>>+	kfree(slot);
>>+}
>>+
>>+static void php_slot_release(struct hotplug_slot *hp_slot)
>>+{
>>+	struct powernv_php_slot *slot = hp_slot->private;
>>+	unsigned long flags;
>>+
>>+	/* Remove from global or child list */
>>+	spin_lock_irqsave(&php_slot_lock, flags);
>>+	list_del(&slot->link);
>>+	spin_unlock_irqrestore(&php_slot_lock, flags);
>>+
>>+	/* Detach from parent */
>>+	powernv_php_slot_put(slot);
>>+	powernv_php_slot_put(slot->parent);
>>+}
>>+
>>+static bool php_slot_get_id(struct device_node *dn,
>>+			    uint64_t *id)
>>+{
>>+	struct device_node *parent = dn;
>>+	const __be64 *prop64;
>>+	const __be32 *prop32;
>>+
>>+	/*
>>+	 * The hotpluggable slot always has a compound Id, which
>>+	 * consists of 16-bits PHB Id, 16 bits bus/slot/function
>>+	 * number, and compound indicator
>>+	 */
>>+	*id = (0x1ul << 63);
>>+
>>+	/* Bus/Slot/Function number */
>>+	prop32 = of_get_property(dn, "reg", NULL);
>>+	if (!prop32)
>>+		return false;
>>+	*id |= ((of_read_number(prop32, 1) & 0x00ffff00) << 8);
>>+
>>+	/* PHB Id */
>>+	while ((parent = of_get_parent(parent))) {
>>+		if (!PCI_DN(parent)) {
>>+			of_node_put(parent);
>>+			break;
>>+		}
>>+
>>+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
>>+		    !of_device_is_compatible(parent, "ibm,ioda-phb")) {
>>+			of_node_put(parent);
>>+			continue;
>>+		}
>>+
>>+		prop64 = of_get_property(parent, "ibm,opal-phbid", NULL);
>>+		if (!prop64) {
>>+			of_node_put(parent);
>>+			return false;
>>+		}
>>+
>>+		*id |= be64_to_cpup(prop64);
>>+		of_node_put(parent);
>>+		return true;
>>+	}
>>+
>>+        return false;
>>+}
>>+
>>+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn)
>>+{
>>+	struct pci_bus *bus;
>>+	struct powernv_php_slot *slot;
>>+	const char *label;
>>+	uint64_t id;
>>+	int slot_no;
>>+	size_t size;
>>+	void *pmem;
>>+
>>+	/* Slot name */
>>+	label = of_get_property(dn, "ibm,slot-label", NULL);
>>+	if (!label)
>>+		return NULL;
>>+
>>+	/* Slot indentifier */
>>+	if (!php_slot_get_id(dn, &id))
>>+		return NULL;
>>+
>>+	/* PCI bus */
>>+	bus = pcibios_find_pci_bus(dn);
>>+	if (!bus)
>>+		return NULL;
>>+
>>+	/* Slot number */
>>+	if (dn->child && PCI_DN(dn->child))
>>+		slot_no = PCI_SLOT(PCI_DN(dn->child)->devfn);
>>+	else
>>+		slot_no = -1;
>>+
>>+	/* Allocate slot */
>>+	size = sizeof(struct powernv_php_slot) +
>>+	       sizeof(struct hotplug_slot) +
>>+	       sizeof(struct hotplug_slot_info);
>>+	pmem = kzalloc(size, GFP_KERNEL);
>>+	if (!pmem) {
>>+		pr_warn("%s: Cannot allocate slot for node %s\n",
>>+			__func__, dn->full_name);
>>+		return NULL;
>>+	}
>>+
>>+	/* Assign memory blocks */
>>+	slot = pmem;
>>+	slot->php_slot = pmem + sizeof(struct powernv_php_slot);
>>+	slot->php_slot->info = pmem + sizeof(struct powernv_php_slot) +
>>+			      sizeof(struct hotplug_slot);
>>+	slot->name = kstrdup(label, GFP_KERNEL);
>>+	if (!slot->name) {
>>+		pr_warn("%s: Cannot populate name for node %s\n",
>>+			__func__, dn->full_name);
>>+		kfree(pmem);
>>+		return NULL;
>>+	}
>>+
>>+	/* Initialize slot */
>>+	kref_init(&slot->kref);
>>+	slot->state = POWERNV_PHP_SLOT_STATE_INIT;
>>+	slot->dn = dn;
>>+	slot->bus = bus;
>>+	slot->id = id;
>>+	slot->slot_no = slot_no;
>>+	INIT_WORK(&slot->work, powernv_php_slot_work);
>>+	init_waitqueue_head(&slot->queue);
>>+	slot->check_power_status = 0;
>>+	slot->status_confirmed = 0;
>>+	slot->release = php_slot_free;
>>+	slot->php_slot->ops = &php_slot_ops;
>>+	slot->php_slot->release = php_slot_release;
>>+	slot->php_slot->private = slot;
>>+	INIT_LIST_HEAD(&slot->children);
>>+	INIT_LIST_HEAD(&slot->link);
>>+
>>+	return slot;
>>+}
>>+
>>+int powernv_php_slot_register(struct powernv_php_slot *slot)
>>+{
>>+	struct powernv_php_slot *parent;
>>+	struct device_node *dn = slot->dn;
>>+	unsigned long flags;
>>+	int ret;
>>+
>>+	/* Avoid register same slot for twice */
>>+	if (powernv_php_slot_find(slot->dn))
>>+		return -EEXIST;
>>+
>>+	/* Register slot */
>>+	ret = pci_hp_register(slot->php_slot, slot->bus,
>>+			      slot->slot_no, slot->name);
>>+	if (ret) {
>>+		pr_warn("%s: Cannot register slot %s (%d)\n",
>>+			__func__, slot->name, ret);
>>+		return ret;
>>+	}
>>+
>>+	/* Put into global or parent list */
>>+	while ((dn = of_get_parent(dn))) {
>>+		if (!PCI_DN(dn)) {
>>+			of_node_put(dn);
>>+			break;
>>+		}
>>+
>>+		parent = powernv_php_slot_find(dn);
>>+		if (parent) {
>>+			of_node_put(dn);
>>+			break;
>>+		}
>>+	}
>>+
>>+	spin_lock_irqsave(&php_slot_lock, flags);
>>+	if (parent) {
>>+		powernv_php_slot_get(parent);
>>+		slot->parent = parent;
>>+		list_add_tail(&slot->link, &parent->children);
>>+	} else {
>>+		list_add_tail(&slot->link, &php_slot_list);
>>+	}
>>+	spin_unlock_irqrestore(&php_slot_lock, flags);
>>+
>>+	/* Update slot state */
>>+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
>>+	return 0;
>>+}
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
@ 2015-05-11  7:38       ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:38 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sun, May 10, 2015 at 01:54:31AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:03 PM, Gavin Shan wrote:
>>The patch intends to add standalone driver to support PCI hotplug
>>for PowerPC PowerNV platform, which runs on top of skiboot firmware.
>>The firmware identified hotpluggable slots and marked their device
>>tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware".
>>The driver simply scans device-tree to create/register PCI hotplug slot
>>accordingly.
>>
>>If the skiboot firmware doesn't support slot status retrieval, the PCI
>>slot device node shouldn't have property "ibm,reset-by-firmware". In
>>that case, none of valid PCI slots will be detected from device tree.
>>The skiboot firmware doesn't export the capability to access attention
>>LEDs yet and it's something for TBD.
>>
>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>---
>>  drivers/pci/hotplug/Kconfig            |  12 +
>>  drivers/pci/hotplug/Makefile           |   4 +
>>  drivers/pci/hotplug/powernv_php.c      | 146 ++++++++
>>  drivers/pci/hotplug/powernv_php.h      |  78 ++++
>>  drivers/pci/hotplug/powernv_php_slot.c | 643 +++++++++++++++++++++++++++++++++
>>  5 files changed, 883 insertions(+)
>>  create mode 100644 drivers/pci/hotplug/powernv_php.c
>>  create mode 100644 drivers/pci/hotplug/powernv_php.h
>>  create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>>
>>diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig
>>index df8caec..ef55dae 100644
>>--- a/drivers/pci/hotplug/Kconfig
>>+++ b/drivers/pci/hotplug/Kconfig
>>@@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC
>>
>>  	  When in doubt, say N.
>>
>>+config HOTPLUG_PCI_POWERNV
>>+	tristate "PowerPC PowerNV PCI Hotplug driver"
>>+	depends on PPC_POWERNV && EEH
>>+	help
>>+	  Say Y here if you run PowerPC PowerNV platform that supports
>>+          PCI Hotplug
>>+
>>+	  To compile this driver as a module, choose M here: the
>>+	  module will be called powernv-php.
>>+
>>+	  When in doubt, say N.
>>+
>>  config HOTPLUG_PCI_RPA
>>  	tristate "RPA PCI Hotplug driver"
>>  	depends on PPC_PSERIES && EEH
>>diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
>>index 4a9aa08..a69665e 100644
>>--- a/drivers/pci/hotplug/Makefile
>>+++ b/drivers/pci/hotplug/Makefile
>>@@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)		+= pciehp.o
>>  obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550)	+= cpcihp_zt5550.o
>>  obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)	+= cpcihp_generic.o
>>  obj-$(CONFIG_HOTPLUG_PCI_SHPC)		+= shpchp.o
>>+obj-$(CONFIG_HOTPLUG_PCI_POWERNV)	+= powernv-php.o
>>  obj-$(CONFIG_HOTPLUG_PCI_RPA)		+= rpaphp.o
>>  obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR)	+= rpadlpar_io.o
>>  obj-$(CONFIG_HOTPLUG_PCI_SGI)		+= sgi_hotplug.o
>>@@ -50,6 +51,9 @@ ibmphp-objs		:=	ibmphp_core.o	\
>>  acpiphp-objs		:=	acpiphp_core.o	\
>>  				acpiphp_glue.o
>>
>>+powernv-php-objs	:=	powernv_php.o	\
>>+				powernv_php_slot.o
>>+
>>  rpaphp-objs		:=	rpaphp_core.o	\
>>  				rpaphp_pci.o	\
>>  				rpaphp_slot.o
>>diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c
>>new file mode 100644
>>index 0000000..5cf9e717
>>--- /dev/null
>>+++ b/drivers/pci/hotplug/powernv_php.c
>>@@ -0,0 +1,146 @@
>>+/*
>>+ * PCI Hotplug Driver for PowerPC PowerNV platform.
>>+ *
>>+ * Copyright Gavin Shan, IBM Corporation 2015.
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#include <linux/kernel.h>
>>+#include <linux/module.h>
>>+#include <linux/sysfs.h>
>>+#include <linux/pci.h>
>>+#include <linux/pci_hotplug.h>
>>+#include <linux/string.h>
>>+#include <linux/slab.h>
>>+#include <asm/opal.h>
>>+#include <asm/pnv-pci.h>
>>+
>>+#include "powernv_php.h"
>
>Compiles without linux/kernel.h, linux/sysfs.h, linux/string.h, linux/slab.h.
>Sure you need all of these?
>

Thanks, I'll check them one by one.

>>+
>>+#define DRIVER_VERSION	"0.1"
>>+#define DRIVER_AUTHOR	"Gavin Shan, IBM Corporation"
>>+#define DRIVER_DESC	"PowerPC PowerNV PCI Hotplug Driver"
>>+
>>+static struct notifier_block php_msg_nb = {
>>+	.notifier_call	= powernv_php_msg_handler,
>>+	.next		= NULL,
>>+	.priority	= 0,
>>+};
>>+
>>+static int powernv_php_register_one(struct device_node *dn)
>>+{
>>+	struct powernv_php_slot *slot;
>>+	const __be32 *prop32;
>>+	int ret;
>>+
>>+	/* Check if it's hotpluggable slot */
>>+	prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL);
>>+	if (!prop32 || !of_read_number(prop32, 1))
>>+		return 0;
>
>Although nobody checks the return code, this should be -ENXIO or something
>but zero. And the check below too.
>

Yeah, it makes sense to me.

>>+
>>+	prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL);
>>+	if (!prop32 || !of_read_number(prop32, 1))
>>+		return 0;
>>+
>>+	/* Allocate slot */
>>+	slot = powernv_php_slot_alloc(dn);
>>+	if (!slot)
>>+		return -ENODEV;
>>+
>>+	/* Register it */
>>+	ret = powernv_php_slot_register(slot);
>>+	if (ret) {
>>+		powernv_php_slot_put(slot);
>>+		return ret;
>>+	}
>>+
>>+	return powernv_php_slot_enable(slot->php_slot, false, false);
>>+}
>>+
>>+int powernv_php_register(struct device_node *dn)
>>+{
>>+	struct device_node *child;
>>+	int ret = 0;
>>+
>>+	/*
>>+	 * The parent slots should be registered before their
>>+	 * child slots.
>>+	 */
>>+	for_each_child_of_node(dn, child) {
>>+		ret = powernv_php_register_one(child);
>>+		if (ret)
>>+			break;
>>+
>>+		powernv_php_register(child);
>>+	}
>>+
>>+	return ret;
>>+}
>>+
>>+static void powernv_php_unregister_one(struct device_node *dn)
>>+{
>>+	struct powernv_php_slot *slot;
>>+
>>+	slot = powernv_php_slot_find(dn);
>>+	if (!slot)
>>+		return;
>>+
>>+	pci_hp_deregister(slot->php_slot);
>>+}
>>+
>>+void powernv_php_unregister(struct device_node *dn)
>>+{
>>+	struct device_node *child;
>>+
>>+	/* The child slots should go before their parent slots */
>>+	for_each_child_of_node(dn, child) {
>>+		powernv_php_unregister(child);
>>+		powernv_php_unregister_one(child);
>>+	}
>>+}
>>+
>>+static int __init powernv_php_init(void)
>>+{
>>+	struct device_node *dn;
>>+
>>+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
>>+
>>+	/* Register hotplug message handler */
>>+	if (pnv_pci_hotplug_notifier(&php_msg_nb, true)) {
>
>If you called the function "pnv_pci_hotplug_notifier_register", you would not
>need the comment above.
>

pnv_pci_hotplug_notifier() has second argument to indicate it's registering
or unregistering notifier. So you're expecting something like this?

	pnv_pci_hotplug_notifier_register(notifier, true or false);

>
>>+		pr_warn("%s: Cannot register hotplug message notifier\n",
>>+			__func__);
>>+		return -EIO;
>>+	}
>>+
>>+	/* Scan PHB nodes and their children */
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
>>+		powernv_php_register(dn);
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
>>+		powernv_php_register(dn);
>>+
>>+	return 0;
>>+}
>>+
>>+static void __exit powernv_php_exit(void)
>>+{
>>+	struct device_node *dn;
>>+
>>+	pnv_pci_hotplug_notifier(&php_msg_nb, false);
>>+
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
>>+		powernv_php_unregister(dn);
>>+	for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
>>+		powernv_php_unregister(dn);
>>+}
>>+
>>+module_init(powernv_php_init);
>>+module_exit(powernv_php_exit);
>>+
>>+MODULE_VERSION(DRIVER_VERSION);
>>+MODULE_LICENSE("GPL v2");
>>+MODULE_AUTHOR(DRIVER_AUTHOR);
>>+MODULE_DESCRIPTION(DRIVER_DESC);
>>diff --git a/drivers/pci/hotplug/powernv_php.h b/drivers/pci/hotplug/powernv_php.h
>>new file mode 100644
>>index 0000000..87ba0d0
>>--- /dev/null
>>+++ b/drivers/pci/hotplug/powernv_php.h
>>@@ -0,0 +1,78 @@
>>+/*
>>+ * PCI Hotplug Driver for PowerPC PowerNV platform.
>>+ *
>>+ * Copyright Gavin Shan, IBM Corporation 2015.
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#ifndef _POWERNV_PHP_H
>>+#define _POWERNV_PHP_H
>
>I would put these (and dependencies if any) here:
>
>#include <linux/kref.h>
>#include <linux/pci.h>
>#include <linux/pci_hotplug.h>
>
>and remove them from .c files.
>

Yeah, will check and do.

>>+
>>+/* Slot power status */
>>+#define POWERNV_PHP_SLOT_POWER_OFF	0
>>+#define POWERNV_PHP_SLOT_POWER_ON	1
>>+
>>+/* Slot presence status */
>>+#define POWERNV_PHP_SLOT_EMPTY		0
>>+#define POWERNV_PHP_SLOT_PRESENT	1
>>+
>>+/* Slot attention status */
>>+#define POWERNV_PHP_SLOT_ATTEN_OFF	0
>>+#define POWERNV_PHP_SLOT_ATTEN_ON	1
>>+#define POWERNV_PHP_SLOT_ATTEN_IND	2
>>+#define POWERNV_PHP_SLOT_ATTEN_ACT	3
>>+
>>+struct powernv_php_slot {
>>+	struct kref		kref;
>>+	int			state;
>>+#define POWERNV_PHP_SLOT_STATE_INIT		0x0
>>+#define POWERNV_PHP_SLOT_STATE_REGISTER		0x1
>>+#define POWERNV_PHP_SLOT_STATE_POPULATED	0x2
>
>I believe these are not bitmasks but bit numbers, right? Decimal values are
>normally used for them.
>

Ok. Will change accordingly.

>>+	char			*name;
>>+	struct device_node	*dn;
>>+	struct pci_bus		*bus;
>>+	uint64_t		id;
>>+	int			slot_no;
>>+	int			check_power_status;
>>+	int			status_confirmed;
>>+	struct opal_msg		*msg;
>>+	struct work_struct	work;
>>+	wait_queue_head_t	queue;
>>+	struct hotplug_slot	*php_slot;
>>+	struct powernv_php_slot	*parent;
>>+	void (*release)(struct kref *kref);
>
>What is the point in this? Just use php_slot_free() directly in
>powernv_php_slot_put, no?
>

Yes, I can if you insist on. But current code isn't wrong :-)

>
>>+	struct list_head	children;
>>+	struct list_head	link;
>>+};
>>+
>>+#define to_powernv_php_slot(kref) container_of(kref, struct powernv_php_slot, kref)
>>+
>>+static inline void powernv_php_slot_get(struct powernv_php_slot *slot)
>>+{
>>+	if (slot)
>>+		kref_get(&slot->kref);
>>+}
>>+
>>+static inline int powernv_php_slot_put(struct powernv_php_slot *slot)
>>+{
>>+	if (slot)
>>+		return kref_put(&slot->kref, slot->release);
>>+
>>+	return 0;
>>+}
>>+
>>+int powernv_php_msg_handler(struct notifier_block *nb,
>>+			    unsigned long type, void *message);
>>+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn);
>>+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn);
>>+int powernv_php_slot_register(struct powernv_php_slot *slot);
>>+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
>>+			    bool rescan_bus, bool rescan_slot);
>
>Just an observation - rescan_bus and rescan_slot are both true or both false
>and never different. And the only caller requesting rescan is in the same
>file as powernv_php_slot_enable() and it could do this rescan if
>powernv_php_slot_enable() could signal that rescan is needed (return 1?).
>
>And no "goto" in powernv_php_slot_enable would be needed. Do not insist though.
>

Thanks for your careful review. I'll change the code accordingly.

>>+int powernv_php_register(struct device_node *dn);
>>+void powernv_php_unregister(struct device_node *dn);
>>+
>>+#endif /* !_POWERNV_PHP_H */
>>diff --git a/drivers/pci/hotplug/powernv_php_slot.c b/drivers/pci/hotplug/powernv_php_slot.c
>>new file mode 100644
>>index 0000000..fc82355
>>--- /dev/null
>>+++ b/drivers/pci/hotplug/powernv_php_slot.c
>>@@ -0,0 +1,643 @@
>>+/*
>>+ * PCI Hotplug Driver for PowerPC PowerNV platform.
>>+ *
>>+ * Copyright Gavin Shan, IBM Corporation 2015.
>>+ *
>>+ * This program is free software; you can redistribute it and/or modify
>>+ * it under the terms of the GNU General Public License as published by
>>+ * the Free Software Foundation; either version 2 of the License, or
>>+ * (at your option) any later version.
>>+ */
>>+
>>+#include <linux/kernel.h>
>>+#include <linux/module.h>
>>+#include <linux/sysfs.h>
>>+#include <linux/pci.h>
>>+#include <linux/pci_hotplug.h>
>>+#include <linux/string.h>
>>+#include <linux/slab.h>
>>+#include <linux/spinlock.h>
>>+#include <linux/wait.h>
>>+#include <linux/workqueue.h>
>>+
>>+#include <asm/opal.h>
>>+#include <asm/pnv-pci.h>
>>+#include <asm/ppc-pci.h>
>>+
>>+#include "powernv_php.h"
>
>I have a suspicion you won't need all these headers here too ;)
>

Most possibly. I'll check.

>>+
>>+static LIST_HEAD(php_slot_list);
>>+static DEFINE_SPINLOCK(php_slot_lock);
>>+
>>+/*
>>+ * Release firmware data for all child device nodes of the
>>+ * indicated one.
>>+ */
>>+static void release_device_nodes_info(struct device_node *np)
>>+{
>>+	struct device_node *child;
>>+
>>+	for_each_child_of_node(np, child) {
>>+		/* In depth first */
>>+		release_device_nodes_info(child);
>
>Why is this "release", not "remove" (as this is what it does - calling
>remove_lalala in a loop)?
>

The device node is expected to be free'ed. "release" is equal "free"
while "remove" isn't "free" necesarily. "remove" possibly means we're
removing something from the system or list, but their occupied memory
chunks are still there.

>>+
>>+		remove_pci_device_node_info(child);
>>+	}
>>+}
>>+
>>+/*
>>+ * Release all subordinate device nodes of the indicated one.
>>+ * Those device nodes in deepest path should be released firstly.
>>+ */
>>+static int release_device_nodes(struct device_node *parent)
>>+{
>>+	struct device_node *np, *child;
>>+	int ret = 0;
>>+
>>+	/* If the device node has children, remove them firstly */
>>+	for_each_child_of_node(parent, np) {
>>+		ret = release_device_nodes(np);
>>+		if (ret)
>>+			return ret;
>>+
>>+		/* The device shouldn't have alive children */
>>+		child = of_get_next_child(np, NULL);
>>+		if (child) {
>>+			of_node_put(child);
>>+			of_node_put(np);
>>+			pr_err("%s: Alive children of node <%s>\n",
>>+			       __func__, of_node_full_name(np));
>>+			return -EBUSY;
>>+		}
>>+
>>+		/* Detach the device node */
>>+		of_detach_node(np);
>>+		of_node_put(np);
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+/*
>>+ * The function processes the message sent by firmware
>>+ * to remove all device tree nodes beneath the slot's
>>+ * nodes, and the associated auxillary data.
>>+ */
>>+static void slot_power_off_handler(struct powernv_php_slot *slot)
>>+{
>>+	int ret;
>>+
>>+	/* Release the firmware data for the child device nodes */
>>+	release_device_nodes_info(slot->dn);
>>+
>>+	/* Release the child device nodes */
>>+	ret = release_device_nodes(slot->dn);
>>+	if (ret)
>>+		pr_warn("%s: Error %d releasing children of <%s>\n",
>>+			__func__, ret, of_node_full_name(slot->dn));
>>+
>>+	/* Confirm status change */
>>+	slot->status_confirmed = 1;
>>+	wake_up_interruptible(&slot->queue);
>>+}
>>+
>>+static void slot_power_on_handler(struct powernv_php_slot *slot)
>>+{
>>+	struct opal_msg *msg = slot->msg;
>>+	unsigned long phys = be64_to_cpu(msg->params[2]);
>>+	unsigned long len = be64_to_cpu(msg->params[3]);
>>+	void *blob = (phys && len > 0) ? __va(phys) : NULL;
>>+
>>+	/* There might have nothing behind the slot yet */
>>+	if (!blob || !len)
>
>"!len" is redundand here - blob will be NULL if len<=0.
>

Sure, I'll remove it, but it's not harmful.

>>+		goto out;
>>+
>>+	/* Copy the FDT blob and parse it */
>>+	of_fdt_add_subtree(slot->dn, blob);
>>+
>>+	/* Add device node firmware data */
>>+	traverse_pci_device_nodes(slot->dn,
>>+				  add_pci_device_node_info,
>>+				  pci_bus_to_host(slot->bus));
>>+
>>+out:
>>+	/* Confirm status change */
>>+	slot->status_confirmed = 1;
>>+	wake_up_interruptible(&slot->queue);
>>+}
>>+
>>+static void powernv_php_slot_work(struct work_struct *data)
>>+{
>>+	struct powernv_php_slot *slot = container_of(data,
>>+						     struct powernv_php_slot,
>>+						     work);
>>+	uint64_t php_event = be64_to_cpu(slot->msg->params[0]);
>>+
>>+	switch (php_event) {
>>+	case 0: /* Slot power off */
>>+		slot_power_off_handler(slot);
>>+		break;
>>+	case 1: /* Slot power on */
>>+		slot_power_on_handler(slot);
>>+		break;
>>+	default:
>>+		pr_warn("%s: Unsupported hotplug event %lld\n",
>>+			__func__, php_event);
>>+	}
>>+
>>+	of_node_put(slot->dn);
>>+}
>>+
>>+int powernv_php_msg_handler(struct notifier_block *nb,
>>+			    unsigned long type, void *message)
>>+{
>>+	phandle h;
>>+	struct device_node *np;
>>+	struct powernv_php_slot *slot;
>>+	struct opal_msg *msg = message;
>>+
>>+	/* Check the message type */
>>+	if (type != OPAL_MSG_PCI_HOTPLUG) {
>>+		pr_warn("%s: Wrong message type %ld received!\n",
>>+			__func__, type);
>>+		return 0;
>>+	}
>>+
>>+	/* Find the device node */
>>+	h = (phandle)be64_to_cpu(msg->params[1]);
>>+	np = of_find_node_by_phandle(h);
>>+	if (!np) {
>>+		pr_warn("%s: No device node for phandle 0x%08x\n",
>>+			__func__, h);
>>+		return 0;
>>+	}
>>+
>>+	/* Find the slot */
>>+	slot = powernv_php_slot_find(np);
>>+	if (!slot) {
>>+		pr_warn("%s: No slot found for node <%s>\n",
>>+			__func__, of_node_full_name(np));
>>+		of_node_put(np);
>>+		return 0;
>>+	}
>>+
>>+	/* Schedule the work */
>>+	slot->msg = msg;
>>+	schedule_work(&slot->work);
>>+	return 0;
>>+}
>>+
>>+static int set_power_status(struct hotplug_slot *php_slot, u8 val)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	int ret;
>>+
>>+	/* Set power status */
>>+	slot->status_confirmed = 0;
>>+	ret = pnv_pci_set_power_status(slot->id, val);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d powering %s slot %016llx\n",
>>+			__func__, ret, val ? "on" : "off", slot->id);
>>+		return ret;
>>+	}
>>+
>>+	/* Waiting until the device tree is updated */
>>+	ret = wait_event_timeout(slot->queue,
>>+				 !slot->status_confirmed,
>>+				 10 * HZ);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d completing power-%s slot %016llx\n",
>>+			__func__, ret, val ? "on" : "off", slot->id);
>>+		return ret;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int get_power_status(struct hotplug_slot *php_slot, u8 *val)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t state;
>>+	int ret;
>>+
>>+	/*
>>+	 * Retrieve power status from firmware. If we fail
>>+	 * getting that, the power status fails back to
>>+	 * be on.
>>+	 */
>>+	ret = pnv_pci_get_power_status(slot->id, &state);
>>+	if (ret) {
>>+		*val = POWERNV_PHP_SLOT_POWER_ON;
>>+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+	} else {
>>+		*val = state ? POWERNV_PHP_SLOT_POWER_ON :
>>+			       POWERNV_PHP_SLOT_POWER_OFF;
>>+		php_slot->info->power_status = *val;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int get_adapter_status(struct hotplug_slot *php_slot, u8 *val)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t state;
>>+	int ret;
>>+
>>+	/*
>>+	 * Retrieve presence status from firmware. If we can't
>>+	 * get that, it will fail back to be empty.
>>+	 */
>>+	ret = pnv_pci_get_presence_status(slot->id, &state);
>>+	if (ret >= 0) {
>>+                *val = state ? POWERNV_PHP_SLOT_PRESENT :
>>+                               POWERNV_PHP_SLOT_EMPTY;
>>+                php_slot->info->adapter_status = *val;
>
>ret = 0;
>
>
>>+	} else {
>>+		*val = POWERNV_PHP_SLOT_EMPTY;
>>+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+	}
>>+
>>+	return ret < 0 ? ret : 0;
>
>
>return ret;
>

Ok. I'll fix it up.

>>+}
>>+
>>+static int set_attention_status(struct hotplug_slot *php_slot, u8 val)
>>+{
>>+	/* The default operation would to turn on the attention */
>>+	switch (val) {
>>+	case POWERNV_PHP_SLOT_ATTEN_OFF:
>>+	case POWERNV_PHP_SLOT_ATTEN_ON:
>>+	case POWERNV_PHP_SLOT_ATTEN_IND:
>>+	case POWERNV_PHP_SLOT_ATTEN_ACT:
>>+		break;
>>+	default:
>>+		val = POWERNV_PHP_SLOT_ATTEN_ON;
>
>Is not @val a garbage in this case?
>

No, the kerenl takes everything that's valid. If not, it will
be POWERNV_PHP_SLOT_ATTEN_ON.

>>+	}
>>+
>>+	/* FIXME: Make it real once firmware supports it */
>>+	php_slot->info->attention_status = val;
>>+
>>+	return 0;
>>+}
>>+
>>+int powernv_php_slot_enable(struct hotplug_slot *php_slot,
>>+			    bool rescan_bus, bool rescan_slot)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t presence, power_status;
>>+	int ret;
>>+
>>+	/* Check if the slot has been configured */
>>+	if (slot->state != POWERNV_PHP_SLOT_STATE_REGISTER)
>>+		return 0;
>>+
>>+	/* Retrieve slot presence status */
>>+	ret = php_slot->ops->get_adapter_status(php_slot, &presence);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d getting presence of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+		return ret;
>>+	}
>>+
>>+	/* Proceed if there have nothing behind the slot */
>>+	if (presence == POWERNV_PHP_SLOT_EMPTY)
>>+		goto scan;
>>+
>>+	/*
>>+	 * If we don't detect something behind the slot, we need
>>+	 * make sure the power suply to the slot is on. Otherwise,
>>+	 * the slot downstream PCIe linkturn should be down.
>>+	 *
>>+	 * On the first time, we don't change the power status to
>>+	 * boost system boot with assumption that the firmware
>>+	 * supplies consistent slot power status: empty slot always
>>+	 * has its power off and non-empty slot has its power on.
>>+	 */
>>+	if (!slot->check_power_status) {
>>+		slot->check_power_status = 1;
>>+		goto scan;
>>+	}
>>+
>>+	/* Check the power status. Scan the slot if that's already on */
>>+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d getting power status of slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+		return ret;
>>+	}
>>+	if (power_status == POWERNV_PHP_SLOT_POWER_ON)
>>+		goto scan;
>>+
>>+	/* Power is off, turn it on and then scan the slot */
>>+	ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_ON);
>>+	if (ret) {
>>+		pr_warn("%s: Error %d powering on slot %016llx\n",
>>+			__func__, ret, slot->id);
>>+		return ret;
>>+	}
>>+
>>+scan:
>>+	switch (presence) {
>>+	case POWERNV_PHP_SLOT_PRESENT:
>>+		if (rescan_bus) {
>>+			pci_lock_rescan_remove();
>>+			pcibios_add_pci_devices(slot->bus);
>>+			pci_unlock_rescan_remove();
>>+		}
>>+
>>+		/* Rescan for child hotpluggable slots */
>>+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
>>+		if (rescan_slot)
>>+			powernv_php_register(slot->dn);
>>+		break;
>>+	case POWERNV_PHP_SLOT_EMPTY:
>>+		slot->state = POWERNV_PHP_SLOT_STATE_POPULATED;
>>+		break;
>>+	default:
>>+		pr_warn("%s: Invalid presence status %d of slot %016llx\n",
>>+			__func__, presence, slot->id);
>>+		return -EINVAL;
>>+	}
>>+
>>+	return 0;
>>+}
>>+
>>+static int enable_slot(struct hotplug_slot *php_slot)
>>+{
>>+	return powernv_php_slot_enable(php_slot, true, true);
>>+}
>>+
>>+static int disable_slot(struct hotplug_slot *php_slot)
>>+{
>>+	struct powernv_php_slot *slot = php_slot->private;
>>+	uint8_t power_status;
>>+	int ret;
>>+
>>+	if (slot->state != POWERNV_PHP_SLOT_STATE_POPULATED)
>>+		return 0;
>>+
>>+	/* Remove all devices behind the slot */
>>+	pci_lock_rescan_remove();
>>+	pcibios_remove_pci_devices(slot->bus);
>>+	pci_unlock_rescan_remove();
>>+
>>+	/* Detach the child hotpluggable slots */
>>+	powernv_php_unregister(slot->dn);
>>+
>>+	/*
>>+	 * Check the power status and turn it off if necessary. If we
>>+	 * fail to get the power status, the power will be forced to
>>+	 * be off.
>>+	 */
>>+	ret = php_slot->ops->get_power_status(php_slot, &power_status);
>>+	if (ret || power_status == POWERNV_PHP_SLOT_POWER_ON) {
>>+		ret = set_power_status(php_slot, POWERNV_PHP_SLOT_POWER_OFF);
>>+		if (ret)
>>+			pr_warn("%s: Error %d powering off slot %016llx\n",
>>+				__func__, ret, slot->id);
>>+	}
>>+
>>+	/* Update slot state */
>>+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
>>+	return 0;
>>+}
>>+
>>+static struct hotplug_slot_ops php_slot_ops = {
>>+	.get_power_status	= get_power_status,
>>+	.get_adapter_status	= get_adapter_status,
>>+	.set_attention_status	= set_attention_status,
>>+	.enable_slot		= enable_slot,
>>+	.disable_slot		= disable_slot,
>>+};
>>+
>>+static struct powernv_php_slot *php_slot_match(struct device_node *dn,
>>+					       struct powernv_php_slot *slot)
>>+{
>>+	struct powernv_php_slot *target, *tmp;
>>+
>>+	if (slot->dn == dn)
>>+		return slot;
>>+
>>+	list_for_each_entry(tmp, &slot->children, link) {
>>+		target = php_slot_match(dn, tmp);
>>+		if (target)
>>+			return target;
>>+	}
>>+
>>+	return NULL;
>>+}
>>+
>>+struct powernv_php_slot *powernv_php_slot_find(struct device_node *dn)
>>+{
>>+	struct powernv_php_slot *slot, *tmp;
>>+	unsigned long flags;
>>+
>>+	spin_lock_irqsave(&php_slot_lock, flags);
>>+	list_for_each_entry(tmp, &php_slot_list, link) {
>>+		slot = php_slot_match(dn, tmp);
>>+		if (slot) {
>>+			spin_unlock_irqrestore(&php_slot_lock, flags);
>>+			return slot;
>>+		}
>>+	}
>>+	spin_unlock_irqrestore(&php_slot_lock, flags);
>>+
>>+	return NULL;
>>+}
>>+
>>+static void php_slot_free(struct kref *kref)
>>+{
>>+	struct powernv_php_slot *slot = to_powernv_php_slot(kref);
>>+
>>+	WARN_ON(!list_empty(&slot->children));
>>+	kfree(slot->name);
>>+	kfree(slot);
>>+}
>>+
>>+static void php_slot_release(struct hotplug_slot *hp_slot)
>>+{
>>+	struct powernv_php_slot *slot = hp_slot->private;
>>+	unsigned long flags;
>>+
>>+	/* Remove from global or child list */
>>+	spin_lock_irqsave(&php_slot_lock, flags);
>>+	list_del(&slot->link);
>>+	spin_unlock_irqrestore(&php_slot_lock, flags);
>>+
>>+	/* Detach from parent */
>>+	powernv_php_slot_put(slot);
>>+	powernv_php_slot_put(slot->parent);
>>+}
>>+
>>+static bool php_slot_get_id(struct device_node *dn,
>>+			    uint64_t *id)
>>+{
>>+	struct device_node *parent = dn;
>>+	const __be64 *prop64;
>>+	const __be32 *prop32;
>>+
>>+	/*
>>+	 * The hotpluggable slot always has a compound Id, which
>>+	 * consists of 16-bits PHB Id, 16 bits bus/slot/function
>>+	 * number, and compound indicator
>>+	 */
>>+	*id = (0x1ul << 63);
>>+
>>+	/* Bus/Slot/Function number */
>>+	prop32 = of_get_property(dn, "reg", NULL);
>>+	if (!prop32)
>>+		return false;
>>+	*id |= ((of_read_number(prop32, 1) & 0x00ffff00) << 8);
>>+
>>+	/* PHB Id */
>>+	while ((parent = of_get_parent(parent))) {
>>+		if (!PCI_DN(parent)) {
>>+			of_node_put(parent);
>>+			break;
>>+		}
>>+
>>+		if (!of_device_is_compatible(parent, "ibm,ioda2-phb") &&
>>+		    !of_device_is_compatible(parent, "ibm,ioda-phb")) {
>>+			of_node_put(parent);
>>+			continue;
>>+		}
>>+
>>+		prop64 = of_get_property(parent, "ibm,opal-phbid", NULL);
>>+		if (!prop64) {
>>+			of_node_put(parent);
>>+			return false;
>>+		}
>>+
>>+		*id |= be64_to_cpup(prop64);
>>+		of_node_put(parent);
>>+		return true;
>>+	}
>>+
>>+        return false;
>>+}
>>+
>>+struct powernv_php_slot *powernv_php_slot_alloc(struct device_node *dn)
>>+{
>>+	struct pci_bus *bus;
>>+	struct powernv_php_slot *slot;
>>+	const char *label;
>>+	uint64_t id;
>>+	int slot_no;
>>+	size_t size;
>>+	void *pmem;
>>+
>>+	/* Slot name */
>>+	label = of_get_property(dn, "ibm,slot-label", NULL);
>>+	if (!label)
>>+		return NULL;
>>+
>>+	/* Slot indentifier */
>>+	if (!php_slot_get_id(dn, &id))
>>+		return NULL;
>>+
>>+	/* PCI bus */
>>+	bus = pcibios_find_pci_bus(dn);
>>+	if (!bus)
>>+		return NULL;
>>+
>>+	/* Slot number */
>>+	if (dn->child && PCI_DN(dn->child))
>>+		slot_no = PCI_SLOT(PCI_DN(dn->child)->devfn);
>>+	else
>>+		slot_no = -1;
>>+
>>+	/* Allocate slot */
>>+	size = sizeof(struct powernv_php_slot) +
>>+	       sizeof(struct hotplug_slot) +
>>+	       sizeof(struct hotplug_slot_info);
>>+	pmem = kzalloc(size, GFP_KERNEL);
>>+	if (!pmem) {
>>+		pr_warn("%s: Cannot allocate slot for node %s\n",
>>+			__func__, dn->full_name);
>>+		return NULL;
>>+	}
>>+
>>+	/* Assign memory blocks */
>>+	slot = pmem;
>>+	slot->php_slot = pmem + sizeof(struct powernv_php_slot);
>>+	slot->php_slot->info = pmem + sizeof(struct powernv_php_slot) +
>>+			      sizeof(struct hotplug_slot);
>>+	slot->name = kstrdup(label, GFP_KERNEL);
>>+	if (!slot->name) {
>>+		pr_warn("%s: Cannot populate name for node %s\n",
>>+			__func__, dn->full_name);
>>+		kfree(pmem);
>>+		return NULL;
>>+	}
>>+
>>+	/* Initialize slot */
>>+	kref_init(&slot->kref);
>>+	slot->state = POWERNV_PHP_SLOT_STATE_INIT;
>>+	slot->dn = dn;
>>+	slot->bus = bus;
>>+	slot->id = id;
>>+	slot->slot_no = slot_no;
>>+	INIT_WORK(&slot->work, powernv_php_slot_work);
>>+	init_waitqueue_head(&slot->queue);
>>+	slot->check_power_status = 0;
>>+	slot->status_confirmed = 0;
>>+	slot->release = php_slot_free;
>>+	slot->php_slot->ops = &php_slot_ops;
>>+	slot->php_slot->release = php_slot_release;
>>+	slot->php_slot->private = slot;
>>+	INIT_LIST_HEAD(&slot->children);
>>+	INIT_LIST_HEAD(&slot->link);
>>+
>>+	return slot;
>>+}
>>+
>>+int powernv_php_slot_register(struct powernv_php_slot *slot)
>>+{
>>+	struct powernv_php_slot *parent;
>>+	struct device_node *dn = slot->dn;
>>+	unsigned long flags;
>>+	int ret;
>>+
>>+	/* Avoid register same slot for twice */
>>+	if (powernv_php_slot_find(slot->dn))
>>+		return -EEXIST;
>>+
>>+	/* Register slot */
>>+	ret = pci_hp_register(slot->php_slot, slot->bus,
>>+			      slot->slot_no, slot->name);
>>+	if (ret) {
>>+		pr_warn("%s: Cannot register slot %s (%d)\n",
>>+			__func__, slot->name, ret);
>>+		return ret;
>>+	}
>>+
>>+	/* Put into global or parent list */
>>+	while ((dn = of_get_parent(dn))) {
>>+		if (!PCI_DN(dn)) {
>>+			of_node_put(dn);
>>+			break;
>>+		}
>>+
>>+		parent = powernv_php_slot_find(dn);
>>+		if (parent) {
>>+			of_node_put(dn);
>>+			break;
>>+		}
>>+	}
>>+
>>+	spin_lock_irqsave(&php_slot_lock, flags);
>>+	if (parent) {
>>+		powernv_php_slot_get(parent);
>>+		slot->parent = parent;
>>+		list_add_tail(&slot->link, &parent->children);
>>+	} else {
>>+		list_add_tail(&slot->link, &php_slot_list);
>>+	}
>>+	spin_unlock_irqrestore(&php_slot_lock, flags);
>>+
>>+	/* Update slot state */
>>+	slot->state = POWERNV_PHP_SLOT_STATE_REGISTER;
>>+	return 0;
>>+}
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
  2015-05-08 23:59   ` Alexey Kardashevskiy
@ 2015-05-11  7:40     ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Sat, May 09, 2015 at 09:59:25AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The series of patches intend to support PCI slot for PowerPC PowerNV platform,
>>which is running on top of skiboot firmware. The patchset requires corresponding
>>changes from skiboot firmware, which is sent to skiboot@lists.ozlabs.org
>>for review. The PCI slots are exposed by skiboot with device node properties,
>>and kernel utilizes those properties to populated PCI slots accordingly.
>>
>>The original PCI infrastructure on PowerNV platform can't support hotplug
>>because the PE is assigned during PHB fixup time, which is called for once
>>during system boot time. For this, the PCI infrastructure on PowerNV platform
>>has been reworked for a lot. After that, the PE and its corresponding resources
>>(IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating
>>PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64
>>resources, on P8 strictly speaking).
>
>Out of curiosity - does this PCI scan happen when memory subsystem is
>initialized? More precisely, after these changes, won't
>pnv_pci_ioda2_setup_dma_pe() be called too early after boot so I won't be
>able to use kmalloc() to allocate iommu_table's?
>

PCI scan (enumeration) is invoked quite late during system bootup time by
subsys_initcall(). When it's called, the slab should have been initialized.

>Also, checkpatch.pl failed multiple times on the series. Please fix.
>

Thanks for pointing it out, I'll fix them one by one.

>>Each PE will maintain a reference count,
>>which is (number of child PCI devices + 1). That indicates when last child PCI
>>device leaves the PE, the PE and its included resources will be relased and put
>>back into free pool again. With this design, the PE will be released when EEH PE
>>is released. PATCH[1 - 8] are related to this part.
>>
>> From skiboot perspective, PCI slot is providing (hot/fundamental/complete)
>>resets to EEH. The kernel gets to know if skiboot supports various reset on one
>>particular PCI slot through device-tree node. If it does, EEH will utilize the
>>functionality provided by skiboot. Besides, the device-tree nodes have to change
>>in order to support PCI hotplug. For example, when one PCI adapter inserted to
>>one slot, its device-tree node should be added to the system dynamically. Conversely,
>>the device-tree node should be removed from the system when the PCI adapter is going
>>to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes,
>>they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are
>>doing the related work.
>>
>>The last patch is the standalone PCI hotplug driver for PowerNV platform. When
>>removing PCI adapter from one PCI slot, which is invoked by command in userland,
>>the skiboot will power off the slot to save power and remove all device-tree
>>nodes for all PCI devices behind the slot. Conversely, the Power to the slot
>>is turned on, the PCI devices behind the slot is rescanned, and the device-tree
>>nodes for those newly detected PCI devices will be built in skiboot. For both
>>of cases, one message will be sent to kernel by skiboot so that the kernel
>>can adjust the device-tree accordingly. At the same time, the kernel also have
>>to deallocate or allocate PE# and its related resources (PE# and so on) for the
>>removed/added PCI devices.
>>
>>Changelog
>>=========
>>v4:
>>    * Rebased to 4.1.RC1
>>    * Added API to unflatten FDT blob to device node sub-tree, which is attached
>>      the indicated parent device node. The original mechanism based on formatted
>>      string stream has been dropped.
>>    * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug")
>>      was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH
>>      Support" depends on that.
>>v3:
>>    * Rebased to 4.1.RC0
>>    * PowerNV PCI infrasturcture is total refactored in order to support PCI
>>      hotplug. The PowerNV hotplug driver is also reworked a lot because of
>>      the changes in skiboot in order to support PCI hotplug.
>>
>>Gavin Shan (21):
>>   pci: Add pcibios_setup_bridge()
>>   powerpc/powernv: Enable M64 on P7IOC
>>   powerpc/powernv: M64 support improvement
>>   powerpc/powernv: Improve IO and M32 mapping
>>   powerpc/powernv: Improve DMA32 segment assignment
>>   powerpc/powernv: Create PEs dynamically
>>   powerpc/powernv: Release PEs dynamically
>>   powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
>>   powerpc/powernv: Use PCI slot reset infrastructure
>>   powerpc/powernv: Fundamental reset for PCI bus reset
>>   powerpc/pci: Don't scan empty slot
>>   powerpc/pci: Move pcibios_find_pci_bus() around
>>   powerpc/powernv: Introduce pnv_pci_poll()
>>   powerpc/powernv: Functions to get/reset PCI slot status
>>   powerpc/pci: Delay creating pci_dn
>>   powerpc/pci: Create eeh_dev while creating pci_dn
>>   powerpc/pci: Export traverse_pci_device_nodes()
>>   powerpc/pci: Update bridge windows on PCI plugging
>>   drivers/of: Support adding sub-tree
>>   powerpc/powernv: Select OF_DYNAMIC
>>   pci/hotplug: PowerPC PowerNV PCI hotplug driver
>>
>>  arch/powerpc/include/asm/eeh.h                 |    7 +-
>>  arch/powerpc/include/asm/opal-api.h            |    7 +-
>>  arch/powerpc/include/asm/opal.h                |    7 +-
>>  arch/powerpc/include/asm/pci-bridge.h          |    7 +-
>>  arch/powerpc/include/asm/pnv-pci.h             |    5 +
>>  arch/powerpc/include/asm/ppc-pci.h             |    7 +-
>>  arch/powerpc/kernel/eeh_dev.c                  |   20 +-
>>  arch/powerpc/kernel/pci-common.c               |   18 +-
>>  arch/powerpc/kernel/pci-hotplug.c              |   44 +-
>>  arch/powerpc/kernel/pci_dn.c                   |  119 +-
>>  arch/powerpc/platforms/maple/pci.c             |   35 +-
>>  arch/powerpc/platforms/pasemi/pci.c            |    3 -
>>  arch/powerpc/platforms/powermac/pci.c          |   39 +-
>>  arch/powerpc/platforms/powernv/Kconfig         |    1 +
>>  arch/powerpc/platforms/powernv/eeh-powernv.c   |  245 ++--
>>  arch/powerpc/platforms/powernv/opal-wrappers.S |    3 +
>>  arch/powerpc/platforms/powernv/pci-ioda.c      | 1657 +++++++++++++++---------
>>  arch/powerpc/platforms/powernv/pci.c           |   64 +-
>>  arch/powerpc/platforms/powernv/pci.h           |   52 +-
>>  arch/powerpc/platforms/pseries/msi.c           |    4 +-
>>  arch/powerpc/platforms/pseries/pci_dlpar.c     |   32 -
>>  arch/powerpc/platforms/pseries/setup.c         |    9 +-
>>  drivers/of/dynamic.c                           |   19 +-
>>  drivers/of/fdt.c                               |  133 +-
>>  drivers/pci/hotplug/Kconfig                    |   12 +
>>  drivers/pci/hotplug/Makefile                   |    4 +
>>  drivers/pci/hotplug/powernv_php.c              |  146 +++
>>  drivers/pci/hotplug/powernv_php.h              |   78 ++
>>  drivers/pci/hotplug/powernv_php_slot.c         |  643 +++++++++
>>  drivers/pci/setup-bus.c                        |   12 +-
>>  include/linux/of.h                             |    2 +
>>  include/linux/of_fdt.h                         |    1 +
>>  include/linux/pci.h                            |    1 +
>>  33 files changed, 2473 insertions(+), 963 deletions(-)
>>  create mode 100644 drivers/pci/hotplug/powernv_php.c
>>  create mode 100644 drivers/pci/hotplug/powernv_php.h
>>  create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
@ 2015-05-11  7:40     ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-11  7:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Sat, May 09, 2015 at 09:59:25AM +1000, Alexey Kardashevskiy wrote:
>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>The series of patches intend to support PCI slot for PowerPC PowerNV platform,
>>which is running on top of skiboot firmware. The patchset requires corresponding
>>changes from skiboot firmware, which is sent to skiboot@lists.ozlabs.org
>>for review. The PCI slots are exposed by skiboot with device node properties,
>>and kernel utilizes those properties to populated PCI slots accordingly.
>>
>>The original PCI infrastructure on PowerNV platform can't support hotplug
>>because the PE is assigned during PHB fixup time, which is called for once
>>during system boot time. For this, the PCI infrastructure on PowerNV platform
>>has been reworked for a lot. After that, the PE and its corresponding resources
>>(IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating
>>PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64
>>resources, on P8 strictly speaking).
>
>Out of curiosity - does this PCI scan happen when memory subsystem is
>initialized? More precisely, after these changes, won't
>pnv_pci_ioda2_setup_dma_pe() be called too early after boot so I won't be
>able to use kmalloc() to allocate iommu_table's?
>

PCI scan (enumeration) is invoked quite late during system bootup time by
subsys_initcall(). When it's called, the slab should have been initialized.

>Also, checkpatch.pl failed multiple times on the series. Please fix.
>

Thanks for pointing it out, I'll fix them one by one.

>>Each PE will maintain a reference count,
>>which is (number of child PCI devices + 1). That indicates when last child PCI
>>device leaves the PE, the PE and its included resources will be relased and put
>>back into free pool again. With this design, the PE will be released when EEH PE
>>is released. PATCH[1 - 8] are related to this part.
>>
>> From skiboot perspective, PCI slot is providing (hot/fundamental/complete)
>>resets to EEH. The kernel gets to know if skiboot supports various reset on one
>>particular PCI slot through device-tree node. If it does, EEH will utilize the
>>functionality provided by skiboot. Besides, the device-tree nodes have to change
>>in order to support PCI hotplug. For example, when one PCI adapter inserted to
>>one slot, its device-tree node should be added to the system dynamically. Conversely,
>>the device-tree node should be removed from the system when the PCI adapter is going
>>to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes,
>>they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are
>>doing the related work.
>>
>>The last patch is the standalone PCI hotplug driver for PowerNV platform. When
>>removing PCI adapter from one PCI slot, which is invoked by command in userland,
>>the skiboot will power off the slot to save power and remove all device-tree
>>nodes for all PCI devices behind the slot. Conversely, the Power to the slot
>>is turned on, the PCI devices behind the slot is rescanned, and the device-tree
>>nodes for those newly detected PCI devices will be built in skiboot. For both
>>of cases, one message will be sent to kernel by skiboot so that the kernel
>>can adjust the device-tree accordingly. At the same time, the kernel also have
>>to deallocate or allocate PE# and its related resources (PE# and so on) for the
>>removed/added PCI devices.
>>
>>Changelog
>>=========
>>v4:
>>    * Rebased to 4.1.RC1
>>    * Added API to unflatten FDT blob to device node sub-tree, which is attached
>>      the indicated parent device node. The original mechanism based on formatted
>>      string stream has been dropped.
>>    * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug")
>>      was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH
>>      Support" depends on that.
>>v3:
>>    * Rebased to 4.1.RC0
>>    * PowerNV PCI infrasturcture is total refactored in order to support PCI
>>      hotplug. The PowerNV hotplug driver is also reworked a lot because of
>>      the changes in skiboot in order to support PCI hotplug.
>>
>>Gavin Shan (21):
>>   pci: Add pcibios_setup_bridge()
>>   powerpc/powernv: Enable M64 on P7IOC
>>   powerpc/powernv: M64 support improvement
>>   powerpc/powernv: Improve IO and M32 mapping
>>   powerpc/powernv: Improve DMA32 segment assignment
>>   powerpc/powernv: Create PEs dynamically
>>   powerpc/powernv: Release PEs dynamically
>>   powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
>>   powerpc/powernv: Use PCI slot reset infrastructure
>>   powerpc/powernv: Fundamental reset for PCI bus reset
>>   powerpc/pci: Don't scan empty slot
>>   powerpc/pci: Move pcibios_find_pci_bus() around
>>   powerpc/powernv: Introduce pnv_pci_poll()
>>   powerpc/powernv: Functions to get/reset PCI slot status
>>   powerpc/pci: Delay creating pci_dn
>>   powerpc/pci: Create eeh_dev while creating pci_dn
>>   powerpc/pci: Export traverse_pci_device_nodes()
>>   powerpc/pci: Update bridge windows on PCI plugging
>>   drivers/of: Support adding sub-tree
>>   powerpc/powernv: Select OF_DYNAMIC
>>   pci/hotplug: PowerPC PowerNV PCI hotplug driver
>>
>>  arch/powerpc/include/asm/eeh.h                 |    7 +-
>>  arch/powerpc/include/asm/opal-api.h            |    7 +-
>>  arch/powerpc/include/asm/opal.h                |    7 +-
>>  arch/powerpc/include/asm/pci-bridge.h          |    7 +-
>>  arch/powerpc/include/asm/pnv-pci.h             |    5 +
>>  arch/powerpc/include/asm/ppc-pci.h             |    7 +-
>>  arch/powerpc/kernel/eeh_dev.c                  |   20 +-
>>  arch/powerpc/kernel/pci-common.c               |   18 +-
>>  arch/powerpc/kernel/pci-hotplug.c              |   44 +-
>>  arch/powerpc/kernel/pci_dn.c                   |  119 +-
>>  arch/powerpc/platforms/maple/pci.c             |   35 +-
>>  arch/powerpc/platforms/pasemi/pci.c            |    3 -
>>  arch/powerpc/platforms/powermac/pci.c          |   39 +-
>>  arch/powerpc/platforms/powernv/Kconfig         |    1 +
>>  arch/powerpc/platforms/powernv/eeh-powernv.c   |  245 ++--
>>  arch/powerpc/platforms/powernv/opal-wrappers.S |    3 +
>>  arch/powerpc/platforms/powernv/pci-ioda.c      | 1657 +++++++++++++++---------
>>  arch/powerpc/platforms/powernv/pci.c           |   64 +-
>>  arch/powerpc/platforms/powernv/pci.h           |   52 +-
>>  arch/powerpc/platforms/pseries/msi.c           |    4 +-
>>  arch/powerpc/platforms/pseries/pci_dlpar.c     |   32 -
>>  arch/powerpc/platforms/pseries/setup.c         |    9 +-
>>  drivers/of/dynamic.c                           |   19 +-
>>  drivers/of/fdt.c                               |  133 +-
>>  drivers/pci/hotplug/Kconfig                    |   12 +
>>  drivers/pci/hotplug/Makefile                   |    4 +
>>  drivers/pci/hotplug/powernv_php.c              |  146 +++
>>  drivers/pci/hotplug/powernv_php.h              |   78 ++
>>  drivers/pci/hotplug/powernv_php_slot.c         |  643 +++++++++
>>  drivers/pci/setup-bus.c                        |   12 +-
>>  include/linux/of.h                             |    2 +
>>  include/linux/of_fdt.h                         |    1 +
>>  include/linux/pci.h                            |    1 +
>>  33 files changed, 2473 insertions(+), 963 deletions(-)
>>  create mode 100644 drivers/pci/hotplug/powernv_php.c
>>  create mode 100644 drivers/pci/hotplug/powernv_php.h
>>  create mode 100644 drivers/pci/hotplug/powernv_php_slot.c
>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-11  7:02         ` Alexey Kardashevskiy
@ 2015-05-12  0:03           ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-12  0:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Mon, May 11, 2015 at 05:02:08PM +1000, Alexey Kardashevskiy wrote:
>On 05/11/2015 04:25 PM, Gavin Shan wrote:
>>On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>>>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>The original code doesn't support releasing PEs dynamically, meaning
>>>>that PE and the associated resources (IO, M32, M64 and DMA) can't
>>>>be released when unplugging a PCI adapter from one hotpluggable slot.
>>>>
>>>>The patch takes object oriented methodology, introducs reference
>>>>count to PE, which is initialized to 1 and increased with 1 when a
>>>>new PCI device joins the PE. Once the last PCI device leaves the
>>>>PE, the PE is going to be release together with its associated
>>>>(IO, M32, M64, DMA) resources.
>>>
>>>
>>>Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>>
>>
>>Ok. I'll add more details in next revision.
>>
>>>>
>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>>  arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>>  arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>>  arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>>  4 files changed, 432 insertions(+), 238 deletions(-)
>>>>
>>>>diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>>>index 5367eb3..a6ad4b1 100644
>>>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>>>@@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>>  	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>>  	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>>  	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>>>+
>>>>+	/* Called when PCI device is released */
>>>>+	void		(*release_device)(struct pci_dev *);
>>>>  };
>>>>
>>>>  /*
>>>>diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>>>index 7ed85a6..0040343 100644
>>>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>>>@@ -29,6 +29,11 @@
>>>>   */
>>>>  void pcibios_release_device(struct pci_dev *dev)
>>>>  {
>>>>+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>+
>>>>+	if (hose->controller_ops.release_device)
>>>>+		hose->controller_ops.release_device(dev);
>>>>+
>>>>  	eeh_remove_device(dev);
>>>>  }
>>>>
>>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>index 910fb67..ef8c216 100644
>>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>@@ -12,6 +12,8 @@
>>>>  #undef DEBUG
>>>>
>>>>  #include <linux/kernel.h>
>>>>+#include <linux/atomic.h>
>>>>+#include <linux/kref.h>
>>>>  #include <linux/pci.h>
>>>>  #include <linux/crash_dump.h>
>>>>  #include <linux/debugfs.h>
>>>>@@ -47,6 +49,8 @@
>>>>  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>>  #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>>
>>>>+static void pnv_ioda_release_pe(struct kref *kref);
>>>>+
>>>>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>>  			    const char *fmt, ...)
>>>>  {
>>>>@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>>  		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>>  }
>>>>
>>>>-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>>>+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>>  {
>>>>-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>+	if (!pe)
>>>>+		return;
>>>>+
>>>>+	kref_get(&pe->kref);
>>>>+}
>>>>+
>>>>+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>>>+{
>>>>+	unsigned int count;
>>>>+
>>>>+	if (!pe)
>>>>  		return;
>>>>+
>>>>+	/*
>>>>+	 * The count is initialized to 1 and increased with 1 when
>>>>+	 * a new PCI device is bound with the PE. Once the last PCI
>>>>+	 * device is leaving from the PE, the PE is going to be
>>>>+	 * released.
>>>>+	 */
>>>>+	count = atomic_read(&pe->kref.refcount);
>>>>+	if (count == 2)
>>>>+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>>>+	else
>>>>+		kref_put(&pe->kref, pnv_ioda_release_pe);
>>>
>>>
>>>What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>>
>>
>>Yeah, that would have problem. But it shouldn't happen because the
>>PCI devices are joining the parent PE# in strictly serialized mode.
>>Same thing happens when detaching PCI devices from its parent PE.
>
>
>oookay. Another thing then - why is this kref counter initialized to 1?
>It would make sense if you did something special when the counter becomes 1
>after decrement but you do not.
>
>Also, this kref thing makes sense if you do kref_put() in multiple places and
>do not know which one will be the last one so you pass the callback to all of
>them. Here you do kref_put/sub in one place and you read the counter - so you
>can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t
>would do the job just fine. If you still feel that the counter should start
>from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and
>others.
>

It's good question actually. The counter is initialized to 1 when the PE
is reserved because of M64 requirement or allocated for non-M64 case. If
we reserve or allocate PE#, there is one thing for sure: the PCI bus has
one PCI device (including PCI bridge) at least. After the PE# is reserved
or allocated, the PCI device joins the PE with the result of increasing
the counter with 1. It means the counter is 2 when PE contains one PCI
device, and 3 when there're 2 devices. One reason for this design is that
we just need decrease the counter if we have to release this PE in the
window between PE reservation/allocation and first PCI device joins. I
think you're correct that we can call pnv_ioda_release_pe() in this window.
In this way, the counter is always reflecting the number of PCI devices
the PE contains.

>>>>+}
>>>>+
>>>>+static void pnv_pci_release_device(struct pci_dev *pdev)
>>>>+{
>>>>+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>+	struct pnv_phb *phb = hose->private_data;
>>>>+	struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>+	struct pnv_ioda_pe *pe;
>>>>+
>>>>+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>+		pe = &phb->ioda.pe_array[pdn->pe_number];
>>>>+		pnv_ioda_pe_put(pe);
>>>>+		pdn->pe_number = IODA_INVALID_PE;
>>>>  	}
>>>>+}
>>>>
>>>>-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>>>-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	int index, count;
>>>>+	unsigned long tbl_addr, tbl_size;
>>>>+
>>>>+	/* No DMA capability for slave PEs */
>>>>+	if (pe->flags & PNV_IODA_PE_SLAVE)
>>>>+		return;
>>>>+
>>>>+	/* Bypass DMA window */
>>>>+	if (phb->type == PNV_PHB_IODA2 &&
>>>>+	    pe->tce_bypass_enabled &&
>>>>+	    pe->tce32_table &&
>>>>+	    pe->tce32_table->set_bypass)
>>>>+		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>>>+
>>>>+	/* 32-bits DMA window */
>>>>+	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>>>+	tbl_addr = pe->tce32_table->it_base;
>>>>+	if (!count)
>>>>  		return;
>>>>+
>>>>+	/* Free IOMMU table */
>>>>+	iommu_free_table(pe->tce32_table,
>>>>+			 of_node_full_name(phb->hose->dn));
>>>>+
>>>>+	/* Deconfigure TCE table */
>>>>+	switch (phb->type) {
>>>>+	case PNV_PHB_IODA1:
>>>>+		for (index = 0; index < count; index++)
>>>>+			opal_pci_map_pe_dma_window(phb->opal_id,
>>>>+						   pe->pe_number,
>>>>+						   pe->tce32_seg_start + index,
>>>>+						   1,
>>>>+						   __pa(tbl_addr) +
>>>>+						   index * TCE32_TABLE_SIZE,
>>>>+						   0,
>>>>+						   0x1000);
>>>>+		bitmap_clear(phb->ioda.tce32_segmap,
>>>>+			     pe->tce32_seg_start,
>>>>+			     count);
>>>>+		tbl_size = TCE32_TABLE_SIZE * count;
>>>>+		break;
>>>>+	case PNV_PHB_IODA2:
>>>>+		opal_pci_map_pe_dma_window(phb->opal_id,
>>>>+					   pe->pe_number,
>>>>+					   pe->pe_number << 1,
>>>>+					   1,
>>>>+					   __pa(tbl_addr),
>>>>+					   0,
>>>>+					   0x1000);
>>>>+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>>>+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>>>+		break;
>>>>+	default:
>>>>+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>>>+		return;
>>>>+	}
>>>>+
>>>>+	/* Free memory of IOMMU table */
>>>>+	free_pages(tbl_addr, get_order(tbl_size));
>>>
>>>
>>>You just programmed the table address to TVT and then you are releasing the
>>>pages. It does not seem right, it will leave garbage in TVT. Also, I am
>>>adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>>>that ;) ).
>>>
>>
>>I assume you're talking about TVE. I don't understand how garbage will be left
>>in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
>>with zero'ed "tce_table_size". The pages previously allocated for TCE table is
>>released to buddy system, which can be allocated by somebody else (from buddy
>>or slab).
>
>opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some memory
>which is still allocated. This value goes to a table (which has 2 entries per
>PE, one for 32bit DMA window and one for bypass/hugewindow) which PHB uses to
>get the actual TCE table address. What is the name of this table? :) Anyway,
>you write an address there and then you call free_pages() so after
>free_pages(), the value in that TVE/TVT/whatever table is a garbage.
>

I don't look into your DDW code yet. Before we have DDW patchset, the bypass
TVE (window) isn't supposed to have corresponding TCE table. I guess you might
change the behaviour in your DDW patchset and I'll take a close look on that.
For DMA32 window, which is the name of the table, the TVE is cleared by skiboot
when having zero "tce_table_size" argument.

	opal_pci_map_pe_dma_window(phb->opal_id,
				   pe->pe_number,
				   pe->pe_number << 1,
				   1,
				   __pa(tbl_addr),
				   0,			<<<< "tce_table_size".
				   0x1000);

>
>>
>>Ok. Please put me into the cc list. I guess the whole series of patches is
>>better to rebased on your DDW patchset, which is to be merged first, I believe.
>>
>>>
>>>>+	pe->tce32_table = NULL;
>>>>+	pe->tce32_seg_start = 0;
>>>>+	pe->tce32_seg_end = 0;
>>>>+}
>>>>+
>>>>+static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	unsigned long *segmap = NULL, *pe_segmap = NULL;
>>>>+	int i;
>>>>+	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
>>>>+				     OPAL_M32_WINDOW_TYPE,
>>>>+				     OPAL_M64_WINDOW_TYPE };
>>>>+
>>>>+	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
>>>>+		switch (win_type[win]) {
>>>>+		case OPAL_IO_WINDOW_TYPE:
>>>>+			segmap = phb->ioda.io_segmap;
>>>>+			pe_segmap = pe->io_segmap;
>>>>+			break;
>>>>+		case OPAL_M32_WINDOW_TYPE:
>>>>+			segmap = phb->ioda.m32_segmap;
>>>>+			pe_segmap = pe->m32_segmap;
>>>>+			break;
>>>>+		case OPAL_M64_WINDOW_TYPE:
>>>>+			segmap = phb->ioda.m64_segmap;
>>>>+			pe_segmap = pe->m64_segmap;
>>>>+			break;
>>>>+		}
>>>>+		i = -1;
>>>>+		while ((i = find_next_bit(pe_segmap,
>>>>+			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
>>>>+			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
>>>>+			    win_type[win] == OPAL_M32_WINDOW_TYPE)
>>>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>>>+						phb->ioda.reserved_pe,
>>>>+						win_type[win], 0, i);
>>>>+			else if (phb->type == PNV_PHB_IODA1)
>>>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>>>+						phb->ioda.reserved_pe,
>>>>+						win_type[win],
>>>>+						i / 8, i % 8);
>>>
>>>The function is called ""release" but it programs something what looks like
>>>reasonable values, is it correct?
>>>
>>
>>It's out of problem, When the segment is deallocated, it's mapped to the
>>reserved PE#.
>>
>>>
>>>
>>>>+
>>>>+			clear_bit(i, pe_segmap);
>>>>+			clear_bit(i, segmap);
>>>>+		}
>>>>+	}
>>>>+}
>>>>+
>>>>+static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>>>+				  struct pnv_ioda_pe *parent,
>>>>+				  struct pnv_ioda_pe *child,
>>>>+				  bool is_add)
>>>>+{
>>>>+	const char *desc = is_add ? "adding" : "removing";
>>>>+	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>>>+			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>>>+	struct pnv_ioda_pe *slave;
>>>>+	long rc;
>>>>+
>>>>+	/* Parent PE affects child PE */
>>>>+	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>+				child->pe_number, op);
>>>>+	if (rc != OPAL_SUCCESS) {
>>>>+		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>>>+			rc, desc);
>>>>+		return -ENXIO;
>>>>+	}
>>>>+
>>>>+	if (!(child->flags & PNV_IODA_PE_MASTER))
>>>>+		return 0;
>>>>+
>>>>+	/* Compound case: parent PE affects slave PEs */
>>>>+	list_for_each_entry(slave, &child->slaves, list) {
>>>>+		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>+					slave->pe_number, op);
>>>>+		if (rc != OPAL_SUCCESS) {
>>>>+			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>>>+				rc, desc);
>>>>+			return -ENXIO;
>>>>+		}
>>>>+	}
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>>+static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	struct pnv_ioda_pe *slave;
>>>>+	struct pci_dev *pdev = NULL;
>>>>+	int ret;
>>>>+
>>>>+	/*
>>>>+	 * Clear PE frozen state. If it's master PE, we need
>>>>+	 * clear slave PE frozen state as well.
>>>>+	 */
>>>>+	opal_pci_eeh_freeze_clear(phb->opal_id,
>>>>+				  pe->pe_number,
>>>>+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>>>+			opal_pci_eeh_freeze_clear(phb->opal_id,
>>>>+						  slave->pe_number,
>>>>+						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>+		}
>>>>+	}
>>>>+
>>>>+	/*
>>>>+	 * Associate PE in PELT. We need add the PE into the
>>>>+	 * corresponding PELT-V as well. Otherwise, the error
>>>>+	 * originated from the PE might contribute to other
>>>>+	 * PEs.
>>>>+	 */
>>>>+	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>>>+	if (ret)
>>>>+		return ret;
>>>>+
>>>>+	/* For compound PEs, any one affects all of them */
>>>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>>>+			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>>>+			if (ret)
>>>>+				return ret;
>>>>+		}
>>>>+	}
>>>>+
>>>>+	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>>>+		pdev = pe->pbus->self;
>>>>+	else if (pe->flags & PNV_IODA_PE_DEV)
>>>>+		pdev = pe->pdev->bus->self;
>>>>+#ifdef CONFIG_PCI_IOV
>>>>+	else if (pe->flags & PNV_IODA_PE_VF)
>>>>+		pdev = pe->parent_dev->bus->self;
>>>>+#endif /* CONFIG_PCI_IOV */
>>>>+
>>>>+	while (pdev) {
>>>>+		struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>+		struct pnv_ioda_pe *parent;
>>>>+
>>>>+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>+			parent = &phb->ioda.pe_array[pdn->pe_number];
>>>>+			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>>>+			if (ret)
>>>>+				return ret;
>>>>+		}
>>>>+
>>>>+		pdev = pdev->bus->self;
>>>>+	}
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>>+static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
>>>
>>>
>>>It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just
>>>moving of this function to a different place deserves a separate patch with a
>>>comment why ("it is going to be used now for non-SRIOV case too" may be?).
>>>
>>
>>Yeah, it makes sense to me. Will fix it up.
>>
>>>
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	struct pci_dev *parent;
>>>>+	uint8_t bcomp, dcomp, fcomp;
>>>>+	long rid_end, rid;
>>>>+	int64_t rc;
>>>>+
>>>>+	/* Tear down MVE */
>>>>+	if (phb->type == PNV_PHB_IODA1 &&
>>>>+	    pe->mve_number != -1) {
>>>>+		rc = opal_pci_set_mve(phb->opal_id,
>>>>+				      pe->mve_number,
>>>>+				      phb->ioda.reserved_pe);
>>>>+		if (rc != OPAL_SUCCESS)
>>>>+			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
>>>>+				rc, pe->mve_number);
>>>>+		rc = opal_pci_set_mve_enable(phb->opal_id,
>>>>+					     pe->mve_number,
>>>>+					     OPAL_DISABLE_MVE);
>>>>+		if (rc != OPAL_SUCCESS)
>>>>+			pe_warn(pe, "Error %lld disabling MVE#%d\n",
>>>>+				rc, pe->mve_number);
>>>>+		pe->mve_number = -1;
>>>>+	}
>>>>+
>>>>+	/* Unmapping PELTV */
>>>>+	pnv_ioda_set_peltv(pe, false);
>>>>+
>>>>+	/* To unmap PELTM */
>>>>+	if (pe->pbus) {
>>>>+		int count;
>>>>+
>>>>+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>>>+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>>>+		parent = pe->pbus->self;
>>>>+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>>>+			count = pe->pbus->busn_res.end -
>>>>+				pe->pbus->busn_res.start + 1;
>>>>+		else
>>>>+			count = 1;
>>>>+
>>>>+		switch(count) {
>>>>+		case  1: bcomp = OpalPciBusAll;   break;
>>>>+		case  2: bcomp = OpalPciBus7Bits; break;
>>>>+		case  4: bcomp = OpalPciBus6Bits; break;
>>>>+		case  8: bcomp = OpalPciBus5Bits; break;
>>>>+		case 16: bcomp = OpalPciBus4Bits; break;
>>>>+		case 32: bcomp = OpalPciBus3Bits; break;
>>>>+		default:
>>>>+			/* Fail back to case of one bus */
>>>>+			pe_warn(pe, "Cannot support %d buses\n", count);
>>>>+			bcomp = OpalPciBusAll;
>>>>+		}
>>>>+		rid_end = pe->rid + (count << 8);
>>>>+	} else {
>>>>+#ifdef CONFIG_PCI_IOV
>>>>+		if (pe->flags & PNV_IODA_PE_VF)
>>>>+			parent = pe->parent_dev;
>>>>+		else
>>>>+#endif
>>>>+			parent = pe->pdev->bus->self;
>>>>+		bcomp = OpalPciBusAll;
>>>>+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>>>+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>>>+		rid_end = pe->rid + 1;
>>>>+	}
>>>>+
>>>>+	/* Clear RID mapping */
>>>>+	for (rid = pe->rid; rid < rid_end; rid++)
>>>>+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>>>+
>>>>+	/* Unmapping PELTM */
>>>>+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>>>+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>>>+	if (rc)
>>>>+		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
>>>>+}
>>>>+
>>>>+static void pnv_ioda_release_pe(struct kref *kref)
>>>>+{
>>>>+	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
>>>>+	struct pnv_ioda_pe *tmp, *slave;
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+
>>>>+	pnv_ioda_release_pe_dma(pe);
>>>>+	pnv_ioda_release_pe_seg(pe);
>>>>+	pnv_ioda_deconfigure_pe(pe);
>>>>+
>>>>+	/* Release slave PEs for compound PE */
>>>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>+		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
>>>>+			pnv_ioda_pe_put(slave);
>>>>+	}
>>>>+
>>>>+	/* Remove the PE from various list. We need remove slave
>>>>+	 * PE from master's list.
>>>>+	 */
>>>>+	list_del(&pe->dma_link);
>>>>+	list_del(&pe->list);
>>>>+
>>>>+	/* Free PE number */
>>>>+	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
>>>>+}
>>>>+
>>>>+static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
>>>>+					    int pe_no)
>>>>+{
>>>>+	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
>>>>+
>>>>+	kref_init(&pe->kref);
>>>>+	pe->phb = phb;
>>>>+	pe->pe_number = pe_no;
>>>>+	INIT_LIST_HEAD(&pe->dma_link);
>>>>+	INIT_LIST_HEAD(&pe->list);
>>>>+
>>>>+	return pe;
>>>>+}
>>>>+
>>>>+static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
>>>>+					       int pe_no)
>>>>+{
>>>>+	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>+		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>+			__func__, pe_no, phb->hose->global_number);
>>>>+		return NULL;
>>>>  	}
>>>>
>>>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>>>+	/*
>>>>+	 * Same PE might be reserved for multiple times, which
>>>>+	 * is out of problem actually.
>>>>+	 */
>>>>+	set_bit(pe_no, phb->ioda.pe_alloc);
>>>>+	return pnv_ioda_init_pe(phb, pe_no);
>>>>  }
>>>>
>>>>-static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>>+static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>>  {
>>>>  	unsigned long pe_no;
>>>>  	unsigned long limit = phb->ioda.total_pe - 1;
>>>>@@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>>  			break;
>>>>
>>>>  		if (--limit >= phb->ioda.total_pe)
>>>>-			return IODA_INVALID_PE;
>>>>+			return NULL;
>>>>  	} while(1);
>>>>
>>>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>>>-	return pe_no;
>>>>-}
>>>>-
>>>>-static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>>>-{
>>>>-	WARN_ON(phb->ioda.pe_array[pe].pdev);
>>>>-
>>>>-	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
>>>>-	clear_bit(pe, phb->ioda.pe_alloc);
>>>>+	return pnv_ioda_init_pe(phb, pe_no);
>>>>  }
>>>>
>>>>  static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>>>@@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>>>  	}
>>>>  }
>>>>
>>>>-static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>-				struct pci_bus *bus, int all)
>>>>+static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>+						struct pci_bus *bus,
>>>>+						int all)
>>>
>>>
>>>Mechanic changes like this could easily go to a separate patch.
>>>
>>
>>Indeed. I'll see how I can split the patches up in next revision.
>>Thanks for the suggestion.
>>
>>>>  {
>>>>  	resource_size_t segsz = phb->ioda.m64_segsize;
>>>>  	struct pci_dev *pdev;
>>>>@@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>  	int i;
>>>>
>>>>  	if (!pnv_ioda_need_m64_pe(phb, bus))
>>>>-		return IODA_INVALID_PE;
>>>>+		return NULL;
>>>>
>>>>          /* Allocate bitmap */
>>>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>>>  	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>>>  	if (!pe_bitsmap) {
>>>>  		pr_warn("%s: Out of memory !\n", __func__);
>>>>-		return IODA_INVALID_PE;
>>>>+		return NULL;
>>>>  	}
>>>>
>>>>  	/* The bridge's M64 window might be extended to PHB's M64
>>>>@@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>  	/* No M64 window found ? */
>>>>  	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>>>  		kfree(pe_bitsmap);
>>>>-		return IODA_INVALID_PE;
>>>>+		return NULL;
>>>>  	}
>>>>
>>>>  	/* Figure out the master PE and put all slave PEs
>>>>@@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>  	}
>>>>
>>>>  	kfree(pe_bitsmap);
>>>>-	return master_pe->pe_number;
>>>>+	return master_pe;
>>>>  }
>>>>
>>>>  static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>>>@@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>>>   * but in the meantime, we need to protect them to avoid warnings
>>>>   */
>>>>  #ifdef CONFIG_PCI_MSI
>>>>-static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>>>+static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>>>>  {
>>>>  	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>  	struct pnv_phb *phb = hose->private_data;
>>>>@@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>>>  }
>>>>  #endif /* CONFIG_PCI_MSI */
>>>>
>>>>-static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>>>-				  struct pnv_ioda_pe *parent,
>>>>-				  struct pnv_ioda_pe *child,
>>>>-				  bool is_add)
>>>>-{
>>>>-	const char *desc = is_add ? "adding" : "removing";
>>>>-	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>>>-			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>>>-	struct pnv_ioda_pe *slave;
>>>>-	long rc;
>>>>-
>>>>-	/* Parent PE affects child PE */
>>>>-	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>-				child->pe_number, op);
>>>>-	if (rc != OPAL_SUCCESS) {
>>>>-		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>>>-			rc, desc);
>>>>-		return -ENXIO;
>>>>-	}
>>>>-
>>>>-	if (!(child->flags & PNV_IODA_PE_MASTER))
>>>>-		return 0;
>>>>-
>>>>-	/* Compound case: parent PE affects slave PEs */
>>>>-	list_for_each_entry(slave, &child->slaves, list) {
>>>>-		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>-					slave->pe_number, op);
>>>>-		if (rc != OPAL_SUCCESS) {
>>>>-			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>>>-				rc, desc);
>>>>-			return -ENXIO;
>>>>-		}
>>>>-	}
>>>>-
>>>>-	return 0;
>>>>-}
>>>>-
>>>>-static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>>>>-			      struct pnv_ioda_pe *pe,
>>>>-			      bool is_add)
>>>>-{
>>>>-	struct pnv_ioda_pe *slave;
>>>>-	struct pci_dev *pdev = NULL;
>>>>-	int ret;
>>>>-
>>>>-	/*
>>>>-	 * Clear PE frozen state. If it's master PE, we need
>>>>-	 * clear slave PE frozen state as well.
>>>>-	 */
>>>>-	if (is_add) {
>>>>-		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
>>>>-					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>-		if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>-			list_for_each_entry(slave, &pe->slaves, list)
>>>>-				opal_pci_eeh_freeze_clear(phb->opal_id,
>>>>-							  slave->pe_number,
>>>>-							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>-		}
>>>>-	}
>>>>-
>>>>-	/*
>>>>-	 * Associate PE in PELT. We need add the PE into the
>>>>-	 * corresponding PELT-V as well. Otherwise, the error
>>>>-	 * originated from the PE might contribute to other
>>>>-	 * PEs.
>>>>-	 */
>>>>-	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>>>-	if (ret)
>>>>-		return ret;
>>>>-
>>>>-	/* For compound PEs, any one affects all of them */
>>>>-	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>-		list_for_each_entry(slave, &pe->slaves, list) {
>>>>-			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>>>-			if (ret)
>>>>-				return ret;
>>>>-		}
>>>>-	}
>>>>-
>>>>-	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>>>-		pdev = pe->pbus->self;
>>>>-	else if (pe->flags & PNV_IODA_PE_DEV)
>>>>-		pdev = pe->pdev->bus->self;
>>>>-#ifdef CONFIG_PCI_IOV
>>>>-	else if (pe->flags & PNV_IODA_PE_VF)
>>>>-		pdev = pe->parent_dev->bus->self;
>>>>-#endif /* CONFIG_PCI_IOV */
>>>>-	while (pdev) {
>>>>-		struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>-		struct pnv_ioda_pe *parent;
>>>>-
>>>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>-			parent = &phb->ioda.pe_array[pdn->pe_number];
>>>>-			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>>>-			if (ret)
>>>>-				return ret;
>>>>-		}
>>>>-
>>>>-		pdev = pdev->bus->self;
>>>>-	}
>>>>-
>>>>-	return 0;
>>>>-}
>>>>-
>>>>-#ifdef CONFIG_PCI_IOV
>>>>-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>>-{
>>>>-	struct pci_dev *parent;
>>>>-	uint8_t bcomp, dcomp, fcomp;
>>>>-	int64_t rc;
>>>>-	long rid_end, rid;
>>>>-
>>>>-	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
>>>>-	if (pe->pbus) {
>>>>-		int count;
>>>>-
>>>>-		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>>>-		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>>>-		parent = pe->pbus->self;
>>>>-		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>>>-			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
>>>>-		else
>>>>-			count = 1;
>>>>-
>>>>-		switch(count) {
>>>>-		case  1: bcomp = OpalPciBusAll;         break;
>>>>-		case  2: bcomp = OpalPciBus7Bits;       break;
>>>>-		case  4: bcomp = OpalPciBus6Bits;       break;
>>>>-		case  8: bcomp = OpalPciBus5Bits;       break;
>>>>-		case 16: bcomp = OpalPciBus4Bits;       break;
>>>>-		case 32: bcomp = OpalPciBus3Bits;       break;
>>>>-		default:
>>>>-			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
>>>>-			        count);
>>>>-			/* Do an exact match only */
>>>>-			bcomp = OpalPciBusAll;
>>>>-		}
>>>>-		rid_end = pe->rid + (count << 8);
>>>>-	} else {
>>>>-		if (pe->flags & PNV_IODA_PE_VF)
>>>>-			parent = pe->parent_dev;
>>>>-		else
>>>>-			parent = pe->pdev->bus->self;
>>>>-		bcomp = OpalPciBusAll;
>>>>-		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>>>-		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>>>-		rid_end = pe->rid + 1;
>>>>-	}
>>>>-
>>>>-	/* Clear the reverse map */
>>>>-	for (rid = pe->rid; rid < rid_end; rid++)
>>>>-		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>>>-
>>>>-	/* Release from all parents PELT-V */
>>>>-	while (parent) {
>>>>-		struct pci_dn *pdn = pci_get_pdn(parent);
>>>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
>>>>-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>>>-			/* XXX What to do in case of error ? */
>>>
>>>
>>>Not much :) Free associated memory and mark it "dead" so it won't be used
>>>again till reboot. In what circumstance can this opal_pci_set_peltv() fail at
>>>all?
>>>
>>
>>Yeah, maybe. Until now, I didn't see this failure since the code is there
>>from the day. Note the code has been there for almost 4 years since the
>>day Ben wrote it.
>
>
>Sure. But if it starts failing, we won't even notice it - there is no even
>pr_err() or WARN_ON.
>

Agree. I'll see what I can do. At least I can have error message to alert.

>>
>>>
>>>>-		}
>>>>-		parent = parent->bus->self;
>>>>-	}
>>>>-
>>>>-	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
>>>>-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>-
>>>>-	/* Disassociate PE in PELT */
>>>>-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
>>>>-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>>>-	if (rc)
>>>>-		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
>>>>-	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>>>-			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>>>-	if (rc)
>>>>-		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
>>>>-
>>>>-	pe->pbus = NULL;
>>>>-	pe->pdev = NULL;
>>>>-	pe->parent_dev = NULL;
>>>>-
>>>>-	return 0;
>>>>-}
>>>>-#endif /* CONFIG_PCI_IOV */
>>>>-
>>>>  static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>>  {
>>>>  	struct pci_dev *parent;
>>>>@@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>>  	}
>>>>
>>>>  	/* Configure PELTV */
>>>>-	pnv_ioda_set_peltv(phb, pe, true);
>>>>+	pnv_ioda_set_peltv(pe, true);
>>>>
>>>>  	/* Setup reverse map */
>>>>  	for (rid = pe->rid; rid < rid_end; rid++)
>>>>@@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>>>>  		if (pdn->pe_number != IODA_INVALID_PE)
>>>>  			continue;
>>>>
>>>>+		/* Increase reference count of the parent PE */
>>>
>>>When you comment like this, I read it as the comment belongs to the whole
>>>next chunk till the first empty line, i.e. to all 5 lines below, which is not
>>>the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() name
>>>suggests incrementing the reference counter 2) "pe" is always parent in this
>>>function. I do not insist though.
>>>
>>
>>Agree on your explaining. I'll remove this unuseful comments.
>>
>>>
>>>>+		pnv_ioda_pe_get(pe);
>>>>  		pdn->pe_number = pe->pe_number;
>>>>  		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>>>>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>>>>@@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>>  {
>>>>  	struct pci_controller *hose = pci_bus_to_host(bus);
>>>>  	struct pnv_phb *phb = hose->private_data;
>>>>-	struct pnv_ioda_pe *pe;
>>>>+	struct pnv_ioda_pe *pe = NULL;
>>>>  	int pe_num = IODA_INVALID_PE;
>>>>
>>>>  	/* For partial hotplug case, the PE instance hasn't been destroyed
>>>>@@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>>  	}
>>>>
>>>>  	/* PE number for root bus should have been reserved */
>>>>-	if (pci_is_root_bus(bus))
>>>>-		pe_num = phb->ioda.root_pe_no;
>>>>+	if (pci_is_root_bus(bus) &&
>>>>+	    phb->ioda.root_pe_no != IODA_INVALID_PE)
>>>>+		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>>>>
>>>>  	/* Check if PE is determined by M64 */
>>>>-	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
>>>>-		pe_num = phb->pick_m64_pe(phb, bus, all);
>>>>+	if (!pe && phb->pick_m64_pe)
>>>>+		pe = phb->pick_m64_pe(phb, bus, all);
>>>>
>>>>  	/* The PE number isn't pinned by M64 */
>>>>-	if (pe_num == IODA_INVALID_PE)
>>>>-		pe_num = pnv_ioda_alloc_pe(phb);
>>>>+	if (!pe)
>>>>+		pe = pnv_ioda_alloc_pe(phb);
>>>>
>>>>-	if (pe_num == IODA_INVALID_PE) {
>>>>-		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
>>>>+	if (!pe) {
>>>>+		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>>>>  			__func__, pci_domain_nr(bus), bus->number);
>>>>  		return NULL;
>>>>  	}
>>>>
>>>>-	pe = &phb->ioda.pe_array[pe_num];
>>>>  	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>>>>  	pe->pbus = bus;
>>>>  	pe->pdev = NULL;
>>>>@@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>>
>>>>  	if (pnv_ioda_configure_pe(phb, pe)) {
>>>>  		/* XXX What do we do here ? */
>>>>-		if (pe_num)
>>>>-			pnv_ioda_free_pe(phb, pe_num);
>>>>-		pe->pbus = NULL;
>>>>+		pnv_ioda_pe_put(pe);
>>>>  		return NULL;
>>>>  	}
>>>>
>>>>  	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>>>>-			GFP_KERNEL, hose->node);
>>>>+				       GFP_KERNEL, hose->node);
>>>
>>>Seems like spaces change only - if you really want this change (which I hate
>>>- makes code look inaccurate to my taste but it seems I am in minority here
>>>:) ), please put it to the separate patch.
>>>
>>
>>Ok. Confirm with you: You prefer the original format? I don't know
>>why I prefer the later one. Maybe my eyes are quite broken :-)
>
>
>I prefer not to change existing whitespaces unless it is done once and for
>the entire file :) Just remove this change from the patch.
>

Sure.

>>>
>>>>  	pe->tce32_table->data = pe;
>>>>
>>>>  	/* Associate it with all child devices */
>>>>@@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>>  		list_del(&pe->list);
>>>>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>>>>
>>>>-		pnv_ioda_deconfigure_pe(phb, pe);
>>>>+		pnv_ioda_deconfigure_pe(pe);
>>>
>>>
>>>Is this change necessary to get "Release PEs dynamically" working? Move it to
>>>mechanical changes patch may be?
>>>
>>
>>Ok. I'll try to do that.
>>
>>>
>>>>
>>>>-		pnv_ioda_free_pe(phb, pe->pe_number);
>>>>+		pnv_ioda_pe_put(pe);
>>>>  	}
>>>>  }
>>>>
>>>>@@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>>
>>>>  		if (pnv_ioda_configure_pe(phb, pe)) {
>>>>  			/* XXX What do we do here ? */
>>>>-			if (pe_num)
>>>>-				pnv_ioda_free_pe(phb, pe_num);
>>>>-			pe->pdev = NULL;
>>>>+			pnv_ioda_pe_put(pe);
>>>>  			continue;
>>>>  		}
>>>>
>>>>@@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>>>>  	struct pnv_ioda_pe *pe;
>>>>  	int rc;
>>>>
>>>>-	pe = pnv_ioda_get_pe(dev);
>>>>+	pe = pnv_ioda_pci_dev_to_pe(dev);
>>>
>>>
>>>And this change could to separately. Not clear how this helps to "Release PEs
>>>dynamically".
>>>
>>>
>>
>>It's not related to "Release PEs dynamically". The change is introduced by
>>the function rename: Original pnv_ioda_get_pe() is renamed to pnv_ioda_pci_dev_to_pe().
>
>
>But the rename happened in this patch and the patch's subj is "Release PEs
>dynamically" so it should be related somehow or move it to a simple separate
>patch "let's give the lalala function a better name to reflect what it
>actually does" (but in this case the new name does not make any more sense
>than the old one).
>

Yeah, I'll try to split the patches to separate blala and walala :-)

>>>>  	if (!pe)
>>>>  		return -ENODEV;
>>>>
>>>>@@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>>>>  	struct pnv_ioda_pe *pe;
>>>>  	int rc;
>>>>
>>>>-	if (!(pe = pnv_ioda_get_pe(dev)))
>>>>+	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>>>>  		return -ENODEV;
>>>>
>>>>  	/* Assign XIVE to PE */
>>>>@@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>>>>  				  unsigned int hwirq, unsigned int virq,
>>>>  				  unsigned int is_64, struct msi_msg *msg)
>>>>  {
>>>>-	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
>>>>+	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>>>>  	unsigned int xive_num = hwirq - phb->msi_base;
>>>>  	__be32 data;
>>>>  	int rc;
>>>>@@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>>>  	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>>>>  	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>>>>  	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>>>+	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>>>>  	hose->controller_ops = pnv_pci_controller_ops;
>>>>
>>>>  #ifdef CONFIG_PCI_IOV
>>>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>>>index 1bea3a8..8b10f01 100644
>>>>--- a/arch/powerpc/platforms/powernv/pci.h
>>>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>>>@@ -28,6 +28,7 @@ enum pnv_phb_model {
>>>>  /* Data associated with a PE, including IOMMU tracking etc.. */
>>>>  struct pnv_phb;
>>>>  struct pnv_ioda_pe {
>>>>+	struct kref		kref;
>>>>  	unsigned long		flags;
>>>>  	struct pnv_phb		*phb;
>>>>
>>>>@@ -120,7 +121,8 @@ struct pnv_phb {
>>>>  	void (*shutdown)(struct pnv_phb *phb);
>>>>  	int (*init_m64)(struct pnv_phb *phb);
>>>>  	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>>>-	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>>>+	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
>>>>+					   struct pci_bus *bus, int all);
>>>>  	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>>>  	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>>>  	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-12  0:03           ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-12  0:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Mon, May 11, 2015 at 05:02:08PM +1000, Alexey Kardashevskiy wrote:
>On 05/11/2015 04:25 PM, Gavin Shan wrote:
>>On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>>>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>The original code doesn't support releasing PEs dynamically, meaning
>>>>that PE and the associated resources (IO, M32, M64 and DMA) can't
>>>>be released when unplugging a PCI adapter from one hotpluggable slot.
>>>>
>>>>The patch takes object oriented methodology, introducs reference
>>>>count to PE, which is initialized to 1 and increased with 1 when a
>>>>new PCI device joins the PE. Once the last PCI device leaves the
>>>>PE, the PE is going to be release together with its associated
>>>>(IO, M32, M64, DMA) resources.
>>>
>>>
>>>Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>>
>>
>>Ok. I'll add more details in next revision.
>>
>>>>
>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>>  arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>>  arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>>  arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>>  4 files changed, 432 insertions(+), 238 deletions(-)
>>>>
>>>>diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>>>index 5367eb3..a6ad4b1 100644
>>>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>>>@@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>>  	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>>  	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>>  	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>>>+
>>>>+	/* Called when PCI device is released */
>>>>+	void		(*release_device)(struct pci_dev *);
>>>>  };
>>>>
>>>>  /*
>>>>diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>>>index 7ed85a6..0040343 100644
>>>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>>>@@ -29,6 +29,11 @@
>>>>   */
>>>>  void pcibios_release_device(struct pci_dev *dev)
>>>>  {
>>>>+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>+
>>>>+	if (hose->controller_ops.release_device)
>>>>+		hose->controller_ops.release_device(dev);
>>>>+
>>>>  	eeh_remove_device(dev);
>>>>  }
>>>>
>>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>index 910fb67..ef8c216 100644
>>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>@@ -12,6 +12,8 @@
>>>>  #undef DEBUG
>>>>
>>>>  #include <linux/kernel.h>
>>>>+#include <linux/atomic.h>
>>>>+#include <linux/kref.h>
>>>>  #include <linux/pci.h>
>>>>  #include <linux/crash_dump.h>
>>>>  #include <linux/debugfs.h>
>>>>@@ -47,6 +49,8 @@
>>>>  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>>  #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>>
>>>>+static void pnv_ioda_release_pe(struct kref *kref);
>>>>+
>>>>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>>  			    const char *fmt, ...)
>>>>  {
>>>>@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>>  		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>>  }
>>>>
>>>>-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>>>+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>>  {
>>>>-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>+	if (!pe)
>>>>+		return;
>>>>+
>>>>+	kref_get(&pe->kref);
>>>>+}
>>>>+
>>>>+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>>>+{
>>>>+	unsigned int count;
>>>>+
>>>>+	if (!pe)
>>>>  		return;
>>>>+
>>>>+	/*
>>>>+	 * The count is initialized to 1 and increased with 1 when
>>>>+	 * a new PCI device is bound with the PE. Once the last PCI
>>>>+	 * device is leaving from the PE, the PE is going to be
>>>>+	 * released.
>>>>+	 */
>>>>+	count = atomic_read(&pe->kref.refcount);
>>>>+	if (count == 2)
>>>>+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>>>+	else
>>>>+		kref_put(&pe->kref, pnv_ioda_release_pe);
>>>
>>>
>>>What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>>
>>
>>Yeah, that would have problem. But it shouldn't happen because the
>>PCI devices are joining the parent PE# in strictly serialized mode.
>>Same thing happens when detaching PCI devices from its parent PE.
>
>
>oookay. Another thing then - why is this kref counter initialized to 1?
>It would make sense if you did something special when the counter becomes 1
>after decrement but you do not.
>
>Also, this kref thing makes sense if you do kref_put() in multiple places and
>do not know which one will be the last one so you pass the callback to all of
>them. Here you do kref_put/sub in one place and you read the counter - so you
>can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t
>would do the job just fine. If you still feel that the counter should start
>from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and
>others.
>

It's good question actually. The counter is initialized to 1 when the PE
is reserved because of M64 requirement or allocated for non-M64 case. If
we reserve or allocate PE#, there is one thing for sure: the PCI bus has
one PCI device (including PCI bridge) at least. After the PE# is reserved
or allocated, the PCI device joins the PE with the result of increasing
the counter with 1. It means the counter is 2 when PE contains one PCI
device, and 3 when there're 2 devices. One reason for this design is that
we just need decrease the counter if we have to release this PE in the
window between PE reservation/allocation and first PCI device joins. I
think you're correct that we can call pnv_ioda_release_pe() in this window.
In this way, the counter is always reflecting the number of PCI devices
the PE contains.

>>>>+}
>>>>+
>>>>+static void pnv_pci_release_device(struct pci_dev *pdev)
>>>>+{
>>>>+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>+	struct pnv_phb *phb = hose->private_data;
>>>>+	struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>+	struct pnv_ioda_pe *pe;
>>>>+
>>>>+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>+		pe = &phb->ioda.pe_array[pdn->pe_number];
>>>>+		pnv_ioda_pe_put(pe);
>>>>+		pdn->pe_number = IODA_INVALID_PE;
>>>>  	}
>>>>+}
>>>>
>>>>-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>>>-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	int index, count;
>>>>+	unsigned long tbl_addr, tbl_size;
>>>>+
>>>>+	/* No DMA capability for slave PEs */
>>>>+	if (pe->flags & PNV_IODA_PE_SLAVE)
>>>>+		return;
>>>>+
>>>>+	/* Bypass DMA window */
>>>>+	if (phb->type == PNV_PHB_IODA2 &&
>>>>+	    pe->tce_bypass_enabled &&
>>>>+	    pe->tce32_table &&
>>>>+	    pe->tce32_table->set_bypass)
>>>>+		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>>>+
>>>>+	/* 32-bits DMA window */
>>>>+	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>>>+	tbl_addr = pe->tce32_table->it_base;
>>>>+	if (!count)
>>>>  		return;
>>>>+
>>>>+	/* Free IOMMU table */
>>>>+	iommu_free_table(pe->tce32_table,
>>>>+			 of_node_full_name(phb->hose->dn));
>>>>+
>>>>+	/* Deconfigure TCE table */
>>>>+	switch (phb->type) {
>>>>+	case PNV_PHB_IODA1:
>>>>+		for (index = 0; index < count; index++)
>>>>+			opal_pci_map_pe_dma_window(phb->opal_id,
>>>>+						   pe->pe_number,
>>>>+						   pe->tce32_seg_start + index,
>>>>+						   1,
>>>>+						   __pa(tbl_addr) +
>>>>+						   index * TCE32_TABLE_SIZE,
>>>>+						   0,
>>>>+						   0x1000);
>>>>+		bitmap_clear(phb->ioda.tce32_segmap,
>>>>+			     pe->tce32_seg_start,
>>>>+			     count);
>>>>+		tbl_size = TCE32_TABLE_SIZE * count;
>>>>+		break;
>>>>+	case PNV_PHB_IODA2:
>>>>+		opal_pci_map_pe_dma_window(phb->opal_id,
>>>>+					   pe->pe_number,
>>>>+					   pe->pe_number << 1,
>>>>+					   1,
>>>>+					   __pa(tbl_addr),
>>>>+					   0,
>>>>+					   0x1000);
>>>>+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>>>+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>>>+		break;
>>>>+	default:
>>>>+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>>>+		return;
>>>>+	}
>>>>+
>>>>+	/* Free memory of IOMMU table */
>>>>+	free_pages(tbl_addr, get_order(tbl_size));
>>>
>>>
>>>You just programmed the table address to TVT and then you are releasing the
>>>pages. It does not seem right, it will leave garbage in TVT. Also, I am
>>>adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>>>that ;) ).
>>>
>>
>>I assume you're talking about TVE. I don't understand how garbage will be left
>>in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
>>with zero'ed "tce_table_size". The pages previously allocated for TCE table is
>>released to buddy system, which can be allocated by somebody else (from buddy
>>or slab).
>
>opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some memory
>which is still allocated. This value goes to a table (which has 2 entries per
>PE, one for 32bit DMA window and one for bypass/hugewindow) which PHB uses to
>get the actual TCE table address. What is the name of this table? :) Anyway,
>you write an address there and then you call free_pages() so after
>free_pages(), the value in that TVE/TVT/whatever table is a garbage.
>

I don't look into your DDW code yet. Before we have DDW patchset, the bypass
TVE (window) isn't supposed to have corresponding TCE table. I guess you might
change the behaviour in your DDW patchset and I'll take a close look on that.
For DMA32 window, which is the name of the table, the TVE is cleared by skiboot
when having zero "tce_table_size" argument.

	opal_pci_map_pe_dma_window(phb->opal_id,
				   pe->pe_number,
				   pe->pe_number << 1,
				   1,
				   __pa(tbl_addr),
				   0,			<<<< "tce_table_size".
				   0x1000);

>
>>
>>Ok. Please put me into the cc list. I guess the whole series of patches is
>>better to rebased on your DDW patchset, which is to be merged first, I believe.
>>
>>>
>>>>+	pe->tce32_table = NULL;
>>>>+	pe->tce32_seg_start = 0;
>>>>+	pe->tce32_seg_end = 0;
>>>>+}
>>>>+
>>>>+static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe *pe)
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	unsigned long *segmap = NULL, *pe_segmap = NULL;
>>>>+	int i;
>>>>+	uint16_t win, win_type[] = { OPAL_IO_WINDOW_TYPE,
>>>>+				     OPAL_M32_WINDOW_TYPE,
>>>>+				     OPAL_M64_WINDOW_TYPE };
>>>>+
>>>>+	for (win = 0; win < ARRAY_SIZE(win_type); win++) {
>>>>+		switch (win_type[win]) {
>>>>+		case OPAL_IO_WINDOW_TYPE:
>>>>+			segmap = phb->ioda.io_segmap;
>>>>+			pe_segmap = pe->io_segmap;
>>>>+			break;
>>>>+		case OPAL_M32_WINDOW_TYPE:
>>>>+			segmap = phb->ioda.m32_segmap;
>>>>+			pe_segmap = pe->m32_segmap;
>>>>+			break;
>>>>+		case OPAL_M64_WINDOW_TYPE:
>>>>+			segmap = phb->ioda.m64_segmap;
>>>>+			pe_segmap = pe->m64_segmap;
>>>>+			break;
>>>>+		}
>>>>+		i = -1;
>>>>+		while ((i = find_next_bit(pe_segmap,
>>>>+			phb->ioda.total_pe, i + 1)) < phb->ioda.total_pe) {
>>>>+			if (win_type[win] == OPAL_IO_WINDOW_TYPE ||
>>>>+			    win_type[win] == OPAL_M32_WINDOW_TYPE)
>>>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>>>+						phb->ioda.reserved_pe,
>>>>+						win_type[win], 0, i);
>>>>+			else if (phb->type == PNV_PHB_IODA1)
>>>>+				opal_pci_map_pe_mmio_window(phb->opal_id,
>>>>+						phb->ioda.reserved_pe,
>>>>+						win_type[win],
>>>>+						i / 8, i % 8);
>>>
>>>The function is called ""release" but it programs something what looks like
>>>reasonable values, is it correct?
>>>
>>
>>It's out of problem, When the segment is deallocated, it's mapped to the
>>reserved PE#.
>>
>>>
>>>
>>>>+
>>>>+			clear_bit(i, pe_segmap);
>>>>+			clear_bit(i, segmap);
>>>>+		}
>>>>+	}
>>>>+}
>>>>+
>>>>+static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>>>+				  struct pnv_ioda_pe *parent,
>>>>+				  struct pnv_ioda_pe *child,
>>>>+				  bool is_add)
>>>>+{
>>>>+	const char *desc = is_add ? "adding" : "removing";
>>>>+	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>>>+			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>>>+	struct pnv_ioda_pe *slave;
>>>>+	long rc;
>>>>+
>>>>+	/* Parent PE affects child PE */
>>>>+	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>+				child->pe_number, op);
>>>>+	if (rc != OPAL_SUCCESS) {
>>>>+		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>>>+			rc, desc);
>>>>+		return -ENXIO;
>>>>+	}
>>>>+
>>>>+	if (!(child->flags & PNV_IODA_PE_MASTER))
>>>>+		return 0;
>>>>+
>>>>+	/* Compound case: parent PE affects slave PEs */
>>>>+	list_for_each_entry(slave, &child->slaves, list) {
>>>>+		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>+					slave->pe_number, op);
>>>>+		if (rc != OPAL_SUCCESS) {
>>>>+			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>>>+				rc, desc);
>>>>+			return -ENXIO;
>>>>+		}
>>>>+	}
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>>+static int pnv_ioda_set_peltv(struct pnv_ioda_pe *pe, bool is_add)
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	struct pnv_ioda_pe *slave;
>>>>+	struct pci_dev *pdev = NULL;
>>>>+	int ret;
>>>>+
>>>>+	/*
>>>>+	 * Clear PE frozen state. If it's master PE, we need
>>>>+	 * clear slave PE frozen state as well.
>>>>+	 */
>>>>+	opal_pci_eeh_freeze_clear(phb->opal_id,
>>>>+				  pe->pe_number,
>>>>+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>>>+			opal_pci_eeh_freeze_clear(phb->opal_id,
>>>>+						  slave->pe_number,
>>>>+						  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>+		}
>>>>+	}
>>>>+
>>>>+	/*
>>>>+	 * Associate PE in PELT. We need add the PE into the
>>>>+	 * corresponding PELT-V as well. Otherwise, the error
>>>>+	 * originated from the PE might contribute to other
>>>>+	 * PEs.
>>>>+	 */
>>>>+	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>>>+	if (ret)
>>>>+		return ret;
>>>>+
>>>>+	/* For compound PEs, any one affects all of them */
>>>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>+		list_for_each_entry(slave, &pe->slaves, list) {
>>>>+			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>>>+			if (ret)
>>>>+				return ret;
>>>>+		}
>>>>+	}
>>>>+
>>>>+	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>>>+		pdev = pe->pbus->self;
>>>>+	else if (pe->flags & PNV_IODA_PE_DEV)
>>>>+		pdev = pe->pdev->bus->self;
>>>>+#ifdef CONFIG_PCI_IOV
>>>>+	else if (pe->flags & PNV_IODA_PE_VF)
>>>>+		pdev = pe->parent_dev->bus->self;
>>>>+#endif /* CONFIG_PCI_IOV */
>>>>+
>>>>+	while (pdev) {
>>>>+		struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>+		struct pnv_ioda_pe *parent;
>>>>+
>>>>+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>+			parent = &phb->ioda.pe_array[pdn->pe_number];
>>>>+			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>>>+			if (ret)
>>>>+				return ret;
>>>>+		}
>>>>+
>>>>+		pdev = pdev->bus->self;
>>>>+	}
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>>+static void pnv_ioda_deconfigure_pe(struct pnv_ioda_pe *pe)
>>>
>>>
>>>It used to be under #ifdef CONFIG_PCI_IOV, now it is not. Looks like just
>>>moving of this function to a different place deserves a separate patch with a
>>>comment why ("it is going to be used now for non-SRIOV case too" may be?).
>>>
>>
>>Yeah, it makes sense to me. Will fix it up.
>>
>>>
>>>>+{
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+	struct pci_dev *parent;
>>>>+	uint8_t bcomp, dcomp, fcomp;
>>>>+	long rid_end, rid;
>>>>+	int64_t rc;
>>>>+
>>>>+	/* Tear down MVE */
>>>>+	if (phb->type == PNV_PHB_IODA1 &&
>>>>+	    pe->mve_number != -1) {
>>>>+		rc = opal_pci_set_mve(phb->opal_id,
>>>>+				      pe->mve_number,
>>>>+				      phb->ioda.reserved_pe);
>>>>+		if (rc != OPAL_SUCCESS)
>>>>+			pe_warn(pe, "Error %lld unmapping MVE#%d\n",
>>>>+				rc, pe->mve_number);
>>>>+		rc = opal_pci_set_mve_enable(phb->opal_id,
>>>>+					     pe->mve_number,
>>>>+					     OPAL_DISABLE_MVE);
>>>>+		if (rc != OPAL_SUCCESS)
>>>>+			pe_warn(pe, "Error %lld disabling MVE#%d\n",
>>>>+				rc, pe->mve_number);
>>>>+		pe->mve_number = -1;
>>>>+	}
>>>>+
>>>>+	/* Unmapping PELTV */
>>>>+	pnv_ioda_set_peltv(pe, false);
>>>>+
>>>>+	/* To unmap PELTM */
>>>>+	if (pe->pbus) {
>>>>+		int count;
>>>>+
>>>>+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>>>+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>>>+		parent = pe->pbus->self;
>>>>+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>>>+			count = pe->pbus->busn_res.end -
>>>>+				pe->pbus->busn_res.start + 1;
>>>>+		else
>>>>+			count = 1;
>>>>+
>>>>+		switch(count) {
>>>>+		case  1: bcomp = OpalPciBusAll;   break;
>>>>+		case  2: bcomp = OpalPciBus7Bits; break;
>>>>+		case  4: bcomp = OpalPciBus6Bits; break;
>>>>+		case  8: bcomp = OpalPciBus5Bits; break;
>>>>+		case 16: bcomp = OpalPciBus4Bits; break;
>>>>+		case 32: bcomp = OpalPciBus3Bits; break;
>>>>+		default:
>>>>+			/* Fail back to case of one bus */
>>>>+			pe_warn(pe, "Cannot support %d buses\n", count);
>>>>+			bcomp = OpalPciBusAll;
>>>>+		}
>>>>+		rid_end = pe->rid + (count << 8);
>>>>+	} else {
>>>>+#ifdef CONFIG_PCI_IOV
>>>>+		if (pe->flags & PNV_IODA_PE_VF)
>>>>+			parent = pe->parent_dev;
>>>>+		else
>>>>+#endif
>>>>+			parent = pe->pdev->bus->self;
>>>>+		bcomp = OpalPciBusAll;
>>>>+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>>>+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>>>+		rid_end = pe->rid + 1;
>>>>+	}
>>>>+
>>>>+	/* Clear RID mapping */
>>>>+	for (rid = pe->rid; rid < rid_end; rid++)
>>>>+		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>>>+
>>>>+	/* Unmapping PELTM */
>>>>+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>>>+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>>>+	if (rc)
>>>>+		pe_warn(pe, "Error %ld unmapping PELTM\n", rc);
>>>>+}
>>>>+
>>>>+static void pnv_ioda_release_pe(struct kref *kref)
>>>>+{
>>>>+	struct pnv_ioda_pe *pe = container_of(kref, struct pnv_ioda_pe, kref);
>>>>+	struct pnv_ioda_pe *tmp, *slave;
>>>>+	struct pnv_phb *phb = pe->phb;
>>>>+
>>>>+	pnv_ioda_release_pe_dma(pe);
>>>>+	pnv_ioda_release_pe_seg(pe);
>>>>+	pnv_ioda_deconfigure_pe(pe);
>>>>+
>>>>+	/* Release slave PEs for compound PE */
>>>>+	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>+		list_for_each_entry_safe(slave, tmp, &pe->slaves, list)
>>>>+			pnv_ioda_pe_put(slave);
>>>>+	}
>>>>+
>>>>+	/* Remove the PE from various list. We need remove slave
>>>>+	 * PE from master's list.
>>>>+	 */
>>>>+	list_del(&pe->dma_link);
>>>>+	list_del(&pe->list);
>>>>+
>>>>+	/* Free PE number */
>>>>+	clear_bit(pe->pe_number, phb->ioda.pe_alloc);
>>>>+}
>>>>+
>>>>+static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb,
>>>>+					    int pe_no)
>>>>+{
>>>>+	struct pnv_ioda_pe *pe = &phb->ioda.pe_array[pe_no];
>>>>+
>>>>+	kref_init(&pe->kref);
>>>>+	pe->phb = phb;
>>>>+	pe->pe_number = pe_no;
>>>>+	INIT_LIST_HEAD(&pe->dma_link);
>>>>+	INIT_LIST_HEAD(&pe->list);
>>>>+
>>>>+	return pe;
>>>>+}
>>>>+
>>>>+static struct pnv_ioda_pe *pnv_ioda_reserve_pe(struct pnv_phb *phb,
>>>>+					       int pe_no)
>>>>+{
>>>>+	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>+		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>+			__func__, pe_no, phb->hose->global_number);
>>>>+		return NULL;
>>>>  	}
>>>>
>>>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>>>+	/*
>>>>+	 * Same PE might be reserved for multiple times, which
>>>>+	 * is out of problem actually.
>>>>+	 */
>>>>+	set_bit(pe_no, phb->ioda.pe_alloc);
>>>>+	return pnv_ioda_init_pe(phb, pe_no);
>>>>  }
>>>>
>>>>-static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>>+static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>>  {
>>>>  	unsigned long pe_no;
>>>>  	unsigned long limit = phb->ioda.total_pe - 1;
>>>>@@ -154,20 +533,10 @@ static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
>>>>  			break;
>>>>
>>>>  		if (--limit >= phb->ioda.total_pe)
>>>>-			return IODA_INVALID_PE;
>>>>+			return NULL;
>>>>  	} while(1);
>>>>
>>>>-	phb->ioda.pe_array[pe_no].phb = phb;
>>>>-	phb->ioda.pe_array[pe_no].pe_number = pe_no;
>>>>-	return pe_no;
>>>>-}
>>>>-
>>>>-static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe)
>>>>-{
>>>>-	WARN_ON(phb->ioda.pe_array[pe].pdev);
>>>>-
>>>>-	memset(&phb->ioda.pe_array[pe], 0, sizeof(struct pnv_ioda_pe));
>>>>-	clear_bit(pe, phb->ioda.pe_alloc);
>>>>+	return pnv_ioda_init_pe(phb, pe_no);
>>>>  }
>>>>
>>>>  static int pnv_ioda1_init_m64(struct pnv_phb *phb)
>>>>@@ -382,8 +751,9 @@ static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb,
>>>>  	}
>>>>  }
>>>>
>>>>-static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>-				struct pci_bus *bus, int all)
>>>>+static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>+						struct pci_bus *bus,
>>>>+						int all)
>>>
>>>
>>>Mechanic changes like this could easily go to a separate patch.
>>>
>>
>>Indeed. I'll see how I can split the patches up in next revision.
>>Thanks for the suggestion.
>>
>>>>  {
>>>>  	resource_size_t segsz = phb->ioda.m64_segsize;
>>>>  	struct pci_dev *pdev;
>>>>@@ -394,14 +764,14 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>  	int i;
>>>>
>>>>  	if (!pnv_ioda_need_m64_pe(phb, bus))
>>>>-		return IODA_INVALID_PE;
>>>>+		return NULL;
>>>>
>>>>          /* Allocate bitmap */
>>>>  	size = _ALIGN_UP(phb->ioda.total_pe / 8, sizeof(unsigned long));
>>>>  	pe_bitsmap = kzalloc(size, GFP_KERNEL);
>>>>  	if (!pe_bitsmap) {
>>>>  		pr_warn("%s: Out of memory !\n", __func__);
>>>>-		return IODA_INVALID_PE;
>>>>+		return NULL;
>>>>  	}
>>>>
>>>>  	/* The bridge's M64 window might be extended to PHB's M64
>>>>@@ -438,7 +808,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>  	/* No M64 window found ? */
>>>>  	if (bitmap_empty(pe_bitsmap, phb->ioda.total_pe)) {
>>>>  		kfree(pe_bitsmap);
>>>>-		return IODA_INVALID_PE;
>>>>+		return NULL;
>>>>  	}
>>>>
>>>>  	/* Figure out the master PE and put all slave PEs
>>>>@@ -491,7 +861,7 @@ static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb,
>>>>  	}
>>>>
>>>>  	kfree(pe_bitsmap);
>>>>-	return master_pe->pe_number;
>>>>+	return master_pe;
>>>>  }
>>>>
>>>>  static void __init pnv_ioda_parse_m64_window(struct pnv_phb *phb)
>>>>@@ -695,7 +1065,7 @@ static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>>>   * but in the meantime, we need to protect them to avoid warnings
>>>>   */
>>>>  #ifdef CONFIG_PCI_MSI
>>>>-static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>>>+static struct pnv_ioda_pe *pnv_ioda_pci_dev_to_pe(struct pci_dev *dev)
>>>>  {
>>>>  	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>  	struct pnv_phb *phb = hose->private_data;
>>>>@@ -709,191 +1079,6 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>>>>  }
>>>>  #endif /* CONFIG_PCI_MSI */
>>>>
>>>>-static int pnv_ioda_set_one_peltv(struct pnv_phb *phb,
>>>>-				  struct pnv_ioda_pe *parent,
>>>>-				  struct pnv_ioda_pe *child,
>>>>-				  bool is_add)
>>>>-{
>>>>-	const char *desc = is_add ? "adding" : "removing";
>>>>-	uint8_t op = is_add ? OPAL_ADD_PE_TO_DOMAIN :
>>>>-			      OPAL_REMOVE_PE_FROM_DOMAIN;
>>>>-	struct pnv_ioda_pe *slave;
>>>>-	long rc;
>>>>-
>>>>-	/* Parent PE affects child PE */
>>>>-	rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>-				child->pe_number, op);
>>>>-	if (rc != OPAL_SUCCESS) {
>>>>-		pe_warn(child, "OPAL error %ld %s to parent PELTV\n",
>>>>-			rc, desc);
>>>>-		return -ENXIO;
>>>>-	}
>>>>-
>>>>-	if (!(child->flags & PNV_IODA_PE_MASTER))
>>>>-		return 0;
>>>>-
>>>>-	/* Compound case: parent PE affects slave PEs */
>>>>-	list_for_each_entry(slave, &child->slaves, list) {
>>>>-		rc = opal_pci_set_peltv(phb->opal_id, parent->pe_number,
>>>>-					slave->pe_number, op);
>>>>-		if (rc != OPAL_SUCCESS) {
>>>>-			pe_warn(slave, "OPAL error %ld %s to parent PELTV\n",
>>>>-				rc, desc);
>>>>-			return -ENXIO;
>>>>-		}
>>>>-	}
>>>>-
>>>>-	return 0;
>>>>-}
>>>>-
>>>>-static int pnv_ioda_set_peltv(struct pnv_phb *phb,
>>>>-			      struct pnv_ioda_pe *pe,
>>>>-			      bool is_add)
>>>>-{
>>>>-	struct pnv_ioda_pe *slave;
>>>>-	struct pci_dev *pdev = NULL;
>>>>-	int ret;
>>>>-
>>>>-	/*
>>>>-	 * Clear PE frozen state. If it's master PE, we need
>>>>-	 * clear slave PE frozen state as well.
>>>>-	 */
>>>>-	if (is_add) {
>>>>-		opal_pci_eeh_freeze_clear(phb->opal_id, pe->pe_number,
>>>>-					  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>-		if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>-			list_for_each_entry(slave, &pe->slaves, list)
>>>>-				opal_pci_eeh_freeze_clear(phb->opal_id,
>>>>-							  slave->pe_number,
>>>>-							  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>-		}
>>>>-	}
>>>>-
>>>>-	/*
>>>>-	 * Associate PE in PELT. We need add the PE into the
>>>>-	 * corresponding PELT-V as well. Otherwise, the error
>>>>-	 * originated from the PE might contribute to other
>>>>-	 * PEs.
>>>>-	 */
>>>>-	ret = pnv_ioda_set_one_peltv(phb, pe, pe, is_add);
>>>>-	if (ret)
>>>>-		return ret;
>>>>-
>>>>-	/* For compound PEs, any one affects all of them */
>>>>-	if (pe->flags & PNV_IODA_PE_MASTER) {
>>>>-		list_for_each_entry(slave, &pe->slaves, list) {
>>>>-			ret = pnv_ioda_set_one_peltv(phb, slave, pe, is_add);
>>>>-			if (ret)
>>>>-				return ret;
>>>>-		}
>>>>-	}
>>>>-
>>>>-	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
>>>>-		pdev = pe->pbus->self;
>>>>-	else if (pe->flags & PNV_IODA_PE_DEV)
>>>>-		pdev = pe->pdev->bus->self;
>>>>-#ifdef CONFIG_PCI_IOV
>>>>-	else if (pe->flags & PNV_IODA_PE_VF)
>>>>-		pdev = pe->parent_dev->bus->self;
>>>>-#endif /* CONFIG_PCI_IOV */
>>>>-	while (pdev) {
>>>>-		struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>-		struct pnv_ioda_pe *parent;
>>>>-
>>>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>-			parent = &phb->ioda.pe_array[pdn->pe_number];
>>>>-			ret = pnv_ioda_set_one_peltv(phb, parent, pe, is_add);
>>>>-			if (ret)
>>>>-				return ret;
>>>>-		}
>>>>-
>>>>-		pdev = pdev->bus->self;
>>>>-	}
>>>>-
>>>>-	return 0;
>>>>-}
>>>>-
>>>>-#ifdef CONFIG_PCI_IOV
>>>>-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>>-{
>>>>-	struct pci_dev *parent;
>>>>-	uint8_t bcomp, dcomp, fcomp;
>>>>-	int64_t rc;
>>>>-	long rid_end, rid;
>>>>-
>>>>-	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
>>>>-	if (pe->pbus) {
>>>>-		int count;
>>>>-
>>>>-		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
>>>>-		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
>>>>-		parent = pe->pbus->self;
>>>>-		if (pe->flags & PNV_IODA_PE_BUS_ALL)
>>>>-			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
>>>>-		else
>>>>-			count = 1;
>>>>-
>>>>-		switch(count) {
>>>>-		case  1: bcomp = OpalPciBusAll;         break;
>>>>-		case  2: bcomp = OpalPciBus7Bits;       break;
>>>>-		case  4: bcomp = OpalPciBus6Bits;       break;
>>>>-		case  8: bcomp = OpalPciBus5Bits;       break;
>>>>-		case 16: bcomp = OpalPciBus4Bits;       break;
>>>>-		case 32: bcomp = OpalPciBus3Bits;       break;
>>>>-		default:
>>>>-			dev_err(&pe->pbus->dev, "Number of subordinate buses %d unsupported\n",
>>>>-			        count);
>>>>-			/* Do an exact match only */
>>>>-			bcomp = OpalPciBusAll;
>>>>-		}
>>>>-		rid_end = pe->rid + (count << 8);
>>>>-	} else {
>>>>-		if (pe->flags & PNV_IODA_PE_VF)
>>>>-			parent = pe->parent_dev;
>>>>-		else
>>>>-			parent = pe->pdev->bus->self;
>>>>-		bcomp = OpalPciBusAll;
>>>>-		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
>>>>-		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
>>>>-		rid_end = pe->rid + 1;
>>>>-	}
>>>>-
>>>>-	/* Clear the reverse map */
>>>>-	for (rid = pe->rid; rid < rid_end; rid++)
>>>>-		phb->ioda.pe_rmap[rid] = IODA_INVALID_PE;
>>>>-
>>>>-	/* Release from all parents PELT-V */
>>>>-	while (parent) {
>>>>-		struct pci_dn *pdn = pci_get_pdn(parent);
>>>>-		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>-			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
>>>>-						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>>>-			/* XXX What to do in case of error ? */
>>>
>>>
>>>Not much :) Free associated memory and mark it "dead" so it won't be used
>>>again till reboot. In what circumstance can this opal_pci_set_peltv() fail at
>>>all?
>>>
>>
>>Yeah, maybe. Until now, I didn't see this failure since the code is there
>>from the day. Note the code has been there for almost 4 years since the
>>day Ben wrote it.
>
>
>Sure. But if it starts failing, we won't even notice it - there is no even
>pr_err() or WARN_ON.
>

Agree. I'll see what I can do. At least I can have error message to alert.

>>
>>>
>>>>-		}
>>>>-		parent = parent->bus->self;
>>>>-	}
>>>>-
>>>>-	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
>>>>-				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
>>>>-
>>>>-	/* Disassociate PE in PELT */
>>>>-	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
>>>>-				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
>>>>-	if (rc)
>>>>-		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
>>>>-	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
>>>>-			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
>>>>-	if (rc)
>>>>-		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
>>>>-
>>>>-	pe->pbus = NULL;
>>>>-	pe->pdev = NULL;
>>>>-	pe->parent_dev = NULL;
>>>>-
>>>>-	return 0;
>>>>-}
>>>>-#endif /* CONFIG_PCI_IOV */
>>>>-
>>>>  static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>>  {
>>>>  	struct pci_dev *parent;
>>>>@@ -953,7 +1138,7 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
>>>>  	}
>>>>
>>>>  	/* Configure PELTV */
>>>>-	pnv_ioda_set_peltv(phb, pe, true);
>>>>+	pnv_ioda_set_peltv(pe, true);
>>>>
>>>>  	/* Setup reverse map */
>>>>  	for (rid = pe->rid; rid < rid_end; rid++)
>>>>@@ -1207,6 +1392,8 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
>>>>  		if (pdn->pe_number != IODA_INVALID_PE)
>>>>  			continue;
>>>>
>>>>+		/* Increase reference count of the parent PE */
>>>
>>>When you comment like this, I read it as the comment belongs to the whole
>>>next chunk till the first empty line, i.e. to all 5 lines below, which is not
>>>the case. I'd remove the comment as 1) "pe_get" in pnv_ioda_pe_get() name
>>>suggests incrementing the reference counter 2) "pe" is always parent in this
>>>function. I do not insist though.
>>>
>>
>>Agree on your explaining. I'll remove this unuseful comments.
>>
>>>
>>>>+		pnv_ioda_pe_get(pe);
>>>>  		pdn->pe_number = pe->pe_number;
>>>>  		pe->dma_weight += pnv_ioda_dev_dma_weight(dev);
>>>>  		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>>>>@@ -1224,7 +1411,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>>  {
>>>>  	struct pci_controller *hose = pci_bus_to_host(bus);
>>>>  	struct pnv_phb *phb = hose->private_data;
>>>>-	struct pnv_ioda_pe *pe;
>>>>+	struct pnv_ioda_pe *pe = NULL;
>>>>  	int pe_num = IODA_INVALID_PE;
>>>>
>>>>  	/* For partial hotplug case, the PE instance hasn't been destroyed
>>>>@@ -1240,24 +1427,24 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>>  	}
>>>>
>>>>  	/* PE number for root bus should have been reserved */
>>>>-	if (pci_is_root_bus(bus))
>>>>-		pe_num = phb->ioda.root_pe_no;
>>>>+	if (pci_is_root_bus(bus) &&
>>>>+	    phb->ioda.root_pe_no != IODA_INVALID_PE)
>>>>+		pe = &phb->ioda.pe_array[phb->ioda.root_pe_no];
>>>>
>>>>  	/* Check if PE is determined by M64 */
>>>>-	if (pe_num == IODA_INVALID_PE && phb->pick_m64_pe)
>>>>-		pe_num = phb->pick_m64_pe(phb, bus, all);
>>>>+	if (!pe && phb->pick_m64_pe)
>>>>+		pe = phb->pick_m64_pe(phb, bus, all);
>>>>
>>>>  	/* The PE number isn't pinned by M64 */
>>>>-	if (pe_num == IODA_INVALID_PE)
>>>>-		pe_num = pnv_ioda_alloc_pe(phb);
>>>>+	if (!pe)
>>>>+		pe = pnv_ioda_alloc_pe(phb);
>>>>
>>>>-	if (pe_num == IODA_INVALID_PE) {
>>>>-		pr_warning("%s: Not enough PE# available for PCI bus %04x:%02x\n",
>>>>+	if (!pe) {
>>>>+		pr_warn("%s: No enough PE# available for PCI bus %04x:%02x\n",
>>>>  			__func__, pci_domain_nr(bus), bus->number);
>>>>  		return NULL;
>>>>  	}
>>>>
>>>>-	pe = &phb->ioda.pe_array[pe_num];
>>>>  	pe->flags |= (all ? PNV_IODA_PE_BUS_ALL : PNV_IODA_PE_BUS);
>>>>  	pe->pbus = bus;
>>>>  	pe->pdev = NULL;
>>>>@@ -1274,14 +1461,12 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
>>>>
>>>>  	if (pnv_ioda_configure_pe(phb, pe)) {
>>>>  		/* XXX What do we do here ? */
>>>>-		if (pe_num)
>>>>-			pnv_ioda_free_pe(phb, pe_num);
>>>>-		pe->pbus = NULL;
>>>>+		pnv_ioda_pe_put(pe);
>>>>  		return NULL;
>>>>  	}
>>>>
>>>>  	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
>>>>-			GFP_KERNEL, hose->node);
>>>>+				       GFP_KERNEL, hose->node);
>>>
>>>Seems like spaces change only - if you really want this change (which I hate
>>>- makes code look inaccurate to my taste but it seems I am in minority here
>>>:) ), please put it to the separate patch.
>>>
>>
>>Ok. Confirm with you: You prefer the original format? I don't know
>>why I prefer the later one. Maybe my eyes are quite broken :-)
>
>
>I prefer not to change existing whitespaces unless it is done once and for
>the entire file :) Just remove this change from the patch.
>

Sure.

>>>
>>>>  	pe->tce32_table->data = pe;
>>>>
>>>>  	/* Associate it with all child devices */
>>>>@@ -1521,9 +1706,9 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>>  		list_del(&pe->list);
>>>>  		mutex_unlock(&phb->ioda.pe_list_mutex);
>>>>
>>>>-		pnv_ioda_deconfigure_pe(phb, pe);
>>>>+		pnv_ioda_deconfigure_pe(pe);
>>>
>>>
>>>Is this change necessary to get "Release PEs dynamically" working? Move it to
>>>mechanical changes patch may be?
>>>
>>
>>Ok. I'll try to do that.
>>
>>>
>>>>
>>>>-		pnv_ioda_free_pe(phb, pe->pe_number);
>>>>+		pnv_ioda_pe_put(pe);
>>>>  	}
>>>>  }
>>>>
>>>>@@ -1601,9 +1786,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>>>>
>>>>  		if (pnv_ioda_configure_pe(phb, pe)) {
>>>>  			/* XXX What do we do here ? */
>>>>-			if (pe_num)
>>>>-				pnv_ioda_free_pe(phb, pe_num);
>>>>-			pe->pdev = NULL;
>>>>+			pnv_ioda_pe_put(pe);
>>>>  			continue;
>>>>  		}
>>>>
>>>>@@ -2263,7 +2446,7 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
>>>>  	struct pnv_ioda_pe *pe;
>>>>  	int rc;
>>>>
>>>>-	pe = pnv_ioda_get_pe(dev);
>>>>+	pe = pnv_ioda_pci_dev_to_pe(dev);
>>>
>>>
>>>And this change could to separately. Not clear how this helps to "Release PEs
>>>dynamically".
>>>
>>>
>>
>>It's not related to "Release PEs dynamically". The change is introduced by
>>the function rename: Original pnv_ioda_get_pe() is renamed to pnv_ioda_pci_dev_to_pe().
>
>
>But the rename happened in this patch and the patch's subj is "Release PEs
>dynamically" so it should be related somehow or move it to a simple separate
>patch "let's give the lalala function a better name to reflect what it
>actually does" (but in this case the new name does not make any more sense
>than the old one).
>

Yeah, I'll try to split the patches to separate blala and walala :-)

>>>>  	if (!pe)
>>>>  		return -ENODEV;
>>>>
>>>>@@ -2379,7 +2562,7 @@ int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>>>>  	struct pnv_ioda_pe *pe;
>>>>  	int rc;
>>>>
>>>>-	if (!(pe = pnv_ioda_get_pe(dev)))
>>>>+	if (!(pe = pnv_ioda_pci_dev_to_pe(dev)))
>>>>  		return -ENODEV;
>>>>
>>>>  	/* Assign XIVE to PE */
>>>>@@ -2401,7 +2584,7 @@ static int pnv_pci_ioda_msi_setup(struct pnv_phb *phb, struct pci_dev *dev,
>>>>  				  unsigned int hwirq, unsigned int virq,
>>>>  				  unsigned int is_64, struct msi_msg *msg)
>>>>  {
>>>>-	struct pnv_ioda_pe *pe = pnv_ioda_get_pe(dev);
>>>>+	struct pnv_ioda_pe *pe = pnv_ioda_pci_dev_to_pe(dev);
>>>>  	unsigned int xive_num = hwirq - phb->msi_base;
>>>>  	__be32 data;
>>>>  	int rc;
>>>>@@ -3065,6 +3248,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>>>  	pnv_pci_controller_ops.setup_bridge = pnv_pci_setup_bridge;
>>>>  	pnv_pci_controller_ops.window_alignment = pnv_pci_window_alignment;
>>>>  	pnv_pci_controller_ops.reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>>>+	pnv_pci_controller_ops.release_device = pnv_pci_release_device;
>>>>  	hose->controller_ops = pnv_pci_controller_ops;
>>>>
>>>>  #ifdef CONFIG_PCI_IOV
>>>>diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
>>>>index 1bea3a8..8b10f01 100644
>>>>--- a/arch/powerpc/platforms/powernv/pci.h
>>>>+++ b/arch/powerpc/platforms/powernv/pci.h
>>>>@@ -28,6 +28,7 @@ enum pnv_phb_model {
>>>>  /* Data associated with a PE, including IOMMU tracking etc.. */
>>>>  struct pnv_phb;
>>>>  struct pnv_ioda_pe {
>>>>+	struct kref		kref;
>>>>  	unsigned long		flags;
>>>>  	struct pnv_phb		*phb;
>>>>
>>>>@@ -120,7 +121,8 @@ struct pnv_phb {
>>>>  	void (*shutdown)(struct pnv_phb *phb);
>>>>  	int (*init_m64)(struct pnv_phb *phb);
>>>>  	void (*reserve_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus);
>>>>-	int (*pick_m64_pe)(struct pnv_phb *phb, struct pci_bus *bus, int all);
>>>>+	struct pnv_ioda_pe *(*pick_m64_pe)(struct pnv_phb *phb,
>>>>+					   struct pci_bus *bus, int all);
>>>>  	int (*get_pe_state)(struct pnv_phb *phb, int pe_no);
>>>>  	void (*freeze_pe)(struct pnv_phb *phb, int pe_no);
>>>>  	int (*unfreeze_pe)(struct pnv_phb *phb, int pe_no, int opt);
>>>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
  2015-05-11  7:17         ` Alexey Kardashevskiy
@ 2015-05-12  0:04           ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-12  0:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Mon, May 11, 2015 at 05:17:42PM +1000, Alexey Kardashevskiy wrote:
>On 05/11/2015 04:47 PM, Gavin Shan wrote:
>>On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote:
>>>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>Function pnv_pci_reset_secondary_bus() is used to reset specified
>>>>PCI bus, which is leaded by root complex or PCI bridge. That means
>>>>the function shouldn't be called on PCI root bus and the patch
>>>>removes the logic for that case.
>>>>
>>>>Also, some adapters beneath the indicated PCI bus may require
>>>>fundamental reset in order to successfully reload their firmwares
>>>>after the reset. The patch translates hot reset to fundamental reset
>>>>for that case.
>>>>
>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>>>>  1 file changed, 26 insertions(+), 9 deletions(-)
>>>>
>>>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>index 3c01095..58e4dcf 100644
>>>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>@@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>>  	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>>>  }
>>>>
>>>>-void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>>
>>>
>>>Why changing dev to pdev? Keeping "dev" could make the patch simpler.
>>>
>>
>>In the early stage when I wrote the EEH code, I had "dev" to refer PCI
>>device, which isn't precisely enough. Actually, "dev" means "struct device"
>>while "pdev" stands for "struct pci_dev". That's why I changed it.
>
>
>The rest of the file and the kernel overall use "dev" for pci_dev just fine.
>I would not bother.
>

Ok. I'll keep it.

>>>>+static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>>>  {
>>>>-	struct pci_controller *hose;
>>>>+	int *freset = data;
>>>>
>>>>-	if (pci_is_root_bus(dev->bus)) {
>>>>-		hose = pci_bus_to_host(dev->bus);
>>>>-		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>>>-		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>>>-	} else {
>>>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>>>+	/*
>>>>+	 * Stop the iteration immediately if there is any
>>>>+	 * one PCI device requesting fundamental reset
>>>>+	 */
>>>>+	*freset |= pdev->needs_freset;
>>>>+	return *freset;
>>>>+}
>>>>+
>>>>+void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
>>>>+{
>>>>+	int option = EEH_RESET_HOT;
>>>>+	int freset = 0;
>>>>+
>>>>+	/* Check if there're any PCI devices asking for fundamental reset */
>>>>+	if (pdev->subordinate) {
>>>>+		pci_walk_bus(pdev->subordinate,
>>>>+			     pnv_pci_dev_reset_type,
>>>>+			     &freset);
>>>>+		if (freset)
>>>>+			option = EEH_RESET_FUNDAMENTAL;
>>>>  	}
>>>>+
>>>>+	/* Issue the requested type of reset */
>>>>+	pnv_eeh_bridge_reset(pdev, option);
>>>>+	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>>>>  }
>>>>
>>>>  /**
>>>>

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
@ 2015-05-12  0:04           ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-12  0:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Mon, May 11, 2015 at 05:17:42PM +1000, Alexey Kardashevskiy wrote:
>On 05/11/2015 04:47 PM, Gavin Shan wrote:
>>On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote:
>>>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>Function pnv_pci_reset_secondary_bus() is used to reset specified
>>>>PCI bus, which is leaded by root complex or PCI bridge. That means
>>>>the function shouldn't be called on PCI root bus and the patch
>>>>removes the logic for that case.
>>>>
>>>>Also, some adapters beneath the indicated PCI bus may require
>>>>fundamental reset in order to successfully reload their firmwares
>>>>after the reset. The patch translates hot reset to fundamental reset
>>>>for that case.
>>>>
>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>---
>>>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +++++++++++++++++++++-------
>>>>  1 file changed, 26 insertions(+), 9 deletions(-)
>>>>
>>>>diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>index 3c01095..58e4dcf 100644
>>>>--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>>>@@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option)
>>>>  	return (rc == OPAL_SUCCESS) ? 0 : -EIO;
>>>>  }
>>>>
>>>>-void pnv_pci_reset_secondary_bus(struct pci_dev *dev)
>>>
>>>
>>>Why changing dev to pdev? Keeping "dev" could make the patch simpler.
>>>
>>
>>In the early stage when I wrote the EEH code, I had "dev" to refer PCI
>>device, which isn't precisely enough. Actually, "dev" means "struct device"
>>while "pdev" stands for "struct pci_dev". That's why I changed it.
>
>
>The rest of the file and the kernel overall use "dev" for pci_dev just fine.
>I would not bother.
>

Ok. I'll keep it.

>>>>+static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data)
>>>>  {
>>>>-	struct pci_controller *hose;
>>>>+	int *freset = data;
>>>>
>>>>-	if (pci_is_root_bus(dev->bus)) {
>>>>-		hose = pci_bus_to_host(dev->bus);
>>>>-		pnv_eeh_phb_reset(hose, EEH_RESET_HOT);
>>>>-		pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE);
>>>>-	} else {
>>>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_HOT);
>>>>-		pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE);
>>>>+	/*
>>>>+	 * Stop the iteration immediately if there is any
>>>>+	 * one PCI device requesting fundamental reset
>>>>+	 */
>>>>+	*freset |= pdev->needs_freset;
>>>>+	return *freset;
>>>>+}
>>>>+
>>>>+void pnv_pci_reset_secondary_bus(struct pci_dev *pdev)
>>>>+{
>>>>+	int option = EEH_RESET_HOT;
>>>>+	int freset = 0;
>>>>+
>>>>+	/* Check if there're any PCI devices asking for fundamental reset */
>>>>+	if (pdev->subordinate) {
>>>>+		pci_walk_bus(pdev->subordinate,
>>>>+			     pnv_pci_dev_reset_type,
>>>>+			     &freset);
>>>>+		if (freset)
>>>>+			option = EEH_RESET_FUNDAMENTAL;
>>>>  	}
>>>>+
>>>>+	/* Issue the requested type of reset */
>>>>+	pnv_eeh_bridge_reset(pdev, option);
>>>>+	pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE);
>>>>  }
>>>>
>>>>  /**
>>>>

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-12  0:03           ` Gavin Shan
@ 2015-05-12  0:53             ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-12  0:53 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linuxppc-dev, linux-pci, benh, bhelgaas

On 05/12/2015 10:03 AM, Gavin Shan wrote:
> On Mon, May 11, 2015 at 05:02:08PM +1000, Alexey Kardashevskiy wrote:
>> On 05/11/2015 04:25 PM, Gavin Shan wrote:
>>> On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>>>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>> The original code doesn't support releasing PEs dynamically, meaning
>>>>> that PE and the associated resources (IO, M32, M64 and DMA) can't
>>>>> be released when unplugging a PCI adapter from one hotpluggable slot.
>>>>>
>>>>> The patch takes object oriented methodology, introducs reference
>>>>> count to PE, which is initialized to 1 and increased with 1 when a
>>>>> new PCI device joins the PE. Once the last PCI device leaves the
>>>>> PE, the PE is going to be release together with its associated
>>>>> (IO, M32, M64, DMA) resources.
>>>>
>>>>
>>>> Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>>>
>>>
>>> Ok. I'll add more details in next revision.
>>>
>>>>>
>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>> ---
>>>>>   arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>>>   arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>>>   arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>>>   arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>>>   4 files changed, 432 insertions(+), 238 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>>>> index 5367eb3..a6ad4b1 100644
>>>>> --- a/arch/powerpc/include/asm/pci-bridge.h
>>>>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>>>>> @@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>>>   	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>>>   	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>>>   	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>>>> +
>>>>> +	/* Called when PCI device is released */
>>>>> +	void		(*release_device)(struct pci_dev *);
>>>>>   };
>>>>>
>>>>>   /*
>>>>> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>>>> index 7ed85a6..0040343 100644
>>>>> --- a/arch/powerpc/kernel/pci-hotplug.c
>>>>> +++ b/arch/powerpc/kernel/pci-hotplug.c
>>>>> @@ -29,6 +29,11 @@
>>>>>    */
>>>>>   void pcibios_release_device(struct pci_dev *dev)
>>>>>   {
>>>>> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>> +
>>>>> +	if (hose->controller_ops.release_device)
>>>>> +		hose->controller_ops.release_device(dev);
>>>>> +
>>>>>   	eeh_remove_device(dev);
>>>>>   }
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>> index 910fb67..ef8c216 100644
>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>> @@ -12,6 +12,8 @@
>>>>>   #undef DEBUG
>>>>>
>>>>>   #include <linux/kernel.h>
>>>>> +#include <linux/atomic.h>
>>>>> +#include <linux/kref.h>
>>>>>   #include <linux/pci.h>
>>>>>   #include <linux/crash_dump.h>
>>>>>   #include <linux/debugfs.h>
>>>>> @@ -47,6 +49,8 @@
>>>>>   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>>>   #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>>>
>>>>> +static void pnv_ioda_release_pe(struct kref *kref);
>>>>> +
>>>>>   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>>>   			    const char *fmt, ...)
>>>>>   {
>>>>> @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>>>   		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>>>   }
>>>>>
>>>>> -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>>>> +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>>>   {
>>>>> -	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>> -		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>> -			__func__, pe_no, phb->hose->global_number);
>>>>> +	if (!pe)
>>>>> +		return;
>>>>> +
>>>>> +	kref_get(&pe->kref);
>>>>> +}
>>>>> +
>>>>> +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>>>> +{
>>>>> +	unsigned int count;
>>>>> +
>>>>> +	if (!pe)
>>>>>   		return;
>>>>> +
>>>>> +	/*
>>>>> +	 * The count is initialized to 1 and increased with 1 when
>>>>> +	 * a new PCI device is bound with the PE. Once the last PCI
>>>>> +	 * device is leaving from the PE, the PE is going to be
>>>>> +	 * released.
>>>>> +	 */
>>>>> +	count = atomic_read(&pe->kref.refcount);
>>>>> +	if (count == 2)
>>>>> +		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>>>> +	else
>>>>> +		kref_put(&pe->kref, pnv_ioda_release_pe);
>>>>
>>>>
>>>> What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>>>
>>>
>>> Yeah, that would have problem. But it shouldn't happen because the
>>> PCI devices are joining the parent PE# in strictly serialized mode.
>>> Same thing happens when detaching PCI devices from its parent PE.
>>
>>
>> oookay. Another thing then - why is this kref counter initialized to 1?
>> It would make sense if you did something special when the counter becomes 1
>> after decrement but you do not.
>>
>> Also, this kref thing makes sense if you do kref_put() in multiple places and
>> do not know which one will be the last one so you pass the callback to all of
>> them. Here you do kref_put/sub in one place and you read the counter - so you
>> can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t
>> would do the job just fine. If you still feel that the counter should start
>>from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and
>> others.
>>
>
> It's good question actually. The counter is initialized to 1 when the PE
> is reserved because of M64 requirement or allocated for non-M64 case. If
> we reserve or allocate PE#, there is one thing for sure: the PCI bus has
> one PCI device (including PCI bridge) at least. After the PE# is reserved
> or allocated, the PCI device joins the PE with the result of increasing
> the counter with 1. It means the counter is 2 when PE contains one PCI
> device, and 3 when there're 2 devices. One reason for this design is that
> we just need decrease the counter if we have to release this PE in the
> window between PE reservation/allocation and first PCI device joins. I
> think you're correct that we can call pnv_ioda_release_pe() in this window.
> In this way, the counter is always reflecting the number of PCI devices
> the PE contains.


Good :) I believe it was something different 2-3 versions ago and evolved 
to this so you do not notice it straight away :)


>
>>>>> +}
>>>>> +
>>>>> +static void pnv_pci_release_device(struct pci_dev *pdev)
>>>>> +{
>>>>> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>> +	struct pnv_phb *phb = hose->private_data;
>>>>> +	struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>> +	struct pnv_ioda_pe *pe;
>>>>> +
>>>>> +	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>> +		pe = &phb->ioda.pe_array[pdn->pe_number];
>>>>> +		pnv_ioda_pe_put(pe);
>>>>> +		pdn->pe_number = IODA_INVALID_PE;
>>>>>   	}
>>>>> +}
>>>>>
>>>>> -	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>>>> -		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>>>> -			__func__, pe_no, phb->hose->global_number);
>>>>> +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>>>> +{
>>>>> +	struct pnv_phb *phb = pe->phb;
>>>>> +	int index, count;
>>>>> +	unsigned long tbl_addr, tbl_size;
>>>>> +
>>>>> +	/* No DMA capability for slave PEs */
>>>>> +	if (pe->flags & PNV_IODA_PE_SLAVE)
>>>>> +		return;
>>>>> +
>>>>> +	/* Bypass DMA window */
>>>>> +	if (phb->type == PNV_PHB_IODA2 &&
>>>>> +	    pe->tce_bypass_enabled &&
>>>>> +	    pe->tce32_table &&
>>>>> +	    pe->tce32_table->set_bypass)
>>>>> +		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>>>> +
>>>>> +	/* 32-bits DMA window */
>>>>> +	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>>>> +	tbl_addr = pe->tce32_table->it_base;
>>>>> +	if (!count)
>>>>>   		return;
>>>>> +
>>>>> +	/* Free IOMMU table */
>>>>> +	iommu_free_table(pe->tce32_table,
>>>>> +			 of_node_full_name(phb->hose->dn));
>>>>> +
>>>>> +	/* Deconfigure TCE table */
>>>>> +	switch (phb->type) {
>>>>> +	case PNV_PHB_IODA1:
>>>>> +		for (index = 0; index < count; index++)
>>>>> +			opal_pci_map_pe_dma_window(phb->opal_id,
>>>>> +						   pe->pe_number,
>>>>> +						   pe->tce32_seg_start + index,
>>>>> +						   1,
>>>>> +						   __pa(tbl_addr) +
>>>>> +						   index * TCE32_TABLE_SIZE,
>>>>> +						   0,
>>>>> +						   0x1000);
>>>>> +		bitmap_clear(phb->ioda.tce32_segmap,
>>>>> +			     pe->tce32_seg_start,
>>>>> +			     count);
>>>>> +		tbl_size = TCE32_TABLE_SIZE * count;
>>>>> +		break;
>>>>> +	case PNV_PHB_IODA2:
>>>>> +		opal_pci_map_pe_dma_window(phb->opal_id,
>>>>> +					   pe->pe_number,
>>>>> +					   pe->pe_number << 1,
>>>>> +					   1,
>>>>> +					   __pa(tbl_addr),
>>>>> +					   0,
>>>>> +					   0x1000);
>>>>> +		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>>>> +		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>>>> +		break;
>>>>> +	default:
>>>>> +		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	/* Free memory of IOMMU table */
>>>>> +	free_pages(tbl_addr, get_order(tbl_size));
>>>>
>>>>
>>>> You just programmed the table address to TVT and then you are releasing the
>>>> pages. It does not seem right, it will leave garbage in TVT. Also, I am
>>>> adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>> >from there (I'll post v10 soon, you'll be in copy and you'll have to review
>>>> that ;) ).
>>>>
>>>
>>> I assume you're talking about TVE. I don't understand how garbage will be left
>>> in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
>>> with zero'ed "tce_table_size". The pages previously allocated for TCE table is
>>> released to buddy system, which can be allocated by somebody else (from buddy
>>> or slab).
>>
>> opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some memory
>> which is still allocated. This value goes to a table (which has 2 entries per
>> PE, one for 32bit DMA window and one for bypass/hugewindow) which PHB uses to
>> get the actual TCE table address. What is the name of this table? :) Anyway,
>> you write an address there and then you call free_pages() so after
>> free_pages(), the value in that TVE/TVT/whatever table is a garbage.
>>
>
> I don't look into your DDW code yet. Before we have DDW patchset, the bypass
> TVE (window) isn't supposed to have corresponding TCE table. I guess you might
> change the behaviour in your DDW patchset and I'll take a close look on that.
> For DMA32 window, which is the name of the table, the TVE is cleared by skiboot
> when having zero "tce_table_size" argument.
>
> 	opal_pci_map_pe_dma_window(phb->opal_id,
> 				   pe->pe_number,
> 				   pe->pe_number << 1,
> 				   1,
> 				   __pa(tbl_addr),
> 				   0,			<<<< "tce_table_size".
> 				   0x1000);


Then please, when you pass tce_table_size==0, also pass zero address/zero 
page size/zero levels, unless you have very good reason to pass non-zero 
values for these. What you have now is confusing - it looks like you are 
initializing the table - it is not obvious that "0" is the size and not 
some flags.

When people see this (which does the same thing, please correct me if I am 
wrong), they do not have questions what you are actually trying to do:

  	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
  				   pe->pe_number << 1, 0, 0, 0, 0);



-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-12  0:53             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 184+ messages in thread
From: Alexey Kardashevskiy @ 2015-05-12  0:53 UTC (permalink / raw)
  To: Gavin Shan; +Cc: bhelgaas, linux-pci, linuxppc-dev

On 05/12/2015 10:03 AM, Gavin Shan wrote:
> On Mon, May 11, 2015 at 05:02:08PM +1000, Alexey Kardashevskiy wrote:
>> On 05/11/2015 04:25 PM, Gavin Shan wrote:
>>> On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>>>> On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>> The original code doesn't support releasing PEs dynamically, meaning
>>>>> that PE and the associated resources (IO, M32, M64 and DMA) can't
>>>>> be released when unplugging a PCI adapter from one hotpluggable slot.
>>>>>
>>>>> The patch takes object oriented methodology, introducs reference
>>>>> count to PE, which is initialized to 1 and increased with 1 when a
>>>>> new PCI device joins the PE. Once the last PCI device leaves the
>>>>> PE, the PE is going to be release together with its associated
>>>>> (IO, M32, M64, DMA) resources.
>>>>
>>>>
>>>> Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>>>
>>>
>>> Ok. I'll add more details in next revision.
>>>
>>>>>
>>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>> ---
>>>>>   arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>>>   arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>>>   arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>>>   arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>>>   4 files changed, 432 insertions(+), 238 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>>>> index 5367eb3..a6ad4b1 100644
>>>>> --- a/arch/powerpc/include/asm/pci-bridge.h
>>>>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>>>>> @@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>>>   	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>>>   	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>>>   	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>>>> +
>>>>> +	/* Called when PCI device is released */
>>>>> +	void		(*release_device)(struct pci_dev *);
>>>>>   };
>>>>>
>>>>>   /*
>>>>> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>>>> index 7ed85a6..0040343 100644
>>>>> --- a/arch/powerpc/kernel/pci-hotplug.c
>>>>> +++ b/arch/powerpc/kernel/pci-hotplug.c
>>>>> @@ -29,6 +29,11 @@
>>>>>    */
>>>>>   void pcibios_release_device(struct pci_dev *dev)
>>>>>   {
>>>>> +	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>> +
>>>>> +	if (hose->controller_ops.release_device)
>>>>> +		hose->controller_ops.release_device(dev);
>>>>> +
>>>>>   	eeh_remove_device(dev);
>>>>>   }
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>> index 910fb67..ef8c216 100644
>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>> @@ -12,6 +12,8 @@
>>>>>   #undef DEBUG
>>>>>
>>>>>   #include <linux/kernel.h>
>>>>> +#include <linux/atomic.h>
>>>>> +#include <linux/kref.h>
>>>>>   #include <linux/pci.h>
>>>>>   #include <linux/crash_dump.h>
>>>>>   #include <linux/debugfs.h>
>>>>> @@ -47,6 +49,8 @@
>>>>>   /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>>>   #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>>>
>>>>> +static void pnv_ioda_release_pe(struct kref *kref);
>>>>> +
>>>>>   static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>>>   			    const char *fmt, ...)
>>>>>   {
>>>>> @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>>>   		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>>>   }
>>>>>
>>>>> -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>>>> +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>>>   {
>>>>> -	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>> -		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>> -			__func__, pe_no, phb->hose->global_number);
>>>>> +	if (!pe)
>>>>> +		return;
>>>>> +
>>>>> +	kref_get(&pe->kref);
>>>>> +}
>>>>> +
>>>>> +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>>>> +{
>>>>> +	unsigned int count;
>>>>> +
>>>>> +	if (!pe)
>>>>>   		return;
>>>>> +
>>>>> +	/*
>>>>> +	 * The count is initialized to 1 and increased with 1 when
>>>>> +	 * a new PCI device is bound with the PE. Once the last PCI
>>>>> +	 * device is leaving from the PE, the PE is going to be
>>>>> +	 * released.
>>>>> +	 */
>>>>> +	count = atomic_read(&pe->kref.refcount);
>>>>> +	if (count == 2)
>>>>> +		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>>>> +	else
>>>>> +		kref_put(&pe->kref, pnv_ioda_release_pe);
>>>>
>>>>
>>>> What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>>>
>>>
>>> Yeah, that would have problem. But it shouldn't happen because the
>>> PCI devices are joining the parent PE# in strictly serialized mode.
>>> Same thing happens when detaching PCI devices from its parent PE.
>>
>>
>> oookay. Another thing then - why is this kref counter initialized to 1?
>> It would make sense if you did something special when the counter becomes 1
>> after decrement but you do not.
>>
>> Also, this kref thing makes sense if you do kref_put() in multiple places and
>> do not know which one will be the last one so you pass the callback to all of
>> them. Here you do kref_put/sub in one place and you read the counter - so you
>> can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t
>> would do the job just fine. If you still feel that the counter should start
>>from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and
>> others.
>>
>
> It's good question actually. The counter is initialized to 1 when the PE
> is reserved because of M64 requirement or allocated for non-M64 case. If
> we reserve or allocate PE#, there is one thing for sure: the PCI bus has
> one PCI device (including PCI bridge) at least. After the PE# is reserved
> or allocated, the PCI device joins the PE with the result of increasing
> the counter with 1. It means the counter is 2 when PE contains one PCI
> device, and 3 when there're 2 devices. One reason for this design is that
> we just need decrease the counter if we have to release this PE in the
> window between PE reservation/allocation and first PCI device joins. I
> think you're correct that we can call pnv_ioda_release_pe() in this window.
> In this way, the counter is always reflecting the number of PCI devices
> the PE contains.


Good :) I believe it was something different 2-3 versions ago and evolved 
to this so you do not notice it straight away :)


>
>>>>> +}
>>>>> +
>>>>> +static void pnv_pci_release_device(struct pci_dev *pdev)
>>>>> +{
>>>>> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>> +	struct pnv_phb *phb = hose->private_data;
>>>>> +	struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>> +	struct pnv_ioda_pe *pe;
>>>>> +
>>>>> +	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>> +		pe = &phb->ioda.pe_array[pdn->pe_number];
>>>>> +		pnv_ioda_pe_put(pe);
>>>>> +		pdn->pe_number = IODA_INVALID_PE;
>>>>>   	}
>>>>> +}
>>>>>
>>>>> -	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>>>> -		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>>>> -			__func__, pe_no, phb->hose->global_number);
>>>>> +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>>>> +{
>>>>> +	struct pnv_phb *phb = pe->phb;
>>>>> +	int index, count;
>>>>> +	unsigned long tbl_addr, tbl_size;
>>>>> +
>>>>> +	/* No DMA capability for slave PEs */
>>>>> +	if (pe->flags & PNV_IODA_PE_SLAVE)
>>>>> +		return;
>>>>> +
>>>>> +	/* Bypass DMA window */
>>>>> +	if (phb->type == PNV_PHB_IODA2 &&
>>>>> +	    pe->tce_bypass_enabled &&
>>>>> +	    pe->tce32_table &&
>>>>> +	    pe->tce32_table->set_bypass)
>>>>> +		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>>>> +
>>>>> +	/* 32-bits DMA window */
>>>>> +	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>>>> +	tbl_addr = pe->tce32_table->it_base;
>>>>> +	if (!count)
>>>>>   		return;
>>>>> +
>>>>> +	/* Free IOMMU table */
>>>>> +	iommu_free_table(pe->tce32_table,
>>>>> +			 of_node_full_name(phb->hose->dn));
>>>>> +
>>>>> +	/* Deconfigure TCE table */
>>>>> +	switch (phb->type) {
>>>>> +	case PNV_PHB_IODA1:
>>>>> +		for (index = 0; index < count; index++)
>>>>> +			opal_pci_map_pe_dma_window(phb->opal_id,
>>>>> +						   pe->pe_number,
>>>>> +						   pe->tce32_seg_start + index,
>>>>> +						   1,
>>>>> +						   __pa(tbl_addr) +
>>>>> +						   index * TCE32_TABLE_SIZE,
>>>>> +						   0,
>>>>> +						   0x1000);
>>>>> +		bitmap_clear(phb->ioda.tce32_segmap,
>>>>> +			     pe->tce32_seg_start,
>>>>> +			     count);
>>>>> +		tbl_size = TCE32_TABLE_SIZE * count;
>>>>> +		break;
>>>>> +	case PNV_PHB_IODA2:
>>>>> +		opal_pci_map_pe_dma_window(phb->opal_id,
>>>>> +					   pe->pe_number,
>>>>> +					   pe->pe_number << 1,
>>>>> +					   1,
>>>>> +					   __pa(tbl_addr),
>>>>> +					   0,
>>>>> +					   0x1000);
>>>>> +		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>>>> +		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>>>> +		break;
>>>>> +	default:
>>>>> +		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	/* Free memory of IOMMU table */
>>>>> +	free_pages(tbl_addr, get_order(tbl_size));
>>>>
>>>>
>>>> You just programmed the table address to TVT and then you are releasing the
>>>> pages. It does not seem right, it will leave garbage in TVT. Also, I am
>>>> adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>> >from there (I'll post v10 soon, you'll be in copy and you'll have to review
>>>> that ;) ).
>>>>
>>>
>>> I assume you're talking about TVE. I don't understand how garbage will be left
>>> in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
>>> with zero'ed "tce_table_size". The pages previously allocated for TCE table is
>>> released to buddy system, which can be allocated by somebody else (from buddy
>>> or slab).
>>
>> opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some memory
>> which is still allocated. This value goes to a table (which has 2 entries per
>> PE, one for 32bit DMA window and one for bypass/hugewindow) which PHB uses to
>> get the actual TCE table address. What is the name of this table? :) Anyway,
>> you write an address there and then you call free_pages() so after
>> free_pages(), the value in that TVE/TVT/whatever table is a garbage.
>>
>
> I don't look into your DDW code yet. Before we have DDW patchset, the bypass
> TVE (window) isn't supposed to have corresponding TCE table. I guess you might
> change the behaviour in your DDW patchset and I'll take a close look on that.
> For DMA32 window, which is the name of the table, the TVE is cleared by skiboot
> when having zero "tce_table_size" argument.
>
> 	opal_pci_map_pe_dma_window(phb->opal_id,
> 				   pe->pe_number,
> 				   pe->pe_number << 1,
> 				   1,
> 				   __pa(tbl_addr),
> 				   0,			<<<< "tce_table_size".
> 				   0x1000);


Then please, when you pass tce_table_size==0, also pass zero address/zero 
page size/zero levels, unless you have very good reason to pass non-zero 
values for these. What you have now is confusing - it looks like you are 
initializing the table - it is not obvious that "0" is the size and not 
some flags.

When people see this (which does the same thing, please correct me if I am 
wrong), they do not have questions what you are actually trying to do:

  	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
  				   pe->pe_number << 1, 0, 0, 0, 0);



-- 
Alexey

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
  2015-05-12  0:53             ` Alexey Kardashevskiy
@ 2015-05-12  1:25               ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-12  1:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Gavin Shan, linuxppc-dev, linux-pci, benh, bhelgaas

On Tue, May 12, 2015 at 10:53:29AM +1000, Alexey Kardashevskiy wrote:
>On 05/12/2015 10:03 AM, Gavin Shan wrote:
>>On Mon, May 11, 2015 at 05:02:08PM +1000, Alexey Kardashevskiy wrote:
>>>On 05/11/2015 04:25 PM, Gavin Shan wrote:
>>>>On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>>>>>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>>>The original code doesn't support releasing PEs dynamically, meaning
>>>>>>that PE and the associated resources (IO, M32, M64 and DMA) can't
>>>>>>be released when unplugging a PCI adapter from one hotpluggable slot.
>>>>>>
>>>>>>The patch takes object oriented methodology, introducs reference
>>>>>>count to PE, which is initialized to 1 and increased with 1 when a
>>>>>>new PCI device joins the PE. Once the last PCI device leaves the
>>>>>>PE, the PE is going to be release together with its associated
>>>>>>(IO, M32, M64, DMA) resources.
>>>>>
>>>>>
>>>>>Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>>>>
>>>>
>>>>Ok. I'll add more details in next revision.
>>>>
>>>>>>
>>>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>---
>>>>>>  arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>>>>  arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>>>>  arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>>>>  arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>>>>  4 files changed, 432 insertions(+), 238 deletions(-)
>>>>>>
>>>>>>diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>>>>>index 5367eb3..a6ad4b1 100644
>>>>>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>>>>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>>>>>@@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>>>>  	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>>>>  	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>>>>  	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>>>>>+
>>>>>>+	/* Called when PCI device is released */
>>>>>>+	void		(*release_device)(struct pci_dev *);
>>>>>>  };
>>>>>>
>>>>>>  /*
>>>>>>diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>>>>>index 7ed85a6..0040343 100644
>>>>>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>>>>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>>>>>@@ -29,6 +29,11 @@
>>>>>>   */
>>>>>>  void pcibios_release_device(struct pci_dev *dev)
>>>>>>  {
>>>>>>+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>>>+
>>>>>>+	if (hose->controller_ops.release_device)
>>>>>>+		hose->controller_ops.release_device(dev);
>>>>>>+
>>>>>>  	eeh_remove_device(dev);
>>>>>>  }
>>>>>>
>>>>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>index 910fb67..ef8c216 100644
>>>>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>@@ -12,6 +12,8 @@
>>>>>>  #undef DEBUG
>>>>>>
>>>>>>  #include <linux/kernel.h>
>>>>>>+#include <linux/atomic.h>
>>>>>>+#include <linux/kref.h>
>>>>>>  #include <linux/pci.h>
>>>>>>  #include <linux/crash_dump.h>
>>>>>>  #include <linux/debugfs.h>
>>>>>>@@ -47,6 +49,8 @@
>>>>>>  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>>>>  #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>>>>
>>>>>>+static void pnv_ioda_release_pe(struct kref *kref);
>>>>>>+
>>>>>>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>>>>  			    const char *fmt, ...)
>>>>>>  {
>>>>>>@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>>>>  		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>>>>  }
>>>>>>
>>>>>>-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>>>>>+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>>>>  {
>>>>>>-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>>>-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>>>+	if (!pe)
>>>>>>+		return;
>>>>>>+
>>>>>>+	kref_get(&pe->kref);
>>>>>>+}
>>>>>>+
>>>>>>+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>>>>>+{
>>>>>>+	unsigned int count;
>>>>>>+
>>>>>>+	if (!pe)
>>>>>>  		return;
>>>>>>+
>>>>>>+	/*
>>>>>>+	 * The count is initialized to 1 and increased with 1 when
>>>>>>+	 * a new PCI device is bound with the PE. Once the last PCI
>>>>>>+	 * device is leaving from the PE, the PE is going to be
>>>>>>+	 * released.
>>>>>>+	 */
>>>>>>+	count = atomic_read(&pe->kref.refcount);
>>>>>>+	if (count == 2)
>>>>>>+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>>>>>+	else
>>>>>>+		kref_put(&pe->kref, pnv_ioda_release_pe);
>>>>>
>>>>>
>>>>>What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>>>>
>>>>
>>>>Yeah, that would have problem. But it shouldn't happen because the
>>>>PCI devices are joining the parent PE# in strictly serialized mode.
>>>>Same thing happens when detaching PCI devices from its parent PE.
>>>
>>>
>>>oookay. Another thing then - why is this kref counter initialized to 1?
>>>It would make sense if you did something special when the counter becomes 1
>>>after decrement but you do not.
>>>
>>>Also, this kref thing makes sense if you do kref_put() in multiple places and
>>>do not know which one will be the last one so you pass the callback to all of
>>>them. Here you do kref_put/sub in one place and you read the counter - so you
>>>can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t
>>>would do the job just fine. If you still feel that the counter should start
>>>from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and
>>>others.
>>>
>>
>>It's good question actually. The counter is initialized to 1 when the PE
>>is reserved because of M64 requirement or allocated for non-M64 case. If
>>we reserve or allocate PE#, there is one thing for sure: the PCI bus has
>>one PCI device (including PCI bridge) at least. After the PE# is reserved
>>or allocated, the PCI device joins the PE with the result of increasing
>>the counter with 1. It means the counter is 2 when PE contains one PCI
>>device, and 3 when there're 2 devices. One reason for this design is that
>>we just need decrease the counter if we have to release this PE in the
>>window between PE reservation/allocation and first PCI device joins. I
>>think you're correct that we can call pnv_ioda_release_pe() in this window.
>>In this way, the counter is always reflecting the number of PCI devices
>>the PE contains.
>
>
>Good :) I believe it was something different 2-3 versions ago and evolved to
>this so you do not notice it straight away :)
>

Thanks :)

>>
>>>>>>+}
>>>>>>+
>>>>>>+static void pnv_pci_release_device(struct pci_dev *pdev)
>>>>>>+{
>>>>>>+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>>>+	struct pnv_phb *phb = hose->private_data;
>>>>>>+	struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>>>+	struct pnv_ioda_pe *pe;
>>>>>>+
>>>>>>+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>>>+		pe = &phb->ioda.pe_array[pdn->pe_number];
>>>>>>+		pnv_ioda_pe_put(pe);
>>>>>>+		pdn->pe_number = IODA_INVALID_PE;
>>>>>>  	}
>>>>>>+}
>>>>>>
>>>>>>-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>>>>>-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>>>+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>>>>>+{
>>>>>>+	struct pnv_phb *phb = pe->phb;
>>>>>>+	int index, count;
>>>>>>+	unsigned long tbl_addr, tbl_size;
>>>>>>+
>>>>>>+	/* No DMA capability for slave PEs */
>>>>>>+	if (pe->flags & PNV_IODA_PE_SLAVE)
>>>>>>+		return;
>>>>>>+
>>>>>>+	/* Bypass DMA window */
>>>>>>+	if (phb->type == PNV_PHB_IODA2 &&
>>>>>>+	    pe->tce_bypass_enabled &&
>>>>>>+	    pe->tce32_table &&
>>>>>>+	    pe->tce32_table->set_bypass)
>>>>>>+		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>>>>>+
>>>>>>+	/* 32-bits DMA window */
>>>>>>+	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>>>>>+	tbl_addr = pe->tce32_table->it_base;
>>>>>>+	if (!count)
>>>>>>  		return;
>>>>>>+
>>>>>>+	/* Free IOMMU table */
>>>>>>+	iommu_free_table(pe->tce32_table,
>>>>>>+			 of_node_full_name(phb->hose->dn));
>>>>>>+
>>>>>>+	/* Deconfigure TCE table */
>>>>>>+	switch (phb->type) {
>>>>>>+	case PNV_PHB_IODA1:
>>>>>>+		for (index = 0; index < count; index++)
>>>>>>+			opal_pci_map_pe_dma_window(phb->opal_id,
>>>>>>+						   pe->pe_number,
>>>>>>+						   pe->tce32_seg_start + index,
>>>>>>+						   1,
>>>>>>+						   __pa(tbl_addr) +
>>>>>>+						   index * TCE32_TABLE_SIZE,
>>>>>>+						   0,
>>>>>>+						   0x1000);
>>>>>>+		bitmap_clear(phb->ioda.tce32_segmap,
>>>>>>+			     pe->tce32_seg_start,
>>>>>>+			     count);
>>>>>>+		tbl_size = TCE32_TABLE_SIZE * count;
>>>>>>+		break;
>>>>>>+	case PNV_PHB_IODA2:
>>>>>>+		opal_pci_map_pe_dma_window(phb->opal_id,
>>>>>>+					   pe->pe_number,
>>>>>>+					   pe->pe_number << 1,
>>>>>>+					   1,
>>>>>>+					   __pa(tbl_addr),
>>>>>>+					   0,
>>>>>>+					   0x1000);
>>>>>>+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>>>>>+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>>>>>+		break;
>>>>>>+	default:
>>>>>>+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>>>>>+		return;
>>>>>>+	}
>>>>>>+
>>>>>>+	/* Free memory of IOMMU table */
>>>>>>+	free_pages(tbl_addr, get_order(tbl_size));
>>>>>
>>>>>
>>>>>You just programmed the table address to TVT and then you are releasing the
>>>>>pages. It does not seem right, it will leave garbage in TVT. Also, I am
>>>>>adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>>>>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>>>>>that ;) ).
>>>>>
>>>>
>>>>I assume you're talking about TVE. I don't understand how garbage will be left
>>>>in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
>>>>with zero'ed "tce_table_size". The pages previously allocated for TCE table is
>>>>released to buddy system, which can be allocated by somebody else (from buddy
>>>>or slab).
>>>
>>>opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some memory
>>>which is still allocated. This value goes to a table (which has 2 entries per
>>>PE, one for 32bit DMA window and one for bypass/hugewindow) which PHB uses to
>>>get the actual TCE table address. What is the name of this table? :) Anyway,
>>>you write an address there and then you call free_pages() so after
>>>free_pages(), the value in that TVE/TVT/whatever table is a garbage.
>>>
>>
>>I don't look into your DDW code yet. Before we have DDW patchset, the bypass
>>TVE (window) isn't supposed to have corresponding TCE table. I guess you might
>>change the behaviour in your DDW patchset and I'll take a close look on that.
>>For DMA32 window, which is the name of the table, the TVE is cleared by skiboot
>>when having zero "tce_table_size" argument.
>>
>>	opal_pci_map_pe_dma_window(phb->opal_id,
>>				   pe->pe_number,
>>				   pe->pe_number << 1,
>>				   1,
>>				   __pa(tbl_addr),
>>				   0,			<<<< "tce_table_size".
>>				   0x1000);
>
>
>Then please, when you pass tce_table_size==0, also pass zero address/zero
>page size/zero levels, unless you have very good reason to pass non-zero
>values for these. What you have now is confusing - it looks like you are
>initializing the table - it is not obvious that "0" is the size and not some
>flags.
>
>When people see this (which does the same thing, please correct me if I am
>wrong), they do not have questions what you are actually trying to do:
>
> 	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> 				   pe->pe_number << 1, 0, 0, 0, 0);
>

Sure, with more zero'ed parameters to the OPAL call, the purpose will be
more clear. I also check the skiboot implementation of this function, it
should work. I'll use the code as you're suggesting.

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
@ 2015-05-12  1:25               ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-12  1:25 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: bhelgaas, linux-pci, linuxppc-dev, Gavin Shan

On Tue, May 12, 2015 at 10:53:29AM +1000, Alexey Kardashevskiy wrote:
>On 05/12/2015 10:03 AM, Gavin Shan wrote:
>>On Mon, May 11, 2015 at 05:02:08PM +1000, Alexey Kardashevskiy wrote:
>>>On 05/11/2015 04:25 PM, Gavin Shan wrote:
>>>>On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote:
>>>>>On 05/01/2015 04:02 PM, Gavin Shan wrote:
>>>>>>The original code doesn't support releasing PEs dynamically, meaning
>>>>>>that PE and the associated resources (IO, M32, M64 and DMA) can't
>>>>>>be released when unplugging a PCI adapter from one hotpluggable slot.
>>>>>>
>>>>>>The patch takes object oriented methodology, introducs reference
>>>>>>count to PE, which is initialized to 1 and increased with 1 when a
>>>>>>new PCI device joins the PE. Once the last PCI device leaves the
>>>>>>PE, the PE is going to be release together with its associated
>>>>>>(IO, M32, M64, DMA) resources.
>>>>>
>>>>>
>>>>>Too little commit log for non-trivial non-cut-n-paste 30KB patch...
>>>>>
>>>>
>>>>Ok. I'll add more details in next revision.
>>>>
>>>>>>
>>>>>>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>>>>---
>>>>>>  arch/powerpc/include/asm/pci-bridge.h     |   3 +
>>>>>>  arch/powerpc/kernel/pci-hotplug.c         |   5 +
>>>>>>  arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++++++++++++++++++-----------
>>>>>>  arch/powerpc/platforms/powernv/pci.h      |   4 +-
>>>>>>  4 files changed, 432 insertions(+), 238 deletions(-)
>>>>>>
>>>>>>diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>>>>>>index 5367eb3..a6ad4b1 100644
>>>>>>--- a/arch/powerpc/include/asm/pci-bridge.h
>>>>>>+++ b/arch/powerpc/include/asm/pci-bridge.h
>>>>>>@@ -31,6 +31,9 @@ struct pci_controller_ops {
>>>>>>  	resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type);
>>>>>>  	void		(*setup_bridge)(struct pci_bus *, unsigned long);
>>>>>>  	void		(*reset_secondary_bus)(struct pci_dev *dev);
>>>>>>+
>>>>>>+	/* Called when PCI device is released */
>>>>>>+	void		(*release_device)(struct pci_dev *);
>>>>>>  };
>>>>>>
>>>>>>  /*
>>>>>>diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
>>>>>>index 7ed85a6..0040343 100644
>>>>>>--- a/arch/powerpc/kernel/pci-hotplug.c
>>>>>>+++ b/arch/powerpc/kernel/pci-hotplug.c
>>>>>>@@ -29,6 +29,11 @@
>>>>>>   */
>>>>>>  void pcibios_release_device(struct pci_dev *dev)
>>>>>>  {
>>>>>>+	struct pci_controller *hose = pci_bus_to_host(dev->bus);
>>>>>>+
>>>>>>+	if (hose->controller_ops.release_device)
>>>>>>+		hose->controller_ops.release_device(dev);
>>>>>>+
>>>>>>  	eeh_remove_device(dev);
>>>>>>  }
>>>>>>
>>>>>>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>index 910fb67..ef8c216 100644
>>>>>>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>>@@ -12,6 +12,8 @@
>>>>>>  #undef DEBUG
>>>>>>
>>>>>>  #include <linux/kernel.h>
>>>>>>+#include <linux/atomic.h>
>>>>>>+#include <linux/kref.h>
>>>>>>  #include <linux/pci.h>
>>>>>>  #include <linux/crash_dump.h>
>>>>>>  #include <linux/debugfs.h>
>>>>>>@@ -47,6 +49,8 @@
>>>>>>  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
>>>>>>  #define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
>>>>>>
>>>>>>+static void pnv_ioda_release_pe(struct kref *kref);
>>>>>>+
>>>>>>  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>>>>>>  			    const char *fmt, ...)
>>>>>>  {
>>>>>>@@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>>>>>>  		(IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>>>>>>  }
>>>>>>
>>>>>>-static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no)
>>>>>>+static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe)
>>>>>>  {
>>>>>>-	if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) {
>>>>>>-		pr_warn("%s: Invalid PE %d on PHB#%x\n",
>>>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>>>+	if (!pe)
>>>>>>+		return;
>>>>>>+
>>>>>>+	kref_get(&pe->kref);
>>>>>>+}
>>>>>>+
>>>>>>+static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe)
>>>>>>+{
>>>>>>+	unsigned int count;
>>>>>>+
>>>>>>+	if (!pe)
>>>>>>  		return;
>>>>>>+
>>>>>>+	/*
>>>>>>+	 * The count is initialized to 1 and increased with 1 when
>>>>>>+	 * a new PCI device is bound with the PE. Once the last PCI
>>>>>>+	 * device is leaving from the PE, the PE is going to be
>>>>>>+	 * released.
>>>>>>+	 */
>>>>>>+	count = atomic_read(&pe->kref.refcount);
>>>>>>+	if (count == 2)
>>>>>>+		kref_sub(&pe->kref, 2, pnv_ioda_release_pe);
>>>>>>+	else
>>>>>>+		kref_put(&pe->kref, pnv_ioda_release_pe);
>>>>>
>>>>>
>>>>>What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()?
>>>>>
>>>>
>>>>Yeah, that would have problem. But it shouldn't happen because the
>>>>PCI devices are joining the parent PE# in strictly serialized mode.
>>>>Same thing happens when detaching PCI devices from its parent PE.
>>>
>>>
>>>oookay. Another thing then - why is this kref counter initialized to 1?
>>>It would make sense if you did something special when the counter becomes 1
>>>after decrement but you do not.
>>>
>>>Also, this kref thing makes sense if you do kref_put() in multiple places and
>>>do not know which one will be the last one so you pass the callback to all of
>>>them. Here you do kref_put/sub in one place and you read the counter - so you
>>>can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t
>>>would do the job just fine. If you still feel that the counter should start
>>>from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and
>>>others.
>>>
>>
>>It's good question actually. The counter is initialized to 1 when the PE
>>is reserved because of M64 requirement or allocated for non-M64 case. If
>>we reserve or allocate PE#, there is one thing for sure: the PCI bus has
>>one PCI device (including PCI bridge) at least. After the PE# is reserved
>>or allocated, the PCI device joins the PE with the result of increasing
>>the counter with 1. It means the counter is 2 when PE contains one PCI
>>device, and 3 when there're 2 devices. One reason for this design is that
>>we just need decrease the counter if we have to release this PE in the
>>window between PE reservation/allocation and first PCI device joins. I
>>think you're correct that we can call pnv_ioda_release_pe() in this window.
>>In this way, the counter is always reflecting the number of PCI devices
>>the PE contains.
>
>
>Good :) I believe it was something different 2-3 versions ago and evolved to
>this so you do not notice it straight away :)
>

Thanks :)

>>
>>>>>>+}
>>>>>>+
>>>>>>+static void pnv_pci_release_device(struct pci_dev *pdev)
>>>>>>+{
>>>>>>+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
>>>>>>+	struct pnv_phb *phb = hose->private_data;
>>>>>>+	struct pci_dn *pdn = pci_get_pdn(pdev);
>>>>>>+	struct pnv_ioda_pe *pe;
>>>>>>+
>>>>>>+	if (pdn && pdn->pe_number != IODA_INVALID_PE) {
>>>>>>+		pe = &phb->ioda.pe_array[pdn->pe_number];
>>>>>>+		pnv_ioda_pe_put(pe);
>>>>>>+		pdn->pe_number = IODA_INVALID_PE;
>>>>>>  	}
>>>>>>+}
>>>>>>
>>>>>>-	if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) {
>>>>>>-		pr_warn("%s: PE %d was assigned on PHB#%x\n",
>>>>>>-			__func__, pe_no, phb->hose->global_number);
>>>>>>+static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe)
>>>>>>+{
>>>>>>+	struct pnv_phb *phb = pe->phb;
>>>>>>+	int index, count;
>>>>>>+	unsigned long tbl_addr, tbl_size;
>>>>>>+
>>>>>>+	/* No DMA capability for slave PEs */
>>>>>>+	if (pe->flags & PNV_IODA_PE_SLAVE)
>>>>>>+		return;
>>>>>>+
>>>>>>+	/* Bypass DMA window */
>>>>>>+	if (phb->type == PNV_PHB_IODA2 &&
>>>>>>+	    pe->tce_bypass_enabled &&
>>>>>>+	    pe->tce32_table &&
>>>>>>+	    pe->tce32_table->set_bypass)
>>>>>>+		pe->tce32_table->set_bypass(pe->tce32_table, false);
>>>>>>+
>>>>>>+	/* 32-bits DMA window */
>>>>>>+	count = pe->tce32_seg_end - pe->tce32_seg_start;
>>>>>>+	tbl_addr = pe->tce32_table->it_base;
>>>>>>+	if (!count)
>>>>>>  		return;
>>>>>>+
>>>>>>+	/* Free IOMMU table */
>>>>>>+	iommu_free_table(pe->tce32_table,
>>>>>>+			 of_node_full_name(phb->hose->dn));
>>>>>>+
>>>>>>+	/* Deconfigure TCE table */
>>>>>>+	switch (phb->type) {
>>>>>>+	case PNV_PHB_IODA1:
>>>>>>+		for (index = 0; index < count; index++)
>>>>>>+			opal_pci_map_pe_dma_window(phb->opal_id,
>>>>>>+						   pe->pe_number,
>>>>>>+						   pe->tce32_seg_start + index,
>>>>>>+						   1,
>>>>>>+						   __pa(tbl_addr) +
>>>>>>+						   index * TCE32_TABLE_SIZE,
>>>>>>+						   0,
>>>>>>+						   0x1000);
>>>>>>+		bitmap_clear(phb->ioda.tce32_segmap,
>>>>>>+			     pe->tce32_seg_start,
>>>>>>+			     count);
>>>>>>+		tbl_size = TCE32_TABLE_SIZE * count;
>>>>>>+		break;
>>>>>>+	case PNV_PHB_IODA2:
>>>>>>+		opal_pci_map_pe_dma_window(phb->opal_id,
>>>>>>+					   pe->pe_number,
>>>>>>+					   pe->pe_number << 1,
>>>>>>+					   1,
>>>>>>+					   __pa(tbl_addr),
>>>>>>+					   0,
>>>>>>+					   0x1000);
>>>>>>+		tbl_size = (1ul << ilog2(phb->ioda.m32_pci_base));
>>>>>>+		tbl_size = (tbl_size >> IOMMU_PAGE_SHIFT_4K) * 8;
>>>>>>+		break;
>>>>>>+	default:
>>>>>>+		pe_warn(pe, "Unsupported PHB type %d\n", phb->type);
>>>>>>+		return;
>>>>>>+	}
>>>>>>+
>>>>>>+	/* Free memory of IOMMU table */
>>>>>>+	free_pages(tbl_addr, get_order(tbl_size));
>>>>>
>>>>>
>>>>>You just programmed the table address to TVT and then you are releasing the
>>>>>pages. It does not seem right, it will leave garbage in TVT. Also, I am
>>>>>adding helpers to alloc/free TCE pages in DDW patchset, you could reuse bits
>>>>>from there (I'll post v10 soon, you'll be in copy and you'll have to review
>>>>>that ;) ).
>>>>>
>>>>
>>>>I assume you're talking about TVE. I don't understand how garbage will be left
>>>>in TVE. opal_pci_map_pe_dma_window(), which is handled by skiboot, clear TVE
>>>>with zero'ed "tce_table_size". The pages previously allocated for TCE table is
>>>>released to buddy system, which can be allocated by somebody else (from buddy
>>>>or slab).
>>>
>>>opal_pci_map_pe_dma_window() takes __pa(tbl_addr) which points to some memory
>>>which is still allocated. This value goes to a table (which has 2 entries per
>>>PE, one for 32bit DMA window and one for bypass/hugewindow) which PHB uses to
>>>get the actual TCE table address. What is the name of this table? :) Anyway,
>>>you write an address there and then you call free_pages() so after
>>>free_pages(), the value in that TVE/TVT/whatever table is a garbage.
>>>
>>
>>I don't look into your DDW code yet. Before we have DDW patchset, the bypass
>>TVE (window) isn't supposed to have corresponding TCE table. I guess you might
>>change the behaviour in your DDW patchset and I'll take a close look on that.
>>For DMA32 window, which is the name of the table, the TVE is cleared by skiboot
>>when having zero "tce_table_size" argument.
>>
>>	opal_pci_map_pe_dma_window(phb->opal_id,
>>				   pe->pe_number,
>>				   pe->pe_number << 1,
>>				   1,
>>				   __pa(tbl_addr),
>>				   0,			<<<< "tce_table_size".
>>				   0x1000);
>
>
>Then please, when you pass tce_table_size==0, also pass zero address/zero
>page size/zero levels, unless you have very good reason to pass non-zero
>values for these. What you have now is confusing - it looks like you are
>initializing the table - it is not obvious that "0" is the size and not some
>flags.
>
>When people see this (which does the same thing, please correct me if I am
>wrong), they do not have questions what you are actually trying to do:
>
> 	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
> 				   pe->pe_number << 1, 0, 0, 0, 0);
>

Sure, with more zero'ed parameters to the OPAL call, the purpose will be
more clear. I also check the skiboot implementation of this function, it
should work. I'll use the code as you're suggesting.

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-04 21:14                 ` Benjamin Herrenschmidt
@ 2015-05-13 23:35                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-13 23:35 UTC (permalink / raw)
  To: Pantelis Antoniou, Rob Herring
  Cc: Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, Grant Likely,
	devicetree

On Tue, 2015-05-05 at 07:14 +1000, Benjamin Herrenschmidt wrote:
> So the "trivial" way to do it (and the way we have implemented the FW
> side so far) is to have the FW simply "flatten" the subtree below the
> slot and pass it to Linux, with the intent of expanding it back below
> the slot node.
> 
> This is what Gavin proposed patches do.
> 
> The overlay mechanism adds all sorts of features that we don't seen to
> need and would make the above more complex.

Guys, I never got a final answer from you on this. Are we ok with adding
the way to just expand a subtree or are you insistent we need to use the
overlap mechanism ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-13 23:35                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-13 23:35 UTC (permalink / raw)
  To: Pantelis Antoniou, Rob Herring
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Bjorn Helgaas,
	linuxppc-dev

On Tue, 2015-05-05 at 07:14 +1000, Benjamin Herrenschmidt wrote:
> So the "trivial" way to do it (and the way we have implemented the FW
> side so far) is to have the FW simply "flatten" the subtree below the
> slot and pass it to Linux, with the intent of expanding it back below
> the slot node.
> 
> This is what Gavin proposed patches do.
> 
> The overlay mechanism adds all sorts of features that we don't seen to
> need and would make the above more complex.

Guys, I never got a final answer from you on this. Are we ok with adding
the way to just expand a subtree or are you insistent we need to use the
overlap mechanism ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-13 23:35                   ` Benjamin Herrenschmidt
  (?)
@ 2015-05-14  0:18                       ` Rob Herring
  -1 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-14  0:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Wed, May 13, 2015 at 6:35 PM, Benjamin Herrenschmidt
<benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org> wrote:
> On Tue, 2015-05-05 at 07:14 +1000, Benjamin Herrenschmidt wrote:
>> So the "trivial" way to do it (and the way we have implemented the FW
>> side so far) is to have the FW simply "flatten" the subtree below the
>> slot and pass it to Linux, with the intent of expanding it back below
>> the slot node.
>>
>> This is what Gavin proposed patches do.
>>
>> The overlay mechanism adds all sorts of features that we don't seen to
>> need and would make the above more complex.
>
> Guys, I never got a final answer from you on this. Are we ok with adding
> the way to just expand a subtree or are you insistent we need to use the
> overlap mechanism ?

I haven't decided really.

The main thing with the current patch is I don't really like the added
complexity to unflatten_dt_node. It is already a fairly complex
function. Perhaps removing of "hybrid" as discussed will help?

If there are things we can do to make overlays easier to use in your
use case, I'd like to hear ideas. I don't really buy that being more
complex than needed is an obstacle. That is very often the case to
have common, scale-able solutions. I want to see a simple case be
simple to support.

Rob
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  0:18                       ` Rob Herring
  0 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-14  0:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Pantelis Antoniou, Gavin Shan, linuxppc-dev, linux-pci,
	Bjorn Helgaas, Grant Likely, devicetree

On Wed, May 13, 2015 at 6:35 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-05-05 at 07:14 +1000, Benjamin Herrenschmidt wrote:
>> So the "trivial" way to do it (and the way we have implemented the FW
>> side so far) is to have the FW simply "flatten" the subtree below the
>> slot and pass it to Linux, with the intent of expanding it back below
>> the slot node.
>>
>> This is what Gavin proposed patches do.
>>
>> The overlay mechanism adds all sorts of features that we don't seen to
>> need and would make the above more complex.
>
> Guys, I never got a final answer from you on this. Are we ok with adding
> the way to just expand a subtree or are you insistent we need to use the
> overlap mechanism ?

I haven't decided really.

The main thing with the current patch is I don't really like the added
complexity to unflatten_dt_node. It is already a fairly complex
function. Perhaps removing of "hybrid" as discussed will help?

If there are things we can do to make overlays easier to use in your
use case, I'd like to hear ideas. I don't really buy that being more
complex than needed is an obstacle. That is very often the case to
have common, scale-able solutions. I want to see a simple case be
simple to support.

Rob

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  0:18                       ` Rob Herring
  0 siblings, 0 replies; 184+ messages in thread
From: Rob Herring @ 2015-05-14  0:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Pantelis Antoniou, Gavin Shan,
	Grant Likely, Bjorn Helgaas, linuxppc-dev

On Wed, May 13, 2015 at 6:35 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2015-05-05 at 07:14 +1000, Benjamin Herrenschmidt wrote:
>> So the "trivial" way to do it (and the way we have implemented the FW
>> side so far) is to have the FW simply "flatten" the subtree below the
>> slot and pass it to Linux, with the intent of expanding it back below
>> the slot node.
>>
>> This is what Gavin proposed patches do.
>>
>> The overlay mechanism adds all sorts of features that we don't seen to
>> need and would make the above more complex.
>
> Guys, I never got a final answer from you on this. Are we ok with adding
> the way to just expand a subtree or are you insistent we need to use the
> overlap mechanism ?

I haven't decided really.

The main thing with the current patch is I don't really like the added
complexity to unflatten_dt_node. It is already a fairly complex
function. Perhaps removing of "hybrid" as discussed will help?

If there are things we can do to make overlays easier to use in your
use case, I'd like to hear ideas. I don't really buy that being more
complex than needed is an obstacle. That is very often the case to
have common, scale-able solutions. I want to see a simple case be
simple to support.

Rob

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  0:18                       ` Rob Herring
  (?)
@ 2015-05-14  0:54                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  0:54 UTC (permalink / raw)
  To: Rob Herring
  Cc: Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:

> I haven't decided really.
> 
> The main thing with the current patch is I don't really like the added
> complexity to unflatten_dt_node. It is already a fairly complex
> function. Perhaps removing of "hybrid" as discussed will help?

I agree, we should be able to make that much simpler, I was planning on
sorting that out with Gavin.

> If there are things we can do to make overlays easier to use in your
> use case, I'd like to hear ideas. I don't really buy that being more
> complex than needed is an obstacle. That is very often the case to
> have common, scale-able solutions. I want to see a simple case be
> simple to support.

Well, it's a LOT more complex from the FW perspective for a bunch of
features we don't really need, in a way because the DT update in our
case is just purely informational to avoid keeping wrong/outdated DT
bits, it has little functional impact (it might have a bit for interrupt
routing through bridges though).

However, I am also pursuing an approach on FW side using a generation
count in our nodes and properties which we could use to generate
arbitrary overlays if we know what generation linux has.

There might actual be a usage scenario for a generic way for our
firwmare to convey DT updates to Linux for other reasons.

A few things that I don't find in the overlay code (but maybe I haven't
looked at it hard enough):

 - Can it remove nodes/properties ?

 - Can it "commit" a changeset so it's permanently part of the main DT ?
We will never have a concept of "revertable" changesets, if we need a
subsequent update, we will get a new overlay from FW that will remove
what needs to be removed and add what needs to be added.

IE, our current mechanism without overlay is fairly simple:

  - On PCI unplug, we remove all nodes below the slot (from linux),
the FW does the equivalent internally.

  - On PCI re-plug, the FW internally builds new nodes and sends a
new subtree as an FDT that we can expand/attach.

Now we could consider that subtree as a changeset that can be undone,
but that wouldn't work for boot time. And subsequent updates wouldn't
have that concept of "undoing" anyway.

IE. conceptually, what overlays do today is quite rooted around the idea
of having a fixed "base" DT and some pre-compiled DTB overlays that
get added/removed. The design completely ignore the idea of a FW that
maintains a "live" tree which we want to keep in sync, which is what we
want to do here, or what we could do with a "live" open firmware
implementation.

Now we might be able to reconcile them, but it feels to me that the
overlay/changeset stuff is too rooted in the first concept...

Ben.


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  0:54                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  0:54 UTC (permalink / raw)
  To: Rob Herring
  Cc: Pantelis Antoniou, Gavin Shan, linuxppc-dev, linux-pci,
	Bjorn Helgaas, Grant Likely, devicetree

On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:

> I haven't decided really.
> 
> The main thing with the current patch is I don't really like the added
> complexity to unflatten_dt_node. It is already a fairly complex
> function. Perhaps removing of "hybrid" as discussed will help?

I agree, we should be able to make that much simpler, I was planning on
sorting that out with Gavin.

> If there are things we can do to make overlays easier to use in your
> use case, I'd like to hear ideas. I don't really buy that being more
> complex than needed is an obstacle. That is very often the case to
> have common, scale-able solutions. I want to see a simple case be
> simple to support.

Well, it's a LOT more complex from the FW perspective for a bunch of
features we don't really need, in a way because the DT update in our
case is just purely informational to avoid keeping wrong/outdated DT
bits, it has little functional impact (it might have a bit for interrupt
routing through bridges though).

However, I am also pursuing an approach on FW side using a generation
count in our nodes and properties which we could use to generate
arbitrary overlays if we know what generation linux has.

There might actual be a usage scenario for a generic way for our
firwmare to convey DT updates to Linux for other reasons.

A few things that I don't find in the overlay code (but maybe I haven't
looked at it hard enough):

 - Can it remove nodes/properties ?

 - Can it "commit" a changeset so it's permanently part of the main DT ?
We will never have a concept of "revertable" changesets, if we need a
subsequent update, we will get a new overlay from FW that will remove
what needs to be removed and add what needs to be added.

IE, our current mechanism without overlay is fairly simple:

  - On PCI unplug, we remove all nodes below the slot (from linux),
the FW does the equivalent internally.

  - On PCI re-plug, the FW internally builds new nodes and sends a
new subtree as an FDT that we can expand/attach.

Now we could consider that subtree as a changeset that can be undone,
but that wouldn't work for boot time. And subsequent updates wouldn't
have that concept of "undoing" anyway.

IE. conceptually, what overlays do today is quite rooted around the idea
of having a fixed "base" DT and some pre-compiled DTB overlays that
get added/removed. The design completely ignore the idea of a FW that
maintains a "live" tree which we want to keep in sync, which is what we
want to do here, or what we could do with a "live" open firmware
implementation.

Now we might be able to reconcile them, but it feels to me that the
overlay/changeset stuff is too rooted in the first concept...

Ben.



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  0:54                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  0:54 UTC (permalink / raw)
  To: Rob Herring
  Cc: devicetree, linux-pci, Pantelis Antoniou, Gavin Shan,
	Grant Likely, Bjorn Helgaas, linuxppc-dev

On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:

> I haven't decided really.
> 
> The main thing with the current patch is I don't really like the added
> complexity to unflatten_dt_node. It is already a fairly complex
> function. Perhaps removing of "hybrid" as discussed will help?

I agree, we should be able to make that much simpler, I was planning on
sorting that out with Gavin.

> If there are things we can do to make overlays easier to use in your
> use case, I'd like to hear ideas. I don't really buy that being more
> complex than needed is an obstacle. That is very often the case to
> have common, scale-able solutions. I want to see a simple case be
> simple to support.

Well, it's a LOT more complex from the FW perspective for a bunch of
features we don't really need, in a way because the DT update in our
case is just purely informational to avoid keeping wrong/outdated DT
bits, it has little functional impact (it might have a bit for interrupt
routing through bridges though).

However, I am also pursuing an approach on FW side using a generation
count in our nodes and properties which we could use to generate
arbitrary overlays if we know what generation linux has.

There might actual be a usage scenario for a generic way for our
firwmare to convey DT updates to Linux for other reasons.

A few things that I don't find in the overlay code (but maybe I haven't
looked at it hard enough):

 - Can it remove nodes/properties ?

 - Can it "commit" a changeset so it's permanently part of the main DT ?
We will never have a concept of "revertable" changesets, if we need a
subsequent update, we will get a new overlay from FW that will remove
what needs to be removed and add what needs to be added.

IE, our current mechanism without overlay is fairly simple:

  - On PCI unplug, we remove all nodes below the slot (from linux),
the FW does the equivalent internally.

  - On PCI re-plug, the FW internally builds new nodes and sends a
new subtree as an FDT that we can expand/attach.

Now we could consider that subtree as a changeset that can be undone,
but that wouldn't work for boot time. And subsequent updates wouldn't
have that concept of "undoing" anyway.

IE. conceptually, what overlays do today is quite rooted around the idea
of having a fixed "base" DT and some pre-compiled DTB overlays that
get added/removed. The design completely ignore the idea of a FW that
maintains a "live" tree which we want to keep in sync, which is what we
want to do here, or what we could do with a "live" open firmware
implementation.

Now we might be able to reconcile them, but it feels to me that the
overlay/changeset stuff is too rooted in the first concept...

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  0:54                           ` Benjamin Herrenschmidt
@ 2015-05-14  6:23                             ` Pantelis Antoniou
  -1 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  6:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

Sorry for taking this long to respond, but I am working on the same problem right
now. I thought I might have something to show, but not yet :)

My PCI overlay case is different. In my case there is no firmware and there
is the blob is provided as an overlay.

The idea is that for a given PCI bus, when a PCI device with a matching
device id, vendor id is probed a matching overlay should be applied.

The trickiness lies in the way that the way that the target is different
each time and how to handle generational issues (i.e. what happens if the pci device
is removed before the application of the overlay occurs, what happens when multiple
applications should happen in parallel, etc.)


> On May 14, 2015, at 03:54 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:
> 
>> I haven't decided really.
>> 
>> The main thing with the current patch is I don't really like the added
>> complexity to unflatten_dt_node. It is already a fairly complex
>> function. Perhaps removing of "hybrid" as discussed will help?
> 
> I agree, we should be able to make that much simpler, I was planning on
> sorting that out with Gavin.
> 

I think using overlays should cover your case without any issues.
I don’t like messing with the unflatten method TBH.

>> If there are things we can do to make overlays easier to use in your
>> use case, I'd like to hear ideas. I don't really buy that being more
>> complex than needed is an obstacle. That is very often the case to
>> have common, scale-able solutions. I want to see a simple case be
>> simple to support.
> 
> Well, it's a LOT more complex from the FW perspective for a bunch of
> features we don't really need, in a way because the DT update in our
> case is just purely informational to avoid keeping wrong/outdated DT
> bits, it has little functional impact (it might have a bit for interrupt
> routing through bridges though).
> 
> However, I am also pursuing an approach on FW side using a generation
> count in our nodes and properties which we could use to generate
> arbitrary overlays if we know what generation linux has.
> 
> There might actual be a usage scenario for a generic way for our
> firwmare to convey DT updates to Linux for other reasons.
> 
> A few things that I don't find in the overlay code (but maybe I haven't
> looked at it hard enough):
> 
> - Can it remove nodes/properties ?
> 

Yes.

> - Can it "commit" a changeset so it's permanently part of the main DT ?
> We will never have a concept of "revertable" changesets, if we need a
> subsequent update, we will get a new overlay from FW that will remove
> what needs to be removed and add what needs to be added.
> 

The overlay when applied is a part of the kernel DT tree.
It is trivial to add a mechanism that simply commits everything and
tosses away the revert information.

Note that in that case you have to make provisions for the unflatten
blob to not be freed or for the device tree nodes/properties to be
dynamically allocated.

> IE, our current mechanism without overlay is fairly simple:
> 
>  - On PCI unplug, we remove all nodes below the slot (from linux),
> the FW does the equivalent internally.
> 

If you use an overlay, you just revert it and everything would
be as it was before, without anything hanging below the slot node.

Note that the ‘remove all nodes below the slot’ does not work for my case.

That is because there are devices being instantiated under the slot
(i2c busses, i2c devices, FPGAs etc) that need to be removed from the
system.

>  - On PCI re-plug, the FW internally builds new nodes and sends a
> new subtree as an FDT that we can expand/attach.
> 

You can easily send a DT blob containing an overlay from firmware.

It can be even easy, since you might not have to recreate the full blob
each time, but instead using flat device tree methods to populate the
few properties that change each time.

> Now we could consider that subtree as a changeset that can be undone,
> but that wouldn't work for boot time. And subsequent updates wouldn't
> have that concept of "undoing" anyway.
> 

I have posted another patch that does boot-time DT quirk which are
non-revertable.

https://lkml.org/lkml/2015/2/18/258

> IE. conceptually, what overlays do today is quite rooted around the idea
> of having a fixed "base" DT and some pre-compiled DTB overlays that
> get added/removed. The design completely ignore the idea of a FW that
> maintains a "live" tree which we want to keep in sync, which is what we
> want to do here, or what we could do with a "live" open firmware
> implementation.
> 
> Now we might be able to reconcile them, but it feels to me that the
> overlay/changeset stuff is too rooted in the first concept…
> 

The first DT overlays use case (beaglebone capes) is what got the concept
started.

Right now is a generic mechanism to apply modifications to the kernel
live tree, with the possibility to revert them.

> Ben.
> 
> 

Regards

— Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  6:23                             ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  6:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

Hi Ben,

Sorry for taking this long to respond, but I am working on the same =
problem right
now. I thought I might have something to show, but not yet :)

My PCI overlay case is different. In my case there is no firmware and =
there
is the blob is provided as an overlay.

The idea is that for a given PCI bus, when a PCI device with a matching
device id, vendor id is probed a matching overlay should be applied.

The trickiness lies in the way that the way that the target is different
each time and how to handle generational issues (i.e. what happens if =
the pci device
is removed before the application of the overlay occurs, what happens =
when multiple
applications should happen in parallel, etc.)


> On May 14, 2015, at 03:54 , Benjamin Herrenschmidt =
<benh@kernel.crashing.org> wrote:
>=20
> On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:
>=20
>> I haven't decided really.
>>=20
>> The main thing with the current patch is I don't really like the =
added
>> complexity to unflatten_dt_node. It is already a fairly complex
>> function. Perhaps removing of "hybrid" as discussed will help?
>=20
> I agree, we should be able to make that much simpler, I was planning =
on
> sorting that out with Gavin.
>=20

I think using overlays should cover your case without any issues.
I don=E2=80=99t like messing with the unflatten method TBH.

>> If there are things we can do to make overlays easier to use in your
>> use case, I'd like to hear ideas. I don't really buy that being more
>> complex than needed is an obstacle. That is very often the case to
>> have common, scale-able solutions. I want to see a simple case be
>> simple to support.
>=20
> Well, it's a LOT more complex from the FW perspective for a bunch of
> features we don't really need, in a way because the DT update in our
> case is just purely informational to avoid keeping wrong/outdated DT
> bits, it has little functional impact (it might have a bit for =
interrupt
> routing through bridges though).
>=20
> However, I am also pursuing an approach on FW side using a generation
> count in our nodes and properties which we could use to generate
> arbitrary overlays if we know what generation linux has.
>=20
> There might actual be a usage scenario for a generic way for our
> firwmare to convey DT updates to Linux for other reasons.
>=20
> A few things that I don't find in the overlay code (but maybe I =
haven't
> looked at it hard enough):
>=20
> - Can it remove nodes/properties ?
>=20

Yes.

> - Can it "commit" a changeset so it's permanently part of the main DT =
?
> We will never have a concept of "revertable" changesets, if we need a
> subsequent update, we will get a new overlay from FW that will remove
> what needs to be removed and add what needs to be added.
>=20

The overlay when applied is a part of the kernel DT tree.
It is trivial to add a mechanism that simply commits everything and
tosses away the revert information.

Note that in that case you have to make provisions for the unflatten
blob to not be freed or for the device tree nodes/properties to be
dynamically allocated.

> IE, our current mechanism without overlay is fairly simple:
>=20
>  - On PCI unplug, we remove all nodes below the slot (from linux),
> the FW does the equivalent internally.
>=20

If you use an overlay, you just revert it and everything would
be as it was before, without anything hanging below the slot node.

Note that the =E2=80=98remove all nodes below the slot=E2=80=99 does not =
work for my case.

That is because there are devices being instantiated under the slot
(i2c busses, i2c devices, FPGAs etc) that need to be removed from the
system.

>  - On PCI re-plug, the FW internally builds new nodes and sends a
> new subtree as an FDT that we can expand/attach.
>=20

You can easily send a DT blob containing an overlay from firmware.

It can be even easy, since you might not have to recreate the full blob
each time, but instead using flat device tree methods to populate the
few properties that change each time.

> Now we could consider that subtree as a changeset that can be undone,
> but that wouldn't work for boot time. And subsequent updates wouldn't
> have that concept of "undoing" anyway.
>=20

I have posted another patch that does boot-time DT quirk which are
non-revertable.

https://lkml.org/lkml/2015/2/18/258

> IE. conceptually, what overlays do today is quite rooted around the =
idea
> of having a fixed "base" DT and some pre-compiled DTB overlays that
> get added/removed. The design completely ignore the idea of a FW that
> maintains a "live" tree which we want to keep in sync, which is what =
we
> want to do here, or what we could do with a "live" open firmware
> implementation.
>=20
> Now we might be able to reconcile them, but it feels to me that the
> overlay/changeset stuff is too rooted in the first concept=E2=80=A6
>=20

The first DT overlays use case (beaglebone capes) is what got the =
concept
started.

Right now is a generic mechanism to apply modifications to the kernel
live tree, with the possibility to revert them.

> Ben.
>=20
>=20

Regards

=E2=80=94 Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  6:23                             ` Pantelis Antoniou
@ 2015-05-14  6:46                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  6:46 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Thu, 2015-05-14 at 09:23 +0300, Pantelis Antoniou wrote:

> > A few things that I don't find in the overlay code (but maybe I haven't
> > looked at it hard enough):
> > 
> > - Can it remove nodes/properties ?
> > 
> 
> Yes.

Ok, I've missed that when looking at the overlay code then, I'll have to
give it a closer look.

> > - Can it "commit" a changeset so it's permanently part of the main DT ?
> > We will never have a concept of "revertable" changesets, if we need a
> > subsequent update, we will get a new overlay from FW that will remove
> > what needs to be removed and add what needs to be added.
> > 
> 
> The overlay when applied is a part of the kernel DT tree.
> It is trivial to add a mechanism that simply commits everything and
> tosses away the revert information.
> 
> Note that in that case you have to make provisions for the unflatten
> blob to not be freed or for the device tree nodes/properties to be
> dynamically allocated.

I think it makes sense to do the dynamic thing anyway...

> > IE, our current mechanism without overlay is fairly simple:
> > 
> >  - On PCI unplug, we remove all nodes below the slot (from linux),
> > the FW does the equivalent internally.
> > 
> 
> If you use an overlay, you just revert it and everything would
> be as it was before, without anything hanging below the slot node.

Except that doesn't work for the boot time content which we get
from the firmware as part of the initial FDT (and we can't change that
without breaking backward compatibility).

> Note that the ‘remove all nodes below the slot’ does not work for my case.
> 
> That is because there are devices being instantiated under the slot
> (i2c busses, i2c devices, FPGAs etc) that need to be removed from the
> system.

Right while in my case, there isn't, it's just the standard OF PCI
representation generated by FW, the main thing is that it might have
some enriched properties for some known cable cards of external drawers
that are good to have.

> >  - On PCI re-plug, the FW internally builds new nodes and sends a
> > new subtree as an FDT that we can expand/attach.
> > 
> 
> You can easily send a DT blob containing an overlay from firmware.

Sending one is easy. Building it is not :-)

> It can be even easy, since you might not have to recreate the full blob
> each time, but instead using flat device tree methods to populate the
> few properties that change each time.

No, we basically have our internal tree in the firmware in a format
similar to Linux, ie, a pointer based tree. We can "flatten" it of
course, but generating an overlay is trickier. We can, it's just more
work and we are running out of time (I basically have to cut that FW in
the next few days, then we'll be stuck with whatever interfaces we
created, I have a big of time to fix bugs after that but that's about
it).

> > Now we could consider that subtree as a changeset that can be undone,
> > but that wouldn't work for boot time. And subsequent updates wouldn't
> > have that concept of "undoing" anyway.
> > 
> 
> I have posted another patch that does boot-time DT quirk which are
> non-revertable.
> 
> https://lkml.org/lkml/2015/2/18/258

Not sure how that applies in my case ... I can't change the
representation of the PCI subtree, this is standard OFW representation,
I can't change the FW to make it an overlay-like thing at boot time,
that would break existing kernels.

> > IE. conceptually, what overlays do today is quite rooted around the idea
> > of having a fixed "base" DT and some pre-compiled DTB overlays that
> > get added/removed. The design completely ignore the idea of a FW that
> > maintains a "live" tree which we want to keep in sync, which is what we
> > want to do here, or what we could do with a "live" open firmware
> > implementation.
> > 
> > Now we might be able to reconcile them, but it feels to me that the
> > overlay/changeset stuff is too rooted in the first concept…
> > 
> 
> The first DT overlays use case (beaglebone capes) is what got the concept
> started.
> 
> Right now is a generic mechanism to apply modifications to the kernel
> live tree, with the possibility to revert them.

Yes but as I said it's not really thought in term of keeping the kernel
tree in sync with an external dynamically generated tree. Maybe we can
fix it, but it's more complex...

Ben.

> > Ben.
> > 
> > 
> 
> Regards
> 
> — Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  6:46                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  6:46 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Thu, 2015-05-14 at 09:23 +0300, Pantelis Antoniou wrote:

> > A few things that I don't find in the overlay code (but maybe I haven't
> > looked at it hard enough):
> > 
> > - Can it remove nodes/properties ?
> > 
> 
> Yes.

Ok, I've missed that when looking at the overlay code then, I'll have to
give it a closer look.

> > - Can it "commit" a changeset so it's permanently part of the main DT ?
> > We will never have a concept of "revertable" changesets, if we need a
> > subsequent update, we will get a new overlay from FW that will remove
> > what needs to be removed and add what needs to be added.
> > 
> 
> The overlay when applied is a part of the kernel DT tree.
> It is trivial to add a mechanism that simply commits everything and
> tosses away the revert information.
> 
> Note that in that case you have to make provisions for the unflatten
> blob to not be freed or for the device tree nodes/properties to be
> dynamically allocated.

I think it makes sense to do the dynamic thing anyway...

> > IE, our current mechanism without overlay is fairly simple:
> > 
> >  - On PCI unplug, we remove all nodes below the slot (from linux),
> > the FW does the equivalent internally.
> > 
> 
> If you use an overlay, you just revert it and everything would
> be as it was before, without anything hanging below the slot node.

Except that doesn't work for the boot time content which we get
from the firmware as part of the initial FDT (and we can't change that
without breaking backward compatibility).

> Note that the ‘remove all nodes below the slot’ does not work for my case.
> 
> That is because there are devices being instantiated under the slot
> (i2c busses, i2c devices, FPGAs etc) that need to be removed from the
> system.

Right while in my case, there isn't, it's just the standard OF PCI
representation generated by FW, the main thing is that it might have
some enriched properties for some known cable cards of external drawers
that are good to have.

> >  - On PCI re-plug, the FW internally builds new nodes and sends a
> > new subtree as an FDT that we can expand/attach.
> > 
> 
> You can easily send a DT blob containing an overlay from firmware.

Sending one is easy. Building it is not :-)

> It can be even easy, since you might not have to recreate the full blob
> each time, but instead using flat device tree methods to populate the
> few properties that change each time.

No, we basically have our internal tree in the firmware in a format
similar to Linux, ie, a pointer based tree. We can "flatten" it of
course, but generating an overlay is trickier. We can, it's just more
work and we are running out of time (I basically have to cut that FW in
the next few days, then we'll be stuck with whatever interfaces we
created, I have a big of time to fix bugs after that but that's about
it).

> > Now we could consider that subtree as a changeset that can be undone,
> > but that wouldn't work for boot time. And subsequent updates wouldn't
> > have that concept of "undoing" anyway.
> > 
> 
> I have posted another patch that does boot-time DT quirk which are
> non-revertable.
> 
> https://lkml.org/lkml/2015/2/18/258

Not sure how that applies in my case ... I can't change the
representation of the PCI subtree, this is standard OFW representation,
I can't change the FW to make it an overlay-like thing at boot time,
that would break existing kernels.

> > IE. conceptually, what overlays do today is quite rooted around the idea
> > of having a fixed "base" DT and some pre-compiled DTB overlays that
> > get added/removed. The design completely ignore the idea of a FW that
> > maintains a "live" tree which we want to keep in sync, which is what we
> > want to do here, or what we could do with a "live" open firmware
> > implementation.
> > 
> > Now we might be able to reconcile them, but it feels to me that the
> > overlay/changeset stuff is too rooted in the first concept…
> > 
> 
> The first DT overlays use case (beaglebone capes) is what got the concept
> started.
> 
> Right now is a generic mechanism to apply modifications to the kernel
> live tree, with the possibility to revert them.

Yes but as I said it's not really thought in term of keeping the kernel
tree in sync with an external dynamically generated tree. Maybe we can
fix it, but it's more complex...

Ben.

> > Ben.
> > 
> > 
> 
> Regards
> 
> — Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  6:46                               ` Benjamin Herrenschmidt
@ 2015-05-14  7:04                                 ` Pantelis Antoniou
  -1 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 14, 2015, at 09:46 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> On Thu, 2015-05-14 at 09:23 +0300, Pantelis Antoniou wrote:
> 
>>> A few things that I don't find in the overlay code (but maybe I haven't
>>> looked at it hard enough):
>>> 
>>> - Can it remove nodes/properties ?
>>> 
>> 
>> Yes.
> 
> Ok, I've missed that when looking at the overlay code then, I'll have to
> give it a closer look.
> 

Ok, let me be more specific. It used to be able to do it ;)
The problem was the format used (a ‘-‘ prefix to the name).

Since I didn’t have clear use for it, I was asked to drop it by Grant.

The removal capability is of-course there for the revert case.

>>> - Can it "commit" a changeset so it's permanently part of the main DT ?
>>> We will never have a concept of "revertable" changesets, if we need a
>>> subsequent update, we will get a new overlay from FW that will remove
>>> what needs to be removed and add what needs to be added.
>>> 
>> 
>> The overlay when applied is a part of the kernel DT tree.
>> It is trivial to add a mechanism that simply commits everything and
>> tosses away the revert information.
>> 
>> Note that in that case you have to make provisions for the unflatten
>> blob to not be freed or for the device tree nodes/properties to be
>> dynamically allocated.
> 
> I think it makes sense to do the dynamic thing anyway...
> 
>>> IE, our current mechanism without overlay is fairly simple:
>>> 
>>> - On PCI unplug, we remove all nodes below the slot (from linux),
>>> the FW does the equivalent internally.
>>> 
>> 
>> If you use an overlay, you just revert it and everything would
>> be as it was before, without anything hanging below the slot node.
> 
> Except that doesn't work for the boot time content which we get
> from the firmware as part of the initial FDT (and we can't change that
> without breaking backward compatibility).
> 

OK

>> Note that the ‘remove all nodes below the slot’ does not work for my case.
>> 
>> That is because there are devices being instantiated under the slot
>> (i2c busses, i2c devices, FPGAs etc) that need to be removed from the
>> system.
> 
> Right while in my case, there isn't, it's just the standard OF PCI
> representation generated by FW, the main thing is that it might have
> some enriched properties for some known cable cards of external drawers
> that are good to have.
> 

I see.

>>> - On PCI re-plug, the FW internally builds new nodes and sends a
>>> new subtree as an FDT that we can expand/attach.
>>> 
>> 
>> You can easily send a DT blob containing an overlay from firmware.
> 
> Sending one is easy. Building it is not :-)
> 

Heh, true ;)

>> It can be even easy, since you might not have to recreate the full blob
>> each time, but instead using flat device tree methods to populate the
>> few properties that change each time.
> 
> No, we basically have our internal tree in the firmware in a format
> similar to Linux, ie, a pointer based tree. We can "flatten" it of
> course, but generating an overlay is trickier. We can, it's just more
> work and we are running out of time (I basically have to cut that FW in
> the next few days, then we'll be stuck with whatever interfaces we
> created, I have a big of time to fix bugs after that but that's about
> it).
> 

Hmm, since you just want to transmit a whole subtree things are a bit
simpler.

You don’t need any of the fixups, and your target node is known.

So your overlay is simply:

/ {
	fragment@0 {
		target-path = “/foo”;
		__overlay__ {
			/* contents of the slot */
		};
	}; 
};

I think it’s possible to just bit-mangle a blob (in pseudo code).

	const u8 template_overlay_blob[] = { <compiled blob of the above> };

	flatten_slot(slot_blob);

	overlay_blob = allocate_new_blob(template_overlay_blob, slot_blob);

	overlay_node = find_node(overlay_blob, “/fragment@0/__overlay__);
	target_prop = find_prop(overlay_blob, “/fragment@0/target-path”);

	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
	modify_slot_target(overlay_blob, target_prop, slot_target);
	
I don’t think you need to re-flatten anything, shuffling bits around with
memmove should work.

>>> Now we could consider that subtree as a changeset that can be undone,
>>> but that wouldn't work for boot time. And subsequent updates wouldn't
>>> have that concept of "undoing" anyway.
>>> 
>> 
>> I have posted another patch that does boot-time DT quirk which are
>> non-revertable.
>> 
>> https://lkml.org/lkml/2015/2/18/258
> 
> Not sure how that applies in my case ... I can't change the
> representation of the PCI subtree, this is standard OFW representation,
> I can't change the FW to make it an overlay-like thing at boot time,
> that would break existing kernels.
> 

The idea is to append the ‘quirk’ to the already booting device tree blob.

Another idea floating around was to simple concatenate the booting blob with
any overlay blobs you want applied at boot time.

>>> IE. conceptually, what overlays do today is quite rooted around the idea
>>> of having a fixed "base" DT and some pre-compiled DTB overlays that
>>> get added/removed. The design completely ignore the idea of a FW that
>>> maintains a "live" tree which we want to keep in sync, which is what we
>>> want to do here, or what we could do with a "live" open firmware
>>> implementation.
>>> 
>>> Now we might be able to reconcile them, but it feels to me that the
>>> overlay/changeset stuff is too rooted in the first concept…
>>> 
>> 
>> The first DT overlays use case (beaglebone capes) is what got the concept
>> started.
>> 
>> Right now is a generic mechanism to apply modifications to the kernel
>> live tree, with the possibility to revert them.
> 
> Yes but as I said it's not really thought in term of keeping the kernel
> tree in sync with an external dynamically generated tree. Maybe we can
> fix it, but it's more complex…
> 

Yes it is, unfortunately.

> Ben.
> 
>>> Ben.
>>> 
>>> 
>> 
>> Regards
>> 
>> — Pantelis
> 
> 

Regards

— Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:04                                 ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:04 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

Hi Ben,

> On May 14, 2015, at 09:46 , Benjamin Herrenschmidt =
<benh@kernel.crashing.org> wrote:
>=20
> On Thu, 2015-05-14 at 09:23 +0300, Pantelis Antoniou wrote:
>=20
>>> A few things that I don't find in the overlay code (but maybe I =
haven't
>>> looked at it hard enough):
>>>=20
>>> - Can it remove nodes/properties ?
>>>=20
>>=20
>> Yes.
>=20
> Ok, I've missed that when looking at the overlay code then, I'll have =
to
> give it a closer look.
>=20

Ok, let me be more specific. It used to be able to do it ;)
The problem was the format used (a =E2=80=98-=E2=80=98 prefix to the =
name).

Since I didn=E2=80=99t have clear use for it, I was asked to drop it by =
Grant.

The removal capability is of-course there for the revert case.

>>> - Can it "commit" a changeset so it's permanently part of the main =
DT ?
>>> We will never have a concept of "revertable" changesets, if we need =
a
>>> subsequent update, we will get a new overlay from FW that will =
remove
>>> what needs to be removed and add what needs to be added.
>>>=20
>>=20
>> The overlay when applied is a part of the kernel DT tree.
>> It is trivial to add a mechanism that simply commits everything and
>> tosses away the revert information.
>>=20
>> Note that in that case you have to make provisions for the unflatten
>> blob to not be freed or for the device tree nodes/properties to be
>> dynamically allocated.
>=20
> I think it makes sense to do the dynamic thing anyway...
>=20
>>> IE, our current mechanism without overlay is fairly simple:
>>>=20
>>> - On PCI unplug, we remove all nodes below the slot (from linux),
>>> the FW does the equivalent internally.
>>>=20
>>=20
>> If you use an overlay, you just revert it and everything would
>> be as it was before, without anything hanging below the slot node.
>=20
> Except that doesn't work for the boot time content which we get
> from the firmware as part of the initial FDT (and we can't change that
> without breaking backward compatibility).
>=20

OK

>> Note that the =E2=80=98remove all nodes below the slot=E2=80=99 does =
not work for my case.
>>=20
>> That is because there are devices being instantiated under the slot
>> (i2c busses, i2c devices, FPGAs etc) that need to be removed from the
>> system.
>=20
> Right while in my case, there isn't, it's just the standard OF PCI
> representation generated by FW, the main thing is that it might have
> some enriched properties for some known cable cards of external =
drawers
> that are good to have.
>=20

I see.

>>> - On PCI re-plug, the FW internally builds new nodes and sends a
>>> new subtree as an FDT that we can expand/attach.
>>>=20
>>=20
>> You can easily send a DT blob containing an overlay from firmware.
>=20
> Sending one is easy. Building it is not :-)
>=20

Heh, true ;)

>> It can be even easy, since you might not have to recreate the full =
blob
>> each time, but instead using flat device tree methods to populate the
>> few properties that change each time.
>=20
> No, we basically have our internal tree in the firmware in a format
> similar to Linux, ie, a pointer based tree. We can "flatten" it of
> course, but generating an overlay is trickier. We can, it's just more
> work and we are running out of time (I basically have to cut that FW =
in
> the next few days, then we'll be stuck with whatever interfaces we
> created, I have a big of time to fix bugs after that but that's about
> it).
>=20

Hmm, since you just want to transmit a whole subtree things are a bit
simpler.

You don=E2=80=99t need any of the fixups, and your target node is known.

So your overlay is simply:

/ {
	fragment@0 {
		target-path =3D =E2=80=9C/foo=E2=80=9D;
		__overlay__ {
			/* contents of the slot */
		};
	};=20
};

I think it=E2=80=99s possible to just bit-mangle a blob (in pseudo =
code).

	const u8 template_overlay_blob[] =3D { <compiled blob of the =
above> };

	flatten_slot(slot_blob);

	overlay_blob =3D allocate_new_blob(template_overlay_blob, =
slot_blob);

	overlay_node =3D find_node(overlay_blob, =
=E2=80=9C/fragment@0/__overlay__);
	target_prop =3D find_prop(overlay_blob, =
=E2=80=9C/fragment@0/target-path=E2=80=9D);

	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
	modify_slot_target(overlay_blob, target_prop, slot_target);
=09
I don=E2=80=99t think you need to re-flatten anything, shuffling bits =
around with
memmove should work.

>>> Now we could consider that subtree as a changeset that can be =
undone,
>>> but that wouldn't work for boot time. And subsequent updates =
wouldn't
>>> have that concept of "undoing" anyway.
>>>=20
>>=20
>> I have posted another patch that does boot-time DT quirk which are
>> non-revertable.
>>=20
>> https://lkml.org/lkml/2015/2/18/258
>=20
> Not sure how that applies in my case ... I can't change the
> representation of the PCI subtree, this is standard OFW =
representation,
> I can't change the FW to make it an overlay-like thing at boot time,
> that would break existing kernels.
>=20

The idea is to append the =E2=80=98quirk=E2=80=99 to the already booting =
device tree blob.

Another idea floating around was to simple concatenate the booting blob =
with
any overlay blobs you want applied at boot time.

>>> IE. conceptually, what overlays do today is quite rooted around the =
idea
>>> of having a fixed "base" DT and some pre-compiled DTB overlays that
>>> get added/removed. The design completely ignore the idea of a FW =
that
>>> maintains a "live" tree which we want to keep in sync, which is what =
we
>>> want to do here, or what we could do with a "live" open firmware
>>> implementation.
>>>=20
>>> Now we might be able to reconcile them, but it feels to me that the
>>> overlay/changeset stuff is too rooted in the first concept=E2=80=A6
>>>=20
>>=20
>> The first DT overlays use case (beaglebone capes) is what got the =
concept
>> started.
>>=20
>> Right now is a generic mechanism to apply modifications to the kernel
>> live tree, with the possibility to revert them.
>=20
> Yes but as I said it's not really thought in term of keeping the =
kernel
> tree in sync with an external dynamically generated tree. Maybe we can
> fix it, but it's more complex=E2=80=A6
>=20

Yes it is, unfortunately.

> Ben.
>=20
>>> Ben.
>>>=20
>>>=20
>>=20
>> Regards
>>=20
>> =E2=80=94 Pantelis
>=20
>=20

Regards

=E2=80=94 Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:04                                 ` Pantelis Antoniou
  (?)
@ 2015-05-14  7:14                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:14 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Thu, 2015-05-14 at 10:04 +0300, Pantelis Antoniou wrote:

> Hmm, since you just want to transmit a whole subtree things are a bit
> simpler.
> 
> You don’t need any of the fixups, and your target node is known.
> 
> So your overlay is simply:
> 
> / {
> 	fragment@0 {
> 		target-path = “/foo”;
> 		__overlay__ {
> 			/* contents of the slot */
> 		};
> 	}; 
> };
>
> I think it’s possible to just bit-mangle a blob (in pseudo code).
> 
> 	const u8 template_overlay_blob[] = { <compiled blob of the above> };
> 
> 	flatten_slot(slot_blob);
> 
> 	overlay_blob = allocate_new_blob(template_overlay_blob, slot_blob);
> 
> 	overlay_node = find_node(overlay_blob, “/fragment@0/__overlay__);
> 	target_prop = find_prop(overlay_blob, “/fragment@0/target-path”);
> 
> 	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
> 	modify_slot_target(overlay_blob, target_prop, slot_target);
> 	
> I don’t think you need to re-flatten anything, shuffling bits around with
> memmove should work.

Fairly gross :-)

But yeah generating the overlay doesn't necessarily scare me, I can
generate a temp tree that is the overlay in which I "copy" the subtree
(or in my internal ptr-based representation I could have a concept of
alias which I follow while flattening).

That leaves me with these problems:

 - No support for removing of nodes, so that needs to be added back to
the format and to Linux unless I continue removing by hand in the PCI
hotplug code itself

 - No support for "committing" the overlay which needs to be added as
well.

> >>> Now we could consider that subtree as a changeset that can be undone,
> >>> but that wouldn't work for boot time. And subsequent updates wouldn't
> >>> have that concept of "undoing" anyway.
> >>> 
> >> 
> >> I have posted another patch that does boot-time DT quirk which are
> >> non-revertable.
> >> 
> >> https://lkml.org/lkml/2015/2/18/258
> > 
> > Not sure how that applies in my case ... I can't change the
> > representation of the PCI subtree, this is standard OFW representation,
> > I can't change the FW to make it an overlay-like thing at boot time,
> > that would break existing kernels.
> > 
> 
> The idea is to append the ‘quirk’ to the already booting device tree blob.

I know but that's not how things work for me. At boot time the FW passes
me one tree that contains all the PCI stuff it has probed.

> Another idea floating around was to simple concatenate the booting blob with
> any overlay blobs you want applied at boot time.

Sure but I don't get overlay blobs at boot time.

> >>> IE. conceptually, what overlays do today is quite rooted around the idea
> >>> of having a fixed "base" DT and some pre-compiled DTB overlays that
> >>> get added/removed. The design completely ignore the idea of a FW that
> >>> maintains a "live" tree which we want to keep in sync, which is what we
> >>> want to do here, or what we could do with a "live" open firmware
> >>> implementation.
> >>> 
> >>> Now we might be able to reconcile them, but it feels to me that the
> >>> overlay/changeset stuff is too rooted in the first concept…
> >>> 
> >> 
> >> The first DT overlays use case (beaglebone capes) is what got the concept
> >> started.
> >> 
> >> Right now is a generic mechanism to apply modifications to the kernel
> >> live tree, with the possibility to revert them.
> > 
> > Yes but as I said it's not really thought in term of keeping the kernel
> > tree in sync with an external dynamically generated tree. Maybe we can
> > fix it, but it's more complex…
> > 
> 
> Yes it is, unfortunately.

Right. Which makes the solution of just passing my bit of tree as a blob
which I expand in Linux where I want it rather than an overlay tempting
if we can make Gavin patch more palatable (removing the hybrid stuff
etc...).

Cheers,
Ben.

> > Ben.
> > 
> >>> Ben.
> >>> 
> >>> 
> >> 
> >> Regards
> >> 
> >> — Pantelis
> > 
> > 
> 
> Regards
> 
> — Pantelis


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:14                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:14 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Thu, 2015-05-14 at 10:04 +0300, Pantelis Antoniou wrote:

> Hmm, since you just want to transmit a whole subtree things are a bit
> simpler.
> 
> You don’t need any of the fixups, and your target node is known.
> 
> So your overlay is simply:
> 
> / {
> 	fragment@0 {
> 		target-path = “/foo”;
> 		__overlay__ {
> 			/* contents of the slot */
> 		};
> 	}; 
> };
>
> I think it’s possible to just bit-mangle a blob (in pseudo code).
> 
> 	const u8 template_overlay_blob[] = { <compiled blob of the above> };
> 
> 	flatten_slot(slot_blob);
> 
> 	overlay_blob = allocate_new_blob(template_overlay_blob, slot_blob);
> 
> 	overlay_node = find_node(overlay_blob, “/fragment@0/__overlay__);
> 	target_prop = find_prop(overlay_blob, “/fragment@0/target-path”);
> 
> 	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
> 	modify_slot_target(overlay_blob, target_prop, slot_target);
> 	
> I don’t think you need to re-flatten anything, shuffling bits around with
> memmove should work.

Fairly gross :-)

But yeah generating the overlay doesn't necessarily scare me, I can
generate a temp tree that is the overlay in which I "copy" the subtree
(or in my internal ptr-based representation I could have a concept of
alias which I follow while flattening).

That leaves me with these problems:

 - No support for removing of nodes, so that needs to be added back to
the format and to Linux unless I continue removing by hand in the PCI
hotplug code itself

 - No support for "committing" the overlay which needs to be added as
well.

> >>> Now we could consider that subtree as a changeset that can be undone,
> >>> but that wouldn't work for boot time. And subsequent updates wouldn't
> >>> have that concept of "undoing" anyway.
> >>> 
> >> 
> >> I have posted another patch that does boot-time DT quirk which are
> >> non-revertable.
> >> 
> >> https://lkml.org/lkml/2015/2/18/258
> > 
> > Not sure how that applies in my case ... I can't change the
> > representation of the PCI subtree, this is standard OFW representation,
> > I can't change the FW to make it an overlay-like thing at boot time,
> > that would break existing kernels.
> > 
> 
> The idea is to append the ‘quirk’ to the already booting device tree blob.

I know but that's not how things work for me. At boot time the FW passes
me one tree that contains all the PCI stuff it has probed.

> Another idea floating around was to simple concatenate the booting blob with
> any overlay blobs you want applied at boot time.

Sure but I don't get overlay blobs at boot time.

> >>> IE. conceptually, what overlays do today is quite rooted around the idea
> >>> of having a fixed "base" DT and some pre-compiled DTB overlays that
> >>> get added/removed. The design completely ignore the idea of a FW that
> >>> maintains a "live" tree which we want to keep in sync, which is what we
> >>> want to do here, or what we could do with a "live" open firmware
> >>> implementation.
> >>> 
> >>> Now we might be able to reconcile them, but it feels to me that the
> >>> overlay/changeset stuff is too rooted in the first concept…
> >>> 
> >> 
> >> The first DT overlays use case (beaglebone capes) is what got the concept
> >> started.
> >> 
> >> Right now is a generic mechanism to apply modifications to the kernel
> >> live tree, with the possibility to revert them.
> > 
> > Yes but as I said it's not really thought in term of keeping the kernel
> > tree in sync with an external dynamically generated tree. Maybe we can
> > fix it, but it's more complex…
> > 
> 
> Yes it is, unfortunately.

Right. Which makes the solution of just passing my bit of tree as a blob
which I expand in Linux where I want it rather than an overlay tempting
if we can make Gavin patch more palatable (removing the hybrid stuff
etc...).

Cheers,
Ben.

> > Ben.
> > 
> >>> Ben.
> >>> 
> >>> 
> >> 
> >> Regards
> >> 
> >> — Pantelis
> > 
> > 
> 
> Regards
> 
> — Pantelis



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:14                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:14 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Thu, 2015-05-14 at 10:04 +0300, Pantelis Antoniou wrote:

> Hmm, since you just want to transmit a whole subtree things are a bit
> simpler.
> 
> You don’t need any of the fixups, and your target node is known.
> 
> So your overlay is simply:
> 
> / {
> 	fragment@0 {
> 		target-path = “/foo”;
> 		__overlay__ {
> 			/* contents of the slot */
> 		};
> 	}; 
> };
>
> I think it’s possible to just bit-mangle a blob (in pseudo code).
> 
> 	const u8 template_overlay_blob[] = { <compiled blob of the above> };
> 
> 	flatten_slot(slot_blob);
> 
> 	overlay_blob = allocate_new_blob(template_overlay_blob, slot_blob);
> 
> 	overlay_node = find_node(overlay_blob, “/fragment@0/__overlay__);
> 	target_prop = find_prop(overlay_blob, “/fragment@0/target-path”);
> 
> 	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
> 	modify_slot_target(overlay_blob, target_prop, slot_target);
> 	
> I don’t think you need to re-flatten anything, shuffling bits around with
> memmove should work.

Fairly gross :-)

But yeah generating the overlay doesn't necessarily scare me, I can
generate a temp tree that is the overlay in which I "copy" the subtree
(or in my internal ptr-based representation I could have a concept of
alias which I follow while flattening).

That leaves me with these problems:

 - No support for removing of nodes, so that needs to be added back to
the format and to Linux unless I continue removing by hand in the PCI
hotplug code itself

 - No support for "committing" the overlay which needs to be added as
well.

> >>> Now we could consider that subtree as a changeset that can be undone,
> >>> but that wouldn't work for boot time. And subsequent updates wouldn't
> >>> have that concept of "undoing" anyway.
> >>> 
> >> 
> >> I have posted another patch that does boot-time DT quirk which are
> >> non-revertable.
> >> 
> >> https://lkml.org/lkml/2015/2/18/258
> > 
> > Not sure how that applies in my case ... I can't change the
> > representation of the PCI subtree, this is standard OFW representation,
> > I can't change the FW to make it an overlay-like thing at boot time,
> > that would break existing kernels.
> > 
> 
> The idea is to append the ‘quirk’ to the already booting device tree blob.

I know but that's not how things work for me. At boot time the FW passes
me one tree that contains all the PCI stuff it has probed.

> Another idea floating around was to simple concatenate the booting blob with
> any overlay blobs you want applied at boot time.

Sure but I don't get overlay blobs at boot time.

> >>> IE. conceptually, what overlays do today is quite rooted around the idea
> >>> of having a fixed "base" DT and some pre-compiled DTB overlays that
> >>> get added/removed. The design completely ignore the idea of a FW that
> >>> maintains a "live" tree which we want to keep in sync, which is what we
> >>> want to do here, or what we could do with a "live" open firmware
> >>> implementation.
> >>> 
> >>> Now we might be able to reconcile them, but it feels to me that the
> >>> overlay/changeset stuff is too rooted in the first concept…
> >>> 
> >> 
> >> The first DT overlays use case (beaglebone capes) is what got the concept
> >> started.
> >> 
> >> Right now is a generic mechanism to apply modifications to the kernel
> >> live tree, with the possibility to revert them.
> > 
> > Yes but as I said it's not really thought in term of keeping the kernel
> > tree in sync with an external dynamically generated tree. Maybe we can
> > fix it, but it's more complex…
> > 
> 
> Yes it is, unfortunately.

Right. Which makes the solution of just passing my bit of tree as a blob
which I expand in Linux where I want it rather than an overlay tempting
if we can make Gavin patch more palatable (removing the hybrid stuff
etc...).

Cheers,
Ben.

> > Ben.
> > 
> >>> Ben.
> >>> 
> >>> 
> >> 
> >> Regards
> >> 
> >> — Pantelis
> > 
> > 
> 
> Regards
> 
> — Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:14                                     ` Benjamin Herrenschmidt
  (?)
@ 2015-05-14  7:19                                       ` Pantelis Antoniou
  -1 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 14, 2015, at 10:14 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> On Thu, 2015-05-14 at 10:04 +0300, Pantelis Antoniou wrote:
> 
>> Hmm, since you just want to transmit a whole subtree things are a bit
>> simpler.
>> 
>> You don’t need any of the fixups, and your target node is known.
>> 
>> So your overlay is simply:
>> 
>> / {
>> 	fragment@0 {
>> 		target-path = “/foo”;
>> 		__overlay__ {
>> 			/* contents of the slot */
>> 		};
>> 	}; 
>> };
>> 
>> I think it’s possible to just bit-mangle a blob (in pseudo code).
>> 
>> 	const u8 template_overlay_blob[] = { <compiled blob of the above> };
>> 
>> 	flatten_slot(slot_blob);
>> 
>> 	overlay_blob = allocate_new_blob(template_overlay_blob, slot_blob);
>> 
>> 	overlay_node = find_node(overlay_blob, “/fragment@0/__overlay__);
>> 	target_prop = find_prop(overlay_blob, “/fragment@0/target-path”);
>> 
>> 	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
>> 	modify_slot_target(overlay_blob, target_prop, slot_target);
>> 	
>> I don’t think you need to re-flatten anything, shuffling bits around with
>> memmove should work.
> 
> Fairly gross :-)
> 

You don’t want to know how sausages are made, but they are delicious :)

> But yeah generating the overlay doesn't necessarily scare me, I can
> generate a temp tree that is the overlay in which I "copy" the subtree
> (or in my internal ptr-based representation I could have a concept of
> alias which I follow while flattening).
> 
> That leaves me with these problems:
> 
> - No support for removing of nodes, so that needs to be added back to
> the format and to Linux unless I continue removing by hand in the PCI
> hotplug code itself
> 

What kind of nodes/properties you need to remove at _application_ time?

What you describe is inserting a bunch of properties and nodes under
a slot’s device node. Reverting the overlay removes them all just fine.

> - No support for "committing" the overlay which needs to be added as
> well.
> 

That’s the easiest part.

>>>>> Now we could consider that subtree as a changeset that can be undone,
>>>>> but that wouldn't work for boot time. And subsequent updates wouldn't
>>>>> have that concept of "undoing" anyway.
>>>>> 
>>>> 
>>>> I have posted another patch that does boot-time DT quirk which are
>>>> non-revertable.
>>>> 
>>>> https://lkml.org/lkml/2015/2/18/258
>>> 
>>> Not sure how that applies in my case ... I can't change the
>>> representation of the PCI subtree, this is standard OFW representation,
>>> I can't change the FW to make it an overlay-like thing at boot time,
>>> that would break existing kernels.
>>> 
>> 
>> The idea is to append the ‘quirk’ to the already booting device tree blob.
> 
> I know but that's not how things work for me. At boot time the FW passes
> me one tree that contains all the PCI stuff it has probed.
> 
>> Another idea floating around was to simple concatenate the booting blob with
>> any overlay blobs you want applied at boot time.
> 
> Sure but I don't get overlay blobs at boot time.
> 
>>>>> IE. conceptually, what overlays do today is quite rooted around the idea
>>>>> of having a fixed "base" DT and some pre-compiled DTB overlays that
>>>>> get added/removed. The design completely ignore the idea of a FW that
>>>>> maintains a "live" tree which we want to keep in sync, which is what we
>>>>> want to do here, or what we could do with a "live" open firmware
>>>>> implementation.
>>>>> 
>>>>> Now we might be able to reconcile them, but it feels to me that the
>>>>> overlay/changeset stuff is too rooted in the first concept…
>>>>> 
>>>> 
>>>> The first DT overlays use case (beaglebone capes) is what got the concept
>>>> started.
>>>> 
>>>> Right now is a generic mechanism to apply modifications to the kernel
>>>> live tree, with the possibility to revert them.
>>> 
>>> Yes but as I said it's not really thought in term of keeping the kernel
>>> tree in sync with an external dynamically generated tree. Maybe we can
>>> fix it, but it's more complex…
>>> 
>> 
>> Yes it is, unfortunately.
> 
> Right. Which makes the solution of just passing my bit of tree as a blob
> which I expand in Linux where I want it rather than an overlay tempting
> if we can make Gavin patch more palatable (removing the hybrid stuff
> etc…)
> .
> 

I see. Well, how about this?

Who said you have to do the whole blob dance in the firmware?

You can just as easily pass the blob as it is to the linux kernel and
the kernel there can convert it to an overlay and apply it.

> Cheers,
> Ben.
> 
>>> Ben.
>>> 
>>>>> Ben.
>>>>> 
>>>>> 
>>>> 
>>>> Regards
>>>> 
>>>> — Pantelis
>>> 
>>> 
>> 
>> Regards
>> 
>> — Pantelis
> 
> 

Regards

— Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:19                                       ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 14, 2015, at 10:14 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> On Thu, 2015-05-14 at 10:04 +0300, Pantelis Antoniou wrote:
> 
>> Hmm, since you just want to transmit a whole subtree things are a bit
>> simpler.
>> 
>> You don’t need any of the fixups, and your target node is known.
>> 
>> So your overlay is simply:
>> 
>> / {
>> 	fragment@0 {
>> 		target-path = “/foo”;
>> 		__overlay__ {
>> 			/* contents of the slot */
>> 		};
>> 	}; 
>> };
>> 
>> I think it’s possible to just bit-mangle a blob (in pseudo code).
>> 
>> 	const u8 template_overlay_blob[] = { <compiled blob of the above> };
>> 
>> 	flatten_slot(slot_blob);
>> 
>> 	overlay_blob = allocate_new_blob(template_overlay_blob, slot_blob);
>> 
>> 	overlay_node = find_node(overlay_blob, “/fragment@0/__overlay__);
>> 	target_prop = find_prop(overlay_blob, “/fragment@0/target-path”);
>> 
>> 	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
>> 	modify_slot_target(overlay_blob, target_prop, slot_target);
>> 	
>> I don’t think you need to re-flatten anything, shuffling bits around with
>> memmove should work.
> 
> Fairly gross :-)
> 

You don’t want to know how sausages are made, but they are delicious :)

> But yeah generating the overlay doesn't necessarily scare me, I can
> generate a temp tree that is the overlay in which I "copy" the subtree
> (or in my internal ptr-based representation I could have a concept of
> alias which I follow while flattening).
> 
> That leaves me with these problems:
> 
> - No support for removing of nodes, so that needs to be added back to
> the format and to Linux unless I continue removing by hand in the PCI
> hotplug code itself
> 

What kind of nodes/properties you need to remove at _application_ time?

What you describe is inserting a bunch of properties and nodes under
a slot’s device node. Reverting the overlay removes them all just fine.

> - No support for "committing" the overlay which needs to be added as
> well.
> 

That’s the easiest part.

>>>>> Now we could consider that subtree as a changeset that can be undone,
>>>>> but that wouldn't work for boot time. And subsequent updates wouldn't
>>>>> have that concept of "undoing" anyway.
>>>>> 
>>>> 
>>>> I have posted another patch that does boot-time DT quirk which are
>>>> non-revertable.
>>>> 
>>>> https://lkml.org/lkml/2015/2/18/258
>>> 
>>> Not sure how that applies in my case ... I can't change the
>>> representation of the PCI subtree, this is standard OFW representation,
>>> I can't change the FW to make it an overlay-like thing at boot time,
>>> that would break existing kernels.
>>> 
>> 
>> The idea is to append the ‘quirk’ to the already booting device tree blob.
> 
> I know but that's not how things work for me. At boot time the FW passes
> me one tree that contains all the PCI stuff it has probed.
> 
>> Another idea floating around was to simple concatenate the booting blob with
>> any overlay blobs you want applied at boot time.
> 
> Sure but I don't get overlay blobs at boot time.
> 
>>>>> IE. conceptually, what overlays do today is quite rooted around the idea
>>>>> of having a fixed "base" DT and some pre-compiled DTB overlays that
>>>>> get added/removed. The design completely ignore the idea of a FW that
>>>>> maintains a "live" tree which we want to keep in sync, which is what we
>>>>> want to do here, or what we could do with a "live" open firmware
>>>>> implementation.
>>>>> 
>>>>> Now we might be able to reconcile them, but it feels to me that the
>>>>> overlay/changeset stuff is too rooted in the first concept…
>>>>> 
>>>> 
>>>> The first DT overlays use case (beaglebone capes) is what got the concept
>>>> started.
>>>> 
>>>> Right now is a generic mechanism to apply modifications to the kernel
>>>> live tree, with the possibility to revert them.
>>> 
>>> Yes but as I said it's not really thought in term of keeping the kernel
>>> tree in sync with an external dynamically generated tree. Maybe we can
>>> fix it, but it's more complex…
>>> 
>> 
>> Yes it is, unfortunately.
> 
> Right. Which makes the solution of just passing my bit of tree as a blob
> which I expand in Linux where I want it rather than an overlay tempting
> if we can make Gavin patch more palatable (removing the hybrid stuff
> etc…)
> .
> 

I see. Well, how about this?

Who said you have to do the whole blob dance in the firmware?

You can just as easily pass the blob as it is to the linux kernel and
the kernel there can convert it to an overlay and apply it.

> Cheers,
> Ben.
> 
>>> Ben.
>>> 
>>>>> Ben.
>>>>> 
>>>>> 
>>>> 
>>>> Regards
>>>> 
>>>> — Pantelis
>>> 
>>> 
>> 
>> Regards
>> 
>> — Pantelis
> 
> 

Regards

— Pantelis


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:19                                       ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

Hi Ben,

> On May 14, 2015, at 10:14 , Benjamin Herrenschmidt =
<benh@kernel.crashing.org> wrote:
>=20
> On Thu, 2015-05-14 at 10:04 +0300, Pantelis Antoniou wrote:
>=20
>> Hmm, since you just want to transmit a whole subtree things are a bit
>> simpler.
>>=20
>> You don=E2=80=99t need any of the fixups, and your target node is =
known.
>>=20
>> So your overlay is simply:
>>=20
>> / {
>> 	fragment@0 {
>> 		target-path =3D =E2=80=9C/foo=E2=80=9D;
>> 		__overlay__ {
>> 			/* contents of the slot */
>> 		};
>> 	};=20
>> };
>>=20
>> I think it=E2=80=99s possible to just bit-mangle a blob (in pseudo =
code).
>>=20
>> 	const u8 template_overlay_blob[] =3D { <compiled blob of the =
above> };
>>=20
>> 	flatten_slot(slot_blob);
>>=20
>> 	overlay_blob =3D allocate_new_blob(template_overlay_blob, =
slot_blob);
>>=20
>> 	overlay_node =3D find_node(overlay_blob, =
=E2=80=9C/fragment@0/__overlay__);
>> 	target_prop =3D find_prop(overlay_blob, =
=E2=80=9C/fragment@0/target-path=E2=80=9D);
>>=20
>> 	inject_slot_blob(overlay_blob, overlay_node, slot_blob);
>> 	modify_slot_target(overlay_blob, target_prop, slot_target);
>> =09
>> I don=E2=80=99t think you need to re-flatten anything, shuffling bits =
around with
>> memmove should work.
>=20
> Fairly gross :-)
>=20

You don=E2=80=99t want to know how sausages are made, but they are =
delicious :)

> But yeah generating the overlay doesn't necessarily scare me, I can
> generate a temp tree that is the overlay in which I "copy" the subtree
> (or in my internal ptr-based representation I could have a concept of
> alias which I follow while flattening).
>=20
> That leaves me with these problems:
>=20
> - No support for removing of nodes, so that needs to be added back to
> the format and to Linux unless I continue removing by hand in the PCI
> hotplug code itself
>=20

What kind of nodes/properties you need to remove at _application_ time?

What you describe is inserting a bunch of properties and nodes under
a slot=E2=80=99s device node. Reverting the overlay removes them all =
just fine.

> - No support for "committing" the overlay which needs to be added as
> well.
>=20

That=E2=80=99s the easiest part.

>>>>> Now we could consider that subtree as a changeset that can be =
undone,
>>>>> but that wouldn't work for boot time. And subsequent updates =
wouldn't
>>>>> have that concept of "undoing" anyway.
>>>>>=20
>>>>=20
>>>> I have posted another patch that does boot-time DT quirk which are
>>>> non-revertable.
>>>>=20
>>>> https://lkml.org/lkml/2015/2/18/258
>>>=20
>>> Not sure how that applies in my case ... I can't change the
>>> representation of the PCI subtree, this is standard OFW =
representation,
>>> I can't change the FW to make it an overlay-like thing at boot time,
>>> that would break existing kernels.
>>>=20
>>=20
>> The idea is to append the =E2=80=98quirk=E2=80=99 to the already =
booting device tree blob.
>=20
> I know but that's not how things work for me. At boot time the FW =
passes
> me one tree that contains all the PCI stuff it has probed.
>=20
>> Another idea floating around was to simple concatenate the booting =
blob with
>> any overlay blobs you want applied at boot time.
>=20
> Sure but I don't get overlay blobs at boot time.
>=20
>>>>> IE. conceptually, what overlays do today is quite rooted around =
the idea
>>>>> of having a fixed "base" DT and some pre-compiled DTB overlays =
that
>>>>> get added/removed. The design completely ignore the idea of a FW =
that
>>>>> maintains a "live" tree which we want to keep in sync, which is =
what we
>>>>> want to do here, or what we could do with a "live" open firmware
>>>>> implementation.
>>>>>=20
>>>>> Now we might be able to reconcile them, but it feels to me that =
the
>>>>> overlay/changeset stuff is too rooted in the first concept=E2=80=A6
>>>>>=20
>>>>=20
>>>> The first DT overlays use case (beaglebone capes) is what got the =
concept
>>>> started.
>>>>=20
>>>> Right now is a generic mechanism to apply modifications to the =
kernel
>>>> live tree, with the possibility to revert them.
>>>=20
>>> Yes but as I said it's not really thought in term of keeping the =
kernel
>>> tree in sync with an external dynamically generated tree. Maybe we =
can
>>> fix it, but it's more complex=E2=80=A6
>>>=20
>>=20
>> Yes it is, unfortunately.
>=20
> Right. Which makes the solution of just passing my bit of tree as a =
blob
> which I expand in Linux where I want it rather than an overlay =
tempting
> if we can make Gavin patch more palatable (removing the hybrid stuff
> etc=E2=80=A6)
> .
>=20

I see. Well, how about this?

Who said you have to do the whole blob dance in the firmware?

You can just as easily pass the blob as it is to the linux kernel and
the kernel there can convert it to an overlay and apply it.

> Cheers,
> Ben.
>=20
>>> Ben.
>>>=20
>>>>> Ben.
>>>>>=20
>>>>>=20
>>>>=20
>>>> Regards
>>>>=20
>>>> =E2=80=94 Pantelis
>>>=20
>>>=20
>>=20
>> Regards
>>=20
>> =E2=80=94 Pantelis
>=20
>=20

Regards

=E2=80=94 Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:19                                       ` Pantelis Antoniou
  (?)
@ 2015-05-14  7:25                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:25 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Thu, 2015-05-14 at 10:19 +0300, Pantelis Antoniou wrote:

> 
> You don’t want to know how sausages are made, but they are delicious :)

 ... most of the time :)

> > But yeah generating the overlay doesn't necessarily scare me, I can
> > generate a temp tree that is the overlay in which I "copy" the subtree
> > (or in my internal ptr-based representation I could have a concept of
> > alias which I follow while flattening).
> > 
> > That leaves me with these problems:
> > 
> > - No support for removing of nodes, so that needs to be added back to
> > the format and to Linux unless I continue removing by hand in the PCI
> > hotplug code itself
> > 
> 
> What kind of nodes/properties you need to remove at _application_ time?

Well, if we stick to removing by hand in Linux for the unplug case, then
none.

> What you describe is inserting a bunch of properties and nodes under
> a slot’s device node. Reverting the overlay removes them all just fine.

Except that still doesn't work for boot time :-)

So I would have to do a special case on unplug:

	if (slot->dt_is_overlay) /* set to false at boot */
		remove_subtree_myself();
	else
		undo_overlay(slot->overlay);

> > - No support for "committing" the overlay which needs to be added as
> > well.
> > 
> 
> That’s the easiest part.

Yeah, I will need to get my head around the code a bit more but it
doesn't seem too scary.

> I see. Well, how about this?
> 
> Who said you have to do the whole blob dance in the firmware?
> 
> You can just as easily pass the blob as it is to the linux kernel and
> the kernel there can convert it to an overlay and apply it.

That's not that pretty but we can do that too which solve the problem of
fixing the FW interface.

There is however an argument to be made in having the FW be able to
generate arbitrary overlays. If we ever want to pass more "property"
updates or node updates to Linux at runtime.

A few cases have crept up on the radar, like updating the pstate tables
or VPD informations ...

If we go down that path, then I would implement a concept of generation
count in the firmware, so I can generate an overlay that include all the
changes since the last "generation" given to Linux.

However that requires supporting removal of nodes/properties. So I'm
tempted to keep that feature on the back burner and go with an ad-hoc
interface for PCI for now.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:25                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:25 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Thu, 2015-05-14 at 10:19 +0300, Pantelis Antoniou wrote:

> 
> You don’t want to know how sausages are made, but they are delicious :)

 ... most of the time :)

> > But yeah generating the overlay doesn't necessarily scare me, I can
> > generate a temp tree that is the overlay in which I "copy" the subtree
> > (or in my internal ptr-based representation I could have a concept of
> > alias which I follow while flattening).
> > 
> > That leaves me with these problems:
> > 
> > - No support for removing of nodes, so that needs to be added back to
> > the format and to Linux unless I continue removing by hand in the PCI
> > hotplug code itself
> > 
> 
> What kind of nodes/properties you need to remove at _application_ time?

Well, if we stick to removing by hand in Linux for the unplug case, then
none.

> What you describe is inserting a bunch of properties and nodes under
> a slot’s device node. Reverting the overlay removes them all just fine.

Except that still doesn't work for boot time :-)

So I would have to do a special case on unplug:

	if (slot->dt_is_overlay) /* set to false at boot */
		remove_subtree_myself();
	else
		undo_overlay(slot->overlay);

> > - No support for "committing" the overlay which needs to be added as
> > well.
> > 
> 
> That’s the easiest part.

Yeah, I will need to get my head around the code a bit more but it
doesn't seem too scary.

> I see. Well, how about this?
> 
> Who said you have to do the whole blob dance in the firmware?
> 
> You can just as easily pass the blob as it is to the linux kernel and
> the kernel there can convert it to an overlay and apply it.

That's not that pretty but we can do that too which solve the problem of
fixing the FW interface.

There is however an argument to be made in having the FW be able to
generate arbitrary overlays. If we ever want to pass more "property"
updates or node updates to Linux at runtime.

A few cases have crept up on the radar, like updating the pstate tables
or VPD informations ...

If we go down that path, then I would implement a concept of generation
count in the firmware, so I can generate an overlay that include all the
changes since the last "generation" given to Linux.

However that requires supporting removal of nodes/properties. So I'm
tempted to keep that feature on the back burner and go with an ad-hoc
interface for PCI for now.

Ben.



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:25                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:25 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Thu, 2015-05-14 at 10:19 +0300, Pantelis Antoniou wrote:

> 
> You don’t want to know how sausages are made, but they are delicious :)

 ... most of the time :)

> > But yeah generating the overlay doesn't necessarily scare me, I can
> > generate a temp tree that is the overlay in which I "copy" the subtree
> > (or in my internal ptr-based representation I could have a concept of
> > alias which I follow while flattening).
> > 
> > That leaves me with these problems:
> > 
> > - No support for removing of nodes, so that needs to be added back to
> > the format and to Linux unless I continue removing by hand in the PCI
> > hotplug code itself
> > 
> 
> What kind of nodes/properties you need to remove at _application_ time?

Well, if we stick to removing by hand in Linux for the unplug case, then
none.

> What you describe is inserting a bunch of properties and nodes under
> a slot’s device node. Reverting the overlay removes them all just fine.

Except that still doesn't work for boot time :-)

So I would have to do a special case on unplug:

	if (slot->dt_is_overlay) /* set to false at boot */
		remove_subtree_myself();
	else
		undo_overlay(slot->overlay);

> > - No support for "committing" the overlay which needs to be added as
> > well.
> > 
> 
> That’s the easiest part.

Yeah, I will need to get my head around the code a bit more but it
doesn't seem too scary.

> I see. Well, how about this?
> 
> Who said you have to do the whole blob dance in the firmware?
> 
> You can just as easily pass the blob as it is to the linux kernel and
> the kernel there can convert it to an overlay and apply it.

That's not that pretty but we can do that too which solve the problem of
fixing the FW interface.

There is however an argument to be made in having the FW be able to
generate arbitrary overlays. If we ever want to pass more "property"
updates or node updates to Linux at runtime.

A few cases have crept up on the radar, like updating the pstate tables
or VPD informations ...

If we go down that path, then I would implement a concept of generation
count in the firmware, so I can generate an overlay that include all the
changes since the last "generation" given to Linux.

However that requires supporting removal of nodes/properties. So I'm
tempted to keep that feature on the back burner and go with an ad-hoc
interface for PCI for now.

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:25                                           ` Benjamin Herrenschmidt
@ 2015-05-14  7:29                                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:29 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree


> So I would have to do a special case on unplug:
> 
> 	if (slot->dt_is_overlay) /* set to false at boot */
> 		remove_subtree_myself();
> 	else
> 		undo_overlay(slot->overlay);

Of course I just inverted the polarity of the if () in the example :-)

But you get the idea...

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:29                                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:29 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev


> So I would have to do a special case on unplug:
> 
> 	if (slot->dt_is_overlay) /* set to false at boot */
> 		remove_subtree_myself();
> 	else
> 		undo_overlay(slot->overlay);

Of course I just inverted the polarity of the if () in the example :-)

But you get the idea...

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:25                                           ` Benjamin Herrenschmidt
  (?)
@ 2015-05-14  7:34                                               ` Pantelis Antoniou
  -1 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

Hi Ben,

> On May 14, 2015, at 10:25 , Benjamin Herrenschmidt <benh-XVmvHMARGARBNKmDvNhNkA@public.gmane.orging.org> wrote:
> 
> On Thu, 2015-05-14 at 10:19 +0300, Pantelis Antoniou wrote:
> 
>> 
>> You don’t want to know how sausages are made, but they are delicious :)
> 
> ... most of the time :)
> 
>>> But yeah generating the overlay doesn't necessarily scare me, I can
>>> generate a temp tree that is the overlay in which I "copy" the subtree
>>> (or in my internal ptr-based representation I could have a concept of
>>> alias which I follow while flattening).
>>> 
>>> That leaves me with these problems:
>>> 
>>> - No support for removing of nodes, so that needs to be added back to
>>> the format and to Linux unless I continue removing by hand in the PCI
>>> hotplug code itself
>>> 
>> 
>> What kind of nodes/properties you need to remove at _application_ time?
> 
> Well, if we stick to removing by hand in Linux for the unplug case, then
> none.
> 

OK

>> What you describe is inserting a bunch of properties and nodes under
>> a slot’s device node. Reverting the overlay removes them all just fine.
> 
> Except that still doesn't work for boot time :-)
> 
> So I would have to do a special case on unplug:
> 
> 	if (slot->dt_is_overlay) /* set to false at boot */
> 		remove_subtree_myself();
> 	else
> 		undo_overlay(slot->overlay);
> 

OK, in that case you do require removal. But in any case it’s the ‘negative’
of an already applied one, either at boot time or not.

Modifying the overlay code to apply a ‘negative’ property should do the trick.

Is that correct?

>>> - No support for "committing" the overlay which needs to be added as
>>> well.
>>> 
>> 
>> That’s the easiest part.
> 
> Yeah, I will need to get my head around the code a bit more but it
> doesn't seem too scary.
> 
>> I see. Well, how about this?
>> 
>> Who said you have to do the whole blob dance in the firmware?
>> 
>> You can just as easily pass the blob as it is to the linux kernel and
>> the kernel there can convert it to an overlay and apply it.
> 
> That's not that pretty but we can do that too which solve the problem of
> fixing the FW interface.
> 
> There is however an argument to be made in having the FW be able to
> generate arbitrary overlays. If we ever want to pass more "property"
> updates or node updates to Linux at runtime.
> 
> A few cases have crept up on the radar, like updating the pstate tables
> or VPD informations ...
> 
> If we go down that path, then I would implement a concept of generation
> count in the firmware, so I can generate an overlay that include all the
> changes since the last "generation" given to Linux.
> 

I will probably need that generation count myself for my PCI use case.

> However that requires supporting removal of nodes/properties. So I'm
> tempted to keep that feature on the back burner and go with an ad-hoc
> interface for PCI for now.
> 

I see. Bonne chance :)

> Ben.
> 
> 

Regards

— Pantelis

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:34                                               ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 14, 2015, at 10:25 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> On Thu, 2015-05-14 at 10:19 +0300, Pantelis Antoniou wrote:
> 
>> 
>> You don’t want to know how sausages are made, but they are delicious :)
> 
> ... most of the time :)
> 
>>> But yeah generating the overlay doesn't necessarily scare me, I can
>>> generate a temp tree that is the overlay in which I "copy" the subtree
>>> (or in my internal ptr-based representation I could have a concept of
>>> alias which I follow while flattening).
>>> 
>>> That leaves me with these problems:
>>> 
>>> - No support for removing of nodes, so that needs to be added back to
>>> the format and to Linux unless I continue removing by hand in the PCI
>>> hotplug code itself
>>> 
>> 
>> What kind of nodes/properties you need to remove at _application_ time?
> 
> Well, if we stick to removing by hand in Linux for the unplug case, then
> none.
> 

OK

>> What you describe is inserting a bunch of properties and nodes under
>> a slot’s device node. Reverting the overlay removes them all just fine.
> 
> Except that still doesn't work for boot time :-)
> 
> So I would have to do a special case on unplug:
> 
> 	if (slot->dt_is_overlay) /* set to false at boot */
> 		remove_subtree_myself();
> 	else
> 		undo_overlay(slot->overlay);
> 

OK, in that case you do require removal. But in any case it’s the ‘negative’
of an already applied one, either at boot time or not.

Modifying the overlay code to apply a ‘negative’ property should do the trick.

Is that correct?

>>> - No support for "committing" the overlay which needs to be added as
>>> well.
>>> 
>> 
>> That’s the easiest part.
> 
> Yeah, I will need to get my head around the code a bit more but it
> doesn't seem too scary.
> 
>> I see. Well, how about this?
>> 
>> Who said you have to do the whole blob dance in the firmware?
>> 
>> You can just as easily pass the blob as it is to the linux kernel and
>> the kernel there can convert it to an overlay and apply it.
> 
> That's not that pretty but we can do that too which solve the problem of
> fixing the FW interface.
> 
> There is however an argument to be made in having the FW be able to
> generate arbitrary overlays. If we ever want to pass more "property"
> updates or node updates to Linux at runtime.
> 
> A few cases have crept up on the radar, like updating the pstate tables
> or VPD informations ...
> 
> If we go down that path, then I would implement a concept of generation
> count in the firmware, so I can generate an overlay that include all the
> changes since the last "generation" given to Linux.
> 

I will probably need that generation count myself for my PCI use case.

> However that requires supporting removal of nodes/properties. So I'm
> tempted to keep that feature on the back burner and go with an ad-hoc
> interface for PCI for now.
> 

I see. Bonne chance :)

> Ben.
> 
> 

Regards

— Pantelis


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:34                                               ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14  7:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

Hi Ben,

> On May 14, 2015, at 10:25 , Benjamin Herrenschmidt =
<benh@kernel.crashing.org> wrote:
>=20
> On Thu, 2015-05-14 at 10:19 +0300, Pantelis Antoniou wrote:
>=20
>>=20
>> You don=E2=80=99t want to know how sausages are made, but they are =
delicious :)
>=20
> ... most of the time :)
>=20
>>> But yeah generating the overlay doesn't necessarily scare me, I can
>>> generate a temp tree that is the overlay in which I "copy" the =
subtree
>>> (or in my internal ptr-based representation I could have a concept =
of
>>> alias which I follow while flattening).
>>>=20
>>> That leaves me with these problems:
>>>=20
>>> - No support for removing of nodes, so that needs to be added back =
to
>>> the format and to Linux unless I continue removing by hand in the =
PCI
>>> hotplug code itself
>>>=20
>>=20
>> What kind of nodes/properties you need to remove at _application_ =
time?
>=20
> Well, if we stick to removing by hand in Linux for the unplug case, =
then
> none.
>=20

OK

>> What you describe is inserting a bunch of properties and nodes under
>> a slot=E2=80=99s device node. Reverting the overlay removes them all =
just fine.
>=20
> Except that still doesn't work for boot time :-)
>=20
> So I would have to do a special case on unplug:
>=20
> 	if (slot->dt_is_overlay) /* set to false at boot */
> 		remove_subtree_myself();
> 	else
> 		undo_overlay(slot->overlay);
>=20

OK, in that case you do require removal. But in any case it=E2=80=99s =
the =E2=80=98negative=E2=80=99
of an already applied one, either at boot time or not.

Modifying the overlay code to apply a =E2=80=98negative=E2=80=99 =
property should do the trick.

Is that correct?

>>> - No support for "committing" the overlay which needs to be added as
>>> well.
>>>=20
>>=20
>> That=E2=80=99s the easiest part.
>=20
> Yeah, I will need to get my head around the code a bit more but it
> doesn't seem too scary.
>=20
>> I see. Well, how about this?
>>=20
>> Who said you have to do the whole blob dance in the firmware?
>>=20
>> You can just as easily pass the blob as it is to the linux kernel and
>> the kernel there can convert it to an overlay and apply it.
>=20
> That's not that pretty but we can do that too which solve the problem =
of
> fixing the FW interface.
>=20
> There is however an argument to be made in having the FW be able to
> generate arbitrary overlays. If we ever want to pass more "property"
> updates or node updates to Linux at runtime.
>=20
> A few cases have crept up on the radar, like updating the pstate =
tables
> or VPD informations ...
>=20
> If we go down that path, then I would implement a concept of =
generation
> count in the firmware, so I can generate an overlay that include all =
the
> changes since the last "generation" given to Linux.
>=20

I will probably need that generation count myself for my PCI use case.

> However that requires supporting removal of nodes/properties. So I'm
> tempted to keep that feature on the back burner and go with an ad-hoc
> interface for PCI for now.
>=20

I see. Bonne chance :)

> Ben.
>=20
>=20

Regards

=E2=80=94 Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:34                                               ` Pantelis Antoniou
  (?)
@ 2015-05-14  7:47                                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:47 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas, Grant Likely,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Thu, 2015-05-14 at 10:34 +0300, Pantelis Antoniou wrote:

> >> What you describe is inserting a bunch of properties and nodes under
> >> a slot’s device node. Reverting the overlay removes them all just fine.
> > 
> > Except that still doesn't work for boot time :-)
> > 
> > So I would have to do a special case on unplug:
> > 
> > 	if (slot->dt_is_overlay) /* set to false at boot */
> > 		remove_subtree_myself();
> > 	else
> > 		undo_overlay(slot->overlay);
> > 
> 
> OK, in that case you do require removal. But in any case it’s the ‘negative’
> of an already applied one, either at boot time or not.

Sort-of, unless we have a way in the overlay to simply specify node
removal statements so we don't have to explicitly remove all properties
(or even all children).

> Modifying the overlay code to apply a ‘negative’ property should do the trick.
> 
> Is that correct?

I would do negatives node and let Linux imply the properties (or even
children).

But yes, that would probably do.

 .../...

> I will probably need that generation count myself for my PCI use case.
> 
> > However that requires supporting removal of nodes/properties. So I'm
> > tempted to keep that feature on the back burner and go with an ad-hoc
> > interface for PCI for now.
> > 
> 
> I see. Bonne chance :)

Merci :)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:47                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:47 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Thu, 2015-05-14 at 10:34 +0300, Pantelis Antoniou wrote:

> >> What you describe is inserting a bunch of properties and nodes under
> >> a slot’s device node. Reverting the overlay removes them all just fine.
> > 
> > Except that still doesn't work for boot time :-)
> > 
> > So I would have to do a special case on unplug:
> > 
> > 	if (slot->dt_is_overlay) /* set to false at boot */
> > 		remove_subtree_myself();
> > 	else
> > 		undo_overlay(slot->overlay);
> > 
> 
> OK, in that case you do require removal. But in any case it’s the ‘negative’
> of an already applied one, either at boot time or not.

Sort-of, unless we have a way in the overlay to simply specify node
removal statements so we don't have to explicitly remove all properties
(or even all children).

> Modifying the overlay code to apply a ‘negative’ property should do the trick.
> 
> Is that correct?

I would do negatives node and let Linux imply the properties (or even
children).

But yes, that would probably do.

 .../...

> I will probably need that generation count myself for my PCI use case.
> 
> > However that requires supporting removal of nodes/properties. So I'm
> > tempted to keep that feature on the back burner and go with an ad-hoc
> > interface for PCI for now.
> > 
> 
> I see. Bonne chance :)

Merci :)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14  7:47                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14  7:47 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Thu, 2015-05-14 at 10:34 +0300, Pantelis Antoniou wrote:

> >> What you describe is inserting a bunch of properties and nodes under
> >> a slot’s device node. Reverting the overlay removes them all just fine.
> > 
> > Except that still doesn't work for boot time :-)
> > 
> > So I would have to do a special case on unplug:
> > 
> > 	if (slot->dt_is_overlay) /* set to false at boot */
> > 		remove_subtree_myself();
> > 	else
> > 		undo_overlay(slot->overlay);
> > 
> 
> OK, in that case you do require removal. But in any case it’s the ‘negative’
> of an already applied one, either at boot time or not.

Sort-of, unless we have a way in the overlay to simply specify node
removal statements so we don't have to explicitly remove all properties
(or even all children).

> Modifying the overlay code to apply a ‘negative’ property should do the trick.
> 
> Is that correct?

I would do negatives node and let Linux imply the properties (or even
children).

But yes, that would probably do.

 .../...

> I will probably need that generation count myself for my PCI use case.
> 
> > However that requires supporting removal of nodes/properties. So I'm
> > tempted to keep that feature on the back burner and go with an ad-hoc
> > interface for PCI for now.
> > 
> 
> I see. Bonne chance :)

Merci :)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  7:47                                                   ` Benjamin Herrenschmidt
  (?)
@ 2015-05-14 11:02                                                     ` Pantelis Antoniou
  -1 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14 11:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 14, 2015, at 10:47 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 

[snip]

So I spend some time thinking about your use case and I think it boils down
to this:

I have a live tree in the firmware, I have made changes and I need to reflect
those changes to the live tree in the kernel.

Sounds like ‘how do I generate a patch for getting those two in sync'. No?

I can see where this might be useful for others as all.

I think we really need to create a liblivedt like we have libfdt since
we have a number of projects going about using/manipulating DT at runtime.

1. The linux kernel, with it’s own live tree implementation.
2. The device tree compiler (it has a live tree) custom implemented.
3. Your weird and wonderful (or wacky) firmware.
4. u-boot does use DT now, but it does with libfdt. I believe this is suboptimal.
5. barebox does DT as well.

Most of what we want to do with DT can be abstracted in a library I think that
all of those projects can use.

What are your thoughts?

Regards

— Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14 11:02                                                     ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14 11:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

Hi Ben,

> On May 14, 2015, at 10:47 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 

[snip]

So I spend some time thinking about your use case and I think it boils down
to this:

I have a live tree in the firmware, I have made changes and I need to reflect
those changes to the live tree in the kernel.

Sounds like ‘how do I generate a patch for getting those two in sync'. No?

I can see where this might be useful for others as all.

I think we really need to create a liblivedt like we have libfdt since
we have a number of projects going about using/manipulating DT at runtime.

1. The linux kernel, with it’s own live tree implementation.
2. The device tree compiler (it has a live tree) custom implemented.
3. Your weird and wonderful (or wacky) firmware.
4. u-boot does use DT now, but it does with libfdt. I believe this is suboptimal.
5. barebox does DT as well.

Most of what we want to do with DT can be abstracted in a library I think that
all of those projects can use.

What are your thoughts?

Regards

— Pantelis


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14 11:02                                                     ` Pantelis Antoniou
  0 siblings, 0 replies; 184+ messages in thread
From: Pantelis Antoniou @ 2015-05-14 11:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

Hi Ben,

> On May 14, 2015, at 10:47 , Benjamin Herrenschmidt =
<benh@kernel.crashing.org> wrote:
>=20

[snip]

So I spend some time thinking about your use case and I think it boils =
down
to this:

I have a live tree in the firmware, I have made changes and I need to =
reflect
those changes to the live tree in the kernel.

Sounds like =E2=80=98how do I generate a patch for getting those two in =
sync'. No?

I can see where this might be useful for others as all.

I think we really need to create a liblivedt like we have libfdt since
we have a number of projects going about using/manipulating DT at =
runtime.

1. The linux kernel, with it=E2=80=99s own live tree implementation.
2. The device tree compiler (it has a live tree) custom implemented.
3. Your weird and wonderful (or wacky) firmware.
4. u-boot does use DT now, but it does with libfdt. I believe this is =
suboptimal.
5. barebox does DT as well.

Most of what we want to do with DT can be abstracted in a library I =
think that
all of those projects can use.

What are your thoughts?

Regards

=E2=80=94 Pantelis

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14 11:02                                                     ` Pantelis Antoniou
@ 2015-05-14 23:25                                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14 23:25 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: Rob Herring, Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas,
	Grant Likely, devicetree

On Thu, 2015-05-14 at 14:02 +0300, Pantelis Antoniou wrote:
> Hi Ben,
> 
> > On May 14, 2015, at 10:47 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> 
> [snip]
> 
> So I spend some time thinking about your use case and I think it boils down
> to this:
> 
> I have a live tree in the firmware, I have made changes and I need to reflect
> those changes to the live tree in the kernel.
> 
> Sounds like ‘how do I generate a patch for getting those two in sync'. No?

More or less.

> I can see where this might be useful for others as all.
> 
> I think we really need to create a liblivedt like we have libfdt since
> we have a number of projects going about using/manipulating DT at runtime.
> 
> 1. The linux kernel, with it’s own live tree implementation.
> 2. The device tree compiler (it has a live tree) custom implemented.
> 3. Your weird and wonderful (or wacky) firmware.
> 4. u-boot does use DT now, but it does with libfdt. I believe this is suboptimal.
> 5. barebox does DT as well.
> 
> Most of what we want to do with DT can be abstracted in a library I think that
> all of those projects can use.
> 
> What are your thoughts?

Well, we have at least two implementations, the kernel one and the one
in our OPAL firmware:

https://github.com/open-power/skiboot/blob/master/include/device.h
https://github.com/open-power/skiboot/blob/master/core/device.c

The latter uses some nice Rusty tricks (tm) for multiple argument
functions.

It would make sense to do a library somewhere yes. However, I need to
cut my firmware API pretty much today so I think for now I'll stick
to something Ad-Hoc for the PCI hotplug code that just passes the
bit of FDT with the new devices and leave the "grand project" of live
sync of the tree for later.

There are other implementations of live DT in various Open Firmware
variants out there, most are in Forth which I suggest you don't bother
with unless you enjoy pain, but I think at least one of these is
actually in C.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-14 23:25                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-05-14 23:25 UTC (permalink / raw)
  To: Pantelis Antoniou
  Cc: devicetree, linux-pci, Gavin Shan, Grant Likely, Rob Herring,
	Bjorn Helgaas, linuxppc-dev

On Thu, 2015-05-14 at 14:02 +0300, Pantelis Antoniou wrote:
> Hi Ben,
> 
> > On May 14, 2015, at 10:47 , Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> 
> [snip]
> 
> So I spend some time thinking about your use case and I think it boils down
> to this:
> 
> I have a live tree in the firmware, I have made changes and I need to reflect
> those changes to the live tree in the kernel.
> 
> Sounds like ‘how do I generate a patch for getting those two in sync'. No?

More or less.

> I can see where this might be useful for others as all.
> 
> I think we really need to create a liblivedt like we have libfdt since
> we have a number of projects going about using/manipulating DT at runtime.
> 
> 1. The linux kernel, with it’s own live tree implementation.
> 2. The device tree compiler (it has a live tree) custom implemented.
> 3. Your weird and wonderful (or wacky) firmware.
> 4. u-boot does use DT now, but it does with libfdt. I believe this is suboptimal.
> 5. barebox does DT as well.
> 
> Most of what we want to do with DT can be abstracted in a library I think that
> all of those projects can use.
> 
> What are your thoughts?

Well, we have at least two implementations, the kernel one and the one
in our OPAL firmware:

https://github.com/open-power/skiboot/blob/master/include/device.h
https://github.com/open-power/skiboot/blob/master/core/device.c

The latter uses some nice Rusty tricks (tm) for multiple argument
functions.

It would make sense to do a library somewhere yes. However, I need to
cut my firmware API pretty much today so I think for now I'll stick
to something Ad-Hoc for the PCI hotplug code that just passes the
bit of FDT with the new devices and leave the "grand project" of live
sync of the tree for later.

There are other implementations of live DT in various Open Firmware
variants out there, most are in Forth which I suggest you don't bother
with unless you enjoy pain, but I think at least one of these is
actually in C.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-01  6:03   ` Gavin Shan
@ 2015-05-15  1:27     ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-15  1:27 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linuxppc-dev, linux-pci, benh, bhelgaas, Grant Likely, Rob Herring

On Fri, May 01, 2015 at 04:03:06PM +1000, Gavin Shan wrote:
>The requirement is raised when developing the PCI hotplug feature
>for PowerPC PowerNV platform, which runs on top of skiboot firmware.
>When plugging PCI adapter to one PCI slot, the firmware rescans the
>slot and build FDT (Flat Device Tree) blob, which is sent to the
>PowerNV PCI hotplug driver for processing. The new constructed device
>nodes from the FDT blob are expected to be attached to the device
>node of the PCI slot. Unfortunately, it seems we don't have a API
>to support the scenario. The patch intends to support it by newly
>introduced function of_fdt_add_subtree(), the design behind it is
>shown as below:
>
>   * When the sub-tree FDT blob, which is owned by firmware, is
>     received by kernel. It's copied over to the blob, which is
>     dynamically allocated. Since then, the FDT blob owned by
>     firmware isn't touched.
>   * Rework unflatten_dt_node() so that the device nodes in current
>     and deeper depth have been constructed from the FDT blob. All
>     device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is
>     similar to OF_DYNAMIC. However, device node with the flag set
>     can be free'd, but in the way other than that for OF_DYNAMIC
>     device nodes.
>   * of_fdt_add_subtree() is the introduced API to do the work.
>

There are already lots of discussion on how to reuse overlay for my case.
Thanks to all for your time on this. I spend some time thinking about it
last night and this morning. I would like to summarize it as below. It's
for sure almost all ideas coming from your guys and I'm just documenting
it. If there are obvious problems in the following summary, please let me
know so that I can fix them as early as possible to save more time.

================
SKIBOOT & KERNEL
================

The idea came from Ben and I'm following to implement it as follows:

- One counter is mantained: cur_counter = 0;
- PCI hot plugging happens happens as:
  * Kernel gets skiboot's cur_counter with OPAL API, which is (x).
  * Skiboot does hot plugging and rescans the slot, then populate
    the device-tree nodes with (++cur_counter), which means the
    counter is turned to (x+1).
  * Kernel retrieves the FDT overlay blob on the device-tree changes
    since last time/token ((x)). Kernel unflattens it and applies the
    changes by overlay. The slot simply records the overlay (IDR) ID
    for the device-tree change.
- PCI hot unplugging happens as:
  * Revert the changes simply if the slot had valid IDR ID. Otherwise,
    the device nodes are flatten during bootup time, we just remove them
    as we're doing now. Note that device nodes can't be free'd because
    the memory chunks consumed by them are allocated from memblock or
    reserved by skiboot.
- Some questions/problems:
  * I don't understand how kexec can figure out the device-tree with
    applied changes from overlay. I assume kexec is simply using the
    FDT blob from skiboot as seen by the first kernel during frest
    boot.

Kernel APIs:
  * id = of_overlay_create(); of_overlay_destroy(id)
Skiboot API:
  * int64_t opal_get_overlay_dt(uint64_t *counter, void *blob, uint32_t len)
    blob == NULL, get current counter.
    blob != NULL, get overlay blob since (*counter). Skiboot also
    returns the last counter.
    The memory chunk for "blob" is always owned by kernel, which doesn't know
    the memory size to hold the overlay FDT blob. So we have to try with
    discret 64KB, which is PAGE_SIZE. After the overlay blob is unflatten
    and applied, the memory chunk can be free. We don't have to keep it
    for reverting the changes introduced by the overlay.

=============================
OVERLAY FDT BLOB FROM SKIBOOT
=============================

ROOT {
   A {
      target=<target node's phandle>
      __overlay__ {

      }
   }

   B {
      target=<target node's phandle>
      __overlay__ {

      }
   }
}

Thanks,
Gavin

>Cc: Grant Likely <grant.likely@linaro.org>
>Cc: Rob Herring <robh+dt@kernel.org>
>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>---
> drivers/of/dynamic.c   |  19 +++++--
> drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
> include/linux/of.h     |   2 +
> include/linux/of_fdt.h |   1 +
> 4 files changed, 127 insertions(+), 28 deletions(-)
>
>diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>index 3351ef4..f562080 100644
>--- a/drivers/of/dynamic.c
>+++ b/drivers/of/dynamic.c
>@@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
> 		return;
> 	}
>
>-	if (!of_node_check_flag(node, OF_DYNAMIC))
>+	/* Release the subtree */
>+	if (node->subtree) {
>+		kfree(node->subtree);
>+		node->subtree = NULL;
>+	}
>+
>+	if (!of_node_check_flag(node, OF_DYNAMIC) &&
>+	    !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
> 		return;
>
> 	while (prop) {
> 		struct property *next = prop->next;
>-		kfree(prop->name);
>-		kfree(prop->value);
>+		if (of_node_check_flag(node, OF_DYNAMIC)) {
>+			kfree(prop->name);
>+			kfree(prop->value);
>+		}
> 		kfree(prop);
> 		prop = next;
>
>@@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
> 			node->deadprops = NULL;
> 		}
> 	}
>-	kfree(node->full_name);
>+
>+	if (of_node_check_flag(node, OF_DYNAMIC))
>+		kfree(node->full_name);
> 	kfree(node->data);
> 	kfree(node);
> }
>diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
>index cde35c5d01..7659560 100644
>--- a/drivers/of/fdt.c
>+++ b/drivers/of/fdt.c
>@@ -28,6 +28,10 @@
> #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
> #include <asm/page.h>
>
>+#include "of_private.h"
>+
>+static int cur_node_depth;
>+
> /*
>  * of_fdt_limit_memory - limit the number of regions in the /memory node
>  * @limit: maximum entries
>@@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
>  * @dad: Parent struct device_node
>  * @fpsize: Size of the node path up at the current depth.
>  */
>-static void * unflatten_dt_node(void *blob,
>-				void *mem,
>-				int *poffset,
>-				struct device_node *dad,
>-				struct device_node **nodepp,
>-				unsigned long fpsize,
>-				bool dryrun)
>+static void *unflatten_dt_node(void *blob,
>+			       void *mem,
>+			       int *poffset,
>+			       struct device_node *dad,
>+			       struct device_node **nodepp,
>+			       unsigned long fpsize,
>+			       bool dryrun,
>+			       bool dynamic)
> {
> 	const __be32 *p;
> 	struct device_node *np;
> 	struct property *pp, **prev_pp = NULL;
> 	const char *pathp;
> 	unsigned int l, allocl;
>-	static int depth = 0;
> 	int old_depth;
> 	int offset;
> 	int has_name = 0;
>@@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
> 		}
> 	}
>
>-	np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
>+	if (dynamic)
>+		np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
>+	else
>+		np = unflatten_dt_alloc(&mem,
>+				sizeof(struct device_node) + allocl,
> 				__alignof__(struct device_node));
> 	if (!dryrun) {
> 		char *fn;
> 		of_node_init(np);
> 		np->full_name = fn = ((char *)np) + sizeof(*np);
>+		if (dynamic)
>+			of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
> 		if (new_format) {
> 			/* rebuild full path for new format */
> 			if (dad && dad->parent) {
>@@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
> 		}
> 		if (strcmp(pname, "name") == 0)
> 			has_name = 1;
>-		pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>-					__alignof__(struct property));
>+
>+		if (dynamic)
>+			pp = kzalloc(sizeof(struct property), GFP_KERNEL);
>+		else
>+			pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>+						__alignof__(struct property));
> 		if (!dryrun) {
> 			/* We accept flattened tree phandles either in
> 			 * ePAPR-style "phandle" properties, or the
>@@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
> 		if (pa < ps)
> 			pa = p1;
> 		sz = (pa - ps) + 1;
>-		pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
>-					__alignof__(struct property));
>+
>+		if (dynamic)
>+			pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
>+		else
>+			pp = unflatten_dt_alloc(&mem,
>+						sizeof(struct property) + sz,
>+						__alignof__(struct property));
> 		if (!dryrun) {
> 			pp->name = "name";
> 			pp->length = sz;
>@@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
> 			np->type = "<NULL>";
> 	}
>
>-	old_depth = depth;
>-	*poffset = fdt_next_node(blob, *poffset, &depth);
>-	if (depth < 0)
>-		depth = 0;
>-	while (*poffset > 0 && depth > old_depth)
>-		mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
>-					fpsize, dryrun);
>+	old_depth = cur_node_depth;
>+	*poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
>+	while (*poffset > 0) {
>+		if (cur_node_depth < old_depth)
>+			break;
>+
>+		if (cur_node_depth == old_depth)
>+			mem = unflatten_dt_node(blob, mem, poffset,
>+						dad, NULL, fpsize,
>+						dryrun, dynamic);
>+		else if (cur_node_depth > old_depth)
>+			mem = unflatten_dt_node(blob, mem, poffset,
>+						np, NULL, fpsize,
>+						dryrun, dynamic);
>+	}
>
> 	if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
> 		pr_err("unflatten: error %d processing FDT\n", *poffset);
>@@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
>  * for the resulting tree
>  */
> static void __unflatten_device_tree(void *blob,
>-			     struct device_node **mynodes,
>-			     void * (*dt_alloc)(u64 size, u64 align))
>+				struct device_node **mynodes,
>+				void * (*dt_alloc)(u64 size, u64 align))
> {
> 	unsigned long size;
> 	int start;
>@@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
>
> 	/* First pass, scan for size */
> 	start = 0;
>-	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
>+	cur_node_depth = 1;
>+	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
>+						NULL, 0, true, false);
> 	size = ALIGN(size, 4);
>
> 	pr_debug("  size is %lx, allocating...\n", size);
>@@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
>
> 	/* Second pass, do actual unflattening */
> 	start = 0;
>-	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
>+	cur_node_depth = 1;
>+	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
> 	if (be32_to_cpup(mem + size) != 0xdeadbeef)
> 		pr_warning("End of tree marker overwritten: %08x\n",
> 			   be32_to_cpup(mem + size));
>@@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
> }
> EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
>
>+static void populate_sysfs_for_child_nodes(struct device_node *parent)
>+{
>+	struct device_node *child;
>+
>+	for_each_child_of_node(parent, child) {
>+		__of_attach_node_sysfs(child);
>+		populate_sysfs_for_child_nodes(child);
>+	}
>+}
>+
>+/**
>+ * of_fdt_add_substree - Create sub-tree of device nodes
>+ * @parent: parent device node to which the sub-tree will attach
>+ * @blob: flat device tree blob representing the sub-tree
>+ *
>+ * Copy over the FDT blob, which passed from firmware, and then
>+ * unflatten the sub-tree.
>+ */
>+void of_fdt_add_subtree(struct device_node *parent, void *blob)
>+{
>+	int start = 0;
>+
>+	/* Validate the header */
>+	if (!blob || fdt_check_header(blob)) {
>+		pr_err("%s: Invalid device-tree blob header at 0x%p\n",
>+		       __func__, blob);
>+		return;
>+	}
>+
>+	/* Free the flat blob for last time lazily */
>+	if (parent->subtree) {
>+		kfree(parent->subtree);
>+		parent->subtree = NULL;
>+	}
>+
>+	/* Copy over the flat blob */
>+	parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
>+	if (!parent->subtree) {
>+		pr_err("%s: Cannot copy over device-tree blob\n",
>+		       __func__);
>+		return;
>+	}
>+
>+	memcpy(parent->subtree, blob, fdt_totalsize(blob));
>+
>+	/* Unflatten it */
>+	mutex_lock(&of_mutex);
>+	cur_node_depth = 1;
>+	unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
>+			  strlen(parent->full_name), false, true);
>+	populate_sysfs_for_child_nodes(parent);
>+	mutex_unlock(&of_mutex);
>+}
>+EXPORT_SYMBOL(of_fdt_add_subtree);
>+
> /* Everything below here references initial_boot_params directly. */
> int __initdata dt_root_addr_cells;
> int __initdata dt_root_size_cells;
>diff --git a/include/linux/of.h b/include/linux/of.h
>index ddeaae6..ac50b02 100644
>--- a/include/linux/of.h
>+++ b/include/linux/of.h
>@@ -60,6 +60,7 @@ struct device_node {
> 	struct	device_node *sibling;
> 	struct	kobject kobj;
> 	unsigned long _flags;
>+	void	*subtree;
> 	void	*data;
> #if defined(CONFIG_SPARC)
> 	const char *path_component_name;
>@@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
> #define OF_DETACHED	2 /* node has been detached from the device tree */
> #define OF_POPULATED	3 /* device already created for the node */
> #define OF_POPULATED_BUS	4 /* of_platform_populate recursed to children of this node */
>+#define OF_DYNAMIC_HYBIRD	5 /* similar to OF_DYNAMIC, but partially */
>
> #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
> #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
>diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
>index 587ee50..1fb47d7 100644
>--- a/include/linux/of_fdt.h
>+++ b/include/linux/of_fdt.h
>@@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
> 			const char *const *compat);
> extern void of_fdt_unflatten_tree(unsigned long *blob,
> 			       struct device_node **mynodes);
>+extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
>
> /* TBD: Temporary export of fdt globals - remove when code fully merged */
> extern int __initdata dt_root_addr_cells;
>-- 
>2.1.0
>


^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-05-15  1:27     ` Gavin Shan
  0 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-05-15  1:27 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, Grant Likely, Rob Herring, bhelgaas, linuxppc-dev

On Fri, May 01, 2015 at 04:03:06PM +1000, Gavin Shan wrote:
>The requirement is raised when developing the PCI hotplug feature
>for PowerPC PowerNV platform, which runs on top of skiboot firmware.
>When plugging PCI adapter to one PCI slot, the firmware rescans the
>slot and build FDT (Flat Device Tree) blob, which is sent to the
>PowerNV PCI hotplug driver for processing. The new constructed device
>nodes from the FDT blob are expected to be attached to the device
>node of the PCI slot. Unfortunately, it seems we don't have a API
>to support the scenario. The patch intends to support it by newly
>introduced function of_fdt_add_subtree(), the design behind it is
>shown as below:
>
>   * When the sub-tree FDT blob, which is owned by firmware, is
>     received by kernel. It's copied over to the blob, which is
>     dynamically allocated. Since then, the FDT blob owned by
>     firmware isn't touched.
>   * Rework unflatten_dt_node() so that the device nodes in current
>     and deeper depth have been constructed from the FDT blob. All
>     device nodes are marked with flag OF_DYNAMIC_HYBIRD, which is
>     similar to OF_DYNAMIC. However, device node with the flag set
>     can be free'd, but in the way other than that for OF_DYNAMIC
>     device nodes.
>   * of_fdt_add_subtree() is the introduced API to do the work.
>

There are already lots of discussion on how to reuse overlay for my case.
Thanks to all for your time on this. I spend some time thinking about it
last night and this morning. I would like to summarize it as below. It's
for sure almost all ideas coming from your guys and I'm just documenting
it. If there are obvious problems in the following summary, please let me
know so that I can fix them as early as possible to save more time.

================
SKIBOOT & KERNEL
================

The idea came from Ben and I'm following to implement it as follows:

- One counter is mantained: cur_counter = 0;
- PCI hot plugging happens happens as:
  * Kernel gets skiboot's cur_counter with OPAL API, which is (x).
  * Skiboot does hot plugging and rescans the slot, then populate
    the device-tree nodes with (++cur_counter), which means the
    counter is turned to (x+1).
  * Kernel retrieves the FDT overlay blob on the device-tree changes
    since last time/token ((x)). Kernel unflattens it and applies the
    changes by overlay. The slot simply records the overlay (IDR) ID
    for the device-tree change.
- PCI hot unplugging happens as:
  * Revert the changes simply if the slot had valid IDR ID. Otherwise,
    the device nodes are flatten during bootup time, we just remove them
    as we're doing now. Note that device nodes can't be free'd because
    the memory chunks consumed by them are allocated from memblock or
    reserved by skiboot.
- Some questions/problems:
  * I don't understand how kexec can figure out the device-tree with
    applied changes from overlay. I assume kexec is simply using the
    FDT blob from skiboot as seen by the first kernel during frest
    boot.

Kernel APIs:
  * id = of_overlay_create(); of_overlay_destroy(id)
Skiboot API:
  * int64_t opal_get_overlay_dt(uint64_t *counter, void *blob, uint32_t len)
    blob == NULL, get current counter.
    blob != NULL, get overlay blob since (*counter). Skiboot also
    returns the last counter.
    The memory chunk for "blob" is always owned by kernel, which doesn't know
    the memory size to hold the overlay FDT blob. So we have to try with
    discret 64KB, which is PAGE_SIZE. After the overlay blob is unflatten
    and applied, the memory chunk can be free. We don't have to keep it
    for reverting the changes introduced by the overlay.

=============================
OVERLAY FDT BLOB FROM SKIBOOT
=============================

ROOT {
   A {
      target=<target node's phandle>
      __overlay__ {

      }
   }

   B {
      target=<target node's phandle>
      __overlay__ {

      }
   }
}

Thanks,
Gavin

>Cc: Grant Likely <grant.likely@linaro.org>
>Cc: Rob Herring <robh+dt@kernel.org>
>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>---
> drivers/of/dynamic.c   |  19 +++++--
> drivers/of/fdt.c       | 133 ++++++++++++++++++++++++++++++++++++++++---------
> include/linux/of.h     |   2 +
> include/linux/of_fdt.h |   1 +
> 4 files changed, 127 insertions(+), 28 deletions(-)
>
>diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>index 3351ef4..f562080 100644
>--- a/drivers/of/dynamic.c
>+++ b/drivers/of/dynamic.c
>@@ -330,13 +330,22 @@ void of_node_release(struct kobject *kobj)
> 		return;
> 	}
>
>-	if (!of_node_check_flag(node, OF_DYNAMIC))
>+	/* Release the subtree */
>+	if (node->subtree) {
>+		kfree(node->subtree);
>+		node->subtree = NULL;
>+	}
>+
>+	if (!of_node_check_flag(node, OF_DYNAMIC) &&
>+	    !of_node_check_flag(node, OF_DYNAMIC_HYBIRD))
> 		return;
>
> 	while (prop) {
> 		struct property *next = prop->next;
>-		kfree(prop->name);
>-		kfree(prop->value);
>+		if (of_node_check_flag(node, OF_DYNAMIC)) {
>+			kfree(prop->name);
>+			kfree(prop->value);
>+		}
> 		kfree(prop);
> 		prop = next;
>
>@@ -345,7 +354,9 @@ void of_node_release(struct kobject *kobj)
> 			node->deadprops = NULL;
> 		}
> 	}
>-	kfree(node->full_name);
>+
>+	if (of_node_check_flag(node, OF_DYNAMIC))
>+		kfree(node->full_name);
> 	kfree(node->data);
> 	kfree(node);
> }
>diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
>index cde35c5d01..7659560 100644
>--- a/drivers/of/fdt.c
>+++ b/drivers/of/fdt.c
>@@ -28,6 +28,10 @@
> #include <asm/setup.h>  /* for COMMAND_LINE_SIZE */
> #include <asm/page.h>
>
>+#include "of_private.h"
>+
>+static int cur_node_depth;
>+
> /*
>  * of_fdt_limit_memory - limit the number of regions in the /memory node
>  * @limit: maximum entries
>@@ -168,20 +172,20 @@ static void *unflatten_dt_alloc(void **mem, unsigned long size,
>  * @dad: Parent struct device_node
>  * @fpsize: Size of the node path up at the current depth.
>  */
>-static void * unflatten_dt_node(void *blob,
>-				void *mem,
>-				int *poffset,
>-				struct device_node *dad,
>-				struct device_node **nodepp,
>-				unsigned long fpsize,
>-				bool dryrun)
>+static void *unflatten_dt_node(void *blob,
>+			       void *mem,
>+			       int *poffset,
>+			       struct device_node *dad,
>+			       struct device_node **nodepp,
>+			       unsigned long fpsize,
>+			       bool dryrun,
>+			       bool dynamic)
> {
> 	const __be32 *p;
> 	struct device_node *np;
> 	struct property *pp, **prev_pp = NULL;
> 	const char *pathp;
> 	unsigned int l, allocl;
>-	static int depth = 0;
> 	int old_depth;
> 	int offset;
> 	int has_name = 0;
>@@ -219,12 +223,18 @@ static void * unflatten_dt_node(void *blob,
> 		}
> 	}
>
>-	np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl,
>+	if (dynamic)
>+		np = kzalloc(sizeof(struct device_node) + allocl, GFP_KERNEL);
>+	else
>+		np = unflatten_dt_alloc(&mem,
>+				sizeof(struct device_node) + allocl,
> 				__alignof__(struct device_node));
> 	if (!dryrun) {
> 		char *fn;
> 		of_node_init(np);
> 		np->full_name = fn = ((char *)np) + sizeof(*np);
>+		if (dynamic)
>+			of_node_set_flag(np, OF_DYNAMIC_HYBIRD);
> 		if (new_format) {
> 			/* rebuild full path for new format */
> 			if (dad && dad->parent) {
>@@ -267,8 +277,12 @@ static void * unflatten_dt_node(void *blob,
> 		}
> 		if (strcmp(pname, "name") == 0)
> 			has_name = 1;
>-		pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>-					__alignof__(struct property));
>+
>+		if (dynamic)
>+			pp = kzalloc(sizeof(struct property), GFP_KERNEL);
>+		else
>+			pp = unflatten_dt_alloc(&mem, sizeof(struct property),
>+						__alignof__(struct property));
> 		if (!dryrun) {
> 			/* We accept flattened tree phandles either in
> 			 * ePAPR-style "phandle" properties, or the
>@@ -309,8 +323,13 @@ static void * unflatten_dt_node(void *blob,
> 		if (pa < ps)
> 			pa = p1;
> 		sz = (pa - ps) + 1;
>-		pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz,
>-					__alignof__(struct property));
>+
>+		if (dynamic)
>+			pp = kzalloc(sizeof(struct property) + sz, GFP_KERNEL);
>+		else
>+			pp = unflatten_dt_alloc(&mem,
>+						sizeof(struct property) + sz,
>+						__alignof__(struct property));
> 		if (!dryrun) {
> 			pp->name = "name";
> 			pp->length = sz;
>@@ -334,13 +353,21 @@ static void * unflatten_dt_node(void *blob,
> 			np->type = "<NULL>";
> 	}
>
>-	old_depth = depth;
>-	*poffset = fdt_next_node(blob, *poffset, &depth);
>-	if (depth < 0)
>-		depth = 0;
>-	while (*poffset > 0 && depth > old_depth)
>-		mem = unflatten_dt_node(blob, mem, poffset, np, NULL,
>-					fpsize, dryrun);
>+	old_depth = cur_node_depth;
>+	*poffset = fdt_next_node(blob, *poffset, &cur_node_depth);
>+	while (*poffset > 0) {
>+		if (cur_node_depth < old_depth)
>+			break;
>+
>+		if (cur_node_depth == old_depth)
>+			mem = unflatten_dt_node(blob, mem, poffset,
>+						dad, NULL, fpsize,
>+						dryrun, dynamic);
>+		else if (cur_node_depth > old_depth)
>+			mem = unflatten_dt_node(blob, mem, poffset,
>+						np, NULL, fpsize,
>+						dryrun, dynamic);
>+	}
>
> 	if (*poffset < 0 && *poffset != -FDT_ERR_NOTFOUND)
> 		pr_err("unflatten: error %d processing FDT\n", *poffset);
>@@ -379,8 +406,8 @@ static void * unflatten_dt_node(void *blob,
>  * for the resulting tree
>  */
> static void __unflatten_device_tree(void *blob,
>-			     struct device_node **mynodes,
>-			     void * (*dt_alloc)(u64 size, u64 align))
>+				struct device_node **mynodes,
>+				void * (*dt_alloc)(u64 size, u64 align))
> {
> 	unsigned long size;
> 	int start;
>@@ -405,7 +432,9 @@ static void __unflatten_device_tree(void *blob,
>
> 	/* First pass, scan for size */
> 	start = 0;
>-	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL, NULL, 0, true);
>+	cur_node_depth = 1;
>+	size = (unsigned long)unflatten_dt_node(blob, NULL, &start, NULL,
>+						NULL, 0, true, false);
> 	size = ALIGN(size, 4);
>
> 	pr_debug("  size is %lx, allocating...\n", size);
>@@ -420,7 +449,8 @@ static void __unflatten_device_tree(void *blob,
>
> 	/* Second pass, do actual unflattening */
> 	start = 0;
>-	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false);
>+	cur_node_depth = 1;
>+	unflatten_dt_node(blob, mem, &start, NULL, mynodes, 0, false, false);
> 	if (be32_to_cpup(mem + size) != 0xdeadbeef)
> 		pr_warning("End of tree marker overwritten: %08x\n",
> 			   be32_to_cpup(mem + size));
>@@ -448,6 +478,61 @@ void of_fdt_unflatten_tree(unsigned long *blob,
> }
> EXPORT_SYMBOL_GPL(of_fdt_unflatten_tree);
>
>+static void populate_sysfs_for_child_nodes(struct device_node *parent)
>+{
>+	struct device_node *child;
>+
>+	for_each_child_of_node(parent, child) {
>+		__of_attach_node_sysfs(child);
>+		populate_sysfs_for_child_nodes(child);
>+	}
>+}
>+
>+/**
>+ * of_fdt_add_substree - Create sub-tree of device nodes
>+ * @parent: parent device node to which the sub-tree will attach
>+ * @blob: flat device tree blob representing the sub-tree
>+ *
>+ * Copy over the FDT blob, which passed from firmware, and then
>+ * unflatten the sub-tree.
>+ */
>+void of_fdt_add_subtree(struct device_node *parent, void *blob)
>+{
>+	int start = 0;
>+
>+	/* Validate the header */
>+	if (!blob || fdt_check_header(blob)) {
>+		pr_err("%s: Invalid device-tree blob header at 0x%p\n",
>+		       __func__, blob);
>+		return;
>+	}
>+
>+	/* Free the flat blob for last time lazily */
>+	if (parent->subtree) {
>+		kfree(parent->subtree);
>+		parent->subtree = NULL;
>+	}
>+
>+	/* Copy over the flat blob */
>+	parent->subtree = kzalloc(fdt_totalsize(blob), GFP_KERNEL);
>+	if (!parent->subtree) {
>+		pr_err("%s: Cannot copy over device-tree blob\n",
>+		       __func__);
>+		return;
>+	}
>+
>+	memcpy(parent->subtree, blob, fdt_totalsize(blob));
>+
>+	/* Unflatten it */
>+	mutex_lock(&of_mutex);
>+	cur_node_depth = 1;
>+	unflatten_dt_node(parent->subtree, NULL, &start, parent, NULL,
>+			  strlen(parent->full_name), false, true);
>+	populate_sysfs_for_child_nodes(parent);
>+	mutex_unlock(&of_mutex);
>+}
>+EXPORT_SYMBOL(of_fdt_add_subtree);
>+
> /* Everything below here references initial_boot_params directly. */
> int __initdata dt_root_addr_cells;
> int __initdata dt_root_size_cells;
>diff --git a/include/linux/of.h b/include/linux/of.h
>index ddeaae6..ac50b02 100644
>--- a/include/linux/of.h
>+++ b/include/linux/of.h
>@@ -60,6 +60,7 @@ struct device_node {
> 	struct	device_node *sibling;
> 	struct	kobject kobj;
> 	unsigned long _flags;
>+	void	*subtree;
> 	void	*data;
> #if defined(CONFIG_SPARC)
> 	const char *path_component_name;
>@@ -222,6 +223,7 @@ static inline unsigned long of_read_ulong(const __be32 *cell, int size)
> #define OF_DETACHED	2 /* node has been detached from the device tree */
> #define OF_POPULATED	3 /* device already created for the node */
> #define OF_POPULATED_BUS	4 /* of_platform_populate recursed to children of this node */
>+#define OF_DYNAMIC_HYBIRD	5 /* similar to OF_DYNAMIC, but partially */
>
> #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
> #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
>diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
>index 587ee50..1fb47d7 100644
>--- a/include/linux/of_fdt.h
>+++ b/include/linux/of_fdt.h
>@@ -39,6 +39,7 @@ extern int of_fdt_match(const void *blob, unsigned long node,
> 			const char *const *compat);
> extern void of_fdt_unflatten_tree(unsigned long *blob,
> 			       struct device_node **mynodes);
>+extern void of_fdt_add_subtree(struct device_node *parent, void *blob);
>
> /* TBD: Temporary export of fdt globals - remove when code fully merged */
> extern int __initdata dt_root_addr_cells;
>-- 
>2.1.0
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-05-14  0:54                           ` Benjamin Herrenschmidt
@ 2015-06-07  7:54                               ` Grant Likely
  -1 siblings, 0 replies; 184+ messages in thread
From: Grant Likely @ 2015-06-07  7:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Rob Herring
  Cc: Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas,
	devicetree-u79uwXL29TY76Z2rM5mHXA

Sorry for not weighing in earlier, I've had other work keeping me away.

My short answer: don't use overlays. They're not what you need. Generic
CONFIG_OF_DYNAMIC should be all that is required to make changes in your
use case.

Overlays are a specific api for being able to apply a set of changes in
a revertable way, but as you say, it is a lot more complicated. However,
overlays are built on top of the of_changeset API which is a lot
simpler. It doesn't do any phandle resolution, and it doesn't do any
tracking. It takes a set of changes (attach node, detach node, add
property, remove property), an applies them to the live tree. At that
point the changes are permenant*.

It is documented in Documentation/devicetree/changesets.txt

Ideally, I want all DT changes to go through the changeset API so that
the lifecycle issues are delt with in one place. It also defers firing
notifiers until after the entire changeset is applied. With
of_attach_node/of_detach_node the notifiers are sent immediately after
each change when the tree may be in an inconsistent state. For example,
a driver expecting child nodes, but the child nodes haven't been added
yet.

Comments below...

* There is an API for reverting a changeset, which simply applies the
  changes backwards and in reverse. The overlay code uses it, but you
  won't need it.

On Thu, 14 May 2015 10:54:31 +1000
, Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
 wrote:
> On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:
> 
> > I haven't decided really.
> > 
> > The main thing with the current patch is I don't really like the added
> > complexity to unflatten_dt_node. It is already a fairly complex
> > function. Perhaps removing of "hybrid" as discussed will help?
> 
> I agree, we should be able to make that much simpler, I was planning on
> sorting that out with Gavin.

Ditto here. I don't want to have any new kinds of nodes created either.
They are either OF_DYNAMIC, and therefore freeable, or they are not and
cannot be freed.

> > If there are things we can do to make overlays easier to use in your
> > use case, I'd like to hear ideas. I don't really buy that being more
> > complex than needed is an obstacle. That is very often the case to
> > have common, scale-able solutions. I want to see a simple case be
> > simple to support.
> 
> Well, it's a LOT more complex from the FW perspective for a bunch of
> features we don't really need, in a way because the DT update in our
> case is just purely informational to avoid keeping wrong/outdated DT
> bits, it has little functional impact (it might have a bit for interrupt
> routing through bridges though).
> 
> However, I am also pursuing an approach on FW side using a generation
> count in our nodes and properties which we could use to generate
> arbitrary overlays if we know what generation linux has.
> 
> There might actual be a usage scenario for a generic way for our
> firwmare to convey DT updates to Linux for other reasons.
> 
> A few things that I don't find in the overlay code (but maybe I haven't
> looked at it hard enough):
> 
>  - Can it remove nodes/properties ?

Overlays: No, because I asked Pantelis to drop them.
Changeset: yes, absolutely

>  - Can it "commit" a changeset so it's permanently part of the main DT ?
> We will never have a concept of "revertable" changesets, if we need a
> subsequent update, we will get a new overlay from FW that will remove
> what needs to be removed and add what needs to be added.

Yes

> IE, our current mechanism without overlay is fairly simple:
> 
>   - On PCI unplug, we remove all nodes below the slot (from linux),
> the FW does the equivalent internally.
> 
>   - On PCI re-plug, the FW internally builds new nodes and sends a
> new subtree as an FDT that we can expand/attach.
> 
> Now we could consider that subtree as a changeset that can be undone,
> but that wouldn't work for boot time. And subsequent updates wouldn't
> have that concept of "undoing" anyway.
> 
> IE. conceptually, what overlays do today is quite rooted around the idea
> of having a fixed "base" DT and some pre-compiled DTB overlays that
> get added/removed. The design completely ignore the idea of a FW that
> maintains a "live" tree which we want to keep in sync, which is what we
> want to do here, or what we could do with a "live" open firmware
> implementation.

Right, which is exactly the reason for the changeset/overlay split.
Overlays assume a fixed base, and that overlays are kind of like plug-in
modules. changeset makes no such assumption.

g.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-06-07  7:54                               ` Grant Likely
  0 siblings, 0 replies; 184+ messages in thread
From: Grant Likely @ 2015-06-07  7:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Rob Herring
  Cc: Pantelis Antoniou, Gavin Shan, linuxppc-dev, linux-pci,
	Bjorn Helgaas, devicetree

Sorry for not weighing in earlier, I've had other work keeping me away.

My short answer: don't use overlays. They're not what you need. Generic
CONFIG_OF_DYNAMIC should be all that is required to make changes in your
use case.

Overlays are a specific api for being able to apply a set of changes in
a revertable way, but as you say, it is a lot more complicated. However,
overlays are built on top of the of_changeset API which is a lot
simpler. It doesn't do any phandle resolution, and it doesn't do any
tracking. It takes a set of changes (attach node, detach node, add
property, remove property), an applies them to the live tree. At that
point the changes are permenant*.

It is documented in Documentation/devicetree/changesets.txt

Ideally, I want all DT changes to go through the changeset API so that
the lifecycle issues are delt with in one place. It also defers firing
notifiers until after the entire changeset is applied. With
of_attach_node/of_detach_node the notifiers are sent immediately after
each change when the tree may be in an inconsistent state. For example,
a driver expecting child nodes, but the child nodes haven't been added
yet.

Comments below...

* There is an API for reverting a changeset, which simply applies the
  changes backwards and in reverse. The overlay code uses it, but you
  won't need it.

On Thu, 14 May 2015 10:54:31 +1000
, Benjamin Herrenschmidt <benh@kernel.crashing.org>
 wrote:
> On Wed, 2015-05-13 at 19:18 -0500, Rob Herring wrote:
> 
> > I haven't decided really.
> > 
> > The main thing with the current patch is I don't really like the added
> > complexity to unflatten_dt_node. It is already a fairly complex
> > function. Perhaps removing of "hybrid" as discussed will help?
> 
> I agree, we should be able to make that much simpler, I was planning on
> sorting that out with Gavin.

Ditto here. I don't want to have any new kinds of nodes created either.
They are either OF_DYNAMIC, and therefore freeable, or they are not and
cannot be freed.

> > If there are things we can do to make overlays easier to use in your
> > use case, I'd like to hear ideas. I don't really buy that being more
> > complex than needed is an obstacle. That is very often the case to
> > have common, scale-able solutions. I want to see a simple case be
> > simple to support.
> 
> Well, it's a LOT more complex from the FW perspective for a bunch of
> features we don't really need, in a way because the DT update in our
> case is just purely informational to avoid keeping wrong/outdated DT
> bits, it has little functional impact (it might have a bit for interrupt
> routing through bridges though).
> 
> However, I am also pursuing an approach on FW side using a generation
> count in our nodes and properties which we could use to generate
> arbitrary overlays if we know what generation linux has.
> 
> There might actual be a usage scenario for a generic way for our
> firwmare to convey DT updates to Linux for other reasons.
> 
> A few things that I don't find in the overlay code (but maybe I haven't
> looked at it hard enough):
> 
>  - Can it remove nodes/properties ?

Overlays: No, because I asked Pantelis to drop them.
Changeset: yes, absolutely

>  - Can it "commit" a changeset so it's permanently part of the main DT ?
> We will never have a concept of "revertable" changesets, if we need a
> subsequent update, we will get a new overlay from FW that will remove
> what needs to be removed and add what needs to be added.

Yes

> IE, our current mechanism without overlay is fairly simple:
> 
>   - On PCI unplug, we remove all nodes below the slot (from linux),
> the FW does the equivalent internally.
> 
>   - On PCI re-plug, the FW internally builds new nodes and sends a
> new subtree as an FDT that we can expand/attach.
> 
> Now we could consider that subtree as a changeset that can be undone,
> but that wouldn't work for boot time. And subsequent updates wouldn't
> have that concept of "undoing" anyway.
> 
> IE. conceptually, what overlays do today is quite rooted around the idea
> of having a fixed "base" DT and some pre-compiled DTB overlays that
> get added/removed. The design completely ignore the idea of a FW that
> maintains a "live" tree which we want to keep in sync, which is what we
> want to do here, or what we could do with a "live" open firmware
> implementation.

Right, which is exactly the reason for the changeset/overlay split.
Overlays assume a fixed base, and that overlays are kind of like plug-in
modules. changeset makes no such assumption.

g.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-06-07  7:54                               ` Grant Likely
@ 2015-06-08 20:57                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-06-08 20:57 UTC (permalink / raw)
  To: Grant Likely
  Cc: Rob Herring, Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Sun, 2015-06-07 at 08:54 +0100, Grant Likely wrote:
> > IE. conceptually, what overlays do today is quite rooted around the idea
> > of having a fixed "base" DT and some pre-compiled DTB overlays that
> > get added/removed. The design completely ignore the idea of a FW that
> > maintains a "live" tree which we want to keep in sync, which is what we
> > want to do here, or what we could do with a "live" open firmware
> > implementation.
> 
> Right, which is exactly the reason for the changeset/overlay split.
> Overlays assume a fixed base, and that overlays are kind of like plug-in
> modules. changeset makes no such assumption.

So you suggest we create a function that takes an fdt and an "anchor" as input,
and expands that FDT below that anchor, but does so by using the changeset API
under the hood ?

Even that looks somewhat tricky (turn that bit of FDT into a pile of changeset
actions), however, I can see how we could create a new function inside changeset
to attach a subtree.

Ie. of_attach_subtree() (which could have it's own reconfig action but we
don't care that much yet), which takes an expanded subtree and an anchor, and
calls of_attach_node() in effect for all nodes in there.

We could then have a two pass mechanism in our hotplug code:

 - Expand the bit of fdt into a separate tree
 - Use of_attach_subtree to "add" that subtree to the main tree

What do you think ?

Ben.


--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-06-08 20:57                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2015-06-08 20:57 UTC (permalink / raw)
  To: Grant Likely
  Cc: Rob Herring, Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci, Bjorn Helgaas, devicetree

On Sun, 2015-06-07 at 08:54 +0100, Grant Likely wrote:
> > IE. conceptually, what overlays do today is quite rooted around the idea
> > of having a fixed "base" DT and some pre-compiled DTB overlays that
> > get added/removed. The design completely ignore the idea of a FW that
> > maintains a "live" tree which we want to keep in sync, which is what we
> > want to do here, or what we could do with a "live" open firmware
> > implementation.
> 
> Right, which is exactly the reason for the changeset/overlay split.
> Overlays assume a fixed base, and that overlays are kind of like plug-in
> modules. changeset makes no such assumption.

So you suggest we create a function that takes an fdt and an "anchor" as input,
and expands that FDT below that anchor, but does so by using the changeset API
under the hood ?

Even that looks somewhat tricky (turn that bit of FDT into a pile of changeset
actions), however, I can see how we could create a new function inside changeset
to attach a subtree.

Ie. of_attach_subtree() (which could have it's own reconfig action but we
don't care that much yet), which takes an expanded subtree and an anchor, and
calls of_attach_node() in effect for all nodes in there.

We could then have a two pass mechanism in our hotplug code:

 - Expand the bit of fdt into a separate tree
 - Use of_attach_subtree to "add" that subtree to the main tree

What do you think ?

Ben.



^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-06-08 20:57                                   ` Benjamin Herrenschmidt
@ 2015-06-08 21:34                                       ` Grant Likely
  -1 siblings, 0 replies; 184+ messages in thread
From: Grant Likely @ 2015-06-08 21:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 8, 2015 at 9:57 PM, Benjamin Herrenschmidt
<benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org> wrote:
> On Sun, 2015-06-07 at 08:54 +0100, Grant Likely wrote:
>> > IE. conceptually, what overlays do today is quite rooted around the idea
>> > of having a fixed "base" DT and some pre-compiled DTB overlays that
>> > get added/removed. The design completely ignore the idea of a FW that
>> > maintains a "live" tree which we want to keep in sync, which is what we
>> > want to do here, or what we could do with a "live" open firmware
>> > implementation.
>>
>> Right, which is exactly the reason for the changeset/overlay split.
>> Overlays assume a fixed base, and that overlays are kind of like plug-in
>> modules. changeset makes no such assumption.
>
> So you suggest we create a function that takes an fdt and an "anchor" as input,
> and expands that FDT below that anchor, but does so by using the changeset API
> under the hood ?
>
> Even that looks somewhat tricky (turn that bit of FDT into a pile of changeset
> actions), however, I can see how we could create a new function inside changeset
> to attach a subtree.
>
> Ie. of_attach_subtree() (which could have it's own reconfig action but we
> don't care that much yet), which takes an expanded subtree and an anchor, and
> calls of_attach_node() in effect for all nodes in there.
>
> We could then have a two pass mechanism in our hotplug code:
>
>  - Expand the bit of fdt into a separate tree
>  - Use of_attach_subtree to "add" that subtree to the main tree
>
> What do you think ?

I like that.

g.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
@ 2015-06-08 21:34                                       ` Grant Likely
  0 siblings, 0 replies; 184+ messages in thread
From: Grant Likely @ 2015-06-08 21:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Rob Herring, Pantelis Antoniou, Gavin Shan, linuxppc-dev,
	linux-pci, Bjorn Helgaas, devicetree

On Mon, Jun 8, 2015 at 9:57 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Sun, 2015-06-07 at 08:54 +0100, Grant Likely wrote:
>> > IE. conceptually, what overlays do today is quite rooted around the idea
>> > of having a fixed "base" DT and some pre-compiled DTB overlays that
>> > get added/removed. The design completely ignore the idea of a FW that
>> > maintains a "live" tree which we want to keep in sync, which is what we
>> > want to do here, or what we could do with a "live" open firmware
>> > implementation.
>>
>> Right, which is exactly the reason for the changeset/overlay split.
>> Overlays assume a fixed base, and that overlays are kind of like plug-in
>> modules. changeset makes no such assumption.
>
> So you suggest we create a function that takes an fdt and an "anchor" as input,
> and expands that FDT below that anchor, but does so by using the changeset API
> under the hood ?
>
> Even that looks somewhat tricky (turn that bit of FDT into a pile of changeset
> actions), however, I can see how we could create a new function inside changeset
> to attach a subtree.
>
> Ie. of_attach_subtree() (which could have it's own reconfig action but we
> don't care that much yet), which takes an expanded subtree and an anchor, and
> calls of_attach_node() in effect for all nodes in there.
>
> We could then have a two pass mechanism in our hotplug code:
>
>  - Expand the bit of fdt into a separate tree
>  - Use of_attach_subtree to "add" that subtree to the main tree
>
> What do you think ?

I like that.

g.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [PATCH v4 19/21] drivers/of: Support adding sub-tree
  2015-06-08 21:34                                       ` Grant Likely
  (?)
@ 2015-06-10  6:55                                       ` Gavin Shan
  -1 siblings, 0 replies; 184+ messages in thread
From: Gavin Shan @ 2015-06-10  6:55 UTC (permalink / raw)
  To: Grant Likely
  Cc: Benjamin Herrenschmidt, Rob Herring, Pantelis Antoniou,
	Gavin Shan, linuxppc-dev, linux-pci, Bjorn Helgaas, devicetree

On Mon, Jun 08, 2015 at 10:34:13PM +0100, Grant Likely wrote:
>On Mon, Jun 8, 2015 at 9:57 PM, Benjamin Herrenschmidt
><benh@kernel.crashing.org> wrote:
>> On Sun, 2015-06-07 at 08:54 +0100, Grant Likely wrote:
>>> > IE. conceptually, what overlays do today is quite rooted around the idea
>>> > of having a fixed "base" DT and some pre-compiled DTB overlays that
>>> > get added/removed. The design completely ignore the idea of a FW that
>>> > maintains a "live" tree which we want to keep in sync, which is what we
>>> > want to do here, or what we could do with a "live" open firmware
>>> > implementation.
>>>
>>> Right, which is exactly the reason for the changeset/overlay split.
>>> Overlays assume a fixed base, and that overlays are kind of like plug-in
>>> modules. changeset makes no such assumption.
>>
>> So you suggest we create a function that takes an fdt and an "anchor" as input,
>> and expands that FDT below that anchor, but does so by using the changeset API
>> under the hood ?
>>
>> Even that looks somewhat tricky (turn that bit of FDT into a pile of changeset
>> actions), however, I can see how we could create a new function inside changeset
>> to attach a subtree.
>>
>> Ie. of_attach_subtree() (which could have it's own reconfig action but we
>> don't care that much yet), which takes an expanded subtree and an anchor, and
>> calls of_attach_node() in effect for all nodes in there.
>>
>> We could then have a two pass mechanism in our hotplug code:
>>
>>  - Expand the bit of fdt into a separate tree
>>  - Use of_attach_subtree to "add" that subtree to the main tree
>>
>> What do you think ?
>
>I like that.
>

Thanks, Grant and Ben. Currently, I'm collecting more feedbacks for v5. So
it's something I will address in v6.

Thanks,
Gavin

>g.
>

^ permalink raw reply	[flat|nested] 184+ messages in thread

end of thread, other threads:[~2015-06-10  6:55 UTC | newest]

Thread overview: 184+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-01  6:02 [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management Gavin Shan
2015-05-01  6:02 ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 01/21] pci: Add pcibios_setup_bridge() Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-07 22:12   ` Bjorn Helgaas
2015-05-07 22:12     ` Bjorn Helgaas
2015-05-11  1:59     ` Gavin Shan
2015-05-11  1:59       ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09  0:18   ` Alexey Kardashevskiy
2015-05-09  0:18     ` Alexey Kardashevskiy
2015-05-11  4:37     ` Gavin Shan
2015-05-11  4:37       ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 03/21] powerpc/powernv: M64 support improvement Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 10:24   ` Alexey Kardashevskiy
2015-05-09 10:24     ` Alexey Kardashevskiy
2015-05-11  4:47     ` Gavin Shan
2015-05-11  4:47       ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 10:53   ` Alexey Kardashevskiy
2015-05-09 10:53     ` Alexey Kardashevskiy
2015-05-11  4:52     ` Gavin Shan
2015-05-11  4:52       ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 05/21] powerpc/powernv: Improve DMA32 segment assignment Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 11:43   ` Alexey Kardashevskiy
2015-05-09 11:43     ` Alexey Kardashevskiy
2015-05-11  4:55     ` Gavin Shan
2015-05-11  4:55       ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 07/21] powerpc/powernv: Release " Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 12:43   ` Alexey Kardashevskiy
2015-05-09 12:43     ` Alexey Kardashevskiy
2015-05-11  6:25     ` Gavin Shan
2015-05-11  6:25       ` Gavin Shan
2015-05-11  7:02       ` Alexey Kardashevskiy
2015-05-11  7:02         ` Alexey Kardashevskiy
2015-05-12  0:03         ` Gavin Shan
2015-05-12  0:03           ` Gavin Shan
2015-05-12  0:53           ` Alexey Kardashevskiy
2015-05-12  0:53             ` Alexey Kardashevskiy
2015-05-12  1:25             ` Gavin Shan
2015-05-12  1:25               ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 08/21] powerpc/powernv: Drop pnv_ioda_setup_dev_PE() Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 12:45   ` Alexey Kardashevskiy
2015-05-09 12:45     ` Alexey Kardashevskiy
2015-05-01  6:02 ` [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 13:41   ` Alexey Kardashevskiy
2015-05-09 13:41     ` Alexey Kardashevskiy
2015-05-11  6:45     ` Gavin Shan
2015-05-11  6:45       ` Gavin Shan
2015-05-11  7:16       ` Alexey Kardashevskiy
2015-05-11  7:16         ` Alexey Kardashevskiy
2015-05-01  6:02 ` [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-09 14:12   ` Alexey Kardashevskiy
2015-05-09 14:12     ` Alexey Kardashevskiy
2015-05-11  6:47     ` Gavin Shan
2015-05-11  6:47       ` Gavin Shan
2015-05-11  7:17       ` Alexey Kardashevskiy
2015-05-11  7:17         ` Alexey Kardashevskiy
2015-05-12  0:04         ` Gavin Shan
2015-05-12  0:04           ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 11/21] powerpc/pci: Don't scan empty slot Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-01  6:02 ` [PATCH v4 12/21] powerpc/pci: Move pcibios_find_pci_bus() around Gavin Shan
2015-05-01  6:02   ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll() Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-09 14:30   ` Alexey Kardashevskiy
2015-05-09 14:30     ` Alexey Kardashevskiy
2015-05-11  7:19     ` Gavin Shan
2015-05-11  7:19       ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 14/21] powerpc/powernv: Functions to get/reset PCI slot status Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-09 14:44   ` Alexey Kardashevskiy
2015-05-09 14:44     ` Alexey Kardashevskiy
2015-05-01  6:03 ` [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-09 14:55   ` Alexey Kardashevskiy
2015-05-09 14:55     ` Alexey Kardashevskiy
2015-05-11  7:21     ` Gavin Shan
2015-05-11  7:21       ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 16/21] powerpc/pci: Create eeh_dev while " Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-09 15:08   ` Alexey Kardashevskiy
2015-05-09 15:08     ` Alexey Kardashevskiy
2015-05-11  7:24     ` Gavin Shan
2015-05-11  7:24       ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 17/21] powerpc/pci: Export traverse_pci_device_nodes() Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 18/21] powerpc/pci: Update bridge windows on PCI plugging Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 19/21] drivers/of: Support adding sub-tree Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-01 12:54   ` Rob Herring
2015-05-01 12:54     ` Rob Herring
2015-05-01 15:22     ` Benjamin Herrenschmidt
2015-05-01 15:22       ` Benjamin Herrenschmidt
2015-05-01 18:46       ` Rob Herring
2015-05-01 18:46         ` Rob Herring
2015-05-01 22:57         ` Benjamin Herrenschmidt
2015-05-01 22:57           ` Benjamin Herrenschmidt
2015-05-01 23:29           ` Benjamin Herrenschmidt
2015-05-01 23:29             ` Benjamin Herrenschmidt
2015-05-02  2:48             ` Benjamin Herrenschmidt
2015-05-02  2:48               ` Benjamin Herrenschmidt
2015-05-04  1:30               ` Gavin Shan
2015-05-04  1:30                 ` Gavin Shan
2015-05-04  4:51                 ` Benjamin Herrenschmidt
2015-05-04  4:51                   ` Benjamin Herrenschmidt
2015-05-04  0:23             ` Gavin Shan
2015-05-04  0:23               ` Gavin Shan
     [not found]           ` <1430521038.7979.70.camel-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
2015-05-04 16:41             ` Pantelis Antoniou
2015-05-04 16:41               ` Pantelis Antoniou
2015-05-04 16:41               ` Pantelis Antoniou
2015-05-04 21:14               ` Benjamin Herrenschmidt
2015-05-04 21:14                 ` Benjamin Herrenschmidt
2015-05-13 23:35                 ` Benjamin Herrenschmidt
2015-05-13 23:35                   ` Benjamin Herrenschmidt
     [not found]                   ` <1431560124.20218.91.camel-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
2015-05-14  0:18                     ` Rob Herring
2015-05-14  0:18                       ` Rob Herring
2015-05-14  0:18                       ` Rob Herring
     [not found]                       ` <CAL_JsqKqTa5eg3eOqx3bkeNdO_920WwDiRbQaxwWLEWpCypFmA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-05-14  0:54                         ` Benjamin Herrenschmidt
2015-05-14  0:54                           ` Benjamin Herrenschmidt
2015-05-14  0:54                           ` Benjamin Herrenschmidt
2015-05-14  6:23                           ` Pantelis Antoniou
2015-05-14  6:23                             ` Pantelis Antoniou
2015-05-14  6:46                             ` Benjamin Herrenschmidt
2015-05-14  6:46                               ` Benjamin Herrenschmidt
2015-05-14  7:04                               ` Pantelis Antoniou
2015-05-14  7:04                                 ` Pantelis Antoniou
     [not found]                                 ` <3988EABE-3DE9-4E1C-9778-22E35138E359-wVdstyuyKrO8r51toPun2/C9HSW9iNxf@public.gmane.org>
2015-05-14  7:14                                   ` Benjamin Herrenschmidt
2015-05-14  7:14                                     ` Benjamin Herrenschmidt
2015-05-14  7:14                                     ` Benjamin Herrenschmidt
2015-05-14  7:19                                     ` Pantelis Antoniou
2015-05-14  7:19                                       ` Pantelis Antoniou
2015-05-14  7:19                                       ` Pantelis Antoniou
     [not found]                                       ` <75F026CA-5AC1-4106-B2F0-AB0D006DEF5A-wVdstyuyKrO8r51toPun2/C9HSW9iNxf@public.gmane.org>
2015-05-14  7:25                                         ` Benjamin Herrenschmidt
2015-05-14  7:25                                           ` Benjamin Herrenschmidt
2015-05-14  7:25                                           ` Benjamin Herrenschmidt
2015-05-14  7:29                                           ` Benjamin Herrenschmidt
2015-05-14  7:29                                             ` Benjamin Herrenschmidt
     [not found]                                           ` <1431588358.4160.42.camel-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
2015-05-14  7:34                                             ` Pantelis Antoniou
2015-05-14  7:34                                               ` Pantelis Antoniou
2015-05-14  7:34                                               ` Pantelis Antoniou
     [not found]                                               ` <D7FC0542-DD1A-428F-8E75-81620C6D83DC-wVdstyuyKrO8r51toPun2/C9HSW9iNxf@public.gmane.org>
2015-05-14  7:47                                                 ` Benjamin Herrenschmidt
2015-05-14  7:47                                                   ` Benjamin Herrenschmidt
2015-05-14  7:47                                                   ` Benjamin Herrenschmidt
2015-05-14 11:02                                                   ` Pantelis Antoniou
2015-05-14 11:02                                                     ` Pantelis Antoniou
2015-05-14 11:02                                                     ` Pantelis Antoniou
2015-05-14 23:25                                                     ` Benjamin Herrenschmidt
2015-05-14 23:25                                                       ` Benjamin Herrenschmidt
     [not found]                           ` <1431564871.4160.8.camel-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
2015-06-07  7:54                             ` Grant Likely
2015-06-07  7:54                               ` Grant Likely
     [not found]                               ` <20150607075422.6ECE9C40A12-WNowdnHR2B42iJbIjFUEsiwD8/FfD2ys@public.gmane.org>
2015-06-08 20:57                                 ` Benjamin Herrenschmidt
2015-06-08 20:57                                   ` Benjamin Herrenschmidt
     [not found]                                   ` <1433797073.4526.163.camel-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
2015-06-08 21:34                                     ` Grant Likely
2015-06-08 21:34                                       ` Grant Likely
2015-06-10  6:55                                       ` Gavin Shan
2015-05-03 23:28     ` Gavin Shan
2015-05-03 23:28       ` Gavin Shan
2015-05-15  1:27   ` Gavin Shan
2015-05-15  1:27     ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 20/21] powerpc/powernv: Select OF_DYNAMIC Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-01  6:03 ` [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver Gavin Shan
2015-05-01  6:03   ` Gavin Shan
2015-05-09 15:54   ` Alexey Kardashevskiy
2015-05-09 15:54     ` Alexey Kardashevskiy
2015-05-11  7:38     ` Gavin Shan
2015-05-11  7:38       ` Gavin Shan
2015-05-08 23:59 ` [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management Alexey Kardashevskiy
2015-05-08 23:59   ` Alexey Kardashevskiy
2015-05-11  7:40   ` Gavin Shan
2015-05-11  7:40     ` Gavin Shan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.