linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V9 00/18] Enable SRIOV on PowerNV
@ 2014-11-02 15:41 Wei Yang
  2014-11-02 15:41 ` [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
                   ` (18 more replies)
  0 siblings, 19 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

This patchset enables the SRIOV on POWER8.

The gerneral idea is put each VF into one individual PE and allocate required
resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
allocation and adjustment for PF's IOV BAR.

On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
sit in its own PE. This gives more flexiblity, while at the mean time it
brings on some restrictions on the PF's IOV BAR size and alignment.

To achieve this effect, we need to do some hack on pci devices's resources.
1. Expand the IOV BAR properly.
   Done by pnv_pci_ioda_fixup_iov_resources().
2. Shift the IOV BAR properly.
   Done by pnv_pci_vf_resource_shift().
3. IOV BAR alignment is calculated by arch dependent function instead of an
   individual VF BAR size.
   Done by pnv_pcibios_sriov_resource_alignment().
4. Take the IOV BAR alignment into consideration in the sizing and assigning.
   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
   sizing and assigning"

Test Environment:
       The SRIOV device tested is Emulex Lancer(10df:e220) and
       Mellanox ConnectX-3(15b3:1003) on POWER8.

Examples on pass through a VF to guest through vfio:
	1. unbind the original driver and bind to vfio-pci driver
	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
	   Note: this should be done for each device in the same iommu_group
	2. Start qemu and pass device through vfio
	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
		   -M pseries -m 2048 -enable-kvm -nographic \
		   -drive file=/home/ywywyang/kvm/fc19.img \
		   -monitor telnet:localhost:5435,server,nowait -boot cd \
		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"

Verify this is the exact VF response:
	1. ping from a machine in the same subnet(the broadcast domain)
	2. run arp -n on this machine
	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
	3. ifconfig in the guest
	   # ifconfig eth1
	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
	        RX packets 175  bytes 13278 (12.9 KiB)
	        RX errors 0  dropped 0  overruns 0  frame 0
		TX packets 58  bytes 9276 (9.0 KiB)
	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
	4. They have the same MAC address

	Note: make sure you shutdown other network interfaces in guest.

---
v9:
   * make the change log consistent in the terminology
     PF's IOV BAR -> the SRIOV BAR in PF
     VF's BAR -> the normal BAR in VF's view
   * rename all newly introduced function from _sriov_ to _iov_
   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
   * add the vendor id and device id of the tested devices
   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
   * rebase on 3.18-rc2 and tested
v8:
   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
     retrieve the IOV BAR size.
   * add a document Documentation/powerpc/pci_resource.txt to explain the
     design.
   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
   * extract a function res_to_dev_res(), so that it is more general to get
     additional size and alignment
   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
     lock.
v7:
   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
     hardware directly. If not, calculate as usual.
   * reorder the patch set, group them by subsystem:
     PCI, powerpc, powernv
   * rebase it on 3.16-rc6
v6:
   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
     similar function is moved to
     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
     enabled, platform will try best to allocate resources for VFs.
   * remove pcibios_sriov_resource_size weak function
   * VF BAR size is retrieved from hardware directly in virtfn_add()
v5:
   * merge those SRIOV related platform functions in machdep_calls
     wrap them in one CONFIG_PCI_IOV marco
   * define IODA_INVALID_M64 to replace (-1)
     use this value to represent the m64_wins is not used
   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
     reduce some log in kernel
   * release M64 window in pnv_pci_ioda2_release_dma_pe()
v4:
   * code format fix, eg. not exceed 80 chars
   * in commit "ppc/pnv: Add function to deconfig a PE"
     check the bus has a bridge before print the name
     remove a PE from its own PELTV
   * change the function name for sriov resource size/alignment
   * rebase on 3.16-rc3
   * VFs will not rely on device node
     As Grant Likely's comments, kernel should have the ability to handle the
     lack of device_node gracefully. Gavin restructure the pci_dn, which
     makes the VF will have pci_dn even when VF's device_node is not provided
     by firmware.
   * clean all the patch title to make them comply with one style
   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
v3:
   * change the return type of virtfn_bus/virtfn_devfn to int
     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
   * reduce the second parameter or pcibios_sriov_disable()
   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
v2:
   * change the return value of virtfn_bus/virtfn_devfn to 0
   * move some TCE related marco definition to
     arch/powerpc/platforms/powernv/pci.h
   * fix the __pci_sriov_resource_alignment on powernv platform
     During the sizing stage, the IOV BAR is truncated to 0, which will
     effect the order of allocation. Fix this, so that make sure BAR will be
     allocated ordered by their alignment.
v1:
   * improve the change log for
     "PCI: Add weak __pci_sriov_resource_size() interface"
     "PCI: Add weak __pci_sriov_resource_alignment() interface"
     "PCI: take additional IOV BAR alignment in sizing and assigning"
   * wrap VF PE code in CONFIG_PCI_IOV
   * did regression test on P7.

Gavin Shan (1):
  powrepc/pci: Refactor pci_dn

Wei Yang (17):
  PCI/IOV: Export interface for retrieve VF's BDF
  PCI: Add weak pcibios_iov_resource_alignment() interface
  PCI: Add weak pcibios_iov_resource_size() interface
  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  powerpc/pci: Add PCI resource alignment documentation
  powerpc/pci: Don't unset pci resources for VFs
  powerpc/pci: Define pcibios_disable_device() on powerpc
  powerpc/pci: remove pci_dn->pcidev field
  powerpc/powernv: Use pci_dn in PCI config accessor
  powerpc/powernv: Allocate pe->iommu_table dynamically
  powerpc/powernv: Expand VF resources according to the number of
    total_pe
  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
    powernv
  powerpc/powernv: Implement pcibios_iov_resource_size() on powernv
  powerpc/powernv: Shift VF resource with an offset
  powerpc/powernv: Allocate VF PE
  powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported
  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3

 .../powerpc/pci_iov_resource_on_powernv.txt        |   75 ++
 arch/powerpc/include/asm/device.h                  |    3 +
 arch/powerpc/include/asm/iommu.h                   |    3 +
 arch/powerpc/include/asm/machdep.h                 |   13 +-
 arch/powerpc/include/asm/pci-bridge.h              |   24 +-
 arch/powerpc/kernel/pci-common.c                   |   39 +
 arch/powerpc/kernel/pci-hotplug.c                  |    3 +
 arch/powerpc/kernel/pci_dn.c                       |  257 ++++++-
 arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
 arch/powerpc/platforms/powernv/pci-ioda.c          |  744 +++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c               |   87 +--
 arch/powerpc/platforms/powernv/pci.h               |   13 +-
 drivers/pci/iov.c                                  |   60 +-
 drivers/pci/setup-bus.c                            |   85 ++-
 include/linux/pci.h                                |   19 +
 15 files changed, 1332 insertions(+), 107 deletions(-)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-19 23:35   ` Bjorn Helgaas
  2014-11-02 15:41 ` [PATCH V9 02/18] PCI: Add weak pcibios_iov_resource_alignment() interface Wei Yang
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

When implementing the SR-IOV on PowerNV platform, some resource reservation is
needed for VFs which don't exist at the bootup stage. To do the match between
resources and VFs, the code need to get the VF's BDF in advance.

In this patch, it exports the interface to retrieve VF's BDF:
   * Make the virtfn_bus as an interface
   * Make the virtfn_devfn as an interface
   * Rename them with more specific name
   * Code cleanup in pci_sriov_resource_alignment()

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   22 +++++++++++++---------
 include/linux/pci.h |   11 +++++++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4d109c0..5e8091b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,14 +19,18 @@
 
 #define VIRTFN_ID_LEN	16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
 				    dev->sriov->stride * id) >> 8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * id) & 0xff;
 }
@@ -69,7 +73,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	struct pci_bus *bus;
 
 	mutex_lock(&iov->dev->sriov->lock);
-	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
 	if (!bus)
 		goto failed;
 
@@ -77,7 +81,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	if (!virtfn)
 		goto failed0;
 
-	virtfn->devfn = virtfn_devfn(dev, id);
+	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
 	virtfn->vendor = dev->vendor;
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
 	pci_setup_device(virtfn);
@@ -140,8 +144,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	struct pci_sriov *iov = dev->sriov;
 
 	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
-					     virtfn_bus(dev, id),
-					     virtfn_devfn(dev, id));
+					     pci_iov_virtfn_bus(dev, id),
+					     pci_iov_virtfn_devfn(dev, id));
 	if (!virtfn)
 		return;
 
@@ -216,7 +220,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
+	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
 		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
 		return -ENOMEM;
 	}
@@ -516,7 +520,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 	if (!reg)
 		return 0;
 
-	 __pci_read_base(dev, type, &tmp, reg);
+	__pci_read_base(dev, type, &tmp, reg);
 	return resource_alignment(&tmp);
 }
 
@@ -546,7 +550,7 @@ int pci_iov_bus_range(struct pci_bus *bus)
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (!dev->is_physfn)
 			continue;
-		busnr = virtfn_bus(dev, dev->sriov->total_VFs - 1);
+		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
 		if (busnr > max)
 			max = busnr;
 	}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 5be8db4..3ed7c66 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1654,6 +1654,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1661,6 +1664,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 02/18] PCI: Add weak pcibios_iov_resource_alignment() interface
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
  2014-11-02 15:41 ` [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface Wei Yang
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

The alignment of PF's IOV BAR is designed to be the individual size of a VF's
BAR size. This works fine for many platforms, but on PowerNV platform it needs
some change.

The original alignment works, since at sizing and assigning stage the
requirement is from an individual VF's BAR size instead of the PF's IOV BAR.
This is the reason for the original code to just retrieve the individual
VF BAR size as the alignment.

On PowerNV platform, it is required to align the whole PF IOV BAR to a hardware
segment. Based on this fact, the alignment of PF's IOV BAR should be
calculated seperately.

This patch introduces a weak pcibios_iov_resource_alignment() interface, which
gives platform a chance to implement specific method to calculate the PF's IOV
BAR alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   11 ++++++++++-
 include/linux/pci.h |    3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5e8091b..4d1685d 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -501,6 +501,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
 		4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+		int resno, resource_size_t align)
+{
+	return align;
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -515,13 +521,16 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
 	struct resource tmp;
 	enum pci_bar_type type;
+	resource_size_t align;
 	int reg = pci_iov_resource_bar(dev, resno, &type);
 
 	if (!reg)
 		return 0;
 
 	__pci_read_base(dev, type, &tmp, reg);
-	return resource_alignment(&tmp);
+	align = resource_alignment(&tmp);
+
+	return pcibios_iov_resource_alignment(dev, resno, align);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3ed7c66..bbf8058 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1159,6 +1159,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
+						 int resno,
+						 resource_size_t align);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
  2014-11-02 15:41 ` [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
  2014-11-02 15:41 ` [PATCH V9 02/18] PCI: Add weak pcibios_iov_resource_alignment() interface Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-19  1:12   ` Bjorn Helgaas
  2014-11-02 15:41 ` [PATCH V9 04/18] PCI: Take additional PF's IOV BAR alignment in sizing and assigning Wei Yang
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

When retrieving VF IOV BAR in virtfn_add(), it will divide the total PF's IOV
BAR size with the totalVF number. This is true for most cases, while may not
be correct on some specific platform.

For example on PowerNV platform, in order to fix PF's IOV BAR into a hardware
alignment, the PF's IOV BAR size would be expended. This means the original
method couldn't work.

This patch introduces a weak pcibios_iov_resource_size() interface, which
gives platform a chance to implement specific method to calculate the VF BAR
resource size.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   27 +++++++++++++++++++++++++--
 include/linux/pci.h |    5 +++++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4d1685d..6866830 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -61,6 +61,30 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t __weak pcibios_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	return 0;
+}
+
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	resource_size_t size;
+	struct pci_sriov *iov;
+
+	if (!dev->is_physfn)
+		return 0;
+
+	size = pcibios_iov_resource_size(dev, resno);
+	if (size != 0)
+		return size;
+
+	iov = dev->sriov;
+	size = resource_size(dev->resource + resno);
+	do_div(size, iov->total_VFs);
+
+	return size;
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -96,8 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index bbf8058..2f5b454 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1162,6 +1162,8 @@ resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
 						 int resno,
 						 resource_size_t align);
+resource_size_t pcibios_iov_resource_size(struct pci_dev *dev,
+		                            int resno);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
@@ -1666,6 +1668,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -1685,6 +1688,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 04/18] PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (2 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 05/18] powerpc/pci: Add PCI resource alignment documentation Wei Yang
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

At resource sizing/assigning stage, resources are divided into two lists,
requested list and additional list, while the alignement of the additional
IOV BAR is not taken into the sizing and assigning procedure.

This is reasonable in the original implementation, since IOV BAR's alignment is
mostly the size of a PF BAR alignemt. This means the alignment is already taken
into consideration. While this rule may be violated on some platform, eg.
PowerNV platform.

This patch takes the additional IOV BAR alignment in sizing and assigning stage
explicitly. When system MMIO space is not enough, the PF's IOV BAR alignment
will not contribute to the bridge. When system MMIO space is enough, the
additional alignment will contribute to the bridge.

Also it take advantage of pci_dev_resource::min_align to store this additional
alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c |   85 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 0482235..05c7df0 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
 	}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-					struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+					       struct resource *res)
 {
 	struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
 			int idx = res - &dev_res->dev->resource[0];
 
 			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
-				 "res[%d]=%pR get_res_add_size add_size %llx\n",
+				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
 				 idx, dev_res->res,
-				 (unsigned long long)dev_res->add_size);
+				 (unsigned long long)dev_res->add_size,
+				 (unsigned long long)dev_res->min_align);
 
-			return dev_res->add_size;
+			return dev_res;
 		}
 	}
 
-	return 0;
+	return NULL;
+}
+
+static resource_size_t get_res_add_size(struct list_head *head,
+					struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+		struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->min_align : 0;
 }
 
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
 	LIST_HEAD(save_head);
 	LIST_HEAD(local_fail_head);
 	struct pci_dev_resource *save_res;
-	struct pci_dev_resource *dev_res, *tmp_res;
+	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
 	unsigned long fail_type;
+	resource_size_t add_align, align;
 
 	/* Check if optional add_size is there */
 	if (!realloc_head || list_empty(realloc_head))
@@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
 	}
 
 	/* Update res in head list with add_size in realloc_head list */
-	list_for_each_entry(dev_res, head, list)
+	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
 		dev_res->res->end += get_res_add_size(realloc_head,
 							dev_res->res);
 
+		/* 
+		 * There are two kinds additional resources in the list:
+		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
+		 * 2. SRIOV resource   -- IORESOURCE_SIZEALIGN
+		 * Here just fix the additional alignment for bridge
+		 */
+		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+			continue;
+
+		add_align = get_res_add_align(realloc_head, dev_res->res);
+
+		/* Reorder the list by their alignment */
+		if (add_align > dev_res->res->start) {
+			dev_res->res->start = add_align;
+			dev_res->res->end = add_align +
+				            resource_size(dev_res->res);
+
+			list_for_each_entry(dev_res2, head, list) {
+				align = pci_resource_alignment(dev_res2->dev,
+							       dev_res2->res);
+				if (add_align > align)
+					list_move_tail(&dev_res->list,
+						       &dev_res2->list);
+			}
+               }
+
+	}
+
 	/* Try updated head list with add_size added */
 	assign_requested_resources_sorted(head, &local_fail_head);
 
@@ -930,6 +979,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	struct resource *b_res = find_free_bus_resource(bus,
 					mask | IORESOURCE_PREFETCH, type);
 	resource_size_t children_add_size = 0;
+	resource_size_t children_add_align = 0;
+	resource_size_t add_align = 0;
 
 	if (!b_res)
 		return -ENOSPC;
@@ -954,6 +1005,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			/* put SRIOV requested res to the optional list */
 			if (realloc_head && i >= PCI_IOV_RESOURCES &&
 					i <= PCI_IOV_RESOURCE_END) {
+				add_align = max(pci_resource_alignment(dev, r), add_align);
 				r->end = r->start - 1;
 				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
 				children_add_size += r_size;
@@ -984,19 +1036,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			if (order > max_order)
 				max_order = order;
 
-			if (realloc_head)
+			if (realloc_head) {
 				children_add_size += get_res_add_size(realloc_head, r);
+				children_add_align = get_res_add_align(realloc_head, r);
+				add_align = max(add_align, children_add_align);
+			}
 		}
 	}
 
 	min_align = calculate_mem_align(aligns, max_order);
 	min_align = max(min_align, window_alignment(bus, b_res->flags));
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
+	add_align = max(min_align, add_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
 	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
 		calculate_memsize(size, min_size, add_size,
-				resource_size(b_res), min_align);
+				resource_size(b_res), add_align);
 	if (!size0 && !size1) {
 		if (b_res->start || b_res->end)
 			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
@@ -1008,10 +1064,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	b_res->end = size0 + min_align - 1;
 	b_res->flags |= IORESOURCE_STARTALIGN;
 	if (size1 > size0 && realloc_head) {
-		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
-		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
-			   b_res, &bus->busn_res,
-			   (unsigned long long)size1-size0);
+		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
+				 "%pR to %pR add_size %llx add_align %llx\n", b_res,
+				 &bus->busn_res, (unsigned long long)size1-size0,
+				 add_align);
 	}
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 05/18] powerpc/pci: Add PCI resource alignment documentation
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (3 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 04/18] PCI: Take additional PF's IOV BAR alignment in sizing and assigning Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 06/18] powerpc/pci: Don't unset pci resources for VFs Wei Yang
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:
    1. size expaned
    2. aligned to M64BT size

This patch documents this change on the reason and how.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 .../powerpc/pci_iov_resource_on_powernv.txt        |   75 ++++++++++++++++++++
 1 file changed, 75 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 0000000..8b3f346
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,75 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerNV platform and how generic PCI code handle this
+requirement.
+
+1. Hardware requirement on PowerNV platform
+On PowerNV platform, IODA2 version, it has 16 M64 BARs, which is used to map
+MMIO range to PE#. Each M64 BAR would cover one MMIO range and this range is
+divided by *total_pe* number evenly with one piece corresponding to one PE.
+
+We decide to leverage this M64 BAR to map VFs to their individual PE, since
+for SRIOV VFs their BAR share the same size.
+
+By doing so, it introduces another problem. The *total_pe* number usually is
+bigger than the total_VFs. If we map one IOV BAR directly to one M64 BAR, some
+part in M64 BAR will map to another devices MMIO range.
+
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.0 Direct map IOV BAR
+
+As Figure 1.0 indicates, the range [total_VFs, total_pe - 1] in M64 BAR may
+map to some MMIO range on other device.
+
+The solution currently we have is to expand the IOV BAR to *total_pe* number.
+
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.1 Map expanded IOV BAR
+
+By expanding the IOV BAR, this ensures the whole M64 range will not effect
+others.
+
+2. How generic PCI code handle it
+Till now, it looks good to make it work, while another problem comes. The M64
+BAR start address needs to be size aligned, while the original generic PCI
+code assign the IOV BAR with individual VF BAR size aligned.
+
+Since usually one SRIOV VF BAR size is the same as its PF size, the original
+generic PCI code will not count in the IOV BAR alignment. (The alignment is
+the same as its PF.) With the change from PowerNV platform, this changes. The
+alignment of the IOV BAR is now the total size, then we need to count in it.
+
+From:
+	alignment(IOV BAR) = size(VF BAR) = size(PF BAR)
+To:
+	alignment(IOV BAR) = size(IOV BAR)
+
+In commit(PCI: Take additional IOV BAR alignment in sizing and assigning), it
+has add_align to track the alignment from IOV BAR and use it to meet the
+requirement.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 06/18] powerpc/pci: Don't unset pci resources for VFs
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (4 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 05/18] powerpc/pci: Add PCI resource alignment documentation Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 07/18] powerpc/pci: Define pcibios_disable_device() on powerpc Wei Yang
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
resources will be cleaned out during device header fixup time and then get
reassigned by PCI core. However, the VF resources won't be reassigned and
thus, we shouldn't clean them out.

This patch adds a condition. If the pci_dev is a VF, skip the resource
unset process.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index e5dad9a..399d813 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -789,6 +789,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 		       pci_name(dev));
 		return;
 	}
+
+	if (dev->is_virtfn)
+		return;
+
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
 		struct resource *res = dev->resource + i;
 		struct pci_bus_region reg;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 07/18] powerpc/pci: Define pcibios_disable_device() on powerpc
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (5 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 06/18] powerpc/pci: Don't unset pci resources for VFs Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 08/18] powrepc/pci: Refactor pci_dn Wei Yang
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

When driver remove a pci_dev, it will call pcibios_disable_device() which is
platform dependent. This gives flexibility to platforms.

This patch defines this weak function on powerpc architecture.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h |    5 ++++-
 arch/powerpc/kernel/pci-common.c   |    8 ++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 307347f..8242262 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -240,7 +240,10 @@ struct machdep_calls {
 
 	/* Called when pci_enable_device() is called. Returns 0 to
 	 * allow assignment/enabling of the device. */
-	int  (*pcibios_enable_device_hook)(struct pci_dev *);
+	int (*pcibios_enable_device_hook)(struct pci_dev *);
+
+	/* Called when pci_disable_device() is called. */
+	void (*pcibios_disable_device_hook)(struct pci_dev *);
 
 	/* Called after scan and before resource survey */
 	void (*pcibios_fixup_phb)(struct pci_controller *hose);
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 399d813..17acfa7 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1452,6 +1452,14 @@ int pcibios_enable_device(struct pci_dev *dev, int mask)
 	return pci_enable_resources(dev, mask);
 }
 
+void pcibios_disable_device(struct pci_dev *dev)
+{
+	if (ppc_md.pcibios_disable_device_hook)
+		ppc_md.pcibios_disable_device_hook(dev);
+
+	return;
+}
+
 resource_size_t pcibios_io_space_offset(struct pci_controller *hose)
 {
 	return (unsigned long) hose->io_base_virt - _IO_BASE;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (6 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 07/18] powerpc/pci: Define pcibios_disable_device() on powerpc Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-19 23:30   ` Bjorn Helgaas
  2014-11-02 15:41 ` [PATCH V9 09/18] powerpc/pci: remove pci_dn->pcidev field Wei Yang
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev

From: Gavin Shan <gwshan@linux.vnet.ibm.com>

pci_dn is the extension of PCI device node and it's created from
device node. Unfortunately, VFs that are enabled dynamically by
PF's driver and they don't have corresponding device nodes, and
pci_dn. The patch refactors pci_dn to support VFs:

   * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
     to the child list of pci_dn of PF's bridge. pci_dn of other
     device put to the child list of pci_dn of its upstream bridge.

   * VF's pci_dn is expected to be created dynamically when applying
     final fixup to PF. VF's pci_dn will be destroyed when releasing
     PF's pci_dev instance. pci_dn of other device is still created
     from device node as before.

   * For one particular PCI device (VF or not), its pci_dn can be
     found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
     or parent's list. The fast path (fetching pci_dn through PCI
     device instance) is populated during early fixup time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/device.h     |    3 +
 arch/powerpc/include/asm/pci-bridge.h |   14 +-
 arch/powerpc/kernel/pci-hotplug.c     |    3 +
 arch/powerpc/kernel/pci_dn.c          |  246 ++++++++++++++++++++++++++++++++-
 4 files changed, 261 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h
index 38faede..29992cd 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -34,6 +34,9 @@ struct dev_archdata {
 #ifdef CONFIG_SWIOTLB
 	dma_addr_t		max_direct_dma_addr;
 #endif
+#ifdef CONFIG_PPC64
+	void			*firmware_data;
+#endif
 #ifdef CONFIG_EEH
 	struct eeh_dev		*edev;
 #endif
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 4ca90a3..757d7bb 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -89,6 +89,7 @@ struct pci_controller {
 
 #ifdef CONFIG_PPC64
 	unsigned long buid;
+	void *firmware_data;
 #endif	/* CONFIG_PPC64 */
 
 	void *private_data;
@@ -150,9 +151,13 @@ static inline int isa_vaddr_is_ioport(void __iomem *address)
 struct iommu_table;
 
 struct pci_dn {
+	int     flags;
+#define PCI_DN_FLAG_IOV_VF     0x01
+
 	int	busno;			/* pci bus number */
 	int	devfn;			/* pci device and function number */
 
+	struct  pci_dn *parent;
 	struct  pci_controller *phb;	/* for pci devices */
 	struct	iommu_table *iommu_table;	/* for phb's or bridges */
 	struct	device_node *node;	/* back-pointer to the device_node */
@@ -169,14 +174,19 @@ struct pci_dn {
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
 #endif
+	struct list_head child_list;
+	struct list_head list;
 };
 
 /* Get the pointer to a device_node's pci_dn */
 #define PCI_DN(dn)	((struct pci_dn *) (dn)->data)
 
+extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+					   int devfn);
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
-
-extern void * update_dn_pci_info(struct device_node *dn, void *data);
+extern struct pci_dn *add_dev_pci_info(struct pci_dev *pdev);
+extern void remove_dev_pci_info(struct pci_dev *pdev);
+extern void *update_dn_pci_info(struct device_node *dn, void *data);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
 					  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
index 5b78917..af60efe 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -30,6 +30,9 @@
 void pcibios_release_device(struct pci_dev *dev)
 {
 	eeh_remove_device(dev);
+
+	/* Release firmware data */
+	remove_dev_pci_info(dev);
 }
 
 /**
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 1f61fab..fa966ae 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -32,12 +32,222 @@
 #include <asm/ppc-pci.h>
 #include <asm/firmware.h>
 
+/*
+ * The function is used to find the firmware data of one
+ * specific PCI device, which is attached to the indicated
+ * PCI bus. For VFs, their firmware data is linked to that
+ * one of PF's bridge. For other devices, their firmware
+ * data is linked to that of their bridge.
+ */
+static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
+{
+	struct pci_bus *pbus;
+	struct device_node *dn;
+	struct pci_dn *pdn;
+
+	/*
+	 * We probably have virtual bus which doesn't
+	 * have associated bridge.
+	 */
+	pbus = bus;
+	while (pbus) {
+		if (pci_is_root_bus(pbus) || pbus->self)
+			break;
+
+		pbus = pbus->parent;
+	}
+
+	/*
+	 * Except virtual bus, all PCI buses should
+	 * have device nodes.
+	 */
+	dn = pci_bus_to_OF_node(pbus);
+	pdn = dn ? PCI_DN(dn) : NULL;
+
+	return pdn;
+}
+
+struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+				    int devfn)
+{
+	struct device_node *dn = NULL;
+	struct pci_dn *parent, *pdn;
+	struct pci_dev *pdev = NULL;
+
+	/* Fast path: fetch from PCI device */
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		if (pdev->devfn == devfn) {
+			if (pdev->dev.archdata.firmware_data)
+				return pdev->dev.archdata.firmware_data;
+
+			dn = pci_device_to_OF_node(pdev);
+			break;
+		}
+	}
+
+	/* Fast path: fetch from device node */
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/* Slow path: fetch from firmware data hierarchy */
+	parent = pci_bus_to_pdn(bus);
+	if (!parent)
+		return NULL;
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == bus->number &&
+                    pdn->devfn == devfn)
+                        return pdn;
+        }
+
+	return NULL;
+}
+
 struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 {
-	struct device_node *dn = pci_device_to_OF_node(pdev);
-	if (!dn)
+	struct device_node *dn;
+	struct pci_dn *parent, *pdn;
+
+	/* Search device directly */
+	if (pdev->dev.archdata.firmware_data)
+		return pdev->dev.archdata.firmware_data;
+
+	/* Check device node */
+	dn = pci_device_to_OF_node(pdev);
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/*
+	 * VFs don't have device nodes. We hook their
+	 * firmware data to PF's bridge.
+	 */
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
 		return NULL;
-	return PCI_DN(dn);
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == pdev->bus->number &&
+		    pdn->devfn == pdev->devfn)
+			return pdn;
+	}
+
+	return NULL;
+}
+
+static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
+					   struct pci_dev *pdev,
+					   int busno, int devfn)
+{
+	struct pci_dn *pdn;
+
+	/* Except PHB, we always have parent firmware data */
+	if (!parent)
+		return NULL;
+
+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
+	if (!pdn) {
+		pr_warn("%s: Out of memory !\n", __func__);
+		return NULL;
+	}
+
+	pdn->phb = parent->phb;
+	pdn->parent = parent;
+	pdn->busno = busno;
+	pdn->devfn = devfn;
+#ifdef CONFIG_PPC_POWERNV
+	pdn->pe_number = IODA_INVALID_PE;
+#endif
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	list_add_tail(&pdn->list, &parent->child_list);
+
+	/*
+	 * If we already have PCI device instance, lets
+	 * bind them.
+	 */
+	if (pdev)
+		pdev->dev.archdata.firmware_data = pdn;
+
+	return pdn;
+}
+
+struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent, *pdn;
+	int i;
+
+	/* Only support IOV for now */
+	if (!pdev->is_physfn)
+		return pci_get_pdn(pdev);
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return NULL;
+
+	pdn->flags |= PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return NULL;
+
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		pdn = add_one_dev_pci_info(parent, NULL,
+					   pci_iov_virtfn_bus(pdev, i),
+					   pci_iov_virtfn_devfn(pdev, i));
+		if (!pdn) {
+			pr_warn("%s: Cannot create firmware data "
+				"for VF#%d of %s\n",
+				__func__, i, pci_name(pdev));
+			return NULL;
+		}
+	}
+#endif
+
+	return pci_get_pdn(pdev);
+}
+
+void remove_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent;
+	struct pci_dn *pdn, *tmp;
+	int i;
+
+	/* Only support IOV PF for now */
+	if (!pdev->is_physfn)
+		return;
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return;
+
+	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return;
+
+	/*
+	 * We might introduce flag to pci_dn in future
+	 * so that we can release VF's firmware data in
+	 * a batch mode.
+	 */
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		list_for_each_entry_safe(pdn, tmp,
+			&parent->child_list, list) {
+			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
+			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
+				continue;
+
+			if (!list_empty(&pdn->list))
+				list_del(&pdn->list);
+			kfree(pdn);
+		}
+	}
+#endif
 }
 
 /*
@@ -49,6 +259,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	struct pci_controller *phb = data;
 	const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
 	const __be32 *regs;
+	struct device_node *parent;
 	struct pci_dn *pdn;
 
 	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
@@ -70,6 +281,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	}
 
 	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
+
+	/* Attach to parent node */
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	parent = of_get_parent(dn);
+	pdn->parent = parent ? PCI_DN(parent) : NULL;
+	if (pdn->parent)
+		list_add_tail(&pdn->list, &pdn->parent->child_list);
+
 	return NULL;
 }
 
@@ -150,6 +370,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	if (pdn) {
 		pdn->devfn = pdn->busno = -1;
 		pdn->phb = phb;
+		phb->firmware_data = pdn;
 	}
 
 	/* Update dn->phb ptrs for new phb and children devices */
@@ -173,3 +394,22 @@ void __init pci_devs_phb_init(void)
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		pci_devs_phb_init_dynamic(phb);
 }
+
+static void pci_dev_pdn_create(struct pci_dev *pdev)
+{
+	add_dev_pci_info(pdev);
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_create);
+
+static void pci_dev_pdn_setup(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	if (pdev->dev.archdata.firmware_data)
+		return;
+
+	/* Setup the fast path */
+	pdn = pci_get_pdn(pdev);
+	pdev->dev.archdata.firmware_data = pdn;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 09/18] powerpc/pci: remove pci_dn->pcidev field
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (7 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 08/18] powrepc/pci: Refactor pci_dn Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 10/18] powerpc/powernv: Use pci_dn in PCI config accessor Wei Yang
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

The field pci_dn->pcidev is assigned but not used.

This patch removes this field.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 757d7bb..063d79d 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -166,7 +166,6 @@ struct pci_dn {
 
 	bool	force_32bit_msi;
 
-	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev;		/* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 468a0f2..df49dc6 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -727,7 +727,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 				pci_name(dev));
 			continue;
 		}
-		pdn->pcidev = dev;
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 10/18] powerpc/powernv: Use pci_dn in PCI config accessor
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (8 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 09/18] powerpc/pci: remove pci_dn->pcidev field Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 11/18] powerpc/powernv: Allocate pe->iommu_table dynamically Wei Yang
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

The PCI config accessors rely on device node. Unfortunately, VFs
don't have corresponding device nodes. So we have to switch to
pci_dn for PCI config access.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +++++-
 arch/powerpc/platforms/powernv/pci.c         |   69 ++++++++++----------------
 arch/powerpc/platforms/powernv/pci.h         |    4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1d19e79..c63b6c1 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -419,21 +419,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
 				   int where, int size, u32 *val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn)) {
 		*val = 0xFFFFFFFF;
 		return PCIBIOS_SET_FAILED;
 	}
 
-	return pnv_pci_cfg_read(dn, where, size, val);
+	return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
 				    int where, int size, u32 val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn))
 		return PCIBIOS_SET_FAILED;
 
-	return pnv_pci_cfg_write(dn, where, size, val);
+	return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b2187d0..f8dbb3f 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -368,9 +368,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 	spin_unlock_irqrestore(&phb->lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-				     struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+	struct pnv_phb *phb = pdn->phb->private_data;
 	u8	fstate;
 	__be16	pcierr;
 	int	pe_no;
@@ -381,7 +381,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	 * setup that yet. So all ER errors should be mapped to
 	 * reserved PE.
 	 */
-	pe_no = PCI_DN(dn)->pe_number;
+	pe_no = pdn->pe_number;
 	if (pe_no == IODA_INVALID_PE) {
 		if (phb->type == PNV_PHB_P5IOC2)
 			pe_no = 0;
@@ -409,8 +409,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 
 	cfg_dbg(" -> EEH check, bdfn=%04x PE#%d fstate=%x\n",
-		(PCI_DN(dn)->busno << 8) | (PCI_DN(dn)->devfn),
-		pe_no, fstate);
+		(pdn->busno << 8) | (pdn->devfn), pe_no, fstate);
 
 	/* Clear the frozen state if applicable */
 	if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -427,10 +426,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 	s64 rc;
@@ -464,10 +462,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
 	return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 
@@ -491,18 +488,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
-			      struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	struct eeh_dev *edev = NULL;
-	struct pnv_phb *phb = hose->private_data;
+	struct pnv_phb *phb = pdn->phb->private_data;
 
 	/* EEH not enabled ? */
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
 		return true;
 
 	/* PE reset or device removed ? */
-	edev = of_node_to_eeh_dev(dn);
+	edev = pdn->edev;
 	if (edev) {
 		if (edev->pe &&
 		    (edev->pe->state & EEH_PE_CFG_BLOCKED))
@@ -515,8 +511,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
 	return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-				struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	return true;
 }
@@ -526,32 +521,26 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 			       unsigned int devfn,
 			       int where, int size, u32 *val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
 	*val = 0xFFFFFFFF;
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_read(dn, where, size, val);
-	if (phb->flags & PNV_PHB_FLAG_EEH) {
+	ret = pnv_pci_cfg_read(pdn, where, size, val);
+	phb = pdn->phb->private_data;
+	if (phb->flags & PNV_PHB_FLAG_EEH && pdn->edev) {
 		if (*val == EEH_IO_ERROR_VALUE(size) &&
-		    eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
+		    eeh_dev_check_failure(pdn->edev))
                         return PCIBIOS_DEVICE_NOT_FOUND;
 	} else {
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 	}
 
 	return ret;
@@ -561,27 +550,21 @@ static int pnv_pci_write_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_write(dn, where, size, val);
+	ret = pnv_pci_cfg_write(pdn, where, size, val);
+	phb = pdn->phb->private_data;
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 
 	return ret;
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 34d29eb..9dc4143 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -219,9 +219,9 @@ extern struct pnv_eeh_ops ioda_eeh_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val);
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 11/18] powerpc/powernv: Allocate pe->iommu_table dynamically
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (9 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 10/18] powerpc/powernv: Use pci_dn in PCI config accessor Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 12/18] powerpc/powernv: Expand VF resources according to the number of total_pe Wei Yang
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

Current iommu_table of a PE is a static field. This will have a problem when
iommu_free_table is called.

This patch allocate iommu_table dynamically.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |    3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |    2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..0fedacb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
 	struct iommu_group *it_group;
 #endif
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+	void           *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index df49dc6..6d35ed9 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -785,6 +785,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 		return;
 	}
 
+	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+			GFP_KERNEL, hose->node);
+	pe->tce32_table->data = pe;
+
 	/* Associate it with all child devices */
 	pnv_ioda_setup_same_PE(bus, pe);
 
@@ -858,7 +862,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -885,7 +889,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, pe->tce32_table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -922,9 +926,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       pe->tce32_table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, pe->tce32_table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1014,8 +1018,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1081,7 +1084,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1119,8 +1122,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1165,10 +1167,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1216,7 +1218,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 9dc4143..0f24a14 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct iommu_table	*tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 12/18] powerpc/powernv: Expand VF resources according to the number of total_pe
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (10 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 11/18] powerpc/powernv: Allocate pe->iommu_table dynamically Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 13/18] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Wei Yang
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
Mostly the total_pe number is different from the total_VFs, which will lead to
a conflict between MMIO space and the PE number.

For example, total_VFs is 128 and total_pe is 256, then the second half of M64
BAR space will be part of other PCI device, which may already belongs to other
PEs.

This patch expands the PF IOV BAR size to reserve total_pe number of VF's BAR
size, which prevents the conflict.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    4 +++
 arch/powerpc/include/asm/pci-bridge.h     |    3 ++
 arch/powerpc/kernel/pci-common.c          |    5 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   56 +++++++++++++++++++++++++++++
 4 files changed, 68 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 8242262..86d47ca 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -254,6 +254,10 @@ struct machdep_calls {
 	/* Reset the secondary bus of bridge */
 	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_IOV
+	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 063d79d..ff26045 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -172,6 +172,9 @@ struct pci_dn {
 #define IODA_INVALID_PE		(-1)
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
+#ifdef CONFIG_PCI_IOV
+	u16     max_vfs;		/* number of VFs IOV BAR expended */
+#endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
 	struct list_head list;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 17acfa7..4e3d87d 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1645,6 +1645,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 	if (ppc_md.pcibios_fixup_phb)
 		ppc_md.pcibios_fixup_phb(hose);
 
+#ifdef CONFIG_PCI_IOV
+	if (ppc_md.pcibios_fixup_sriov)
+		ppc_md.pcibios_fixup_sriov(bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6d35ed9..ef279d3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1601,6 +1601,59 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
 static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct resource *res;
+	int i;
+	resource_size_t size;
+	struct pci_dn *pdn;
+
+	if (!pdev->is_physfn || pdev->is_added)
+		return;
+
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+
+	pdn = pci_get_pdn(pdev);
+	pdn->max_vfs = 0;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
+				 res, pci_name(pdev));
+			continue;
+		}
+
+		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
+		size = pci_iov_resource_size(pdev, i);
+		res->end = res->start + size * phb->ioda.total_pe - 1;
+		dev_dbg(&pdev->dev, "                       %pR\n", res);
+	}
+	pdn->max_vfs = phb->ioda.total_pe;
+}
+
+static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	struct pci_bus *b;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		b = pdev->subordinate;
+
+		if (b)
+			pnv_pci_ioda_fixup_sriov(b);
+
+		pnv_pci_ioda_fixup_iov_resources(pdev);
+	}
+}
+#endif /* CONFIG_PCI_IOV */
+
 /*
  * This function is supposed to be called on basis of PE from top
  * to bottom style. So the the I/O or MMIO segment assigned to
@@ -1983,6 +2036,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
+#ifdef CONFIG_PCI_IOV
+	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+#endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
 	/* Reset IODA tables to a clean state */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 13/18] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (11 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 12/18] powerpc/powernv: Expand VF resources according to the number of total_pe Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 14/18] powerpc/powernv: Implement pcibios_iov_resource_size() " Wei Yang
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

This patch implements the pcibios_iov_resource_alignment() on powernv
platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    3 +++
 arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 86d47ca..15a13e6 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -256,6 +256,9 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
 	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
+			                                    int resno,
+							    resource_size_t align);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 4e3d87d..581e67b 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -131,6 +131,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
+						 int resno,
+						 resource_size_t align)
+{
+	if (ppc_md.pcibios_iov_resource_alignment)
+		return ppc_md.pcibios_iov_resource_alignment(pdev,
+							       resno,
+							       align);
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index ef279d3..880d76d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1830,6 +1830,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+							    int resno,
+							    resource_size_t align)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t iov_align;
+
+	iov_align = resource_size(&pdev->resource[resno]);
+	if (iov_align)
+		return iov_align;
+
+	if (pdn->max_vfs)
+		return pdn->max_vfs * align;
+
+	return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2038,6 +2057,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 14/18] powerpc/powernv: Implement pcibios_iov_resource_size() on powernv
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (12 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 13/18] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 15/18] powerpc/powernv: Shift VF resource with an offset Wei Yang
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

On PowerNV platform, the PF's IOV BAR size would be expanded, which is
different from the normal case.

This patch retrieves the VF BAR size by total size dividing the expanded number
of VFs.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    1 +
 arch/powerpc/kernel/pci-common.c          |    8 ++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   15 +++++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 15a13e6..d971874 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -259,6 +259,7 @@ struct machdep_calls {
 	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
 			                                    int resno,
 							    resource_size_t align);
+	resource_size_t (*pcibios_iov_resource_size)(struct pci_dev *, int resno);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 581e67b..a2a96d3 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -143,6 +143,14 @@ resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
 
 	return 0;
 }
+
+resource_size_t pcibios_iov_resource_size(struct pci_dev *pdev, int resno)
+{
+	if (ppc_md.pcibios_iov_resource_size)
+		return ppc_md.pcibios_iov_resource_size(pdev, resno);
+
+	return 0;
+}
 #endif /* CONFIG_PCI_IOV */
 
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 880d76d..f7abba3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1847,6 +1847,20 @@ static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
 
 	return align;
 }
+
+static resource_size_t pnv_pci_iov_resource_size(struct pci_dev *pdev, int resno)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t size = 0;
+
+	if (!pdn->max_vfs)
+		return size;
+
+	size = resource_size(pdev->resource + resno);
+	do_div(size, pdn->max_vfs);
+
+	return size;
+}
 #endif /* CONFIG_PCI_IOV */
 
 /* Prevent enabling devices for which we couldn't properly
@@ -2058,6 +2072,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
 	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
+	ppc_md.pcibios_iov_resource_size = pnv_pci_iov_resource_size;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 15/18] powerpc/powernv: Shift VF resource with an offset
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (13 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 14/18] powerpc/powernv: Implement pcibios_iov_resource_size() " Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 16/18] powerpc/powernv: Allocate VF PE Wei Yang
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

On PowrNV platform, resource position in M64 implies the PE# the resource
belongs to. In some particular case, adjustment of a resource is necessary to
locate it to a correct position in M64.

This patch introduces a function to shift the 'real' PF IOV BAR address
according to an offset.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f7abba3..5034a3c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -14,6 +14,7 @@
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
+#include <linux/pci_regs.h>
 #include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/string.h>
@@ -644,6 +645,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res;
+	resource_size_t size;
+
+	if (!dev->is_physfn)
+		return;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*offset;
+
+		dev_info(&dev->dev, "                 %pR\n", res);
+		pci_update_resource(dev, i);
+	}
+	pdn->max_vfs -= offset;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 16/18] powerpc/powernv: Allocate VF PE
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (14 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 15/18] powerpc/powernv: Shift VF resource with an offset Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 17/18] powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported Wei Yang
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

VFs are created, when pci device is enabled.

This patch tries best to assign maximum resources and PEs for VF when pci
device is enabled. Enough M64 assigned to cover the IOV BAR, IOV BAR is
shifted to meet the PE# indicated by M64. VF's pdn->pdev and pdn->pe_number
are fixed.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    4 +
 arch/powerpc/kernel/pci_dn.c              |   11 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  460 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c      |   18 ++
 arch/powerpc/platforms/powernv/pci.h      |    7 +
 5 files changed, 487 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index ff26045..8d8d40a 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -174,6 +174,10 @@ struct pci_dn {
 	int	pe_number;
 #ifdef CONFIG_PCI_IOV
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
+	u16     vf_pes;			/* VF PE# under this PF */
+	int     offset;			/* PE# for the first VF PE */
+#define IODA_INVALID_M64        (-1)
+	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index fa966ae..dbc2f55 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -216,6 +216,17 @@ void remove_dev_pci_info(struct pci_dev *pdev)
 	struct pci_dn *pdn, *tmp;
 	int i;
 
+	/*
+	 * VF and VF PE is create/released dynamicly, which we need to
+	 * bind/unbind them. Otherwise when re-enable SRIOV, the VF and VF PE
+	 * would be mismatched.
+	 */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		pdn->pe_number = IODA_INVALID_PE;
+		return;
+	}
+
 	/* Only support IOV PF for now */
 	if (!pdev->is_physfn)
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5034a3c..649f49d4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -45,6 +45,9 @@
 #include "powernv.h"
 #include "pci.h"
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -57,11 +60,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV)
 		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
-	else
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
 		sprintf(pfix, "%04x:%02x     ",
 			pci_domain_nr(pe->pbus), pe->pbus->number);
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		sprintf(pfix, "%04x:%02x:%2x.%d",
+			pci_domain_nr(pe->parent_dev->bus),
+			(pe->rid & 0xff00) >> 8,
+			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
+#endif /* CONFIG_PCI_IOV*/
 
 	printk("%spci %s: [PE# %.3d] %pV",
 	       level, pfix, pe->pe_number, &vaf);
@@ -508,6 +518,89 @@ static struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
 }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	int64_t rc;
+	long rid_end, rid;
+
+	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;         break;
+		case  2: bcomp = OpalPciBus7Bits;       break;
+		case  4: bcomp = OpalPciBus6Bits;       break;
+		case  8: bcomp = OpalPciBus5Bits;       break;
+		case 16: bcomp = OpalPciBus4Bits;       break;
+		case 32: bcomp = OpalPciBus3Bits;       break;
+		default:
+			pr_err("%s: Number of subordinate busses %d"
+			       " unsupported\n",
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear the reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = 0;
+
+	/* Release from all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Dissociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+
+	pe->pbus = NULL;
+	pe->pdev = NULL;
+	pe->parent_dev = NULL;
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -536,13 +629,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 		default:
 			pr_err("%s: Number of subordinate busses %d"
 			       " unsupported\n",
-			       pci_name(pe->pbus->self), count);
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
 			/* Do an exact match only */
 			bcomp = OpalPciBusAll;
 		}
 		rid_end = pe->rid + (count << 8);
 	} else {
-		parent = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif /* CONFIG_PCI_IOV */
+			parent = pe->pdev->bus->self;
 		bcomp = OpalPciBusAll;
 		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
 		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
@@ -878,6 +977,314 @@ static void pnv_pci_ioda_setup_PEs(void)
 	}
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    i;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		if (pdn->m64_wins[i] == IODA_INVALID_M64)
+			continue;
+		opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
+		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+	}
+
+	return 0;
+}
+
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	unsigned int           win;
+	struct resource       *res;
+	int                    i;
+	int64_t                rc;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	/* Initialize the m64_wins to IODA_INVALID_M64 */
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = pdev->resource + PCI_IOV_RESOURCES + i;
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		do {
+			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+					phb->ioda.m64_bar_idx + 1, 0);
+
+			if (win >= phb->ioda.m64_bar_idx + 1)
+				goto m64_failed;
+		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+		pdn->m64_wins[i] = win;
+
+		/* Map the M64 here */
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 pdn->m64_wins[i],
+						 res->start,
+						 0, /* unused */
+						 resource_size(res));
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
+			goto m64_failed;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
+			goto m64_failed;
+		}
+	}
+	return 0;
+
+m64_failed:
+	pnv_pci_vf_release_m64(pdev);
+	return -EBUSY;
+}
+
+static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct iommu_table    *tbl;
+	unsigned long         addr;
+	int64_t               rc;
+
+	bus = dev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	tbl = pe->tce32_table;
+	addr = tbl->it_base;
+
+	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+				   pe->pe_number << 1, 1, __pa(addr),
+				   0, 0x1000);
+
+	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+				        pe->pe_number,
+				        (pe->pe_number << 1) + 1,
+				        pe->tce_bypass_base,
+				        0);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+
+	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	free_pages(addr, get_order(TCE32_TABLE_SIZE));
+	pe->tce32_table = NULL;
+}
+
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe, *pe_n;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+
+	if (!pdev->is_physfn)
+		return;
+
+	pdn = pci_get_pdn(pdev);
+	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
+		if (pe->parent_dev != pdev)
+			continue;
+
+		pnv_pci_ioda2_release_dma_pe(pdev, pe);
+
+		/* Remove from list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_del(&pe->list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_ioda_deconfigure_pe(phb, pe);
+
+		pnv_ioda_free_pe(phb, pe->pe_number);
+	}
+}
+
+void pnv_pci_sriov_disable(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	struct pci_sriov      *iov;
+	u16 vf_num;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+	iov = pdev->sriov;
+	vf_num = pdn->vf_pes;
+
+	/* Release VF PEs */
+	pnv_ioda_release_vf_PE(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+
+		/* Release M64 BARs */
+		pnv_pci_vf_release_m64(pdev);
+
+		/* Release PE numbers */
+		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->offset = 0;
+	}
+
+	return;
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe;
+	int                    pe_num;
+	u16                    vf_index;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (!pdev->is_physfn)
+		return;
+
+	/* Reserve PE for each VF */
+	for (vf_index = 0; vf_index < vf_num; vf_index++) {
+		pe_num = pdn->offset + vf_index;
+
+		pe = &phb->ioda.pe_array[pe_num];
+		pe->pe_number = pe_num;
+		pe->phb = phb;
+		pe->flags = PNV_IODA_PE_VF;
+		pe->pbus = NULL;
+		pe->parent_dev = pdev;
+		pe->tce32_seg = -1;
+		pe->mve_number = -1;
+		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
+			   pci_iov_virtfn_devfn(pdev, vf_index);
+
+		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
+			hose->global_number, pdev->bus->number,
+			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
+			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+
+		if (pnv_ioda_configure_pe(phb, pe)) {
+			/* XXX What do we do here ? */
+			if (pe_num)
+				pnv_ioda_free_pe(phb, pe_num);
+			pe->pdev = NULL;
+			continue;
+		}
+
+		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+				GFP_KERNEL, hose->node);
+		pe->tce32_table->data = pe;
+
+		/* Put PE to the list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_add_tail(&pe->list, &phb->ioda.pe_list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+
+	}
+}
+
+int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    ret;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		/* Calculate available PE for required VFs */
+		mutex_lock(&phb->ioda.pe_alloc_mutex);
+try_again:
+		pdn->offset = bitmap_find_next_zero_area(
+			phb->ioda.pe_alloc, phb->ioda.total_pe,
+			0, vf_num, 0);
+		if (pdn->offset >= phb->ioda.total_pe) {
+			vf_num--;
+			if (vf_num)
+				goto try_again;
+
+			mutex_unlock(&phb->ioda.pe_alloc_mutex);
+			pr_info("Failed to enable VF\n");
+			pdn->offset = 0;
+			return -EBUSY;
+		}
+		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->vf_pes = vf_num;
+		mutex_unlock(&phb->ioda.pe_alloc_mutex);
+
+		/* Assign M64 BAR accordingly */
+		ret = pnv_pci_vf_assign_m64(pdev);
+		if (ret) {
+			pr_info("No enough M64 resource\n");
+			goto m64_failed;
+		}
+
+		/* Do some magic shift */
+		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+	}
+
+	/* Setup VF PEs */
+	pnv_ioda_setup_vf_PE(pdev, vf_num);
+
+	return 0;
+
+m64_failed:
+	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+	pdn->offset = 0;
+
+	return ret;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev)
 {
 	struct pci_dn *pdn = pci_get_pdn(pdev);
@@ -1070,9 +1477,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 	void *addr;
 
-	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
-#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
-
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
 	/* XXX FIXME: Allocate multi-level tables on PHB3 */
@@ -1135,12 +1539,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	return;
  fail:
@@ -1267,12 +1678,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
@@ -1915,9 +2333,22 @@ static int pnv_pci_enable_device_hook(struct pci_dev *dev)
 	if (!pdn || pdn->pe_number == IODA_INVALID_PE)
 		return -EINVAL;
 
+#ifdef CONFIG_PCI_IOV
+	if (dev->is_physfn)
+		pnv_pci_sriov_enable(dev, pci_sriov_get_totalvfs(dev));
+#endif
 	return 0;
 }
 
+static void pnv_pci_disable_device_hook(struct pci_dev *dev)
+{
+#ifdef CONFIG_PCI_IOV
+	if (dev->is_physfn)
+		pnv_pci_sriov_disable(dev);
+#endif
+	return;
+}
+
 static u32 pnv_ioda_bdfn_to_pe(struct pnv_phb *phb, struct pci_bus *bus,
 			       u32 devfn)
 {
@@ -1983,6 +2414,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
+	mutex_init(&phb->ioda.pe_alloc_mutex);
 
 	/* Detect specific models for error handling */
 	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
@@ -2043,6 +2475,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
+	mutex_init(&phb->ioda.pe_list_mutex);
 
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
@@ -2098,6 +2531,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	 */
 	ppc_md.pcibios_fixup = pnv_pci_ioda_fixup;
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
+	ppc_md.pcibios_disable_device_hook = pnv_pci_disable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index f8dbb3f..580c3d7 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -716,6 +716,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 {
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 	struct pnv_phb *phb = hose->private_data;
+#ifdef CONFIG_PCI_IOV
+	struct pnv_ioda_pe *pe;
+	struct pci_dn *pdn;
+
+	/* Fix the VF pdn PE number */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			if (pe->rid == ((pdev->bus->number << 8) |
+			    (pdev->devfn & 0xff))) {
+				pdn->pe_number = pe->pe_number;
+				pe->pdev = pdev;
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_PCI_IOV */
 
 	/* If we have no phb structure, try to setup a fallback based on
 	 * the device-tree (RTAS PCI for example)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 0f24a14..3556984 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
+#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -34,6 +35,9 @@ struct pnv_ioda_pe {
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
 	 */
+#ifdef CONFIG_PCI_IOV
+	struct pci_dev          *parent_dev;
+#endif
 	struct pci_dev		*pdev;
 	struct pci_bus		*pbus;
 
@@ -165,6 +169,8 @@ struct pnv_phb {
 
 			/* PE allocation bitmap */
 			unsigned long		*pe_alloc;
+			/* PE allocation mutex */
+			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
 			unsigned int		*m32_segmap;
@@ -179,6 +185,7 @@ struct pnv_phb {
 			 * on the sequence of creation
 			 */
 			struct list_head	pe_list;
+			struct mutex            pe_list_mutex;
 
 			/* Reverse map of PEs, will have to extend if
 			 * we are to support more than 256 PEs, indexed
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 17/18] powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (15 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 16/18] powerpc/powernv: Allocate VF PE Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-02 15:41 ` [PATCH V9 18/18] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Wei Yang
  2014-11-18 23:11 ` [PATCH V9 00/18] Enable SRIOV on PowerNV Gavin Shan
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

M64 aperture size is limited on PHB3. When the IOV BAR is too big, this will
exceed the limitation and failed to be assigned.

This patch introduce a different expanding based on the IOV BAR size:

IOV BAR size is smaller than 64M, expand to total_pe.
IOV BAR size is bigger than 64M, roundup power2.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 8d8d40a..a336c7a 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -176,6 +176,8 @@ struct pci_dn {
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
 	u16     vf_pes;			/* VF PE# under this PF */
 	int     offset;			/* PE# for the first VF PE */
+#define M64_PER_IOV 4
+	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
 	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 649f49d4..a4e78ab 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2059,6 +2059,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	int i;
 	resource_size_t size;
 	struct pci_dn *pdn;
+	int mul, total_vfs;
 
 	if (!pdev->is_physfn || pdev->is_added)
 		return;
@@ -2069,6 +2070,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	pdn = pci_get_pdn(pdev);
 	pdn->max_vfs = 0;
 
+	total_vfs = pci_sriov_get_totalvfs(pdev);
+	pdn->m64_per_iov = 1;
+	mul = phb->ioda.total_pe;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
+					res, pci_name(pdev));
+			continue;
+		}
+
+		size = pci_iov_resource_size(pdev, i);
+
+		/* bigger than 64M */
+		if (size > (1 << 26)) {
+			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
+					"is bigger than 64M, roundup power2\n", i);
+			pdn->m64_per_iov = M64_PER_IOV;
+			mul = __roundup_pow_of_two(total_vfs);
+			break;
+		}
+	}
+
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &pdev->resource[i];
 		if (!res->flags || res->parent)
@@ -2081,10 +2108,10 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 	}
-	pdn->max_vfs = phb->ioda.total_pe;
+	pdn->max_vfs = mul;
 }
 
 static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V9 18/18] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (16 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 17/18] powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported Wei Yang
@ 2014-11-02 15:41 ` Wei Yang
  2014-11-18 23:11 ` [PATCH V9 00/18] Enable SRIOV on PowerNV Gavin Shan
  18 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-02 15:41 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev; +Cc: Wei Yang

When IOV BAR is big, each of it is covered by 4 M64 window. This leads to
several VF PE sits in one PE in terms of M64.

This patch group VF PEs according to the M64 allocation.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  188 +++++++++++++++++++++++------
 2 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index a336c7a..aaf5a31 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -179,7 +179,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
 	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
-	int     m64_wins[PCI_SRIOV_NUM_BARS];
+	int     m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a4e78ab..ea02131 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -984,26 +984,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pci_dn         *pdn;
-	int                    i;
+	int                    i, j;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
 
-	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		if (pdn->m64_wins[i] == IODA_INVALID_M64)
-			continue;
-		opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
-		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
-		pdn->m64_wins[i] = IODA_INVALID_M64;
-	}
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		for (j = 0; j < M64_PER_IOV; j++) {
+			if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+				continue;
+			opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
+			clear_bit(pdn->m64_wins[i][j], &phb->ioda.m64_bar_alloc);
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+		}
 
 	return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
@@ -1011,17 +1012,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 	struct pci_dn         *pdn;
 	unsigned int           win;
 	struct resource       *res;
-	int                    i;
+	int                    i, j;
 	int64_t                rc;
+	int                    total_vfs;
+	resource_size_t        size, start;
+	int                    pe_num;
+	int                    vf_groups;
+	int                    vf_per_group;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
+	total_vfs = pci_sriov_get_totalvfs(pdev);
 
 	/* Initialize the m64_wins to IODA_INVALID_M64 */
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-		pdn->m64_wins[i] = IODA_INVALID_M64;
+		for (j = 0; j < M64_PER_IOV; j++)
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+
+	if (pdn->m64_per_iov == M64_PER_IOV) {
+		vf_groups = (vf_num <= M64_PER_IOV) ? vf_num: M64_PER_IOV;
+		vf_per_group = (vf_num <= M64_PER_IOV)? 1:
+			__roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+	} else {
+		vf_groups = 1;
+		vf_per_group = 1;
+	}
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = pdev->resource + PCI_IOV_RESOURCES + i;
@@ -1031,33 +1048,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 		if (!pnv_pci_is_mem_pref_64(res->flags))
 			continue;
 
-		do {
-			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-					phb->ioda.m64_bar_idx + 1, 0);
-
-			if (win >= phb->ioda.m64_bar_idx + 1)
-				goto m64_failed;
-		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+		for (j = 0; j < vf_groups; j++) {
+			do {
+				win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+						phb->ioda.m64_bar_idx + 1, 0);
+
+				if (win >= phb->ioda.m64_bar_idx + 1)
+					goto m64_failed;
+			} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+			pdn->m64_wins[i][j] = win;
+
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				size = pci_iov_resource_size(pdev,
+								   PCI_IOV_RESOURCES + i);
+				size = size * vf_per_group;
+				start = res->start + size * j;
+			} else {
+				size = resource_size(res);
+				start = res->start;
+			}
 
-		pdn->m64_wins[i] = win;
+			/* Map the M64 here */
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				pe_num = pdn->offset + j;
+				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+						pe_num, OPAL_M64_WINDOW_TYPE,
+						pdn->m64_wins[i][j], 0);
+			}
 
-		/* Map the M64 here */
-		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+			rc = opal_pci_set_phb_mem_window(phb->opal_id,
 						 OPAL_M64_WINDOW_TYPE,
-						 pdn->m64_wins[i],
-						 res->start,
+						 pdn->m64_wins[i][j],
+						 start,
 						 0, /* unused */
-						 resource_size(res));
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
-			goto m64_failed;
-		}
+						 size);
 
-		rc = opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
-			goto m64_failed;
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to set M64 BAR #%d: %lld\n",
+						win, rc);
+				goto m64_failed;
+			}
+
+			if (pdn->m64_per_iov == M64_PER_IOV)
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 2);
+			else
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 1);
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to enable M64 BAR #%d: %llx\n",
+						win, rc);
+				goto m64_failed;
+			}
 		}
 	}
 	return 0;
@@ -1099,22 +1144,53 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 	pe->tce32_table = NULL;
 }
 
-static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pnv_ioda_pe    *pe, *pe_n;
 	struct pci_dn         *pdn;
+	u16                    vf_index;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
 
 	if (!pdev->is_physfn)
 		return;
 
-	pdn = pci_get_pdn(pdev);
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++){
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_REMOVE_PE_FROM_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to unlink same"
+						" group PE#%d(%lld)\n", __func__,
+						pdn->offset + vf_index1, rc);
+				}
+	}
+
 	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
 		if (pe->parent_dev != pdev)
 			continue;
@@ -1149,10 +1225,11 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 	vf_num = pdn->vf_pes;
 
 	/* Release VF PEs */
-	pnv_ioda_release_vf_PE(pdev);
+	pnv_ioda_release_vf_PE(pdev, vf_num);
 
 	if (phb->type == PNV_PHB_IODA2) {
-		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, -pdn->offset);
 
 		/* Release M64 BARs */
 		pnv_pci_vf_release_m64(pdev);
@@ -1176,6 +1253,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 	int                    pe_num;
 	u16                    vf_index;
 	struct pci_dn         *pdn;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
@@ -1223,7 +1301,36 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+	}
 
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++) {
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_ADD_PE_TO_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to link same "
+						"group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+			}
 	}
 }
 
@@ -1262,14 +1369,15 @@ try_again:
 		mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
 		/* Assign M64 BAR accordingly */
-		ret = pnv_pci_vf_assign_m64(pdev);
+		ret = pnv_pci_vf_assign_m64(pdev, vf_num);
 		if (ret) {
 			pr_info("No enough M64 resource\n");
 			goto m64_failed;
 		}
 
 		/* Do some magic shift */
-		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, pdn->offset);
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 00/18] Enable SRIOV on PowerNV
  2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
                   ` (17 preceding siblings ...)
  2014-11-02 15:41 ` [PATCH V9 18/18] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Wei Yang
@ 2014-11-18 23:11 ` Gavin Shan
  2014-11-18 23:40   ` Bjorn Helgaas
  18 siblings, 1 reply; 39+ messages in thread
From: Gavin Shan @ 2014-11-18 23:11 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, benh, linuxppc-dev, gwshan

On Sun, Nov 02, 2014 at 11:41:16PM +0800, Wei Yang wrote:

Hello Bjorn,

Did you have available bandwidth to review it? :-)

Thanks,
Gavin

>This patchset enables the SRIOV on POWER8.
>
>The gerneral idea is put each VF into one individual PE and allocate required
>resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
>allocation and adjustment for PF's IOV BAR.
>
>On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
>sit in its own PE. This gives more flexiblity, while at the mean time it
>brings on some restrictions on the PF's IOV BAR size and alignment.
>
>To achieve this effect, we need to do some hack on pci devices's resources.
>1. Expand the IOV BAR properly.
>   Done by pnv_pci_ioda_fixup_iov_resources().
>2. Shift the IOV BAR properly.
>   Done by pnv_pci_vf_resource_shift().
>3. IOV BAR alignment is calculated by arch dependent function instead of an
>   individual VF BAR size.
>   Done by pnv_pcibios_sriov_resource_alignment().
>4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>   sizing and assigning"
>
>Test Environment:
>       The SRIOV device tested is Emulex Lancer(10df:e220) and
>       Mellanox ConnectX-3(15b3:1003) on POWER8.
>
>Examples on pass through a VF to guest through vfio:
>	1. unbind the original driver and bind to vfio-pci driver
>	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
>	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
>	   Note: this should be done for each device in the same iommu_group
>	2. Start qemu and pass device through vfio
>	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
>		   -M pseries -m 2048 -enable-kvm -nographic \
>		   -drive file=/home/ywywyang/kvm/fc19.img \
>		   -monitor telnet:localhost:5435,server,nowait -boot cd \
>		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"
>
>Verify this is the exact VF response:
>	1. ping from a machine in the same subnet(the broadcast domain)
>	2. run arp -n on this machine
>	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
>	3. ifconfig in the guest
>	   # ifconfig eth1
>	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
>		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
>	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
>	        RX packets 175  bytes 13278 (12.9 KiB)
>	        RX errors 0  dropped 0  overruns 0  frame 0
>		TX packets 58  bytes 9276 (9.0 KiB)
>	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>	4. They have the same MAC address
>
>	Note: make sure you shutdown other network interfaces in guest.
>
>---
>v9:
>   * make the change log consistent in the terminology
>     PF's IOV BAR -> the SRIOV BAR in PF
>     VF's BAR -> the normal BAR in VF's view
>   * rename all newly introduced function from _sriov_ to _iov_
>   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
>   * add the vendor id and device id of the tested devices
>   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
>     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
>   * rebase on 3.18-rc2 and tested
>v8:
>   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
>     retrieve the IOV BAR size.
>   * add a document Documentation/powerpc/pci_resource.txt to explain the
>     design.
>   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
>   * extract a function res_to_dev_res(), so that it is more general to get
>     additional size and alignment
>   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
>     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
>     lock.
>v7:
>   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
>   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
>     hardware directly. If not, calculate as usual.
>   * reorder the patch set, group them by subsystem:
>     PCI, powerpc, powernv
>   * rebase it on 3.16-rc6
>v6:
>   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
>     similar function is moved to
>     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
>     enabled, platform will try best to allocate resources for VFs.
>   * remove pcibios_sriov_resource_size weak function
>   * VF BAR size is retrieved from hardware directly in virtfn_add()
>v5:
>   * merge those SRIOV related platform functions in machdep_calls
>     wrap them in one CONFIG_PCI_IOV marco
>   * define IODA_INVALID_M64 to replace (-1)
>     use this value to represent the m64_wins is not used
>   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
>     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
>   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
>     reduce some log in kernel
>   * release M64 window in pnv_pci_ioda2_release_dma_pe()
>v4:
>   * code format fix, eg. not exceed 80 chars
>   * in commit "ppc/pnv: Add function to deconfig a PE"
>     check the bus has a bridge before print the name
>     remove a PE from its own PELTV
>   * change the function name for sriov resource size/alignment
>   * rebase on 3.16-rc3
>   * VFs will not rely on device node
>     As Grant Likely's comments, kernel should have the ability to handle the
>     lack of device_node gracefully. Gavin restructure the pci_dn, which
>     makes the VF will have pci_dn even when VF's device_node is not provided
>     by firmware.
>   * clean all the patch title to make them comply with one style
>   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>v3:
>   * change the return type of virtfn_bus/virtfn_devfn to int
>     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>   * reduce the second parameter or pcibios_sriov_disable()
>   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
>   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
>   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
>v2:
>   * change the return value of virtfn_bus/virtfn_devfn to 0
>   * move some TCE related marco definition to
>     arch/powerpc/platforms/powernv/pci.h
>   * fix the __pci_sriov_resource_alignment on powernv platform
>     During the sizing stage, the IOV BAR is truncated to 0, which will
>     effect the order of allocation. Fix this, so that make sure BAR will be
>     allocated ordered by their alignment.
>v1:
>   * improve the change log for
>     "PCI: Add weak __pci_sriov_resource_size() interface"
>     "PCI: Add weak __pci_sriov_resource_alignment() interface"
>     "PCI: take additional IOV BAR alignment in sizing and assigning"
>   * wrap VF PE code in CONFIG_PCI_IOV
>   * did regression test on P7.
>
>Gavin Shan (1):
>  powrepc/pci: Refactor pci_dn
>
>Wei Yang (17):
>  PCI/IOV: Export interface for retrieve VF's BDF
>  PCI: Add weak pcibios_iov_resource_alignment() interface
>  PCI: Add weak pcibios_iov_resource_size() interface
>  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
>  powerpc/pci: Add PCI resource alignment documentation
>  powerpc/pci: Don't unset pci resources for VFs
>  powerpc/pci: Define pcibios_disable_device() on powerpc
>  powerpc/pci: remove pci_dn->pcidev field
>  powerpc/powernv: Use pci_dn in PCI config accessor
>  powerpc/powernv: Allocate pe->iommu_table dynamically
>  powerpc/powernv: Expand VF resources according to the number of
>    total_pe
>  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
>    powernv
>  powerpc/powernv: Implement pcibios_iov_resource_size() on powernv
>  powerpc/powernv: Shift VF resource with an offset
>  powerpc/powernv: Allocate VF PE
>  powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported
>  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
>
> .../powerpc/pci_iov_resource_on_powernv.txt        |   75 ++
> arch/powerpc/include/asm/device.h                  |    3 +
> arch/powerpc/include/asm/iommu.h                   |    3 +
> arch/powerpc/include/asm/machdep.h                 |   13 +-
> arch/powerpc/include/asm/pci-bridge.h              |   24 +-
> arch/powerpc/kernel/pci-common.c                   |   39 +
> arch/powerpc/kernel/pci-hotplug.c                  |    3 +
> arch/powerpc/kernel/pci_dn.c                       |  257 ++++++-
> arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
> arch/powerpc/platforms/powernv/pci-ioda.c          |  744 +++++++++++++++++++-
> arch/powerpc/platforms/powernv/pci.c               |   87 +--
> arch/powerpc/platforms/powernv/pci.h               |   13 +-
> drivers/pci/iov.c                                  |   60 +-
> drivers/pci/setup-bus.c                            |   85 ++-
> include/linux/pci.h                                |   19 +
> 15 files changed, 1332 insertions(+), 107 deletions(-)
> create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt
>
>-- 
>1.7.9.5
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 00/18] Enable SRIOV on PowerNV
  2014-11-18 23:11 ` [PATCH V9 00/18] Enable SRIOV on PowerNV Gavin Shan
@ 2014-11-18 23:40   ` Bjorn Helgaas
  0 siblings, 0 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-18 23:40 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, Wei Yang, Benjamin Herrenschmidt, linuxppc-dev

On Tue, Nov 18, 2014 at 4:11 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> On Sun, Nov 02, 2014 at 11:41:16PM +0800, Wei Yang wrote:
>
> Hello Bjorn,
>
> Did you have available bandwidth to review it? :-)

I'm working on it right now :)

>>This patchset enables the SRIOV on POWER8.
>>
>>The gerneral idea is put each VF into one individual PE and allocate required
>>resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
>>allocation and adjustment for PF's IOV BAR.
>>
>>On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
>>sit in its own PE. This gives more flexiblity, while at the mean time it
>>brings on some restrictions on the PF's IOV BAR size and alignment.
>>
>>To achieve this effect, we need to do some hack on pci devices's resources.
>>1. Expand the IOV BAR properly.
>>   Done by pnv_pci_ioda_fixup_iov_resources().
>>2. Shift the IOV BAR properly.
>>   Done by pnv_pci_vf_resource_shift().
>>3. IOV BAR alignment is calculated by arch dependent function instead of an
>>   individual VF BAR size.
>>   Done by pnv_pcibios_sriov_resource_alignment().
>>4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>>   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>>   sizing and assigning"
>>
>>Test Environment:
>>       The SRIOV device tested is Emulex Lancer(10df:e220) and
>>       Mellanox ConnectX-3(15b3:1003) on POWER8.
>>
>>Examples on pass through a VF to guest through vfio:
>>       1. unbind the original driver and bind to vfio-pci driver
>>          echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
>>          echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
>>          Note: this should be done for each device in the same iommu_group
>>       2. Start qemu and pass device through vfio
>>          /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
>>                  -M pseries -m 2048 -enable-kvm -nographic \
>>                  -drive file=/home/ywywyang/kvm/fc19.img \
>>                  -monitor telnet:localhost:5435,server,nowait -boot cd \
>>                  -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"
>>
>>Verify this is the exact VF response:
>>       1. ping from a machine in the same subnet(the broadcast domain)
>>       2. run arp -n on this machine
>>          9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
>>       3. ifconfig in the guest
>>          # ifconfig eth1
>>          eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>>               inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
>>               inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
>>               ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
>>               RX packets 175  bytes 13278 (12.9 KiB)
>>               RX errors 0  dropped 0  overruns 0  frame 0
>>               TX packets 58  bytes 9276 (9.0 KiB)
>>               TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>       4. They have the same MAC address
>>
>>       Note: make sure you shutdown other network interfaces in guest.
>>
>>---
>>v9:
>>   * make the change log consistent in the terminology
>>     PF's IOV BAR -> the SRIOV BAR in PF
>>     VF's BAR -> the normal BAR in VF's view
>>   * rename all newly introduced function from _sriov_ to _iov_
>>   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
>>   * add the vendor id and device id of the tested devices
>>   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
>>     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
>>   * rebase on 3.18-rc2 and tested
>>v8:
>>   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
>>     retrieve the IOV BAR size.
>>   * add a document Documentation/powerpc/pci_resource.txt to explain the
>>     design.
>>   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
>>   * extract a function res_to_dev_res(), so that it is more general to get
>>     additional size and alignment
>>   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
>>     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
>>     lock.
>>v7:
>>   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
>>   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
>>     hardware directly. If not, calculate as usual.
>>   * reorder the patch set, group them by subsystem:
>>     PCI, powerpc, powernv
>>   * rebase it on 3.16-rc6
>>v6:
>>   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
>>     similar function is moved to
>>     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
>>     enabled, platform will try best to allocate resources for VFs.
>>   * remove pcibios_sriov_resource_size weak function
>>   * VF BAR size is retrieved from hardware directly in virtfn_add()
>>v5:
>>   * merge those SRIOV related platform functions in machdep_calls
>>     wrap them in one CONFIG_PCI_IOV marco
>>   * define IODA_INVALID_M64 to replace (-1)
>>     use this value to represent the m64_wins is not used
>>   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
>>     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
>>   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
>>     reduce some log in kernel
>>   * release M64 window in pnv_pci_ioda2_release_dma_pe()
>>v4:
>>   * code format fix, eg. not exceed 80 chars
>>   * in commit "ppc/pnv: Add function to deconfig a PE"
>>     check the bus has a bridge before print the name
>>     remove a PE from its own PELTV
>>   * change the function name for sriov resource size/alignment
>>   * rebase on 3.16-rc3
>>   * VFs will not rely on device node
>>     As Grant Likely's comments, kernel should have the ability to handle the
>>     lack of device_node gracefully. Gavin restructure the pci_dn, which
>>     makes the VF will have pci_dn even when VF's device_node is not provided
>>     by firmware.
>>   * clean all the patch title to make them comply with one style
>>   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>>v3:
>>   * change the return type of virtfn_bus/virtfn_devfn to int
>>     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>>   * reduce the second parameter or pcibios_sriov_disable()
>>   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
>>   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
>>   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
>>v2:
>>   * change the return value of virtfn_bus/virtfn_devfn to 0
>>   * move some TCE related marco definition to
>>     arch/powerpc/platforms/powernv/pci.h
>>   * fix the __pci_sriov_resource_alignment on powernv platform
>>     During the sizing stage, the IOV BAR is truncated to 0, which will
>>     effect the order of allocation. Fix this, so that make sure BAR will be
>>     allocated ordered by their alignment.
>>v1:
>>   * improve the change log for
>>     "PCI: Add weak __pci_sriov_resource_size() interface"
>>     "PCI: Add weak __pci_sriov_resource_alignment() interface"
>>     "PCI: take additional IOV BAR alignment in sizing and assigning"
>>   * wrap VF PE code in CONFIG_PCI_IOV
>>   * did regression test on P7.
>>
>>Gavin Shan (1):
>>  powrepc/pci: Refactor pci_dn
>>
>>Wei Yang (17):
>>  PCI/IOV: Export interface for retrieve VF's BDF
>>  PCI: Add weak pcibios_iov_resource_alignment() interface
>>  PCI: Add weak pcibios_iov_resource_size() interface
>>  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
>>  powerpc/pci: Add PCI resource alignment documentation
>>  powerpc/pci: Don't unset pci resources for VFs
>>  powerpc/pci: Define pcibios_disable_device() on powerpc
>>  powerpc/pci: remove pci_dn->pcidev field
>>  powerpc/powernv: Use pci_dn in PCI config accessor
>>  powerpc/powernv: Allocate pe->iommu_table dynamically
>>  powerpc/powernv: Expand VF resources according to the number of
>>    total_pe
>>  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
>>    powernv
>>  powerpc/powernv: Implement pcibios_iov_resource_size() on powernv
>>  powerpc/powernv: Shift VF resource with an offset
>>  powerpc/powernv: Allocate VF PE
>>  powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported
>>  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
>>
>> .../powerpc/pci_iov_resource_on_powernv.txt        |   75 ++
>> arch/powerpc/include/asm/device.h                  |    3 +
>> arch/powerpc/include/asm/iommu.h                   |    3 +
>> arch/powerpc/include/asm/machdep.h                 |   13 +-
>> arch/powerpc/include/asm/pci-bridge.h              |   24 +-
>> arch/powerpc/kernel/pci-common.c                   |   39 +
>> arch/powerpc/kernel/pci-hotplug.c                  |    3 +
>> arch/powerpc/kernel/pci_dn.c                       |  257 ++++++-
>> arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
>> arch/powerpc/platforms/powernv/pci-ioda.c          |  744 +++++++++++++++++++-
>> arch/powerpc/platforms/powernv/pci.c               |   87 +--
>> arch/powerpc/platforms/powernv/pci.h               |   13 +-
>> drivers/pci/iov.c                                  |   60 +-
>> drivers/pci/setup-bus.c                            |   85 ++-
>> include/linux/pci.h                                |   19 +
>> 15 files changed, 1332 insertions(+), 107 deletions(-)
>> create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt
>>
>>--
>>1.7.9.5
>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-02 15:41 ` [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface Wei Yang
@ 2014-11-19  1:12   ` Bjorn Helgaas
  2014-11-19  2:15     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-19  1:12 UTC (permalink / raw)
  To: Wei Yang
  Cc: benh, linux-pci, linuxppc-dev, gwshan, Donald Dutile, Myron Stowe

[+cc Don, Myron]

Hi Wei,

On Sun, Nov 02, 2014 at 11:41:19PM +0800, Wei Yang wrote:
> When retrieving VF IOV BAR in virtfn_add(), it will divide the total PF's IOV
> BAR size with the totalVF number. This is true for most cases, while may not
> be correct on some specific platform.
> 
> For example on PowerNV platform, in order to fix PF's IOV BAR into a hardware
> alignment, the PF's IOV BAR size would be expended. This means the original
> method couldn't work.
> 
> This patch introduces a weak pcibios_iov_resource_size() interface, which
> gives platform a chance to implement specific method to calculate the VF BAR
> resource size.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   27 +++++++++++++++++++++++++--
>  include/linux/pci.h |    5 +++++
>  2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4d1685d..6866830 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -61,6 +61,30 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
>  		pci_remove_bus(virtbus);
>  }
>  
> +resource_size_t __weak pcibios_iov_resource_size(struct pci_dev *dev, int resno)
> +{
> +	return 0;
> +}
> +
> +resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> +{
> +	resource_size_t size;
> +	struct pci_sriov *iov;
> +
> +	if (!dev->is_physfn)
> +		return 0;
> +
> +	size = pcibios_iov_resource_size(dev, resno);
> +	if (size != 0)
> +		return size;
> +
> +	iov = dev->sriov;
> +	size = resource_size(dev->resource + resno);
> +	do_div(size, iov->total_VFs);
> +
> +	return size;
> +}
> +
>  static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  {
>  	int i;
> @@ -96,8 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  			continue;
>  		virtfn->resource[i].name = pci_name(virtfn);
>  		virtfn->resource[i].flags = res->flags;
> -		size = resource_size(res);
> -		do_div(size, iov->total_VFs);
> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>  		virtfn->resource[i].start = res->start + size * id;

Can you help me understand this?

We have previously called sriov_init() on the PF.  There, we sized the VF
BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
The size we discover is the amount of space required by a single VF, so
sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
hold the VF BAR[i] areas for all the possible VFs.

Now we're in virtfn_add(), setting up a new VF.  The usual BARs at config
space 0x10, etc., are read-only zeroes for a VF, so we don't size them the
usual way.  Instead, we carve out a slice of the
PF->resource[PCI_IOV_RESOURCES + i] area.

I thought the starting address of the VF BARn memory aperture was
prescribed by the spec in sec 2.1.1.1 and shown in figure 2-1:

  BARx VFy starting address = VF BARx + (y - 1) * (VF BARx aperture size)

That's basically what the existing code does.  We don't save the VF BARx
aperture size, so we recompute it by undoing the multiplication we did in
sriov_init().

But you're computing the starting address using a different VF BARx
aperture size.  How does that work?  I assumed this calculation was built
into the PCI device, but obviously I'm missing something.

To make it concrete, here's a made-up example:

  PF SR-IOV Capability
    TotalVFs = 4
    NumVFs = 4
    System Page Size = 4KB
    VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)

  PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
  VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
  VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
  VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]

You're changing this so we might use 16KB as the VF BAR0 aperture size
instead of 4KB, which would result in the following:

  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00003fff]
  VF2 pci_dev->resource[0] = [mem 0x00004000-0x00007fff]
  VF3 pci_dev->resource[0] = [mem 0x00008000-0x0000bfff]
  VF4 pci_dev->resource[0] = [mem 0x0000c000-0x0000ffff]

But you didn't change sriov_init(), so the PF resource[7] is only 16KB, not
the 64KB the VFs need.  And I assume the VF address decoder is wired to
claim the 4KB regions at 0x0000, 0x1000, 0x2000, 0x3000, not the ones at
0x0000, 0x4000, 0x8000, 0xc000.

Bjorn

>  		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
>  		rc = request_resource(res, &virtfn->resource[i]);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index bbf8058..2f5b454 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1162,6 +1162,8 @@ resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>  resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
>  						 int resno,
>  						 resource_size_t align);
> +resource_size_t pcibios_iov_resource_size(struct pci_dev *dev,
> +		                            int resno);
>  
>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
> @@ -1666,6 +1668,7 @@ int pci_num_vf(struct pci_dev *dev);
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
> +resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  #else
>  static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>  {
> @@ -1685,6 +1688,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
>  { return 0; }
>  static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  { return 0; }
> +static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> +{ return 0; }
>  #endif
>  
>  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19  1:12   ` Bjorn Helgaas
@ 2014-11-19  2:15     ` Benjamin Herrenschmidt
  2014-11-19  3:21       ` Wei Yang
  0 siblings, 1 reply; 39+ messages in thread
From: Benjamin Herrenschmidt @ 2014-11-19  2:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wei Yang, linux-pci, linuxppc-dev, gwshan, Donald Dutile, Myron Stowe

On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
> 
> Can you help me understand this?
> 
> We have previously called sriov_init() on the PF.  There, we sized the VF
> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
> The size we discover is the amount of space required by a single VF, so
> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
> hold the VF BAR[i] areas for all the possible VFs.

So I'll let Richard (Wei) answer on the details but I'll just chime in
about the "big picture". This isn't about changing the spacing between VFs
which is handled by the system page size.

This is about the way we create MMIO windows from the CPU to the VF BARs.

Basically, we have a (limited) set of 64-bit windows we can create that
are divided in equal sized segments (256 of them), each segment assigned
in HW to one of our Partitionable Endpoints (aka domain).

So even if we only ever create 16 VFs for a device, we need to use an
entire of these windows, which will use 256*VF_size and thus allocate
that much space. Also the window has to be naturally aligned.

We can then assign the VF BAR to a spot inside that window that corresponds
to the range of PEs that we have assigned to that device (which typically
isn't going to be the beginning of the window).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19  2:15     ` Benjamin Herrenschmidt
@ 2014-11-19  3:21       ` Wei Yang
  2014-11-19  4:26         ` Bjorn Helgaas
  0 siblings, 1 reply; 39+ messages in thread
From: Wei Yang @ 2014-11-19  3:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, Myron Stowe, linux-pci, gwshan, Donald Dutile,
	Bjorn Helgaas, linuxppc-dev

On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
>On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
>> 
>> Can you help me understand this?
>> 
>> We have previously called sriov_init() on the PF.  There, we sized the VF
>> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
>> The size we discover is the amount of space required by a single VF, so
>> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
>> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
>> hold the VF BAR[i] areas for all the possible VFs.
>
>So I'll let Richard (Wei) answer on the details but I'll just chime in
>about the "big picture". This isn't about changing the spacing between VFs
>which is handled by the system page size.
>
>This is about the way we create MMIO windows from the CPU to the VF BARs.
>
>Basically, we have a (limited) set of 64-bit windows we can create that
>are divided in equal sized segments (256 of them), each segment assigned
>in HW to one of our Partitionable Endpoints (aka domain).
>
>So even if we only ever create 16 VFs for a device, we need to use an
>entire of these windows, which will use 256*VF_size and thus allocate
>that much space. Also the window has to be naturally aligned.
>
>We can then assign the VF BAR to a spot inside that window that corresponds
>to the range of PEs that we have assigned to that device (which typically
>isn't going to be the beginning of the window).
>

Bjorn & Ben,

Let me try to explain it. Thanks for Ben's explanation, it would be helpful. We
are not trying to change the space between VFs.

As mentioned by Ben, we use some HW to map the MMIO space to PE. But the HW
must map 256 segments with the same size. This will lead a situation like
this.

   +------+------+        +------+------+------+------+
   |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
   +------+------+        +------+------+------+------+

Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
Then it introduces one problem, the PF#A and PF#B have been already assigned
to some PE#. We can't map one MMIO range to two different PE#.

What we have done is to "Expand the IOV BAR" to fit the whole HW 256 segments.
By doing so, the MMIO range will look like this.

   +------+------+        +------+------+------+------+------+------+
   |VF#0  |VF#1  |   ...  |      |VF#N-1|blank |blank |PF#A  |PF#B  |
   +------+------+        +------+------+------+------+------+------+

We do some tricky to "Expand" the IOV BAR, so that make sure there would not
be some overlap between VF's PE and PF's PE.

Then this will leads to the IOV BAR size change from:
   
   IOV BAR size = (VF BAR aperture size) * VF_number

 to:
   
   IOV BAR size = (VF BAR aperture size) * 256

This is the reason we need a platform dependent method to get the VF BAR size.
Otherwise the VF BAR size would be not correct.

Now let's take a look at your example again.

  PF SR-IOV Capability
    TotalVFs = 4
    NumVFs = 4
    System Page Size = 4KB
    VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)

  PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
  VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
  VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
  VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]

The difference after our expanding is the IOV BAR size is 256*4KB instead of
16KB. So it will look like this:

  PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
  VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
  VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
  VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
  ...
  and 252 4KB space leave not used.

So the start address and the size of VF will not change, but the PF's IOV BAR
will be expanded.

>Cheers,
>Ben.
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19  3:21       ` Wei Yang
@ 2014-11-19  4:26         ` Bjorn Helgaas
  2014-11-19  9:27           ` Wei Yang
  0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-19  4:26 UTC (permalink / raw)
  To: Wei Yang
  Cc: Benjamin Herrenschmidt, linux-pci, linuxppc-dev, gwshan,
	Donald Dutile, Myron Stowe

On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
> >> 
> >> Can you help me understand this?
> >> 
> >> We have previously called sriov_init() on the PF.  There, we sized the VF
> >> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
> >> The size we discover is the amount of space required by a single VF, so
> >> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
> >> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
> >> hold the VF BAR[i] areas for all the possible VFs.
> >
> >So I'll let Richard (Wei) answer on the details but I'll just chime in
> >about the "big picture". This isn't about changing the spacing between VFs
> >which is handled by the system page size.
> >
> >This is about the way we create MMIO windows from the CPU to the VF BARs.
> >
> >Basically, we have a (limited) set of 64-bit windows we can create that
> >are divided in equal sized segments (256 of them), each segment assigned
> >in HW to one of our Partitionable Endpoints (aka domain).
> >
> >So even if we only ever create 16 VFs for a device, we need to use an
> >entire of these windows, which will use 256*VF_size and thus allocate
> >that much space. Also the window has to be naturally aligned.
> >
> >We can then assign the VF BAR to a spot inside that window that corresponds
> >to the range of PEs that we have assigned to that device (which typically
> >isn't going to be the beginning of the window).
> >
> 
> Bjorn & Ben,
> 
> Let me try to explain it. Thanks for Ben's explanation, it would be helpful. We
> are not trying to change the space between VFs.
> 
> As mentioned by Ben, we use some HW to map the MMIO space to PE. 

We need some documentation with pictures about what a PE is.  I did find
this:

https://events.linuxfoundation.org/images/stories/slides/lfcs2013_yang.pdf

which looks like a good start, although there's not quite enough text for
me to understand, and it doesn't have much about MMIO space.

> But the HW
> must map 256 segments with the same size. This will lead a situation like
> this.
> 
>    +------+------+        +------+------+------+------+
>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
>    +------+------+        +------+------+------+------+
> 
> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.

I guess these 256 segments are regions of CPU physical address space, and
they are being mapped to bus address space?  Is there some relationship
between a PE and part of the bus address space?

> Then it introduces one problem, the PF#A and PF#B have been already assigned
> to some PE#. We can't map one MMIO range to two different PE#.
> 
> What we have done is to "Expand the IOV BAR" to fit the whole HW 256 segments.
> By doing so, the MMIO range will look like this.
> 
>    +------+------+        +------+------+------+------+------+------+
>    |VF#0  |VF#1  |   ...  |      |VF#N-1|blank |blank |PF#A  |PF#B  |
>    +------+------+        +------+------+------+------+------+------+
> 
> We do some tricky to "Expand" the IOV BAR, so that make sure there would not
> be some overlap between VF's PE and PF's PE.

The language here is tricky.  You're not actually *expanding* the IOV BAR.
The IOV BAR is a hardware thing and its size is determined by normal BAR
sizing and the number of VFs.  What you're doing is reserving additional
space for that BAR, and the additional space will be unused.  That's all
fine; we just need a way to describe it accurately.

> Then this will leads to the IOV BAR size change from:
>    
>    IOV BAR size = (VF BAR aperture size) * VF_number
> 
>  to:
>    
>    IOV BAR size = (VF BAR aperture size) * 256
> 
> This is the reason we need a platform dependent method to get the VF BAR size.
> Otherwise the VF BAR size would be not correct.
> 
> Now let's take a look at your example again.
> 
>   PF SR-IOV Capability
>     TotalVFs = 4
>     NumVFs = 4
>     System Page Size = 4KB
>     VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)
> 
>   PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
> 
> The difference after our expanding is the IOV BAR size is 256*4KB instead of
> 16KB. So it will look like this:
> 
>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)

Is the idea that you want this resource to be big enough to cover all 256
segments?  I think I'm OK with increasing the size of the PF resources to
prevent overlap.  That part shouldn't be too ugly.

>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>   ...
>   and 252 4KB space leave not used.
> 
> So the start address and the size of VF will not change, but the PF's IOV BAR
> will be expanded.

I'm really dubious about this change to use pci_iov_resource_size().  I
think you might be doing that because if you increase the PF resource size,
dividing that increased size by total_VFs will give you garbage.  E.g., in
the example above, you would compute "size = 1024KB / 4", which would make
the VF BARs appear to be 256KB instead of 4KB as they should be.

I think it would be better to solve that problem by decoupling the PF
resource size and the VF BAR size.  For example, we could keep track of the
VF BAR size explicitly in struct pci_sriov, instead of computing it from
the PF resource size and total_VFs.  This would keep the VF BAR size
completely platform-independent.

Bjorn

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19  4:26         ` Bjorn Helgaas
@ 2014-11-19  9:27           ` Wei Yang
  2014-11-19 17:23             ` Bjorn Helgaas
  0 siblings, 1 reply; 39+ messages in thread
From: Wei Yang @ 2014-11-19  9:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wei Yang, Benjamin Herrenschmidt, Myron Stowe, linux-pci, gwshan,
	Donald Dutile, linuxppc-dev

On Tue, Nov 18, 2014 at 09:26:01PM -0700, Bjorn Helgaas wrote:
>On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
>> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
>> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
>> >> 
>> >> Can you help me understand this?
>> >> 
>> >> We have previously called sriov_init() on the PF.  There, we sized the VF
>> >> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
>> >> The size we discover is the amount of space required by a single VF, so
>> >> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
>> >> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
>> >> hold the VF BAR[i] areas for all the possible VFs.
>> >
>> >So I'll let Richard (Wei) answer on the details but I'll just chime in
>> >about the "big picture". This isn't about changing the spacing between VFs
>> >which is handled by the system page size.
>> >
>> >This is about the way we create MMIO windows from the CPU to the VF BARs.
>> >
>> >Basically, we have a (limited) set of 64-bit windows we can create that
>> >are divided in equal sized segments (256 of them), each segment assigned
>> >in HW to one of our Partitionable Endpoints (aka domain).
>> >
>> >So even if we only ever create 16 VFs for a device, we need to use an
>> >entire of these windows, which will use 256*VF_size and thus allocate
>> >that much space. Also the window has to be naturally aligned.
>> >
>> >We can then assign the VF BAR to a spot inside that window that corresponds
>> >to the range of PEs that we have assigned to that device (which typically
>> >isn't going to be the beginning of the window).
>> >
>> 
>> Bjorn & Ben,
>> 
>> Let me try to explain it. Thanks for Ben's explanation, it would be helpful. We
>> are not trying to change the space between VFs.
>> 
>> As mentioned by Ben, we use some HW to map the MMIO space to PE. 
>
>We need some documentation with pictures about what a PE is.  I did find
>this:
>
>https://events.linuxfoundation.org/images/stories/slides/lfcs2013_yang.pdf
>
>which looks like a good start, although there's not quite enough text for
>me to understand, and it doesn't have much about MMIO space.

Yes, this slide is used 2 years ago and for P7 platform.

Current solution is for P8, which we choose some different mechanism.
Especially the MMIO manipulation used in current implementation is not used in
that moment.

>> But the HW
>> must map 256 segments with the same size. This will lead a situation like
>> this.
>> 
>>    +------+------+        +------+------+------+------+
>>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
>>    +------+------+        +------+------+------+------+
>> 
>> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
>
>I guess these 256 segments are regions of CPU physical address space, and
>they are being mapped to bus address space?  Is there some relationship
>between a PE and part of the bus address space?
>

PE is an entity for EEH, which may include a whole bus or one pci device.

When some device got some error, we need to identify which PE it belongs to.
So we have some HW to map between PE# and MMIO/DMA/MSI address.

The HW mentioned in previous letter is the one to map MMIO address to a PE#.
While this HW must map a range with 256 equal segments. And yes, this is
mapped to bus address space.

>> Then it introduces one problem, the PF#A and PF#B have been already assigned
>> to some PE#. We can't map one MMIO range to two different PE#.
>> 
>> What we have done is to "Expand the IOV BAR" to fit the whole HW 256 segments.
>> By doing so, the MMIO range will look like this.
>> 
>>    +------+------+        +------+------+------+------+------+------+
>>    |VF#0  |VF#1  |   ...  |      |VF#N-1|blank |blank |PF#A  |PF#B  |
>>    +------+------+        +------+------+------+------+------+------+
>> 
>> We do some tricky to "Expand" the IOV BAR, so that make sure there would not
>> be some overlap between VF's PE and PF's PE.
>
>The language here is tricky.  You're not actually *expanding* the IOV BAR.
>The IOV BAR is a hardware thing and its size is determined by normal BAR
>sizing and the number of VFs.  What you're doing is reserving additional
>space for that BAR, and the additional space will be unused.  That's all
>fine; we just need a way to describe it accurately.
>

Yes, you are right. My word is not exact.

What I am doing is to reserve more space for IOV BAR. I will make the log more
precise in next version.

>> Then this will leads to the IOV BAR size change from:
>>    
>>    IOV BAR size = (VF BAR aperture size) * VF_number
>> 
>>  to:
>>    
>>    IOV BAR size = (VF BAR aperture size) * 256
>> 
>> This is the reason we need a platform dependent method to get the VF BAR size.
>> Otherwise the VF BAR size would be not correct.
>> 
>> Now let's take a look at your example again.
>> 
>>   PF SR-IOV Capability
>>     TotalVFs = 4
>>     NumVFs = 4
>>     System Page Size = 4KB
>>     VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)
>> 
>>   PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
>>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>> 
>> The difference after our expanding is the IOV BAR size is 256*4KB instead of
>> 16KB. So it will look like this:
>> 
>>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
>
>Is the idea that you want this resource to be big enough to cover all 256
>segments?  I think I'm OK with increasing the size of the PF resources to
>prevent overlap.  That part shouldn't be too ugly.
>

Yes, big enough to cover all 256 segments.

Sorry for making it ugly :-(

>>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>>   ...
>>   and 252 4KB space leave not used.
>> 
>> So the start address and the size of VF will not change, but the PF's IOV BAR
>> will be expanded.
>
>I'm really dubious about this change to use pci_iov_resource_size().  I
>think you might be doing that because if you increase the PF resource size,
>dividing that increased size by total_VFs will give you garbage.  E.g., in
>the example above, you would compute "size = 1024KB / 4", which would make
>the VF BARs appear to be 256KB instead of 4KB as they should be.
>

Yes, your understanding is correct.

>I think it would be better to solve that problem by decoupling the PF
>resource size and the VF BAR size.  For example, we could keep track of the
>VF BAR size explicitly in struct pci_sriov, instead of computing it from
>the PF resource size and total_VFs.  This would keep the VF BAR size
>completely platform-independent.
>

Hmm... this is another solution.

If you prefer this one, I will make a change accordingly.

Thanks for your comments :-)

>Bjorn

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19  9:27           ` Wei Yang
@ 2014-11-19 17:23             ` Bjorn Helgaas
  2014-11-19 20:51               ` Benjamin Herrenschmidt
  2014-11-20  5:39               ` Wei Yang
  0 siblings, 2 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-19 17:23 UTC (permalink / raw)
  To: Wei Yang
  Cc: Benjamin Herrenschmidt, linux-pci, linuxppc-dev, gwshan,
	Donald Dutile, Myron Stowe

On Wed, Nov 19, 2014 at 05:27:40PM +0800, Wei Yang wrote:
> On Tue, Nov 18, 2014 at 09:26:01PM -0700, Bjorn Helgaas wrote:
> >On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
> >> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
> >> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:

> >> But the HW
> >> must map 256 segments with the same size. This will lead a situation like
> >> this.
> >> 
> >>    +------+------+        +------+------+------+------+
> >>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
> >>    +------+------+        +------+------+------+------+
> >> 
> >> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
> >
> >I guess these 256 segments are regions of CPU physical address space, and
> >they are being mapped to bus address space?  Is there some relationship
> >between a PE and part of the bus address space?
> >
> 
> PE is an entity for EEH, which may include a whole bus or one pci device.

Yes, I've read that many times.  What's missing is the connection between a
PE and the things in the PCI specs (buses, devices, functions, MMIO address
space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
how the core uses the standard PCI elements, but we don't really have a
clear description of those constraints yet.

> When some device got some error, we need to identify which PE it belongs to.
> So we have some HW to map between PE# and MMIO/DMA/MSI address.
> 
> The HW mentioned in previous letter is the one to map MMIO address to a PE#.
> While this HW must map a range with 256 equal segments. And yes, this is
> mapped to bus address space.
> ...

> >> The difference after our expanding is the IOV BAR size is 256*4KB instead of
> >> 16KB. So it will look like this:
> >> 
> >>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
> >
> >Is the idea that you want this resource to be big enough to cover all 256
> >segments?  I think I'm OK with increasing the size of the PF resources to
> >prevent overlap.  That part shouldn't be too ugly.
> >
> 
> Yes, big enough to cover all 256 segments.
> 
> Sorry for making it ugly :-(

I didn't mean that what you did was ugly.  I meant that increasing the size
of the PF resource can be done cleanly.

By the way, when you do this, it would be nice if the dmesg showed the
standard PF IOV BAR sizing, and then a separate line showing the resource
expansion to deal with the PE constraints.  I don't think even the standard
output is very clear -- I think we currently get something like this:

  pci 0000:00:00.0 reg 0x174: [mem 0x00000000-0x00000fff]

But that is only the size of a single VF BAR aperture.  Then sriov_init()
multiplies that by the number of possible VFs, but I don't think we print
the overall size of that PF resource.  I think we should, because it's
misleading to print only the smaller piece.  Maybe something like this:

  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x00003fff] (for 4 VFs)

And then you could do something like:

  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x000fffff] (expanded for PE alignment)

> >>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
> >>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
> >>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
> >>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
> >>   ...
> >>   and 252 4KB space leave not used.
> >> 
> >> So the start address and the size of VF will not change, but the PF's IOV BAR
> >> will be expanded.
> >
> >I'm really dubious about this change to use pci_iov_resource_size().  I
> >think you might be doing that because if you increase the PF resource size,
> >dividing that increased size by total_VFs will give you garbage.  E.g., in
> >the example above, you would compute "size = 1024KB / 4", which would make
> >the VF BARs appear to be 256KB instead of 4KB as they should be.
> 
> Yes, your understanding is correct.
> 
> >I think it would be better to solve that problem by decoupling the PF
> >resource size and the VF BAR size.  For example, we could keep track of the
> >VF BAR size explicitly in struct pci_sriov, instead of computing it from
> >the PF resource size and total_VFs.  This would keep the VF BAR size
> >completely platform-independent.
> 
> Hmm... this is another solution.
> 
> If you prefer this one, I will make a change accordingly.

Yes, I definitely prefer to track the VF BAR size explicitly.  I think that
will make the code much clearer.

Bjorn

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19 17:23             ` Bjorn Helgaas
@ 2014-11-19 20:51               ` Benjamin Herrenschmidt
  2014-11-20  5:40                 ` Wei Yang
  2014-11-20  5:39               ` Wei Yang
  1 sibling, 1 reply; 39+ messages in thread
From: Benjamin Herrenschmidt @ 2014-11-19 20:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wei Yang, linux-pci, linuxppc-dev, gwshan, Donald Dutile, Myron Stowe

On Wed, 2014-11-19 at 10:23 -0700, Bjorn Helgaas wrote:
> 
> Yes, I've read that many times.  What's missing is the connection between a
> PE and the things in the PCI specs (buses, devices, functions, MMIO address
> space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
> how the core uses the standard PCI elements, but we don't really have a
> clear description of those constraints yet.

Right, a "PE" is a HW concept in fact in our bridges, that essentially is
a shared isolation state between DMA, MMIO, MSIs, PCIe error messages,...
for a given "domain" or set of PCI functions.

The techniques of how the HW resources are mapped to PE and associated
constraints are slightly different from one generation of our chips to
the next. In general, P7 follows an architecture known as "IODA" and P8
"IODA2". I'm trying to get that spec made available via OpenPower but
that hasn't happened yet.

In this case we mostly care about IODA2 (P8), so I'll give a quick
description here. Wei, feel free to copy/paste that into a bit of doco
to throw into Documentation/powerpc/ along with your next spin of the patch.

The concept of "PE" is a way to group the various resources associated
with a device or a set of device to provide isolation between partitions
(ie. filtering of DMA, MSIs etc...) and to provide a mechanism to freeze
a device that is causing errors in order to limit the possibility of
propagation of bad data.

There is thus, in HW, a table of "PE" states that contains a pair of
"frozen" state bits (one for MMIO and one for DMA, they get set together
but can be cleared independently) for each PE.

When a PE is frozen, all stores in any direction are dropped and all loads
return all 1's value. MSIs are also blocked. There's a bit more state that
captures things like the details of the error that caused the freeze etc...
but that's not critical.

The interesting part is how the various type of PCIe transactions (MMIO,
DMA,...) are matched to their corresponding PEs.

I will provide a rought description of what we have on P8 (IODA2). Keep
in mind that this is all per PHB (host bridge). Each PHB is a completely
separate HW entity which replicates the entire logic, so has its own set
of PEs etc...

First, P8 has 256 PEs per PHB.

 * Inbound

For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but
accessed in HW by the chip) that provides a direct correspondence between
a PCIe RID (bus/dev/fn) with a PE number. We call this the RTT.

 - For DMA we then provide an entire address space for each PE that can contains
two "windows", depending on the value of PCI bit 59. Each window can then be
configured to be remapped via a "TCE table" (iommu translation table), which has
various configurable characteristics which we can describe another day.

 - For MSIs, we have two windows in the address space (one at the top of the 32-bit
space and one much higher) which, via a combination of the address and MSI value,
will result in one of the 2048 interrupts per bridge being triggered. There's
a PE value in the interrupt controller descriptor table as well which is compared
with the PE obtained from the RTT to "authorize" the device to emit that specific
interrupt.

 - Error messages just use the RTT.

 * Outbound. That's where the tricky part is.

The PHB basically has a concept of "windows" from the CPU address space to the
PCI address space. There is one M32 window and 16 M64 windows. They have different
characteristics. First what they have in common: they are configured to forward a
configurable portion of the CPU address space to the PCIe bus and must be naturally
aligned power of two in size. The rest is different:

  - The M32 window:

    * It is limited to 4G in size

    * It drops the top bits of the address (above the size) and replaces them with
a configurable value. This is typically used to generate 32-bit PCIe accesses. We
configure that window at boot from FW and don't touch it from Linux, it's usually
set to forward a 2G portion of address space from the CPU to PCIe
0x8000_0000..0xffff_ffff. (Note: The top 64K are actually reserved for MSIs but
this is not a problem at this point, we just need to ensure Linux doesn't assign
anything there, the M32 logic ignores that however and will forward in that space
if we try).

    * It is divided into 256 segments of equal size. A table in the chip provides
for each of these 256 segments a PE#. That allows to essentially assign portions
of the MMIO space to PEs on a segment granularity. For a 2G window, this is 8M.

Now, this is the "main" window we use in Linux today (excluding SR-IOV). We
basically use the trick of forcing the bridge MMIO windows onto a segment
alignment/granularity so that the space behind a bridge can be assigned to a PE.

Ideally we would like to be able to have individual functions in PE's but that
would mean using a completely different address allocation scheme where individual
function BARs can be "grouped" to fit in one or more segments....

 - The M64 windows.

   * Their smallest size is 1M

   * They do not translate addresses (the address on PCIe is the same as the
address on the PowerBus. There is a way to also set the top 14 bits which are
not conveyed by PowerBus but we don't use this).

   * They can be configured to be segmented or not. When segmented, they have
256 segments, however they are not remapped. The segment number *is* the PE
number. When no segmented, the PE number can be specified for the entire
window.

   * They support overlaps in which case there is a well defined ordering of
matching (I don't remember off hand which of the lower or higher numbered
window takes priority but basically it's well defined).

We have code (fairly new compared to the M32 stuff) that exploits that for
large BARs in 64-bit space:

We create a single big M64 that covers the entire region of address space that
has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
it comes out of a different "reserve"). We configure that window as segmented.

Then we do the same thing as with M32, using the bridge aligment trick, to
match to those giant segments.

Since we cannot remap, we have two additional constraints:

  - We do the PE# allocation *after* the 64-bit space has been assigned since
the segments used will derive directly the PE#, we then "update" the M32 PE#
for the devices that use both 32-bit and 64-bit spaces or assign the remaining
PE# to 32-bit only devices.

  - We cannot "group" segments in HW so if a device ends up using more than
one segment, we end up with more than one PE#. There is a HW mechanism to
make the freeze state cascade to "companion" PEs but that only work for PCIe
error messages (typically used so that if you freeze a switch, it freezes all
its children). So we do it in SW. We lose a bit of effectiveness of EEH in that
case, but that's the best we found. So when any of the PEs freezes, we freeze
the other ones for that "domain". We thus introduce the concept of "master PE"
which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
for the remaining M64 segments.

We would like to investigate using additional M64's in "single PE" mode to
overlay over specific BARs to work around some of that, for example for devices
with very large BARs (some GPUs), it would make sense, but we haven't done it
yet.

Finally, the plan to use M64 for SR-IOV, which we describe a bit already, consists
of using those M64's. So for a given IOV BAR, we need to effectively reserve the
entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
the beginning of a free range of segments/PEs inside that M64.

The goal is of course to be able to give a separate PE for each VF...

I hope that helps clarifying things a bit ...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-02 15:41 ` [PATCH V9 08/18] powrepc/pci: Refactor pci_dn Wei Yang
@ 2014-11-19 23:30   ` Bjorn Helgaas
  2014-11-20  1:02     ` Gavin Shan
  2014-11-20  7:20     ` Wei Yang
  0 siblings, 2 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-19 23:30 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> 
> pci_dn is the extension of PCI device node and it's created from
> device node. Unfortunately, VFs that are enabled dynamically by
> PF's driver and they don't have corresponding device nodes, and
> pci_dn. The patch refactors pci_dn to support VFs:
> 
>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>      to the child list of pci_dn of PF's bridge. pci_dn of other
>      device put to the child list of pci_dn of its upstream bridge.
> 
>    * VF's pci_dn is expected to be created dynamically when applying
>      final fixup to PF. VF's pci_dn will be destroyed when releasing
>      PF's pci_dev instance. pci_dn of other device is still created
>      from device node as before.
> 
>    * For one particular PCI device (VF or not), its pci_dn can be
>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>      or parent's list. The fast path (fetching pci_dn through PCI
>      device instance) is populated during early fixup time.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
> ...

> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
> +{
> +#ifdef CONFIG_PCI_IOV
> +	struct pci_dn *parent, *pdn;
> +	int i;
> +
> +	/* Only support IOV for now */
> +	if (!pdev->is_physfn)
> +		return pci_get_pdn(pdev);
> +
> +	/* Check if VFs have been populated */
> +	pdn = pci_get_pdn(pdev);
> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> +		return NULL;
> +
> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> +	parent = pci_bus_to_pdn(pdev->bus);
> +	if (!parent)
> +		return NULL;
> +
> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
> +		pdn = add_one_dev_pci_info(parent, NULL,
> +					   pci_iov_virtfn_bus(pdev, i),
> +					   pci_iov_virtfn_devfn(pdev, i));

I'm not sure this makes sense, but I certainly don't know this code, so
maybe I'm missing something.

pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
and First VF Offset in the SR-IOV capability by sriov_init(), which is
called before add_dev_pci_info():

  pci_scan_child_bus
    pci_scan_slot
      pci_scan_single_device
	pci_device_add
	  pci_init_capabilities
	    pci_iov_init(PF)
	      sriov_init(PF, pos)
		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
		iov->offset = offset
		iov->stride = stride

  pci_bus_add_devices
    pci_bus_add_device
      pci_fixup_device(pci_fixup_final)
	add_dev_pci_info
	  pci_iov_virtfn_bus
	    return ... + sriov->offset + (sriov->stride * id) ...

But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
sriov_init() above.  We will change NumVFs to something different when a
driver calls pci_enable_sriov():

  pci_enable_sriov
    sriov_enable
      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)

Now First VF Offset and VF Stride have changed from what they were when we
called pci_iov_virtfn_bus() above.

> +		if (!pdn) {
> +			pr_warn("%s: Cannot create firmware data "
> +				"for VF#%d of %s\n",
> +				__func__, i, pci_name(pdev));
> +			return NULL;
> +		}
> +	}
> +#endif
> +
> +	return pci_get_pdn(pdev);
> +}
> ...

> +static void pci_dev_pdn_create(struct pci_dev *pdev)
> +{
> +	add_dev_pci_info(pdev);
> +}
> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_create);

There are no other callers of add_dev_pci_info(), so it seems pointless to
declare it in arch/powerpc/include/asm/pci-bridge.h and add this wrapper
around it.

> +
> +static void pci_dev_pdn_setup(struct pci_dev *pdev)
> +{
> +	struct pci_dn *pdn;
> +
> +	if (pdev->dev.archdata.firmware_data)
> +		return;
> +
> +	/* Setup the fast path */
> +	pdn = pci_get_pdn(pdev);
> +	pdev->dev.archdata.firmware_data = pdn;
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF
  2014-11-02 15:41 ` [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
@ 2014-11-19 23:35   ` Bjorn Helgaas
  0 siblings, 0 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-19 23:35 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Sun, Nov 02, 2014 at 11:41:17PM +0800, Wei Yang wrote:
> When implementing the SR-IOV on PowerNV platform, some resource reservation is
> needed for VFs which don't exist at the bootup stage. To do the match between
> resources and VFs, the code need to get the VF's BDF in advance.
> 
> In this patch, it exports the interface to retrieve VF's BDF:
>    * Make the virtfn_bus as an interface
>    * Make the virtfn_devfn as an interface
>    * Rename them with more specific name
>    * Code cleanup in pci_sriov_resource_alignment()
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   22 +++++++++++++---------
>  include/linux/pci.h |   11 +++++++++++
>  2 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4d109c0..5e8091b 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -19,14 +19,18 @@
>  
>  #define VIRTFN_ID_LEN	16
>  
> -static inline u8 virtfn_bus(struct pci_dev *dev, int id)
> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>  {
> +	if (!dev->is_physfn)
> +		return -EINVAL;
>  	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
>  				    dev->sriov->stride * id) >> 8);
>  }
>  
> -static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>  {
> +	if (!dev->is_physfn)
> +		return -EINVAL;
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * id) & 0xff;
>  }

I'm concerned about exporting these because they depend on First VF Offset
and VF Stride from the SR-IOV Capability, and those values change when the
ARI Capability Hierarchy setting or the NumVFs setting change (SR-IOV spec
sec 3.3.9, 3.3.10).  The caller doesn't necessarily know about this
connection and may not be able to deal with the change.

I outlined one possible problem with this in patch 08/18.

Bjorn

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-19 23:30   ` Bjorn Helgaas
@ 2014-11-20  1:02     ` Gavin Shan
  2014-11-20  7:25       ` Wei Yang
  2014-11-20  7:20     ` Wei Yang
  1 sibling, 1 reply; 39+ messages in thread
From: Gavin Shan @ 2014-11-20  1:02 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
>On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
>> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> 
>> pci_dn is the extension of PCI device node and it's created from
>> device node. Unfortunately, VFs that are enabled dynamically by
>> PF's driver and they don't have corresponding device nodes, and
>> pci_dn. The patch refactors pci_dn to support VFs:
>> 
>>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>>      to the child list of pci_dn of PF's bridge. pci_dn of other
>>      device put to the child list of pci_dn of its upstream bridge.
>> 
>>    * VF's pci_dn is expected to be created dynamically when applying
>>      final fixup to PF. VF's pci_dn will be destroyed when releasing
>>      PF's pci_dev instance. pci_dn of other device is still created
>>      from device node as before.
>> 
>>    * For one particular PCI device (VF or not), its pci_dn can be
>>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>>      or parent's list. The fast path (fetching pci_dn through PCI
>>      device instance) is populated during early fixup time.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>> ...
>
>> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
>> +{
>> +#ifdef CONFIG_PCI_IOV
>> +	struct pci_dn *parent, *pdn;
>> +	int i;
>> +
>> +	/* Only support IOV for now */
>> +	if (!pdev->is_physfn)
>> +		return pci_get_pdn(pdev);
>> +
>> +	/* Check if VFs have been populated */
>> +	pdn = pci_get_pdn(pdev);
>> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>> +		return NULL;
>> +
>> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>> +	parent = pci_bus_to_pdn(pdev->bus);
>> +	if (!parent)
>> +		return NULL;
>> +
>> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>> +		pdn = add_one_dev_pci_info(parent, NULL,
>> +					   pci_iov_virtfn_bus(pdev, i),
>> +					   pci_iov_virtfn_devfn(pdev, i));
>
>I'm not sure this makes sense, but I certainly don't know this code, so
>maybe I'm missing something.
>

For ARI, Richard had some patches to fix the issue from firmware side.

>pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
>pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
>and First VF Offset in the SR-IOV capability by sriov_init(), which is
>called before add_dev_pci_info():
>
>  pci_scan_child_bus
>    pci_scan_slot
>      pci_scan_single_device
>	pci_device_add
>	  pci_init_capabilities
>	    pci_iov_init(PF)
>	      sriov_init(PF, pos)
>		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
>		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
>		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
>		iov->offset = offset
>		iov->stride = stride
>
>  pci_bus_add_devices
>    pci_bus_add_device
>      pci_fixup_device(pci_fixup_final)
>	add_dev_pci_info
>	  pci_iov_virtfn_bus
>	    return ... + sriov->offset + (sriov->stride * id) ...
>
>But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
>NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
>sriov_init() above.  We will change NumVFs to something different when a
>driver calls pci_enable_sriov():
>
>  pci_enable_sriov
>    sriov_enable
>      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
>
>Now First VF Offset and VF Stride have changed from what they were when we
>called pci_iov_virtfn_bus() above.
>

It's the case we missed: First VF Offset and VF Stride can change when
PF's number of VFs is changed. It means the BDFN (Bus/Device/Function
number) for one VF can't be determined until PF's number of VFs is
populated and updated to HW (before calling to virtfn_add()).

The dynamically created pci_dn is used in PCI config accessors currently.
That means we have to get it ready before first PCI config request to the
VF in pci_setup_device(). In the code of old revision, we had some weak
function called in pci_alloc_dev(), which gave platform chance to create
pci_dn. I think we have to switch back to the old way in order to fix
the problem you catched. However, the old way is implemented with cost
of more weak function, which you're probably unhappy to see.

  sriov_enable()
    virtfn_add()
      virtfn_add_bus()
      pci_alloc_dev()
      pci_setup_device()

>> +		if (!pdn) {
>> +			pr_warn("%s: Cannot create firmware data "
>> +				"for VF#%d of %s\n",
>> +				__func__, i, pci_name(pdev));
>> +			return NULL;
>> +		}
>> +	}
>> +#endif
>> +
>> +	return pci_get_pdn(pdev);
>> +}
>> ...
>
>> +static void pci_dev_pdn_create(struct pci_dev *pdev)
>> +{
>> +	add_dev_pci_info(pdev);
>> +}
>> +DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_create);
>
>There are no other callers of add_dev_pci_info(), so it seems pointless to
>declare it in arch/powerpc/include/asm/pci-bridge.h and add this wrapper
>around it.
>

Yep and will fix.

Thanks,
Gavin

>> +
>> +static void pci_dev_pdn_setup(struct pci_dev *pdev)
>> +{
>> +	struct pci_dn *pdn;
>> +
>> +	if (pdev->dev.archdata.firmware_data)
>> +		return;
>> +
>> +	/* Setup the fast path */
>> +	pdn = pci_get_pdn(pdev);
>> +	pdev->dev.archdata.firmware_data = pdn;
>> +}
>> +DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
>> -- 
>> 1.7.9.5
>> 
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19 17:23             ` Bjorn Helgaas
  2014-11-19 20:51               ` Benjamin Herrenschmidt
@ 2014-11-20  5:39               ` Wei Yang
  1 sibling, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-20  5:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wei Yang, Benjamin Herrenschmidt, Myron Stowe, linux-pci, gwshan,
	Donald Dutile, linuxppc-dev

On Wed, Nov 19, 2014 at 10:23:50AM -0700, Bjorn Helgaas wrote:
>On Wed, Nov 19, 2014 at 05:27:40PM +0800, Wei Yang wrote:
>> On Tue, Nov 18, 2014 at 09:26:01PM -0700, Bjorn Helgaas wrote:
>> >On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
>> >> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
>> >> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
>
>> >> But the HW
>> >> must map 256 segments with the same size. This will lead a situation like
>> >> this.
>> >> 
>> >>    +------+------+        +------+------+------+------+
>> >>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
>> >>    +------+------+        +------+------+------+------+
>> >> 
>> >> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
>> >
>> >I guess these 256 segments are regions of CPU physical address space, and
>> >they are being mapped to bus address space?  Is there some relationship
>> >between a PE and part of the bus address space?
>> >
>> 
>> PE is an entity for EEH, which may include a whole bus or one pci device.
>
>Yes, I've read that many times.  What's missing is the connection between a
>PE and the things in the PCI specs (buses, devices, functions, MMIO address
>space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
>how the core uses the standard PCI elements, but we don't really have a
>clear description of those constraints yet.
>
>> When some device got some error, we need to identify which PE it belongs to.
>> So we have some HW to map between PE# and MMIO/DMA/MSI address.
>> 
>> The HW mentioned in previous letter is the one to map MMIO address to a PE#.
>> While this HW must map a range with 256 equal segments. And yes, this is
>> mapped to bus address space.
>> ...
>
>> >> The difference after our expanding is the IOV BAR size is 256*4KB instead of
>> >> 16KB. So it will look like this:
>> >> 
>> >>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
>> >
>> >Is the idea that you want this resource to be big enough to cover all 256
>> >segments?  I think I'm OK with increasing the size of the PF resources to
>> >prevent overlap.  That part shouldn't be too ugly.
>> >
>> 
>> Yes, big enough to cover all 256 segments.
>> 
>> Sorry for making it ugly :-(
>
>I didn't mean that what you did was ugly.  I meant that increasing the size
>of the PF resource can be done cleanly.
>
>By the way, when you do this, it would be nice if the dmesg showed the
>standard PF IOV BAR sizing, and then a separate line showing the resource
>expansion to deal with the PE constraints.  I don't think even the standard
>output is very clear -- I think we currently get something like this:
>
>  pci 0000:00:00.0 reg 0x174: [mem 0x00000000-0x00000fff]
>
>But that is only the size of a single VF BAR aperture.  Then sriov_init()
>multiplies that by the number of possible VFs, but I don't think we print
>the overall size of that PF resource.  I think we should, because it's
>misleading to print only the smaller piece.  Maybe something like this:
>
>  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x00003fff] (for 4 VFs)
>
>And then you could do something like:
>
>  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x000fffff] (expanded for PE alignment)
>

Got it, will add message to reflect it.

>> >>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>> >>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>> >>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>> >>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>> >>   ...
>> >>   and 252 4KB space leave not used.
>> >> 
>> >> So the start address and the size of VF will not change, but the PF's IOV BAR
>> >> will be expanded.
>> >
>> >I'm really dubious about this change to use pci_iov_resource_size().  I
>> >think you might be doing that because if you increase the PF resource size,
>> >dividing that increased size by total_VFs will give you garbage.  E.g., in
>> >the example above, you would compute "size = 1024KB / 4", which would make
>> >the VF BARs appear to be 256KB instead of 4KB as they should be.
>> 
>> Yes, your understanding is correct.
>> 
>> >I think it would be better to solve that problem by decoupling the PF
>> >resource size and the VF BAR size.  For example, we could keep track of the
>> >VF BAR size explicitly in struct pci_sriov, instead of computing it from
>> >the PF resource size and total_VFs.  This would keep the VF BAR size
>> >completely platform-independent.
>> 
>> Hmm... this is another solution.
>> 
>> If you prefer this one, I will make a change accordingly.
>
>Yes, I definitely prefer to track the VF BAR size explicitly.  I think that
>will make the code much clearer.

Got it.

>
>Bjorn

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface
  2014-11-19 20:51               ` Benjamin Herrenschmidt
@ 2014-11-20  5:40                 ` Wei Yang
  0 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-20  5:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, Myron Stowe, linux-pci, gwshan, Donald Dutile,
	Bjorn Helgaas, linuxppc-dev

On Thu, Nov 20, 2014 at 07:51:40AM +1100, Benjamin Herrenschmidt wrote:
>On Wed, 2014-11-19 at 10:23 -0700, Bjorn Helgaas wrote:
>> 
>> Yes, I've read that many times.  What's missing is the connection between a
>> PE and the things in the PCI specs (buses, devices, functions, MMIO address
>> space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
>> how the core uses the standard PCI elements, but we don't really have a
>> clear description of those constraints yet.
>
>Right, a "PE" is a HW concept in fact in our bridges, that essentially is
>a shared isolation state between DMA, MMIO, MSIs, PCIe error messages,...
>for a given "domain" or set of PCI functions.
>
>The techniques of how the HW resources are mapped to PE and associated
>constraints are slightly different from one generation of our chips to
>the next. In general, P7 follows an architecture known as "IODA" and P8
>"IODA2". I'm trying to get that spec made available via OpenPower but
>that hasn't happened yet.
>
>In this case we mostly care about IODA2 (P8), so I'll give a quick
>description here. Wei, feel free to copy/paste that into a bit of doco
>to throw into Documentation/powerpc/ along with your next spin of the patch.
>

Got it.

I will add more description in powerpc directory.


-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-19 23:30   ` Bjorn Helgaas
  2014-11-20  1:02     ` Gavin Shan
@ 2014-11-20  7:20     ` Wei Yang
  2014-11-20 19:05       ` Bjorn Helgaas
  1 sibling, 1 reply; 39+ messages in thread
From: Wei Yang @ 2014-11-20  7:20 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
>On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
>> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> 
>> pci_dn is the extension of PCI device node and it's created from
>> device node. Unfortunately, VFs that are enabled dynamically by
>> PF's driver and they don't have corresponding device nodes, and
>> pci_dn. The patch refactors pci_dn to support VFs:
>> 
>>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>>      to the child list of pci_dn of PF's bridge. pci_dn of other
>>      device put to the child list of pci_dn of its upstream bridge.
>> 
>>    * VF's pci_dn is expected to be created dynamically when applying
>>      final fixup to PF. VF's pci_dn will be destroyed when releasing
>>      PF's pci_dev instance. pci_dn of other device is still created
>>      from device node as before.
>> 
>>    * For one particular PCI device (VF or not), its pci_dn can be
>>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>>      or parent's list. The fast path (fetching pci_dn through PCI
>>      device instance) is populated during early fixup time.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>> ...
>
>> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
>> +{
>> +#ifdef CONFIG_PCI_IOV
>> +	struct pci_dn *parent, *pdn;
>> +	int i;
>> +
>> +	/* Only support IOV for now */
>> +	if (!pdev->is_physfn)
>> +		return pci_get_pdn(pdev);
>> +
>> +	/* Check if VFs have been populated */
>> +	pdn = pci_get_pdn(pdev);
>> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>> +		return NULL;
>> +
>> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>> +	parent = pci_bus_to_pdn(pdev->bus);
>> +	if (!parent)
>> +		return NULL;
>> +
>> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>> +		pdn = add_one_dev_pci_info(parent, NULL,
>> +					   pci_iov_virtfn_bus(pdev, i),
>> +					   pci_iov_virtfn_devfn(pdev, i));
>
>I'm not sure this makes sense, but I certainly don't know this code, so
>maybe I'm missing something.
>
>pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
>pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
>and First VF Offset in the SR-IOV capability by sriov_init(), which is
>called before add_dev_pci_info():
>
>  pci_scan_child_bus
>    pci_scan_slot
>      pci_scan_single_device
>	pci_device_add
>	  pci_init_capabilities
>	    pci_iov_init(PF)
>	      sriov_init(PF, pos)
>		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
>		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
>		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
>		iov->offset = offset
>		iov->stride = stride
>
>  pci_bus_add_devices
>    pci_bus_add_device
>      pci_fixup_device(pci_fixup_final)
>	add_dev_pci_info
>	  pci_iov_virtfn_bus
>	    return ... + sriov->offset + (sriov->stride * id) ...
>
>But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
>NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
>sriov_init() above.  We will change NumVFs to something different when a
>driver calls pci_enable_sriov():
>
>  pci_enable_sriov
>    sriov_enable
>      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
>
>Now First VF Offset and VF Stride have changed from what they were when we
>called pci_iov_virtfn_bus() above.
>

Oops, I see the ARI would affect those value, while missed the NumVFs also
would.

Let's look at the problem one by one.

1. The ARI capability.
===============================================================================
The kernel initialize the capability like this:

pci_init_capabilities()
	pci_configure_ari()
	pci_iov_init()
		iov->offset = offset
		iov->stride = stride

When offset/stride is retrieved at this point, the ARI capability is taken
into consideration.

2. The PF's NumVFs field
===============================================================================
2.1 Potential problem in current code
===============================================================================
First, is current pci code has some potential problem?

sriov_enable()
	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
	iov->offset = offset;
	iov->stride = stride;
	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
	virtfn_add()
		...
		virtfn->devfn = pci_iov_virtfn_devfn(dev, id);

The sriov_enable() retrieve the offset/stride then write the NumVFs. According
to the SPEC, at this moment the offset/stride may change. While I don't see
some code to retrieve and store those value again. And these fields will be
used in virtfn_add().

If my understanding is correct, I suggest to move the retrieve and store
operation after NumVFs is written.

2.2 The IOV bus range may not be correct in pci_scan_child_bus()?
===============================================================================
In current pci core, when enumerating the pci tree, we do like this:

pci_scan_child_bus()
	pci_scan_slot()
		pci_scan_single_device()
			pci_device_add()
				pci_init_capabilities()
					pci_iov_init()
	max += pci_iov_bus_range(bus);
		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
	max = pci_scan_bridge(bus, dev, max, pass);

>From this point, we see pci core reserve some bus range for VFs. This
calculation is based on the offset/stride at this moment. And do the
enumeration with the new bus number.

sriov_enable() could be called several times from driver to enable SRIOV, and
with different nr_virtfn. If each time the NumVFs written, the offset/stride
will change. This means we may try to address an extra bus we didn't reserve?
Or this means it is out of control?

Do I miss something?

2.3 How can I reserve bus range in FW?
===============================================================================
This question comes from the previous one.

Based on my understanding, current pci core will rely on the bus number in HW
if pcibios_assign_all_busses() is not set. If we want to support those VFs
sits on different bus with PF, we need to reserve bus range and write the
correct secondary/subordinate in bridge. Otherwise, those VFs on different bus
may not be addressed.

Currently I am writing the code in FW to reserve the range with the same
mechanism in pci core. While as you mentioned the offset/stride may change
after sriov_enable(), I am confused whether this is the correct way.

2.4 The potential problem for [Patch 08/18]
===============================================================================
According to the SPEC, the offset/stride will change after each
sriov_enable(). This means the bus/devfn will change after each
sriov_enable().

My current thought is to fix it up in virtfn_add(). If the total VF number
will not change, we could create those pci_dn at the beginning and fix the
bus/devfn at each time the VF is truely created.

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-20  1:02     ` Gavin Shan
@ 2014-11-20  7:25       ` Wei Yang
  0 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-20  7:25 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Bjorn Helgaas, linux-pci, Wei Yang, benh, linuxppc-dev

On Thu, Nov 20, 2014 at 12:02:13PM +1100, Gavin Shan wrote:
>On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
>>On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
>>> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> 
>>> pci_dn is the extension of PCI device node and it's created from
>>> device node. Unfortunately, VFs that are enabled dynamically by
>>> PF's driver and they don't have corresponding device nodes, and
>>> pci_dn. The patch refactors pci_dn to support VFs:
>>> 
>>>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>>>      to the child list of pci_dn of PF's bridge. pci_dn of other
>>>      device put to the child list of pci_dn of its upstream bridge.
>>> 
>>>    * VF's pci_dn is expected to be created dynamically when applying
>>>      final fixup to PF. VF's pci_dn will be destroyed when releasing
>>>      PF's pci_dev instance. pci_dn of other device is still created
>>>      from device node as before.
>>> 
>>>    * For one particular PCI device (VF or not), its pci_dn can be
>>>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>>>      or parent's list. The fast path (fetching pci_dn through PCI
>>>      device instance) is populated during early fixup time.
>>> 
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>> ...
>>
>>> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
>>> +{
>>> +#ifdef CONFIG_PCI_IOV
>>> +	struct pci_dn *parent, *pdn;
>>> +	int i;
>>> +
>>> +	/* Only support IOV for now */
>>> +	if (!pdev->is_physfn)
>>> +		return pci_get_pdn(pdev);
>>> +
>>> +	/* Check if VFs have been populated */
>>> +	pdn = pci_get_pdn(pdev);
>>> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>>> +		return NULL;
>>> +
>>> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>>> +	parent = pci_bus_to_pdn(pdev->bus);
>>> +	if (!parent)
>>> +		return NULL;
>>> +
>>> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>>> +		pdn = add_one_dev_pci_info(parent, NULL,
>>> +					   pci_iov_virtfn_bus(pdev, i),
>>> +					   pci_iov_virtfn_devfn(pdev, i));
>>
>>I'm not sure this makes sense, but I certainly don't know this code, so
>>maybe I'm missing something.
>>
>
>For ARI, Richard had some patches to fix the issue from firmware side.
>
>>pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
>>pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
>>and First VF Offset in the SR-IOV capability by sriov_init(), which is
>>called before add_dev_pci_info():
>>
>>  pci_scan_child_bus
>>    pci_scan_slot
>>      pci_scan_single_device
>>	pci_device_add
>>	  pci_init_capabilities
>>	    pci_iov_init(PF)
>>	      sriov_init(PF, pos)
>>		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
>>		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
>>		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
>>		iov->offset = offset
>>		iov->stride = stride
>>
>>  pci_bus_add_devices
>>    pci_bus_add_device
>>      pci_fixup_device(pci_fixup_final)
>>	add_dev_pci_info
>>	  pci_iov_virtfn_bus
>>	    return ... + sriov->offset + (sriov->stride * id) ...
>>
>>But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
>>NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
>>sriov_init() above.  We will change NumVFs to something different when a
>>driver calls pci_enable_sriov():
>>
>>  pci_enable_sriov
>>    sriov_enable
>>      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
>>
>>Now First VF Offset and VF Stride have changed from what they were when we
>>called pci_iov_virtfn_bus() above.
>>
>
>It's the case we missed: First VF Offset and VF Stride can change when
>PF's number of VFs is changed. It means the BDFN (Bus/Device/Function
>number) for one VF can't be determined until PF's number of VFs is
>populated and updated to HW (before calling to virtfn_add()).
>
>The dynamically created pci_dn is used in PCI config accessors currently.
>That means we have to get it ready before first PCI config request to the
>VF in pci_setup_device(). In the code of old revision, we had some weak
>function called in pci_alloc_dev(), which gave platform chance to create
>pci_dn. I think we have to switch back to the old way in order to fix
>the problem you catched. However, the old way is implemented with cost
>of more weak function, which you're probably unhappy to see.
>
>  sriov_enable()
>    virtfn_add()
>      virtfn_add_bus()
>      pci_alloc_dev()
>      pci_setup_device()

Ok, sounds my solution in previous reply can't work. We need the pci_dn ready
before access the configuration space of VFs.


-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-20  7:20     ` Wei Yang
@ 2014-11-20 19:05       ` Bjorn Helgaas
  2014-11-21  0:04         ` Gavin Shan
  2014-11-21  1:46         ` Wei Yang
  0 siblings, 2 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2014-11-20 19:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Nov 20, 2014 at 03:20:57PM +0800, Wei Yang wrote:
> On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
> >On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
> >> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> 
> >> pci_dn is the extension of PCI device node and it's created from
> >> device node. Unfortunately, VFs that are enabled dynamically by
> >> PF's driver and they don't have corresponding device nodes, and
> >> pci_dn. The patch refactors pci_dn to support VFs:
> >> 
> >>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
> >>      to the child list of pci_dn of PF's bridge. pci_dn of other
> >>      device put to the child list of pci_dn of its upstream bridge.
> >> 
> >>    * VF's pci_dn is expected to be created dynamically when applying
> >>      final fixup to PF. VF's pci_dn will be destroyed when releasing
> >>      PF's pci_dev instance. pci_dn of other device is still created
> >>      from device node as before.
> >> 
> >>    * For one particular PCI device (VF or not), its pci_dn can be
> >>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
> >>      or parent's list. The fast path (fetching pci_dn through PCI
> >>      device instance) is populated during early fixup time.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >> ...
> >
> >> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
> >> +{
> >> +#ifdef CONFIG_PCI_IOV
> >> +	struct pci_dn *parent, *pdn;
> >> +	int i;
> >> +
> >> +	/* Only support IOV for now */
> >> +	if (!pdev->is_physfn)
> >> +		return pci_get_pdn(pdev);
> >> +
> >> +	/* Check if VFs have been populated */
> >> +	pdn = pci_get_pdn(pdev);
> >> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> >> +		return NULL;
> >> +
> >> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> >> +	parent = pci_bus_to_pdn(pdev->bus);
> >> +	if (!parent)
> >> +		return NULL;
> >> +
> >> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
> >> +		pdn = add_one_dev_pci_info(parent, NULL,
> >> +					   pci_iov_virtfn_bus(pdev, i),
> >> +					   pci_iov_virtfn_devfn(pdev, i));
> >
> >I'm not sure this makes sense, but I certainly don't know this code, so
> >maybe I'm missing something.
> >
> >pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
> >pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
> >and First VF Offset in the SR-IOV capability by sriov_init(), which is
> >called before add_dev_pci_info():
> >
> >  pci_scan_child_bus
> >    pci_scan_slot
> >      pci_scan_single_device
> >	pci_device_add
> >	  pci_init_capabilities
> >	    pci_iov_init(PF)
> >	      sriov_init(PF, pos)
> >		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
> >		iov->offset = offset
> >		iov->stride = stride
> >
> >  pci_bus_add_devices
> >    pci_bus_add_device
> >      pci_fixup_device(pci_fixup_final)
> >	add_dev_pci_info
> >	  pci_iov_virtfn_bus
> >	    return ... + sriov->offset + (sriov->stride * id) ...
> >
> >But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
> >NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
> >sriov_init() above.  We will change NumVFs to something different when a
> >driver calls pci_enable_sriov():
> >
> >  pci_enable_sriov
> >    sriov_enable
> >      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
> >
> >Now First VF Offset and VF Stride have changed from what they were when we
> >called pci_iov_virtfn_bus() above.
> >
> 
> Oops, I see the ARI would affect those value, while missed the NumVFs also
> would.
> 
> Let's look at the problem one by one.
> 
> 1. The ARI capability.
> ===============================================================================
> The kernel initialize the capability like this:
> 
> pci_init_capabilities()
> 	pci_configure_ari()
> 	pci_iov_init()
> 		iov->offset = offset
> 		iov->stride = stride
> 
> When offset/stride is retrieved at this point, the ARI capability is taken
> into consideration.

PCI_SRIOV_CTRL_ARI is currently only changed at the time we enumerate the
PF, so I don't think this is really a problem.

> 2. The PF's NumVFs field
> ===============================================================================
> 2.1 Potential problem in current code
> ===============================================================================
> First, is current pci code has some potential problem?
> 
> sriov_enable()
> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
> 	iov->offset = offset;
> 	iov->stride = stride;
> 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
> 	virtfn_add()
> 		...
> 		virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
> 
> The sriov_enable() retrieve the offset/stride then write the NumVFs. According
> to the SPEC, at this moment the offset/stride may change. While I don't see
> some code to retrieve and store those value again. And these fields will be
> used in virtfn_add().
> 
> If my understanding is correct, I suggest to move the retrieve and store
> operation after NumVFs is written.

Yep, it looks like the existing code has similar problems.  We might want
to add a simple function that writes PCI_SRIOV_NUM_VF, then reads
PCI_SRIOV_VF_OFFSET and PCI_SRIOV_VF_STRIDE and refreshes the cached values
in dev->sriov.

Then we'd at least know that virtfn_bus() and virtfn_devfn() return values
that are correct until the next NumVFs change.

> 2.2 The IOV bus range may not be correct in pci_scan_child_bus()?
> ===============================================================================
> In current pci core, when enumerating the pci tree, we do like this:
> 
> pci_scan_child_bus()
> 	pci_scan_slot()
> 		pci_scan_single_device()
> 			pci_device_add()
> 				pci_init_capabilities()
> 					pci_iov_init()
> 	max += pci_iov_bus_range(bus);
> 		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
> 	max = pci_scan_bridge(bus, dev, max, pass);
> 
> From this point, we see pci core reserve some bus range for VFs. This
> calculation is based on the offset/stride at this moment. And do the
> enumeration with the new bus number.
> 
> sriov_enable() could be called several times from driver to enable SRIOV, and
> with different nr_virtfn. If each time the NumVFs written, the offset/stride
> will change. This means we may try to address an extra bus we didn't reserve?
> Or this means it is out of control?

This looks like a problem, too.  I don't have a good suggestion for fixing
it.

> 2.3 How can I reserve bus range in FW?
> ===============================================================================
> This question comes from the previous one.
> 
> Based on my understanding, current pci core will rely on the bus number in HW
> if pcibios_assign_all_busses() is not set. If we want to support those VFs
> sits on different bus with PF, we need to reserve bus range and write the
> correct secondary/subordinate in bridge. Otherwise, those VFs on different bus
> may not be addressed.
> 
> Currently I am writing the code in FW to reserve the range with the same
> mechanism in pci core. While as you mentioned the offset/stride may change
> after sriov_enable(), I am confused whether this is the correct way.

If your firmware knows something about the device and can compute the
number of buses it will likely need, it can set up bridges with appropriate
bus number ranges, and Linux will generally leave those alone.

I don't know the best way to figure out the number of buses, though.  It
seems like you almost need to experimentally set NumVFs and read the
resulting offset/stride, because I think it's really up to the device to
decide how to number the VFs.  Maybe pci_iov_bus_range() needs to do
something similar.

> 2.4 The potential problem for [Patch 08/18]
> ===============================================================================
> According to the SPEC, the offset/stride will change after each
> sriov_enable(). This means the bus/devfn will change after each
> sriov_enable().
> 
> My current thought is to fix it up in virtfn_add(). If the total VF number
> will not change, we could create those pci_dn at the beginning and fix the
> bus/devfn at each time the VF is truely created.

By "fix it up," I assume you mean call an arch function that does the
pci_pdn setup you need.

It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or
at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(),
sriov_enable(), sriov_disable(), and sriov_restore_state().  From a
hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set,
so it might be nice to have this setup connected to that.

Bjorn

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-20 19:05       ` Bjorn Helgaas
@ 2014-11-21  0:04         ` Gavin Shan
  2014-11-25  9:28           ` Wei Yang
  2014-11-21  1:46         ` Wei Yang
  1 sibling, 1 reply; 39+ messages in thread
From: Gavin Shan @ 2014-11-21  0:04 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Thu, Nov 20, 2014 at 12:05:41PM -0700, Bjorn Helgaas wrote:
>On Thu, Nov 20, 2014 at 03:20:57PM +0800, Wei Yang wrote:
>> On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
>> >On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
>> >> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> >> 
>> >> pci_dn is the extension of PCI device node and it's created from
>> >> device node. Unfortunately, VFs that are enabled dynamically by
>> >> PF's driver and they don't have corresponding device nodes, and
>> >> pci_dn. The patch refactors pci_dn to support VFs:
>> >> 
>> >>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>> >>      to the child list of pci_dn of PF's bridge. pci_dn of other
>> >>      device put to the child list of pci_dn of its upstream bridge.
>> >> 
>> >>    * VF's pci_dn is expected to be created dynamically when applying
>> >>      final fixup to PF. VF's pci_dn will be destroyed when releasing
>> >>      PF's pci_dev instance. pci_dn of other device is still created
>> >>      from device node as before.
>> >> 
>> >>    * For one particular PCI device (VF or not), its pci_dn can be
>> >>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>> >>      or parent's list. The fast path (fetching pci_dn through PCI
>> >>      device instance) is populated during early fixup time.
>> >> 
>> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> >> ---
>> >> ...
>> >
>> >> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
>> >> +{
>> >> +#ifdef CONFIG_PCI_IOV
>> >> +	struct pci_dn *parent, *pdn;
>> >> +	int i;
>> >> +
>> >> +	/* Only support IOV for now */
>> >> +	if (!pdev->is_physfn)
>> >> +		return pci_get_pdn(pdev);
>> >> +
>> >> +	/* Check if VFs have been populated */
>> >> +	pdn = pci_get_pdn(pdev);
>> >> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>> >> +		return NULL;
>> >> +
>> >> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>> >> +	parent = pci_bus_to_pdn(pdev->bus);
>> >> +	if (!parent)
>> >> +		return NULL;
>> >> +
>> >> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>> >> +		pdn = add_one_dev_pci_info(parent, NULL,
>> >> +					   pci_iov_virtfn_bus(pdev, i),
>> >> +					   pci_iov_virtfn_devfn(pdev, i));
>> >
>> >I'm not sure this makes sense, but I certainly don't know this code, so
>> >maybe I'm missing something.
>> >
>> >pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
>> >pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
>> >and First VF Offset in the SR-IOV capability by sriov_init(), which is
>> >called before add_dev_pci_info():
>> >
>> >  pci_scan_child_bus
>> >    pci_scan_slot
>> >      pci_scan_single_device
>> >	pci_device_add
>> >	  pci_init_capabilities
>> >	    pci_iov_init(PF)
>> >	      sriov_init(PF, pos)
>> >		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
>> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
>> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
>> >		iov->offset = offset
>> >		iov->stride = stride
>> >
>> >  pci_bus_add_devices
>> >    pci_bus_add_device
>> >      pci_fixup_device(pci_fixup_final)
>> >	add_dev_pci_info
>> >	  pci_iov_virtfn_bus
>> >	    return ... + sriov->offset + (sriov->stride * id) ...
>> >
>> >But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
>> >NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
>> >sriov_init() above.  We will change NumVFs to something different when a
>> >driver calls pci_enable_sriov():
>> >
>> >  pci_enable_sriov
>> >    sriov_enable
>> >      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
>> >
>> >Now First VF Offset and VF Stride have changed from what they were when we
>> >called pci_iov_virtfn_bus() above.
>> >
>> 
>> Oops, I see the ARI would affect those value, while missed the NumVFs also
>> would.
>> 
>> Let's look at the problem one by one.
>> 
>> 1. The ARI capability.
>> ===============================================================================
>> The kernel initialize the capability like this:
>> 
>> pci_init_capabilities()
>> 	pci_configure_ari()
>> 	pci_iov_init()
>> 		iov->offset = offset
>> 		iov->stride = stride
>> 
>> When offset/stride is retrieved at this point, the ARI capability is taken
>> into consideration.
>
>PCI_SRIOV_CTRL_ARI is currently only changed at the time we enumerate the
>PF, so I don't think this is really a problem.
>
>> 2. The PF's NumVFs field
>> ===============================================================================
>> 2.1 Potential problem in current code
>> ===============================================================================
>> First, is current pci code has some potential problem?
>> 
>> sriov_enable()
>> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
>> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
>> 	iov->offset = offset;
>> 	iov->stride = stride;
>> 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
>> 	virtfn_add()
>> 		...
>> 		virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>> 
>> The sriov_enable() retrieve the offset/stride then write the NumVFs. According
>> to the SPEC, at this moment the offset/stride may change. While I don't see
>> some code to retrieve and store those value again. And these fields will be
>> used in virtfn_add().
>> 
>> If my understanding is correct, I suggest to move the retrieve and store
>> operation after NumVFs is written.
>
>Yep, it looks like the existing code has similar problems.  We might want
>to add a simple function that writes PCI_SRIOV_NUM_VF, then reads
>PCI_SRIOV_VF_OFFSET and PCI_SRIOV_VF_STRIDE and refreshes the cached values
>in dev->sriov.
>
>Then we'd at least know that virtfn_bus() and virtfn_devfn() return values
>that are correct until the next NumVFs change.
>
>> 2.2 The IOV bus range may not be correct in pci_scan_child_bus()?
>> ===============================================================================
>> In current pci core, when enumerating the pci tree, we do like this:
>> 
>> pci_scan_child_bus()
>> 	pci_scan_slot()
>> 		pci_scan_single_device()
>> 			pci_device_add()
>> 				pci_init_capabilities()
>> 					pci_iov_init()
>> 	max += pci_iov_bus_range(bus);
>> 		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
>> 	max = pci_scan_bridge(bus, dev, max, pass);
>> 
>> From this point, we see pci core reserve some bus range for VFs. This
>> calculation is based on the offset/stride at this moment. And do the
>> enumeration with the new bus number.
>> 
>> sriov_enable() could be called several times from driver to enable SRIOV, and
>> with different nr_virtfn. If each time the NumVFs written, the offset/stride
>> will change. This means we may try to address an extra bus we didn't reserve?
>> Or this means it is out of control?
>
>This looks like a problem, too.  I don't have a good suggestion for fixing
>it.
>
>> 2.3 How can I reserve bus range in FW?
>> ===============================================================================
>> This question comes from the previous one.
>> 
>> Based on my understanding, current pci core will rely on the bus number in HW
>> if pcibios_assign_all_busses() is not set. If we want to support those VFs
>> sits on different bus with PF, we need to reserve bus range and write the
>> correct secondary/subordinate in bridge. Otherwise, those VFs on different bus
>> may not be addressed.
>> 
>> Currently I am writing the code in FW to reserve the range with the same
>> mechanism in pci core. While as you mentioned the offset/stride may change
>> after sriov_enable(), I am confused whether this is the correct way.
>
>If your firmware knows something about the device and can compute the
>number of buses it will likely need, it can set up bridges with appropriate
>bus number ranges, and Linux will generally leave those alone.
>
>I don't know the best way to figure out the number of buses, though.  It
>seems like you almost need to experimentally set NumVFs and read the
>resulting offset/stride, because I think it's really up to the device to
>decide how to number the VFs.  Maybe pci_iov_bus_range() needs to do
>something similar.
>

Yep, it's the reasonable way to probe maximal number of buses for the
upstream bridge of PF. I guess Richard need implement similar thing in
firmware.

>> 2.4 The potential problem for [Patch 08/18]
>> ===============================================================================
>> According to the SPEC, the offset/stride will change after each
>> sriov_enable(). This means the bus/devfn will change after each
>> sriov_enable().
>> 
>> My current thought is to fix it up in virtfn_add(). If the total VF number
>> will not change, we could create those pci_dn at the beginning and fix the
>> bus/devfn at each time the VF is truely created.
>
>By "fix it up," I assume you mean call an arch function that does the
>pci_pdn setup you need.
>
>It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or
>at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(),
>sriov_enable(), sriov_disable(), and sriov_restore_state().  From a
>hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set,
>so it might be nice to have this setup connected to that.
>

Yes, Both ways can fix the issue. For couple reasons, I want add weak
pcibios_virtfn_add(), which is called in virtfn_add() if you agree.

- PCI_SRIOV_CTRL_VFE might be set somewhere except the functions you pointed.
  Set/clear PCI_SRIOV_CTRL_VFE will invoke background work to check pci_dn
  and add/remove accordingly. It would be overhead which we can avoid.
- We plan to support EEH for VFs in future. virtfn_add() way matches with
  current EEH implementation better. EEH device and PE are created based
  on (struct pci_dev), and EEH devices and PE can be destroied in time in
  pcibios_release_device(), which is invoked by virtfn_remove(). So we only
  need one weak function. In contrast, we have to create EEH device and PE
  for VFs a bit early before any VFs are instantiated, and destroy them a
  bit late after all VFs are offline.

Thanks,
Gavin

>Bjorn
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-20 19:05       ` Bjorn Helgaas
  2014-11-21  0:04         ` Gavin Shan
@ 2014-11-21  1:46         ` Wei Yang
  1 sibling, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-21  1:46 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Thu, Nov 20, 2014 at 12:05:41PM -0700, Bjorn Helgaas wrote:
>On Thu, Nov 20, 2014 at 03:20:57PM +0800, Wei Yang wrote:
>> On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
>> >On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
>> >> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> >> 
>> >> pci_dn is the extension of PCI device node and it's created from
>> >> device node. Unfortunately, VFs that are enabled dynamically by
>> >> PF's driver and they don't have corresponding device nodes, and
>> >> pci_dn. The patch refactors pci_dn to support VFs:
>> >> 
>> >>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>> >>      to the child list of pci_dn of PF's bridge. pci_dn of other
>> >>      device put to the child list of pci_dn of its upstream bridge.
>> >> 
>> >>    * VF's pci_dn is expected to be created dynamically when applying
>> >>      final fixup to PF. VF's pci_dn will be destroyed when releasing
>> >>      PF's pci_dev instance. pci_dn of other device is still created
>> >>      from device node as before.
>> >> 
>> >>    * For one particular PCI device (VF or not), its pci_dn can be
>> >>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>> >>      or parent's list. The fast path (fetching pci_dn through PCI
>> >>      device instance) is populated during early fixup time.
>> >> 
>> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> >> ---
>> >> ...
>> >
>> >> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
>> >> +{
>> >> +#ifdef CONFIG_PCI_IOV
>> >> +	struct pci_dn *parent, *pdn;
>> >> +	int i;
>> >> +
>> >> +	/* Only support IOV for now */
>> >> +	if (!pdev->is_physfn)
>> >> +		return pci_get_pdn(pdev);
>> >> +
>> >> +	/* Check if VFs have been populated */
>> >> +	pdn = pci_get_pdn(pdev);
>> >> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>> >> +		return NULL;
>> >> +
>> >> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>> >> +	parent = pci_bus_to_pdn(pdev->bus);
>> >> +	if (!parent)
>> >> +		return NULL;
>> >> +
>> >> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>> >> +		pdn = add_one_dev_pci_info(parent, NULL,
>> >> +					   pci_iov_virtfn_bus(pdev, i),
>> >> +					   pci_iov_virtfn_devfn(pdev, i));
>> >
>> >I'm not sure this makes sense, but I certainly don't know this code, so
>> >maybe I'm missing something.
>> >
>> >pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
>> >pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
>> >and First VF Offset in the SR-IOV capability by sriov_init(), which is
>> >called before add_dev_pci_info():
>> >
>> >  pci_scan_child_bus
>> >    pci_scan_slot
>> >      pci_scan_single_device
>> >	pci_device_add
>> >	  pci_init_capabilities
>> >	    pci_iov_init(PF)
>> >	      sriov_init(PF, pos)
>> >		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
>> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
>> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
>> >		iov->offset = offset
>> >		iov->stride = stride
>> >
>> >  pci_bus_add_devices
>> >    pci_bus_add_device
>> >      pci_fixup_device(pci_fixup_final)
>> >	add_dev_pci_info
>> >	  pci_iov_virtfn_bus
>> >	    return ... + sriov->offset + (sriov->stride * id) ...
>> >
>> >But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
>> >NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
>> >sriov_init() above.  We will change NumVFs to something different when a
>> >driver calls pci_enable_sriov():
>> >
>> >  pci_enable_sriov
>> >    sriov_enable
>> >      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
>> >
>> >Now First VF Offset and VF Stride have changed from what they were when we
>> >called pci_iov_virtfn_bus() above.
>> >
>> 
>> Oops, I see the ARI would affect those value, while missed the NumVFs also
>> would.
>> 
>> Let's look at the problem one by one.
>> 
>> 1. The ARI capability.
>> ===============================================================================
>> The kernel initialize the capability like this:
>> 
>> pci_init_capabilities()
>> 	pci_configure_ari()
>> 	pci_iov_init()
>> 		iov->offset = offset
>> 		iov->stride = stride
>> 
>> When offset/stride is retrieved at this point, the ARI capability is taken
>> into consideration.
>
>PCI_SRIOV_CTRL_ARI is currently only changed at the time we enumerate the
>PF, so I don't think this is really a problem.
>
>> 2. The PF's NumVFs field
>> ===============================================================================
>> 2.1 Potential problem in current code
>> ===============================================================================
>> First, is current pci code has some potential problem?
>> 
>> sriov_enable()
>> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
>> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
>> 	iov->offset = offset;
>> 	iov->stride = stride;
>> 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
>> 	virtfn_add()
>> 		...
>> 		virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>> 
>> The sriov_enable() retrieve the offset/stride then write the NumVFs. According
>> to the SPEC, at this moment the offset/stride may change. While I don't see
>> some code to retrieve and store those value again. And these fields will be
>> used in virtfn_add().
>> 
>> If my understanding is correct, I suggest to move the retrieve and store
>> operation after NumVFs is written.
>
>Yep, it looks like the existing code has similar problems.  We might want
>to add a simple function that writes PCI_SRIOV_NUM_VF, then reads
>PCI_SRIOV_VF_OFFSET and PCI_SRIOV_VF_STRIDE and refreshes the cached values
>in dev->sriov.
>
>Then we'd at least know that virtfn_bus() and virtfn_devfn() return values
>that are correct until the next NumVFs change.
>

Ok, I will write a function to wrap it.

>> 2.2 The IOV bus range may not be correct in pci_scan_child_bus()?
>> ===============================================================================
>> In current pci core, when enumerating the pci tree, we do like this:
>> 
>> pci_scan_child_bus()
>> 	pci_scan_slot()
>> 		pci_scan_single_device()
>> 			pci_device_add()
>> 				pci_init_capabilities()
>> 					pci_iov_init()
>> 	max += pci_iov_bus_range(bus);
>> 		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
>> 	max = pci_scan_bridge(bus, dev, max, pass);
>> 
>> From this point, we see pci core reserve some bus range for VFs. This
>> calculation is based on the offset/stride at this moment. And do the
>> enumeration with the new bus number.
>> 
>> sriov_enable() could be called several times from driver to enable SRIOV, and
>> with different nr_virtfn. If each time the NumVFs written, the offset/stride
>> will change. This means we may try to address an extra bus we didn't reserve?
>> Or this means it is out of control?
>
>This looks like a problem, too.  I don't have a good suggestion for fixing
>it.

How about enumerating all possible NumVFs and select the maximum?

I am not sure what will happen if the FW sets a different number? So FW should
listen to kernel, right?

I could write a code with this logic and test, while I am afraid this will
break some platfrom in some case.

>
>> 2.3 How can I reserve bus range in FW?
>> ===============================================================================
>> This question comes from the previous one.
>> 
>> Based on my understanding, current pci core will rely on the bus number in HW
>> if pcibios_assign_all_busses() is not set. If we want to support those VFs
>> sits on different bus with PF, we need to reserve bus range and write the
>> correct secondary/subordinate in bridge. Otherwise, those VFs on different bus
>> may not be addressed.
>> 
>> Currently I am writing the code in FW to reserve the range with the same
>> mechanism in pci core. While as you mentioned the offset/stride may change
>> after sriov_enable(), I am confused whether this is the correct way.
>
>If your firmware knows something about the device and can compute the
>number of buses it will likely need, it can set up bridges with appropriate
>bus number ranges, and Linux will generally leave those alone.
>

Yep, this is what I am trying to do.

>I don't know the best way to figure out the number of buses, though.  It
>seems like you almost need to experimentally set NumVFs and read the
>resulting offset/stride, because I think it's really up to the device to
>decide how to number the VFs.  Maybe pci_iov_bus_range() needs to do
>something similar.

Got it, I will add this logic.

>
>> 2.4 The potential problem for [Patch 08/18]
>> ===============================================================================
>> According to the SPEC, the offset/stride will change after each
>> sriov_enable(). This means the bus/devfn will change after each
>> sriov_enable().
>> 
>> My current thought is to fix it up in virtfn_add(). If the total VF number
>> will not change, we could create those pci_dn at the beginning and fix the
>> bus/devfn at each time the VF is truely created.
>
>By "fix it up," I assume you mean call an arch function that does the
>pci_pdn setup you need.
>
>It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or
>at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(),
>sriov_enable(), sriov_disable(), and sriov_restore_state().  From a
>hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set,
>so it might be nice to have this setup connected to that.

If my understanding is correct, we could wrap up the configuration write/read
on PCI_SRIOV_CTRL and when it involves PCI_SRIOV_CTRL_VFE, do the fix up?

>
>Bjorn

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
  2014-11-21  0:04         ` Gavin Shan
@ 2014-11-25  9:28           ` Wei Yang
  0 siblings, 0 replies; 39+ messages in thread
From: Wei Yang @ 2014-11-25  9:28 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Bjorn Helgaas, linux-pci, Wei Yang, benh, linuxppc-dev

On Fri, Nov 21, 2014 at 11:04:11AM +1100, Gavin Shan wrote:
>>> 2.4 The potential problem for [Patch 08/18]
>>> ===============================================================================
>>> According to the SPEC, the offset/stride will change after each
>>> sriov_enable(). This means the bus/devfn will change after each
>>> sriov_enable().
>>> 
>>> My current thought is to fix it up in virtfn_add(). If the total VF number
>>> will not change, we could create those pci_dn at the beginning and fix the
>>> bus/devfn at each time the VF is truely created.
>>
>>By "fix it up," I assume you mean call an arch function that does the
>>pci_pdn setup you need.
>>
>>It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or
>>at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(),
>>sriov_enable(), sriov_disable(), and sriov_restore_state().  From a
>>hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set,
>>so it might be nice to have this setup connected to that.
>>
>
>Yes, Both ways can fix the issue. For couple reasons, I want add weak
>pcibios_virtfn_add(), which is called in virtfn_add() if you agree.
>
>- PCI_SRIOV_CTRL_VFE might be set somewhere except the functions you pointed.
>  Set/clear PCI_SRIOV_CTRL_VFE will invoke background work to check pci_dn
>  and add/remove accordingly. It would be overhead which we can avoid.
>- We plan to support EEH for VFs in future. virtfn_add() way matches with
>  current EEH implementation better. EEH device and PE are created based
>  on (struct pci_dev), and EEH devices and PE can be destroied in time in
>  pcibios_release_device(), which is invoked by virtfn_remove(). So we only
>  need one weak function. In contrast, we have to create EEH device and PE
>  for VFs a bit early before any VFs are instantiated, and destroy them a
>  bit late after all VFs are offline.
>

Bjorn,

Since my mail box got some problem in the last few days, I am not sure you
agree with Gavin's proposal or not?

And if you agree to enumerate the maximum VF bus range, I will add this logic
in the next versin.

>Thanks,
>Gavin
>
>>Bjorn
>>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-11-25  9:28 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
2014-11-02 15:41 ` [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
2014-11-19 23:35   ` Bjorn Helgaas
2014-11-02 15:41 ` [PATCH V9 02/18] PCI: Add weak pcibios_iov_resource_alignment() interface Wei Yang
2014-11-02 15:41 ` [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface Wei Yang
2014-11-19  1:12   ` Bjorn Helgaas
2014-11-19  2:15     ` Benjamin Herrenschmidt
2014-11-19  3:21       ` Wei Yang
2014-11-19  4:26         ` Bjorn Helgaas
2014-11-19  9:27           ` Wei Yang
2014-11-19 17:23             ` Bjorn Helgaas
2014-11-19 20:51               ` Benjamin Herrenschmidt
2014-11-20  5:40                 ` Wei Yang
2014-11-20  5:39               ` Wei Yang
2014-11-02 15:41 ` [PATCH V9 04/18] PCI: Take additional PF's IOV BAR alignment in sizing and assigning Wei Yang
2014-11-02 15:41 ` [PATCH V9 05/18] powerpc/pci: Add PCI resource alignment documentation Wei Yang
2014-11-02 15:41 ` [PATCH V9 06/18] powerpc/pci: Don't unset pci resources for VFs Wei Yang
2014-11-02 15:41 ` [PATCH V9 07/18] powerpc/pci: Define pcibios_disable_device() on powerpc Wei Yang
2014-11-02 15:41 ` [PATCH V9 08/18] powrepc/pci: Refactor pci_dn Wei Yang
2014-11-19 23:30   ` Bjorn Helgaas
2014-11-20  1:02     ` Gavin Shan
2014-11-20  7:25       ` Wei Yang
2014-11-20  7:20     ` Wei Yang
2014-11-20 19:05       ` Bjorn Helgaas
2014-11-21  0:04         ` Gavin Shan
2014-11-25  9:28           ` Wei Yang
2014-11-21  1:46         ` Wei Yang
2014-11-02 15:41 ` [PATCH V9 09/18] powerpc/pci: remove pci_dn->pcidev field Wei Yang
2014-11-02 15:41 ` [PATCH V9 10/18] powerpc/powernv: Use pci_dn in PCI config accessor Wei Yang
2014-11-02 15:41 ` [PATCH V9 11/18] powerpc/powernv: Allocate pe->iommu_table dynamically Wei Yang
2014-11-02 15:41 ` [PATCH V9 12/18] powerpc/powernv: Expand VF resources according to the number of total_pe Wei Yang
2014-11-02 15:41 ` [PATCH V9 13/18] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Wei Yang
2014-11-02 15:41 ` [PATCH V9 14/18] powerpc/powernv: Implement pcibios_iov_resource_size() " Wei Yang
2014-11-02 15:41 ` [PATCH V9 15/18] powerpc/powernv: Shift VF resource with an offset Wei Yang
2014-11-02 15:41 ` [PATCH V9 16/18] powerpc/powernv: Allocate VF PE Wei Yang
2014-11-02 15:41 ` [PATCH V9 17/18] powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported Wei Yang
2014-11-02 15:41 ` [PATCH V9 18/18] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Wei Yang
2014-11-18 23:11 ` [PATCH V9 00/18] Enable SRIOV on PowerNV Gavin Shan
2014-11-18 23:40   ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).