All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V10 00/17] Enable SRIOV on Power8
@ 2014-12-22  5:54 ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

This patchset enables the SRIOV on POWER8.

The gerneral idea is put each VF into one individual PE and allocate required
resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
allocation and adjustment for PF's IOV BAR.

On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
sit in its own PE. This gives more flexiblity, while at the mean time it
brings on some restrictions on the PF's IOV BAR size and alignment.

To achieve this effect, we need to do some hack on pci devices's resources.
1. Expand the IOV BAR properly.
   Done by pnv_pci_ioda_fixup_iov_resources().
2. Shift the IOV BAR properly.
   Done by pnv_pci_vf_resource_shift().
3. IOV BAR alignment is calculated by arch dependent function instead of an
   individual VF BAR size.
   Done by pnv_pcibios_sriov_resource_alignment().
4. Take the IOV BAR alignment into consideration in the sizing and assigning.
   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
   sizing and assigning"

Test Environment:
       The SRIOV device tested is Emulex Lancer(10df:e220) and
       Mellanox ConnectX-3(15b3:1003) on POWER8.

Examples on pass through a VF to guest through vfio:
	1. unbind the original driver and bind to vfio-pci driver
	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
	   Note: this should be done for each device in the same iommu_group
	2. Start qemu and pass device through vfio
	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
		   -M pseries -m 2048 -enable-kvm -nographic \
		   -drive file=/home/ywywyang/kvm/fc19.img \
		   -monitor telnet:localhost:5435,server,nowait -boot cd \
		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"

Verify this is the exact VF response:
	1. ping from a machine in the same subnet(the broadcast domain)
	2. run arp -n on this machine
	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
	3. ifconfig in the guest
	   # ifconfig eth1
	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
	        RX packets 175  bytes 13278 (12.9 KiB)
	        RX errors 0  dropped 0  overruns 0  frame 0
		TX packets 58  bytes 9276 (9.0 KiB)
	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
	4. They have the same MAC address

	Note: make sure you shutdown other network interfaces in guest.

---
v10:
   * remove weak function pcibios_iov_resource_size()
     the VF BAR size is stored in pci_sriov structure and retrieved from
     pci_iov_resource_size()
   * Use "Reserve additional" instead of "Expand" to be more acurate in the
     change log
   * add log message to show the PF's IOV BAR final size
   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
     for arch setup before enable VFs. Like the arch could fix up the BDF for
     VFs, since the change of NumVFs would affect the BDF of VFs.
   * Add some explanation of PE on Power arch in the documentation
v9:
   * make the change log consistent in the terminology
     PF's IOV BAR -> the SRIOV BAR in PF
     VF's BAR -> the normal BAR in VF's view
   * rename all newly introduced function from _sriov_ to _iov_
   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
   * add the vendor id and device id of the tested devices
   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
   * rebase on 3.18-rc2 and tested
v8:
   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
     retrieve the IOV BAR size.
   * add a document Documentation/powerpc/pci_resource.txt to explain the
     design.
   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
   * extract a function res_to_dev_res(), so that it is more general to get
     additional size and alignment
   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
     lock.
v7:
   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
     hardware directly. If not, calculate as usual.
   * reorder the patch set, group them by subsystem:
     PCI, powerpc, powernv
   * rebase it on 3.16-rc6
v6:
   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
     similar function is moved to
     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
     enabled, platform will try best to allocate resources for VFs.
   * remove pcibios_sriov_resource_size weak function
   * VF BAR size is retrieved from hardware directly in virtfn_add()
v5:
   * merge those SRIOV related platform functions in machdep_calls
     wrap them in one CONFIG_PCI_IOV marco
   * define IODA_INVALID_M64 to replace (-1)
     use this value to represent the m64_wins is not used
   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
     reduce some log in kernel
   * release M64 window in pnv_pci_ioda2_release_dma_pe()
v4:
   * code format fix, eg. not exceed 80 chars
   * in commit "ppc/pnv: Add function to deconfig a PE"
     check the bus has a bridge before print the name
     remove a PE from its own PELTV
   * change the function name for sriov resource size/alignment
   * rebase on 3.16-rc3
   * VFs will not rely on device node
     As Grant Likely's comments, kernel should have the ability to handle the
     lack of device_node gracefully. Gavin restructure the pci_dn, which
     makes the VF will have pci_dn even when VF's device_node is not provided
     by firmware.
   * clean all the patch title to make them comply with one style
   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
v3:
   * change the return type of virtfn_bus/virtfn_devfn to int
     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
   * reduce the second parameter or pcibios_sriov_disable()
   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
v2:
   * change the return value of virtfn_bus/virtfn_devfn to 0
   * move some TCE related marco definition to
     arch/powerpc/platforms/powernv/pci.h
   * fix the __pci_sriov_resource_alignment on powernv platform
     During the sizing stage, the IOV BAR is truncated to 0, which will
     effect the order of allocation. Fix this, so that make sure BAR will be
     allocated ordered by their alignment.
v1:
   * improve the change log for
     "PCI: Add weak __pci_sriov_resource_size() interface"
     "PCI: Add weak __pci_sriov_resource_alignment() interface"
     "PCI: take additional IOV BAR alignment in sizing and assigning"
   * wrap VF PE code in CONFIG_PCI_IOV
   * did regression test on P7.
Gavin Shan (1):
  powrepc/pci: Refactor pci_dn

Wei Yang (16):
  PCI/IOV: Export interface for retrieve VF's BDF
  PCI/IOV: add VF enable/disable hook
  PCI: Add weak pcibios_iov_resource_alignment() interface
  PCI: Store VF BAR size in pci_sriov
  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  powerpc/pci: Add PCI resource alignment documentation
  powerpc/pci: Don't unset pci resources for VFs
  powerpc/pci: remove pci_dn->pcidev field
  powerpc/powernv: Use pci_dn in PCI config accessor
  powerpc/powernv: Allocate pe->iommu_table dynamically
  powerpc/powernv: Reserve additional space for IOV BAR according to
    the number of total_pe
  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
    powernv
  powerpc/powernv: Shift VF resource with an offset
  powerpc/powernv: Allocate VF PE
  powerpc/powernv: Reserve additional space for IOV BAR, with
    m64_per_iov supported
  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3

 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++
 arch/powerpc/include/asm/device.h                  |    3 +
 arch/powerpc/include/asm/iommu.h                   |    3 +
 arch/powerpc/include/asm/machdep.h                 |    7 +
 arch/powerpc/include/asm/pci-bridge.h              |   24 +-
 arch/powerpc/kernel/pci-common.c                   |   23 +
 arch/powerpc/kernel/pci_dn.c                       |  251 ++++++-
 arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
 arch/powerpc/platforms/powernv/pci-ioda.c          |  739 +++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c               |   87 +--
 arch/powerpc/platforms/powernv/pci.h               |   13 +-
 drivers/pci/iov.c                                  |   80 ++-
 drivers/pci/pci.h                                  |    2 +
 drivers/pci/setup-bus.c                            |   85 ++-
 include/linux/pci.h                                |   17 +
 15 files changed, 1449 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH V10 00/17] Enable SRIOV on Power8
@ 2014-12-22  5:54 ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

This patchset enables the SRIOV on POWER8.

The gerneral idea is put each VF into one individual PE and allocate required
resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
allocation and adjustment for PF's IOV BAR.

On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
sit in its own PE. This gives more flexiblity, while at the mean time it
brings on some restrictions on the PF's IOV BAR size and alignment.

To achieve this effect, we need to do some hack on pci devices's resources.
1. Expand the IOV BAR properly.
   Done by pnv_pci_ioda_fixup_iov_resources().
2. Shift the IOV BAR properly.
   Done by pnv_pci_vf_resource_shift().
3. IOV BAR alignment is calculated by arch dependent function instead of an
   individual VF BAR size.
   Done by pnv_pcibios_sriov_resource_alignment().
4. Take the IOV BAR alignment into consideration in the sizing and assigning.
   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
   sizing and assigning"

Test Environment:
       The SRIOV device tested is Emulex Lancer(10df:e220) and
       Mellanox ConnectX-3(15b3:1003) on POWER8.

Examples on pass through a VF to guest through vfio:
	1. unbind the original driver and bind to vfio-pci driver
	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
	   Note: this should be done for each device in the same iommu_group
	2. Start qemu and pass device through vfio
	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
		   -M pseries -m 2048 -enable-kvm -nographic \
		   -drive file=/home/ywywyang/kvm/fc19.img \
		   -monitor telnet:localhost:5435,server,nowait -boot cd \
		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"

Verify this is the exact VF response:
	1. ping from a machine in the same subnet(the broadcast domain)
	2. run arp -n on this machine
	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
	3. ifconfig in the guest
	   # ifconfig eth1
	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
	        RX packets 175  bytes 13278 (12.9 KiB)
	        RX errors 0  dropped 0  overruns 0  frame 0
		TX packets 58  bytes 9276 (9.0 KiB)
	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
	4. They have the same MAC address

	Note: make sure you shutdown other network interfaces in guest.

---
v10:
   * remove weak function pcibios_iov_resource_size()
     the VF BAR size is stored in pci_sriov structure and retrieved from
     pci_iov_resource_size()
   * Use "Reserve additional" instead of "Expand" to be more acurate in the
     change log
   * add log message to show the PF's IOV BAR final size
   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
     for arch setup before enable VFs. Like the arch could fix up the BDF for
     VFs, since the change of NumVFs would affect the BDF of VFs.
   * Add some explanation of PE on Power arch in the documentation
v9:
   * make the change log consistent in the terminology
     PF's IOV BAR -> the SRIOV BAR in PF
     VF's BAR -> the normal BAR in VF's view
   * rename all newly introduced function from _sriov_ to _iov_
   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
   * add the vendor id and device id of the tested devices
   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
   * rebase on 3.18-rc2 and tested
v8:
   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
     retrieve the IOV BAR size.
   * add a document Documentation/powerpc/pci_resource.txt to explain the
     design.
   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
   * extract a function res_to_dev_res(), so that it is more general to get
     additional size and alignment
   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
     lock.
v7:
   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
     hardware directly. If not, calculate as usual.
   * reorder the patch set, group them by subsystem:
     PCI, powerpc, powernv
   * rebase it on 3.16-rc6
v6:
   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
     similar function is moved to
     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
     enabled, platform will try best to allocate resources for VFs.
   * remove pcibios_sriov_resource_size weak function
   * VF BAR size is retrieved from hardware directly in virtfn_add()
v5:
   * merge those SRIOV related platform functions in machdep_calls
     wrap them in one CONFIG_PCI_IOV marco
   * define IODA_INVALID_M64 to replace (-1)
     use this value to represent the m64_wins is not used
   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
     reduce some log in kernel
   * release M64 window in pnv_pci_ioda2_release_dma_pe()
v4:
   * code format fix, eg. not exceed 80 chars
   * in commit "ppc/pnv: Add function to deconfig a PE"
     check the bus has a bridge before print the name
     remove a PE from its own PELTV
   * change the function name for sriov resource size/alignment
   * rebase on 3.16-rc3
   * VFs will not rely on device node
     As Grant Likely's comments, kernel should have the ability to handle the
     lack of device_node gracefully. Gavin restructure the pci_dn, which
     makes the VF will have pci_dn even when VF's device_node is not provided
     by firmware.
   * clean all the patch title to make them comply with one style
   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
v3:
   * change the return type of virtfn_bus/virtfn_devfn to int
     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
   * reduce the second parameter or pcibios_sriov_disable()
   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
v2:
   * change the return value of virtfn_bus/virtfn_devfn to 0
   * move some TCE related marco definition to
     arch/powerpc/platforms/powernv/pci.h
   * fix the __pci_sriov_resource_alignment on powernv platform
     During the sizing stage, the IOV BAR is truncated to 0, which will
     effect the order of allocation. Fix this, so that make sure BAR will be
     allocated ordered by their alignment.
v1:
   * improve the change log for
     "PCI: Add weak __pci_sriov_resource_size() interface"
     "PCI: Add weak __pci_sriov_resource_alignment() interface"
     "PCI: take additional IOV BAR alignment in sizing and assigning"
   * wrap VF PE code in CONFIG_PCI_IOV
   * did regression test on P7.
Gavin Shan (1):
  powrepc/pci: Refactor pci_dn

Wei Yang (16):
  PCI/IOV: Export interface for retrieve VF's BDF
  PCI/IOV: add VF enable/disable hook
  PCI: Add weak pcibios_iov_resource_alignment() interface
  PCI: Store VF BAR size in pci_sriov
  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  powerpc/pci: Add PCI resource alignment documentation
  powerpc/pci: Don't unset pci resources for VFs
  powerpc/pci: remove pci_dn->pcidev field
  powerpc/powernv: Use pci_dn in PCI config accessor
  powerpc/powernv: Allocate pe->iommu_table dynamically
  powerpc/powernv: Reserve additional space for IOV BAR according to
    the number of total_pe
  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
    powernv
  powerpc/powernv: Shift VF resource with an offset
  powerpc/powernv: Allocate VF PE
  powerpc/powernv: Reserve additional space for IOV BAR, with
    m64_per_iov supported
  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3

 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++
 arch/powerpc/include/asm/device.h                  |    3 +
 arch/powerpc/include/asm/iommu.h                   |    3 +
 arch/powerpc/include/asm/machdep.h                 |    7 +
 arch/powerpc/include/asm/pci-bridge.h              |   24 +-
 arch/powerpc/kernel/pci-common.c                   |   23 +
 arch/powerpc/kernel/pci_dn.c                       |  251 ++++++-
 arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
 arch/powerpc/platforms/powernv/pci-ioda.c          |  739 +++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c               |   87 +--
 arch/powerpc/platforms/powernv/pci.h               |   13 +-
 drivers/pci/iov.c                                  |   80 ++-
 drivers/pci/pci.h                                  |    2 +
 drivers/pci/setup-bus.c                            |   85 ++-
 include/linux/pci.h                                |   17 +
 15 files changed, 1449 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH V10 01/17] PCI/IOV: Export interface for retrieve VF's BDF
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

When implementing the SR-IOV on PowerNV platform, some resource reservation is
needed for VFs which don't exist at the bootup stage. To do the match between
resources and VFs, the code need to get the VF's BDF in advance.

In this patch, it exports the interface to retrieve VF's BDF:
   * Make the virtfn_bus as an interface
   * Make the virtfn_devfn as an interface
   * Rename them with more specific name
   * Code cleanup in pci_sriov_resource_alignment()

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   22 +++++++++++++---------
 include/linux/pci.h |   11 +++++++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ea3a82c..e76d1a0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,14 +19,18 @@
 
 #define VIRTFN_ID_LEN	16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
 				    dev->sriov->stride * id) >> 8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * id) & 0xff;
 }
@@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
 
 	for ( ; total >= 0; total--) {
 		pci_iov_set_numvfs(dev, total);
-		busnr = virtfn_bus(dev, iov->total_VFs - 1);
+		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
 		if (busnr > max)
 			max = busnr;
 	}
@@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	struct pci_bus *bus;
 
 	mutex_lock(&iov->dev->sriov->lock);
-	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
 	if (!bus)
 		goto failed;
 
@@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	if (!virtfn)
 		goto failed0;
 
-	virtfn->devfn = virtfn_devfn(dev, id);
+	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
 	virtfn->vendor = dev->vendor;
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
 	pci_setup_device(virtfn);
@@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	struct pci_sriov *iov = dev->sriov;
 
 	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
-					     virtfn_bus(dev, id),
-					     virtfn_devfn(dev, id));
+					     pci_iov_virtfn_bus(dev, id),
+					     pci_iov_virtfn_devfn(dev, id));
 	if (!virtfn)
 		return;
 
@@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
+	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
 		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
 		return -ENOMEM;
 	}
@@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 	if (!reg)
 		return 0;
 
-	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
+	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
 	return resource_alignment(&tmp);
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 360a966..74ef944 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 01/17] PCI/IOV: Export interface for retrieve VF's BDF
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

When implementing the SR-IOV on PowerNV platform, some resource reservation is
needed for VFs which don't exist at the bootup stage. To do the match between
resources and VFs, the code need to get the VF's BDF in advance.

In this patch, it exports the interface to retrieve VF's BDF:
   * Make the virtfn_bus as an interface
   * Make the virtfn_devfn as an interface
   * Rename them with more specific name
   * Code cleanup in pci_sriov_resource_alignment()

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   22 +++++++++++++---------
 include/linux/pci.h |   11 +++++++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ea3a82c..e76d1a0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,14 +19,18 @@
 
 #define VIRTFN_ID_LEN	16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
 				    dev->sriov->stride * id) >> 8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * id) & 0xff;
 }
@@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
 
 	for ( ; total >= 0; total--) {
 		pci_iov_set_numvfs(dev, total);
-		busnr = virtfn_bus(dev, iov->total_VFs - 1);
+		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
 		if (busnr > max)
 			max = busnr;
 	}
@@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	struct pci_bus *bus;
 
 	mutex_lock(&iov->dev->sriov->lock);
-	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
 	if (!bus)
 		goto failed;
 
@@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	if (!virtfn)
 		goto failed0;
 
-	virtfn->devfn = virtfn_devfn(dev, id);
+	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
 	virtfn->vendor = dev->vendor;
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
 	pci_setup_device(virtfn);
@@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	struct pci_sriov *iov = dev->sriov;
 
 	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
-					     virtfn_bus(dev, id),
-					     virtfn_devfn(dev, id));
+					     pci_iov_virtfn_bus(dev, id),
+					     pci_iov_virtfn_devfn(dev, id));
 	if (!virtfn)
 		return;
 
@@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
+	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
 		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
 		return -ENOMEM;
 	}
@@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 	if (!reg)
 		return 0;
 
-	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
+	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
 	return resource_alignment(&tmp);
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 360a966..74ef944 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 02/17] PCI/IOV: add VF enable/disable hook
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

VFs are dynamically created/released when driver enable them. On some
platforms, like PowerNV, special resources are necessary to enable VFs.

This patch adds two hooks for platform initialization before creating the VFs.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e76d1a0..5437fad0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -213,6 +213,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
 	int rc;
@@ -223,6 +228,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_dev *pdev;
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
+	int retval;
 
 	if (!nr_virtfn)
 		return 0;
@@ -297,6 +303,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	if (nr_virtfn < initial)
 		initial = nr_virtfn;
 
+	if ((retval = pcibios_sriov_enable(dev, initial))) {
+		dev_err(&dev->dev, "Failure %d from pcibios_sriov_setup()\n",
+			retval);
+		return retval;
+	}
+
 	for (i = 0; i < initial; i++) {
 		rc = virtfn_add(dev, i, 0);
 		if (rc)
@@ -325,6 +337,11 @@ failed:
 	return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev)
+{
+       return 0;
+}
+
 static void sriov_disable(struct pci_dev *dev)
 {
 	int i;
@@ -336,6 +353,8 @@ static void sriov_disable(struct pci_dev *dev)
 	for (i = 0; i < iov->num_VFs; i++)
 		virtfn_remove(dev, i, 0);
 
+	pcibios_sriov_disable(dev);
+
 	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 02/17] PCI/IOV: add VF enable/disable hook
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

VFs are dynamically created/released when driver enable them. On some
platforms, like PowerNV, special resources are necessary to enable VFs.

This patch adds two hooks for platform initialization before creating the VFs.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e76d1a0..5437fad0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -213,6 +213,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
 	int rc;
@@ -223,6 +228,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_dev *pdev;
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
+	int retval;
 
 	if (!nr_virtfn)
 		return 0;
@@ -297,6 +303,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	if (nr_virtfn < initial)
 		initial = nr_virtfn;
 
+	if ((retval = pcibios_sriov_enable(dev, initial))) {
+		dev_err(&dev->dev, "Failure %d from pcibios_sriov_setup()\n",
+			retval);
+		return retval;
+	}
+
 	for (i = 0; i < initial; i++) {
 		rc = virtfn_add(dev, i, 0);
 		if (rc)
@@ -325,6 +337,11 @@ failed:
 	return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev)
+{
+       return 0;
+}
+
 static void sriov_disable(struct pci_dev *dev)
 {
 	int i;
@@ -336,6 +353,8 @@ static void sriov_disable(struct pci_dev *dev)
 	for (i = 0; i < iov->num_VFs; i++)
 		virtfn_remove(dev, i, 0);
 
+	pcibios_sriov_disable(dev);
+
 	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

The alignment of PF's IOV BAR is designed to be the individual size of a VF's
BAR size. This works fine for many platforms, but on PowerNV platform it needs
some change.

The original alignment works, since at sizing and assigning stage the
requirement is from an individual VF's BAR size instead of the PF's IOV BAR.
This is the reason for the original code to just retrieve the individual
VF BAR size as the alignment.

On PowerNV platform, it is required to align the whole PF IOV BAR to a hardware
segment. Based on this fact, the alignment of PF's IOV BAR should be
calculated seperately.

This patch introduces a weak pcibios_iov_resource_alignment() interface, which
gives platform a chance to implement specific method to calculate the PF's IOV
BAR alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   11 ++++++++++-
 include/linux/pci.h |    3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5437fad0..554dd64 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
 		4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+		int resno, resource_size_t align)
+{
+	return align;
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
 	struct resource tmp;
 	int reg = pci_iov_resource_bar(dev, resno);
+	resource_size_t align;
 
 	if (!reg)
 		return 0;
 
 	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
-	return resource_alignment(&tmp);
+	align = resource_alignment(&tmp);
+
+	return pcibios_iov_resource_alignment(dev, resno, align);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 74ef944..ae7a7ea 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
+						 int resno,
+						 resource_size_t align);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

The alignment of PF's IOV BAR is designed to be the individual size of a VF's
BAR size. This works fine for many platforms, but on PowerNV platform it needs
some change.

The original alignment works, since at sizing and assigning stage the
requirement is from an individual VF's BAR size instead of the PF's IOV BAR.
This is the reason for the original code to just retrieve the individual
VF BAR size as the alignment.

On PowerNV platform, it is required to align the whole PF IOV BAR to a hardware
segment. Based on this fact, the alignment of PF's IOV BAR should be
calculated seperately.

This patch introduces a weak pcibios_iov_resource_alignment() interface, which
gives platform a chance to implement specific method to calculate the PF's IOV
BAR alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   11 ++++++++++-
 include/linux/pci.h |    3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5437fad0..554dd64 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
 		4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+		int resno, resource_size_t align)
+{
+	return align;
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
 	struct resource tmp;
 	int reg = pci_iov_resource_bar(dev, resno);
+	resource_size_t align;
 
 	if (!reg)
 		return 0;
 
 	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
-	return resource_alignment(&tmp);
+	align = resource_alignment(&tmp);
+
+	return pcibios_iov_resource_alignment(dev, resno, align);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 74ef944..ae7a7ea 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
+						 int resno,
+						 resource_size_t align);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 04/17] PCI: Store VF BAR size in pci_sriov
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

Currently we don't store the VF BAR size, and each time we calculate the size
by dividing the PF's IOV BAR size by total_VFs.

This patch stores the VF BAR size in pci_sriov and introduces a function to
retrieve it. Also, it adds a log message to show the total PF's IOV BAR size.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   28 ++++++++++++++++++++--------
 drivers/pci/pci.h   |    2 ++
 include/linux/pci.h |    3 +++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 554dd64..9a3e16c 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -100,6 +100,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	if (!dev->is_physfn)
+		return 0;
+
+	return dev->sriov->res[resno - PCI_IOV_RESOURCES];
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -135,8 +143,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
@@ -419,6 +426,12 @@ found:
 	pgsz &= ~(pgsz - 1);
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
+	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+	if (!iov) {
+		rc = -ENOMEM;
+		goto failed;
+	}
+
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = dev->resource + PCI_IOV_RESOURCES + i;
@@ -430,16 +443,15 @@ found:
 			rc = -EIO;
 			goto failed;
 		}
+		iov->res[res - dev->resource - PCI_IOV_RESOURCES] =
+			resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
+		dev_info(&dev->dev, "VF BAR%ld: %pR (for %d VFs)",
+				res - dev->resource - PCI_IOV_RESOURCES,
+				res, total);
 		nres++;
 	}
 
-	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
-
 	iov->pos = pos;
 	iov->nres = nres;
 	iov->ctrl = ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 94faf97..b1c9fdd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -241,6 +241,8 @@ struct pci_sriov {
 	struct pci_dev *dev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	resource_size_t res[PCI_SRIOV_NUM_BARS];
+				/* VF BAR size */
 };
 
 #ifdef CONFIG_PCI_ATS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ae7a7ea..f0b5f87 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1670,6 +1670,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -1689,6 +1690,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 04/17] PCI: Store VF BAR size in pci_sriov
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

Currently we don't store the VF BAR size, and each time we calculate the size
by dividing the PF's IOV BAR size by total_VFs.

This patch stores the VF BAR size in pci_sriov and introduces a function to
retrieve it. Also, it adds a log message to show the total PF's IOV BAR size.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   28 ++++++++++++++++++++--------
 drivers/pci/pci.h   |    2 ++
 include/linux/pci.h |    3 +++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 554dd64..9a3e16c 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -100,6 +100,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	if (!dev->is_physfn)
+		return 0;
+
+	return dev->sriov->res[resno - PCI_IOV_RESOURCES];
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -135,8 +143,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
@@ -419,6 +426,12 @@ found:
 	pgsz &= ~(pgsz - 1);
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
+	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+	if (!iov) {
+		rc = -ENOMEM;
+		goto failed;
+	}
+
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = dev->resource + PCI_IOV_RESOURCES + i;
@@ -430,16 +443,15 @@ found:
 			rc = -EIO;
 			goto failed;
 		}
+		iov->res[res - dev->resource - PCI_IOV_RESOURCES] =
+			resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
+		dev_info(&dev->dev, "VF BAR%ld: %pR (for %d VFs)",
+				res - dev->resource - PCI_IOV_RESOURCES,
+				res, total);
 		nres++;
 	}
 
-	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
-
 	iov->pos = pos;
 	iov->nres = nres;
 	iov->ctrl = ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 94faf97..b1c9fdd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -241,6 +241,8 @@ struct pci_sriov {
 	struct pci_dev *dev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	resource_size_t res[PCI_SRIOV_NUM_BARS];
+				/* VF BAR size */
 };
 
 #ifdef CONFIG_PCI_ATS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ae7a7ea..f0b5f87 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1670,6 +1670,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -1689,6 +1690,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 05/17] PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

At resource sizing/assigning stage, resources are divided into two lists,
requested list and additional list, while the alignement of the additional
IOV BAR is not taken into the sizing and assigning procedure.

This is reasonable in the original implementation, since IOV BAR's alignment is
mostly the size of a PF BAR alignemt. This means the alignment is already taken
into consideration. While this rule may be violated on some platform, eg.
PowerNV platform.

This patch takes the additional IOV BAR alignment in sizing and assigning stage
explicitly. When system MMIO space is not enough, the PF's IOV BAR alignment
will not contribute to the bridge. When system MMIO space is enough, the
additional alignment will contribute to the bridge.

Also it take advantage of pci_dev_resource::min_align to store this additional
alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c |   85 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 0482235..05c7df0 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
 	}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-					struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+					       struct resource *res)
 {
 	struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
 			int idx = res - &dev_res->dev->resource[0];
 
 			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
-				 "res[%d]=%pR get_res_add_size add_size %llx\n",
+				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
 				 idx, dev_res->res,
-				 (unsigned long long)dev_res->add_size);
+				 (unsigned long long)dev_res->add_size,
+				 (unsigned long long)dev_res->min_align);
 
-			return dev_res->add_size;
+			return dev_res;
 		}
 	}
 
-	return 0;
+	return NULL;
+}
+
+static resource_size_t get_res_add_size(struct list_head *head,
+					struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+		struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->min_align : 0;
 }
 
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
 	LIST_HEAD(save_head);
 	LIST_HEAD(local_fail_head);
 	struct pci_dev_resource *save_res;
-	struct pci_dev_resource *dev_res, *tmp_res;
+	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
 	unsigned long fail_type;
+	resource_size_t add_align, align;
 
 	/* Check if optional add_size is there */
 	if (!realloc_head || list_empty(realloc_head))
@@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
 	}
 
 	/* Update res in head list with add_size in realloc_head list */
-	list_for_each_entry(dev_res, head, list)
+	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
 		dev_res->res->end += get_res_add_size(realloc_head,
 							dev_res->res);
 
+		/* 
+		 * There are two kinds additional resources in the list:
+		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
+		 * 2. SRIOV resource   -- IORESOURCE_SIZEALIGN
+		 * Here just fix the additional alignment for bridge
+		 */
+		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+			continue;
+
+		add_align = get_res_add_align(realloc_head, dev_res->res);
+
+		/* Reorder the list by their alignment */
+		if (add_align > dev_res->res->start) {
+			dev_res->res->start = add_align;
+			dev_res->res->end = add_align +
+				            resource_size(dev_res->res);
+
+			list_for_each_entry(dev_res2, head, list) {
+				align = pci_resource_alignment(dev_res2->dev,
+							       dev_res2->res);
+				if (add_align > align)
+					list_move_tail(&dev_res->list,
+						       &dev_res2->list);
+			}
+               }
+
+	}
+
 	/* Try updated head list with add_size added */
 	assign_requested_resources_sorted(head, &local_fail_head);
 
@@ -930,6 +979,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	struct resource *b_res = find_free_bus_resource(bus,
 					mask | IORESOURCE_PREFETCH, type);
 	resource_size_t children_add_size = 0;
+	resource_size_t children_add_align = 0;
+	resource_size_t add_align = 0;
 
 	if (!b_res)
 		return -ENOSPC;
@@ -954,6 +1005,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			/* put SRIOV requested res to the optional list */
 			if (realloc_head && i >= PCI_IOV_RESOURCES &&
 					i <= PCI_IOV_RESOURCE_END) {
+				add_align = max(pci_resource_alignment(dev, r), add_align);
 				r->end = r->start - 1;
 				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
 				children_add_size += r_size;
@@ -984,19 +1036,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			if (order > max_order)
 				max_order = order;
 
-			if (realloc_head)
+			if (realloc_head) {
 				children_add_size += get_res_add_size(realloc_head, r);
+				children_add_align = get_res_add_align(realloc_head, r);
+				add_align = max(add_align, children_add_align);
+			}
 		}
 	}
 
 	min_align = calculate_mem_align(aligns, max_order);
 	min_align = max(min_align, window_alignment(bus, b_res->flags));
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
+	add_align = max(min_align, add_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
 	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
 		calculate_memsize(size, min_size, add_size,
-				resource_size(b_res), min_align);
+				resource_size(b_res), add_align);
 	if (!size0 && !size1) {
 		if (b_res->start || b_res->end)
 			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
@@ -1008,10 +1064,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	b_res->end = size0 + min_align - 1;
 	b_res->flags |= IORESOURCE_STARTALIGN;
 	if (size1 > size0 && realloc_head) {
-		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
-		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
-			   b_res, &bus->busn_res,
-			   (unsigned long long)size1-size0);
+		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
+				 "%pR to %pR add_size %llx add_align %llx\n", b_res,
+				 &bus->busn_res, (unsigned long long)size1-size0,
+				 add_align);
 	}
 	return 0;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 05/17] PCI: Take additional PF's IOV BAR alignment in sizing and assigning
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

At resource sizing/assigning stage, resources are divided into two lists,
requested list and additional list, while the alignement of the additional
IOV BAR is not taken into the sizing and assigning procedure.

This is reasonable in the original implementation, since IOV BAR's alignment is
mostly the size of a PF BAR alignemt. This means the alignment is already taken
into consideration. While this rule may be violated on some platform, eg.
PowerNV platform.

This patch takes the additional IOV BAR alignment in sizing and assigning stage
explicitly. When system MMIO space is not enough, the PF's IOV BAR alignment
will not contribute to the bridge. When system MMIO space is enough, the
additional alignment will contribute to the bridge.

Also it take advantage of pci_dev_resource::min_align to store this additional
alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c |   85 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 0482235..05c7df0 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
 	}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-					struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+					       struct resource *res)
 {
 	struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
 			int idx = res - &dev_res->dev->resource[0];
 
 			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
-				 "res[%d]=%pR get_res_add_size add_size %llx\n",
+				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
 				 idx, dev_res->res,
-				 (unsigned long long)dev_res->add_size);
+				 (unsigned long long)dev_res->add_size,
+				 (unsigned long long)dev_res->min_align);
 
-			return dev_res->add_size;
+			return dev_res;
 		}
 	}
 
-	return 0;
+	return NULL;
+}
+
+static resource_size_t get_res_add_size(struct list_head *head,
+					struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+		struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->min_align : 0;
 }
 
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
 	LIST_HEAD(save_head);
 	LIST_HEAD(local_fail_head);
 	struct pci_dev_resource *save_res;
-	struct pci_dev_resource *dev_res, *tmp_res;
+	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
 	unsigned long fail_type;
+	resource_size_t add_align, align;
 
 	/* Check if optional add_size is there */
 	if (!realloc_head || list_empty(realloc_head))
@@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
 	}
 
 	/* Update res in head list with add_size in realloc_head list */
-	list_for_each_entry(dev_res, head, list)
+	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
 		dev_res->res->end += get_res_add_size(realloc_head,
 							dev_res->res);
 
+		/* 
+		 * There are two kinds additional resources in the list:
+		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
+		 * 2. SRIOV resource   -- IORESOURCE_SIZEALIGN
+		 * Here just fix the additional alignment for bridge
+		 */
+		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+			continue;
+
+		add_align = get_res_add_align(realloc_head, dev_res->res);
+
+		/* Reorder the list by their alignment */
+		if (add_align > dev_res->res->start) {
+			dev_res->res->start = add_align;
+			dev_res->res->end = add_align +
+				            resource_size(dev_res->res);
+
+			list_for_each_entry(dev_res2, head, list) {
+				align = pci_resource_alignment(dev_res2->dev,
+							       dev_res2->res);
+				if (add_align > align)
+					list_move_tail(&dev_res->list,
+						       &dev_res2->list);
+			}
+               }
+
+	}
+
 	/* Try updated head list with add_size added */
 	assign_requested_resources_sorted(head, &local_fail_head);
 
@@ -930,6 +979,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	struct resource *b_res = find_free_bus_resource(bus,
 					mask | IORESOURCE_PREFETCH, type);
 	resource_size_t children_add_size = 0;
+	resource_size_t children_add_align = 0;
+	resource_size_t add_align = 0;
 
 	if (!b_res)
 		return -ENOSPC;
@@ -954,6 +1005,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			/* put SRIOV requested res to the optional list */
 			if (realloc_head && i >= PCI_IOV_RESOURCES &&
 					i <= PCI_IOV_RESOURCE_END) {
+				add_align = max(pci_resource_alignment(dev, r), add_align);
 				r->end = r->start - 1;
 				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
 				children_add_size += r_size;
@@ -984,19 +1036,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			if (order > max_order)
 				max_order = order;
 
-			if (realloc_head)
+			if (realloc_head) {
 				children_add_size += get_res_add_size(realloc_head, r);
+				children_add_align = get_res_add_align(realloc_head, r);
+				add_align = max(add_align, children_add_align);
+			}
 		}
 	}
 
 	min_align = calculate_mem_align(aligns, max_order);
 	min_align = max(min_align, window_alignment(bus, b_res->flags));
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
+	add_align = max(min_align, add_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
 	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
 		calculate_memsize(size, min_size, add_size,
-				resource_size(b_res), min_align);
+				resource_size(b_res), add_align);
 	if (!size0 && !size1) {
 		if (b_res->start || b_res->end)
 			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
@@ -1008,10 +1064,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	b_res->end = size0 + min_align - 1;
 	b_res->flags |= IORESOURCE_STARTALIGN;
 	if (size1 > size0 && realloc_head) {
-		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
-		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
-			   b_res, &bus->busn_res,
-			   (unsigned long long)size1-size0);
+		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
+				 "%pR to %pR add_size %llx add_align %llx\n", b_res,
+				 &bus->busn_res, (unsigned long long)size1-size0,
+				 add_align);
 	}
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 06/17] powerpc/pci: Add PCI resource alignment documentation
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:
    1. size expaned
    2. aligned to M64BT size

This patch documents this change on the reason and how.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++++++++++++++++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 0000000..10d4ac2
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,215 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+Benjamin Herrenschmidt <benh@au1.ibm.com>
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerNV platform and how generic PCI code handle this
+requirement. The first two sections describes the concept to PE and the
+implementation on P8 (IODA2)
+
+1. General Introduction on the Purpose of PE
+PE stands for Partitionable Endpoint.
+
+The concept of PE is a way to group the various resources associated
+with a device or a set of device to provide isolation between partitions
+(ie. filtering of DMA, MSIs etc...) and to provide a mechanism to freeze
+a device that is causing errors in order to limit the possibility of
+propagation of bad data.
+
+There is thus, in HW, a table of PE states that contains a pair of
+"frozen" state bits (one for MMIO and one for DMA, they get set together
+but can be cleared independently) for each PE.
+
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value. MSIs are also blocked. There's a bit more state that
+captures things like the details of the error that caused the freeze etc...
+but that's not critical.
+
+The interesting part is how the various type of PCIe transactions (MMIO,
+DMA,...) are matched to their corresponding PEs.
+
+Following section provides a rough description of what we have on P8 (IODA2).
+Keep in mind that this is all per PHB (host bridge). Each PHB is a completely
+separate HW entity which replicates the entire logic, so has its own set
+of PEs etc...
+
+2. Implementation of PE on P8 (IODA2)
+First, P8 has 256 PEs per PHB.
+
+ * Inbound
+
+For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but
+accessed in HW by the chip) that provides a direct correspondence between
+a PCIe RID (bus/dev/fn) with a "PE" number. We call this the RTT.
+
+ - For DMA we then provide an entire address space for each PE that can contains
+two "windows", depending on the value of PCI bit 59. Each window can then be
+configured to be remapped via a "TCE table" (iommu translation table), which has
+various configurable characteristics which we can describe another day.
+
+ - For MSIs, we have two windows in the address space (one at the top of the 32-bit
+space and one much higher) which, via a combination of the address and MSI value,
+will result in one of the 2048 interrupts per bridge being triggered. There's
+a PE value in the interrupt controller descriptor table as well which is compared
+with the PE obtained from the RTT to "authorize" the device to emit that specific
+interrupt.
+
+ - Error messages just use the RTT.
+
+ * Outbound. That's where the tricky part is.
+
+The PHB basically has a concept of "windows" from the CPU address space to the
+PCI address space. There is one M32 window and 16 M64 windows. They have different
+characteristics. First what they have in common: they are configured to forward a
+configurable portion of the CPU address space to the PCIe bus and must be naturally
+aligned power of two in size. The rest is different:
+
+  - The M32 window:
+
+    * It is limited to 4G in size
+
+    * It drops the top bits of the address (above the size) and replaces them with
+a configurable value. This is typically used to generate 32-bit PCIe accesses. We
+configure that window at boot from FW and don't touch it from Linux, it's usually
+set to forward a 2G portion of address space from the CPU to PCIe
+0x8000_0000..0xffff_ffff. (Note: The top 64K are actually reserved for MSIs but
+this is not a problem at this point, we just need to ensure Linux doesn't assign
+anything there, the M32 logic ignores that however and will forward in that space
+if we try).
+
+    * It is divided into 256 segments of equal size. A table in the chip provides
+for each of these 256 segments a PE#. That allows to essentially assign portions
+of the MMIO space to PEs on a segment granularity. For a 2G window, this is 8M.
+
+Now, this is the "main" window we use in Linux today (excluding SR-IOV). We
+basically use the trick of forcing the bridge MMIO windows onto a segment
+alignment/granularity so that the space behind a bridge can be assigned to a PE.
+
+Ideally we would like to be able to have individual functions in PE's but that
+would mean using a completely different address allocation scheme where individual
+function BARs can be "grouped" to fit in one or more segments....
+
+ - The M64 windows.
+
+   * Their smallest size is 1M
+
+   * They do not translate addresses (the address on PCIe is the same as the
+address on the PowerBus. There is a way to also set the top 14 bits which are
+not conveyed by PowerBus but we don't use this).
+
+   * They can be configured to be segmented or not. When segmented, they have
+256 segments, however they are not remapped. The segment number *is* the PE
+number. When no segmented, the PE number can be specified for the entire
+window.
+
+   * They support overlaps in which case there is a well defined ordering of
+matching (I don't remember off hand which of the lower or higher numbered
+window takes priority but basically it's well defined).
+
+We have code (fairly new compared to the M32 stuff) that exploits that for
+large BARs in 64-bit space:
+
+We create a single big M64 that covers the entire region of address space that
+has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
+it comes out of a different "reserve"). We configure that window as segmented.
+
+Then we do the same thing as with M32, using the bridge aligment trick, to
+match to those giant segments.
+
+Since we cannot remap, we have two additional constraints:
+
+  - We do the PE# allocation *after* the 64-bit space has been assigned since
+the segments used will derive directly the PE#, we then "update" the M32 PE#
+for the devices that use both 32-bit and 64-bit spaces or assign the remaining
+PE# to 32-bit only devices.
+
+  - We cannot "group" segments in HW so if a device ends up using more than
+one segment, we end up with more than one PE#. There is a HW mechanism to
+make the freeze state cascade to "companion" PEs but that only work for PCIe
+error messages (typically used so that if you freeze a switch, it freezes all
+its children). So we do it in SW. We lose a bit of effectiveness of EEH in that
+case, but that's the best we found. So when any of the PEs freezes, we freeze
+the other ones for that "domain". We thus introduce the concept of "master PE"
+which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
+for the remaining M64 segments.
+
+We would like to investigate using additional M64's in "single PE" mode to
+overlay over specific BARs to work around some of that, for example for devices
+with very large BARs (some GPUs), it would make sense, but we haven't done it
+yet.
+
+Finally, the plan to use M64 for SR-IOV, which will be described more in next
+two sections. So for a given IOV BAR, we need to effectively reserve the
+entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
+the beginning of a free range of segments/PEs inside that M64.
+
+The goal is of course to be able to give a separate PE for each VF...
+
+3. Hardware requirement on PowerNV platform for SRIOV
+On PowerNV platform, IODA2 version, it has 16 M64 BARs, which is used to map
+MMIO range to PE#. Each M64 BAR would cover one MMIO range and this range is
+divided by *total_pe* number evenly with one piece corresponding to one PE.
+
+We decide to leverage this M64 BAR to map VFs to their individual PE, since
+for SRIOV VFs their BAR share the same size.
+
+By doing so, it introduces another problem. The *total_pe* number usually is
+bigger than the total_VFs. If we map one IOV BAR directly to one M64 BAR, some
+part in M64 BAR will map to another devices MMIO range.
+
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.0 Direct map IOV BAR
+
+As Figure 1.0 indicates, the range [total_VFs, total_pe - 1] in M64 BAR may
+map to some MMIO range on other device.
+
+The solution currently we have is to expand the IOV BAR to *total_pe* number.
+
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.1 Map expanded IOV BAR
+
+By expanding the IOV BAR, this ensures the whole M64 range will not effect
+others.
+
+4. How generic PCI code handle it
+Till now, it looks good to make it work, while another problem comes. The M64
+BAR start address needs to be size aligned, while the original generic PCI
+code assign the IOV BAR with individual VF BAR size aligned.
+
+Since usually one SRIOV VF BAR size is the same as its PF size, the original
+generic PCI code will not count in the IOV BAR alignment. (The alignment is
+the same as its PF.) With the change from PowerNV platform, this changes. The
+alignment of the IOV BAR is now the total size, then we need to count in it.
+
+From:
+	alignment(IOV BAR) = size(VF BAR) = size(PF BAR)
+To:
+	alignment(IOV BAR) = size(IOV BAR)
+
+In commit(PCI: Take additional IOV BAR alignment in sizing and assigning), it
+has add_align to track the alignment from IOV BAR and use it to meet the
+requirement.
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 06/17] powerpc/pci: Add PCI resource alignment documentation
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:
    1. size expaned
    2. aligned to M64BT size

This patch documents this change on the reason and how.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++++++++++++++++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 0000000..10d4ac2
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,215 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+Benjamin Herrenschmidt <benh@au1.ibm.com>
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerNV platform and how generic PCI code handle this
+requirement. The first two sections describes the concept to PE and the
+implementation on P8 (IODA2)
+
+1. General Introduction on the Purpose of PE
+PE stands for Partitionable Endpoint.
+
+The concept of PE is a way to group the various resources associated
+with a device or a set of device to provide isolation between partitions
+(ie. filtering of DMA, MSIs etc...) and to provide a mechanism to freeze
+a device that is causing errors in order to limit the possibility of
+propagation of bad data.
+
+There is thus, in HW, a table of PE states that contains a pair of
+"frozen" state bits (one for MMIO and one for DMA, they get set together
+but can be cleared independently) for each PE.
+
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value. MSIs are also blocked. There's a bit more state that
+captures things like the details of the error that caused the freeze etc...
+but that's not critical.
+
+The interesting part is how the various type of PCIe transactions (MMIO,
+DMA,...) are matched to their corresponding PEs.
+
+Following section provides a rough description of what we have on P8 (IODA2).
+Keep in mind that this is all per PHB (host bridge). Each PHB is a completely
+separate HW entity which replicates the entire logic, so has its own set
+of PEs etc...
+
+2. Implementation of PE on P8 (IODA2)
+First, P8 has 256 PEs per PHB.
+
+ * Inbound
+
+For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but
+accessed in HW by the chip) that provides a direct correspondence between
+a PCIe RID (bus/dev/fn) with a "PE" number. We call this the RTT.
+
+ - For DMA we then provide an entire address space for each PE that can contains
+two "windows", depending on the value of PCI bit 59. Each window can then be
+configured to be remapped via a "TCE table" (iommu translation table), which has
+various configurable characteristics which we can describe another day.
+
+ - For MSIs, we have two windows in the address space (one at the top of the 32-bit
+space and one much higher) which, via a combination of the address and MSI value,
+will result in one of the 2048 interrupts per bridge being triggered. There's
+a PE value in the interrupt controller descriptor table as well which is compared
+with the PE obtained from the RTT to "authorize" the device to emit that specific
+interrupt.
+
+ - Error messages just use the RTT.
+
+ * Outbound. That's where the tricky part is.
+
+The PHB basically has a concept of "windows" from the CPU address space to the
+PCI address space. There is one M32 window and 16 M64 windows. They have different
+characteristics. First what they have in common: they are configured to forward a
+configurable portion of the CPU address space to the PCIe bus and must be naturally
+aligned power of two in size. The rest is different:
+
+  - The M32 window:
+
+    * It is limited to 4G in size
+
+    * It drops the top bits of the address (above the size) and replaces them with
+a configurable value. This is typically used to generate 32-bit PCIe accesses. We
+configure that window at boot from FW and don't touch it from Linux, it's usually
+set to forward a 2G portion of address space from the CPU to PCIe
+0x8000_0000..0xffff_ffff. (Note: The top 64K are actually reserved for MSIs but
+this is not a problem at this point, we just need to ensure Linux doesn't assign
+anything there, the M32 logic ignores that however and will forward in that space
+if we try).
+
+    * It is divided into 256 segments of equal size. A table in the chip provides
+for each of these 256 segments a PE#. That allows to essentially assign portions
+of the MMIO space to PEs on a segment granularity. For a 2G window, this is 8M.
+
+Now, this is the "main" window we use in Linux today (excluding SR-IOV). We
+basically use the trick of forcing the bridge MMIO windows onto a segment
+alignment/granularity so that the space behind a bridge can be assigned to a PE.
+
+Ideally we would like to be able to have individual functions in PE's but that
+would mean using a completely different address allocation scheme where individual
+function BARs can be "grouped" to fit in one or more segments....
+
+ - The M64 windows.
+
+   * Their smallest size is 1M
+
+   * They do not translate addresses (the address on PCIe is the same as the
+address on the PowerBus. There is a way to also set the top 14 bits which are
+not conveyed by PowerBus but we don't use this).
+
+   * They can be configured to be segmented or not. When segmented, they have
+256 segments, however they are not remapped. The segment number *is* the PE
+number. When no segmented, the PE number can be specified for the entire
+window.
+
+   * They support overlaps in which case there is a well defined ordering of
+matching (I don't remember off hand which of the lower or higher numbered
+window takes priority but basically it's well defined).
+
+We have code (fairly new compared to the M32 stuff) that exploits that for
+large BARs in 64-bit space:
+
+We create a single big M64 that covers the entire region of address space that
+has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
+it comes out of a different "reserve"). We configure that window as segmented.
+
+Then we do the same thing as with M32, using the bridge aligment trick, to
+match to those giant segments.
+
+Since we cannot remap, we have two additional constraints:
+
+  - We do the PE# allocation *after* the 64-bit space has been assigned since
+the segments used will derive directly the PE#, we then "update" the M32 PE#
+for the devices that use both 32-bit and 64-bit spaces or assign the remaining
+PE# to 32-bit only devices.
+
+  - We cannot "group" segments in HW so if a device ends up using more than
+one segment, we end up with more than one PE#. There is a HW mechanism to
+make the freeze state cascade to "companion" PEs but that only work for PCIe
+error messages (typically used so that if you freeze a switch, it freezes all
+its children). So we do it in SW. We lose a bit of effectiveness of EEH in that
+case, but that's the best we found. So when any of the PEs freezes, we freeze
+the other ones for that "domain". We thus introduce the concept of "master PE"
+which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
+for the remaining M64 segments.
+
+We would like to investigate using additional M64's in "single PE" mode to
+overlay over specific BARs to work around some of that, for example for devices
+with very large BARs (some GPUs), it would make sense, but we haven't done it
+yet.
+
+Finally, the plan to use M64 for SR-IOV, which will be described more in next
+two sections. So for a given IOV BAR, we need to effectively reserve the
+entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
+the beginning of a free range of segments/PEs inside that M64.
+
+The goal is of course to be able to give a separate PE for each VF...
+
+3. Hardware requirement on PowerNV platform for SRIOV
+On PowerNV platform, IODA2 version, it has 16 M64 BARs, which is used to map
+MMIO range to PE#. Each M64 BAR would cover one MMIO range and this range is
+divided by *total_pe* number evenly with one piece corresponding to one PE.
+
+We decide to leverage this M64 BAR to map VFs to their individual PE, since
+for SRIOV VFs their BAR share the same size.
+
+By doing so, it introduces another problem. The *total_pe* number usually is
+bigger than the total_VFs. If we map one IOV BAR directly to one M64 BAR, some
+part in M64 BAR will map to another devices MMIO range.
+
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.0 Direct map IOV BAR
+
+As Figure 1.0 indicates, the range [total_VFs, total_pe - 1] in M64 BAR may
+map to some MMIO range on other device.
+
+The solution currently we have is to expand the IOV BAR to *total_pe* number.
+
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.1 Map expanded IOV BAR
+
+By expanding the IOV BAR, this ensures the whole M64 range will not effect
+others.
+
+4. How generic PCI code handle it
+Till now, it looks good to make it work, while another problem comes. The M64
+BAR start address needs to be size aligned, while the original generic PCI
+code assign the IOV BAR with individual VF BAR size aligned.
+
+Since usually one SRIOV VF BAR size is the same as its PF size, the original
+generic PCI code will not count in the IOV BAR alignment. (The alignment is
+the same as its PF.) With the change from PowerNV platform, this changes. The
+alignment of the IOV BAR is now the total size, then we need to count in it.
+
+From:
+	alignment(IOV BAR) = size(VF BAR) = size(PF BAR)
+To:
+	alignment(IOV BAR) = size(IOV BAR)
+
+In commit(PCI: Take additional IOV BAR alignment in sizing and assigning), it
+has add_align to track the alignment from IOV BAR and use it to meet the
+requirement.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 07/17] powerpc/pci: Don't unset pci resources for VFs
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
resources will be cleaned out during device header fixup time and then get
reassigned by PCI core. However, the VF resources won't be reassigned and
thus, we shouldn't clean them out.

This patch adds a condition. If the pci_dev is a VF, skip the resource
unset process.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 37d512d..889f743 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 		       pci_name(dev));
 		return;
 	}
+
+	if (dev->is_virtfn)
+		return;
+
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
 		struct resource *res = dev->resource + i;
 		struct pci_bus_region reg;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
resources will be cleaned out during device header fixup time and then get
reassigned by PCI core. However, the VF resources won't be reassigned and
thus, we shouldn't clean them out.

This patch adds a condition. If the pci_dev is a VF, skip the resource
unset process.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 37d512d..889f743 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 		       pci_name(dev));
 		return;
 	}
+
+	if (dev->is_virtfn)
+		return;
+
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
 		struct resource *res = dev->resource + i;
 		struct pci_bus_region reg;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 08/17] powrepc/pci: Refactor pci_dn
  2014-12-22  5:54 ` Wei Yang
                   ` (7 preceding siblings ...)
  (?)
@ 2014-12-22  5:54 ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Gavin Shan <gwshan@linux.vnet.ibm.com>

pci_dn is the extension of PCI device node and it's created from
device node. Unfortunately, VFs that are enabled dynamically by
PF's driver and they don't have corresponding device nodes, and
pci_dn. The patch refactors pci_dn to support VFs:

   * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
     to the child list of pci_dn of PF's bridge. pci_dn of other
     device put to the child list of pci_dn of its upstream bridge.

   * VF's pci_dn is expected to be created dynamically when PF
     enabling VFs. VF's pci_dn will be destroyed when PF disabling
     VFs. pci_dn of other device is still created from device node
     as before.

   * For one particular PCI device (VF or not), its pci_dn can be
     found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
     or parent's list. The fast path (fetching pci_dn through PCI
     device instance) is populated during early fixup time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/device.h         |    3 +
 arch/powerpc/include/asm/pci-bridge.h     |   14 +-
 arch/powerpc/kernel/pci_dn.c              |  240 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci-ioda.c |   15 ++
 4 files changed, 267 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h
index 38faede..29992cd 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -34,6 +34,9 @@ struct dev_archdata {
 #ifdef CONFIG_SWIOTLB
 	dma_addr_t		max_direct_dma_addr;
 #endif
+#ifdef CONFIG_PPC64
+	void			*firmware_data;
+#endif
 #ifdef CONFIG_EEH
 	struct eeh_dev		*edev;
 #endif
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 725247b..74efea9 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -89,6 +89,7 @@ struct pci_controller {
 
 #ifdef CONFIG_PPC64
 	unsigned long buid;
+	void *firmware_data;
 #endif	/* CONFIG_PPC64 */
 
 	void *private_data;
@@ -150,9 +151,13 @@ static inline int isa_vaddr_is_ioport(void __iomem *address)
 struct iommu_table;
 
 struct pci_dn {
+	int     flags;
+#define PCI_DN_FLAG_IOV_VF     0x01
+
 	int	busno;			/* pci bus number */
 	int	devfn;			/* pci device and function number */
 
+	struct  pci_dn *parent;
 	struct  pci_controller *phb;	/* for pci devices */
 	struct	iommu_table *iommu_table;	/* for phb's or bridges */
 	struct	device_node *node;	/* back-pointer to the device_node */
@@ -167,14 +172,19 @@ struct pci_dn {
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
 #endif
+	struct list_head child_list;
+	struct list_head list;
 };
 
 /* Get the pointer to a device_node's pci_dn */
 #define PCI_DN(dn)	((struct pci_dn *) (dn)->data)
 
+extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+					   int devfn);
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
-
-extern void * update_dn_pci_info(struct device_node *dn, void *data);
+extern struct pci_dn *add_dev_pci_info(struct pci_dev *pdev);
+extern void remove_dev_pci_info(struct pci_dev *pdev);
+extern void *update_dn_pci_info(struct device_node *dn, void *data);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
 					  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 1f61fab..c99fb80 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -32,12 +32,222 @@
 #include <asm/ppc-pci.h>
 #include <asm/firmware.h>
 
+/*
+ * The function is used to find the firmware data of one
+ * specific PCI device, which is attached to the indicated
+ * PCI bus. For VFs, their firmware data is linked to that
+ * one of PF's bridge. For other devices, their firmware
+ * data is linked to that of their bridge.
+ */
+static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
+{
+	struct pci_bus *pbus;
+	struct device_node *dn;
+	struct pci_dn *pdn;
+
+	/*
+	 * We probably have virtual bus which doesn't
+	 * have associated bridge.
+	 */
+	pbus = bus;
+	while (pbus) {
+		if (pci_is_root_bus(pbus) || pbus->self)
+			break;
+
+		pbus = pbus->parent;
+	}
+
+	/*
+	 * Except virtual bus, all PCI buses should
+	 * have device nodes.
+	 */
+	dn = pci_bus_to_OF_node(pbus);
+	pdn = dn ? PCI_DN(dn) : NULL;
+
+	return pdn;
+}
+
+struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+				    int devfn)
+{
+	struct device_node *dn = NULL;
+	struct pci_dn *parent, *pdn;
+	struct pci_dev *pdev = NULL;
+
+	/* Fast path: fetch from PCI device */
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		if (pdev->devfn == devfn) {
+			if (pdev->dev.archdata.firmware_data)
+				return pdev->dev.archdata.firmware_data;
+
+			dn = pci_device_to_OF_node(pdev);
+			break;
+		}
+	}
+
+	/* Fast path: fetch from device node */
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/* Slow path: fetch from firmware data hierarchy */
+	parent = pci_bus_to_pdn(bus);
+	if (!parent)
+		return NULL;
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == bus->number &&
+                    pdn->devfn == devfn)
+                        return pdn;
+        }
+
+	return NULL;
+}
+
 struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 {
-	struct device_node *dn = pci_device_to_OF_node(pdev);
-	if (!dn)
+	struct device_node *dn;
+	struct pci_dn *parent, *pdn;
+
+	/* Search device directly */
+	if (pdev->dev.archdata.firmware_data)
+		return pdev->dev.archdata.firmware_data;
+
+	/* Check device node */
+	dn = pci_device_to_OF_node(pdev);
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/*
+	 * VFs don't have device nodes. We hook their
+	 * firmware data to PF's bridge.
+	 */
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return NULL;
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == pdev->bus->number &&
+		    pdn->devfn == pdev->devfn)
+			return pdn;
+	}
+
+	return NULL;
+}
+
+static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
+					   struct pci_dev *pdev,
+					   int busno, int devfn)
+{
+	struct pci_dn *pdn;
+
+	/* Except PHB, we always have parent firmware data */
+	if (!parent)
+		return NULL;
+
+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
+	if (!pdn) {
+		pr_warn("%s: Out of memory !\n", __func__);
+		return NULL;
+	}
+
+	pdn->phb = parent->phb;
+	pdn->parent = parent;
+	pdn->busno = busno;
+	pdn->devfn = devfn;
+#ifdef CONFIG_PPC_POWERNV
+	pdn->pe_number = IODA_INVALID_PE;
+#endif
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	list_add_tail(&pdn->list, &parent->child_list);
+
+	/*
+	 * If we already have PCI device instance, lets
+	 * bind them.
+	 */
+	if (pdev)
+		pdev->dev.archdata.firmware_data = pdn;
+
+	return pdn;
+}
+
+struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent, *pdn;
+	int i;
+
+	/* Only support IOV for now */
+	if (!pdev->is_physfn)
+		return pci_get_pdn(pdev);
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return NULL;
+
+	pdn->flags |= PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
 		return NULL;
-	return PCI_DN(dn);
+
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		pdn = add_one_dev_pci_info(parent, NULL,
+					   pci_iov_virtfn_bus(pdev, i),
+					   pci_iov_virtfn_devfn(pdev, i));
+		if (!pdn) {
+			pr_warn("%s: Cannot create firmware data "
+				"for VF#%d of %s\n",
+				__func__, i, pci_name(pdev));
+			return NULL;
+		}
+	}
+#endif
+
+	return pci_get_pdn(pdev);
+}
+
+void remove_dev_pci_info(struct pci_dev *pdev)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent;
+	struct pci_dn *pdn, *tmp;
+	int i;
+
+	/* Only support IOV PF for now */
+	if (!pdev->is_physfn)
+		return;
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return;
+
+	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return;
+
+	/*
+	 * We might introduce flag to pci_dn in future
+	 * so that we can release VF's firmware data in
+	 * a batch mode.
+	 */
+	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
+		list_for_each_entry_safe(pdn, tmp,
+			&parent->child_list, list) {
+			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
+			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
+				continue;
+
+			if (!list_empty(&pdn->list))
+				list_del(&pdn->list);
+			kfree(pdn);
+		}
+	}
+#endif
 }
 
 /*
@@ -49,6 +259,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	struct pci_controller *phb = data;
 	const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
 	const __be32 *regs;
+	struct device_node *parent;
 	struct pci_dn *pdn;
 
 	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
@@ -70,6 +281,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	}
 
 	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
+
+	/* Attach to parent node */
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	parent = of_get_parent(dn);
+	pdn->parent = parent ? PCI_DN(parent) : NULL;
+	if (pdn->parent)
+		list_add_tail(&pdn->list, &pdn->parent->child_list);
+
 	return NULL;
 }
 
@@ -150,6 +370,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	if (pdn) {
 		pdn->devfn = pdn->busno = -1;
 		pdn->phb = phb;
+		phb->firmware_data = pdn;
 	}
 
 	/* Update dn->phb ptrs for new phb and children devices */
@@ -173,3 +394,16 @@ void __init pci_devs_phb_init(void)
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		pci_devs_phb_init_dynamic(phb);
 }
+
+static void pci_dev_pdn_setup(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	if (pdev->dev.archdata.firmware_data)
+		return;
+
+	/* Setup the fast path */
+	pdn = pci_get_pdn(pdev);
+	pdev->dev.archdata.firmware_data = pdn;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index fac88ed..9faa5ec 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -949,6 +949,21 @@ static void pnv_pci_ioda_setup_PEs(void)
 	}
 }
 
+#ifdef CONFIG_PCI_IOV
+int pcibios_sriov_disable(struct pci_dev *pdev)
+{
+	/* Release firmware data */
+	remove_dev_pci_info(pdev);
+	return 0;
+}
+
+int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	add_dev_pci_info(pdev);
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev)
 {
 	struct pci_dn *pdn = pci_get_pdn(pdev);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 09/17] powerpc/pci: remove pci_dn->pcidev field
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

The field pci_dn->pcidev is assigned but not used.

This patch removes this field.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 74efea9..93126cb 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -164,7 +164,6 @@ struct pci_dn {
 
 	int	pci_ext_config_space;	/* for pci devices */
 
-	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev;		/* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9faa5ec..42b168f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -832,7 +832,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 				pci_name(dev));
 			continue;
 		}
-		pdn->pcidev = dev;
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 09/17] powerpc/pci: remove pci_dn->pcidev field
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

The field pci_dn->pcidev is assigned but not used.

This patch removes this field.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 74efea9..93126cb 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -164,7 +164,6 @@ struct pci_dn {
 
 	int	pci_ext_config_space;	/* for pci devices */
 
-	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev;		/* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9faa5ec..42b168f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -832,7 +832,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 				pci_name(dev));
 			continue;
 		}
-		pdn->pcidev = dev;
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 10/17] powerpc/powernv: Use pci_dn in PCI config accessor
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

The PCI config accessors rely on device node. Unfortunately, VFs
don't have corresponding device nodes. So we have to switch to
pci_dn for PCI config access.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +++++-
 arch/powerpc/platforms/powernv/pci.c         |   69 ++++++++++----------------
 arch/powerpc/platforms/powernv/pci.h         |    4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1d19e79..c63b6c1 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -419,21 +419,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
 				   int where, int size, u32 *val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn)) {
 		*val = 0xFFFFFFFF;
 		return PCIBIOS_SET_FAILED;
 	}
 
-	return pnv_pci_cfg_read(dn, where, size, val);
+	return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
 				    int where, int size, u32 val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn))
 		return PCIBIOS_SET_FAILED;
 
-	return pnv_pci_cfg_write(dn, where, size, val);
+	return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4945e87..b7d4b9d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 	spin_unlock_irqrestore(&phb->lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-				     struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+	struct pnv_phb *phb = pdn->phb->private_data;
 	u8	fstate;
 	__be16	pcierr;
 	int	pe_no;
@@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	 * setup that yet. So all ER errors should be mapped to
 	 * reserved PE.
 	 */
-	pe_no = PCI_DN(dn)->pe_number;
+	pe_no = pdn->pe_number;
 	if (pe_no == IODA_INVALID_PE) {
 		if (phb->type == PNV_PHB_P5IOC2)
 			pe_no = 0;
@@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 
 	cfg_dbg(" -> EEH check, bdfn=%04x PE#%d fstate=%x\n",
-		(PCI_DN(dn)->busno << 8) | (PCI_DN(dn)->devfn),
-		pe_no, fstate);
+		(pdn->busno << 8) | (pdn->devfn), pe_no, fstate);
 
 	/* Clear the frozen state if applicable */
 	if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 	s64 rc;
@@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
 	return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 
@@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
-			      struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	struct eeh_dev *edev = NULL;
-	struct pnv_phb *phb = hose->private_data;
+	struct pnv_phb *phb = pdn->phb->private_data;
 
 	/* EEH not enabled ? */
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
 		return true;
 
 	/* PE reset or device removed ? */
-	edev = of_node_to_eeh_dev(dn);
+	edev = pdn->edev;
 	if (edev) {
 		if (edev->pe &&
 		    (edev->pe->state & EEH_PE_CFG_BLOCKED))
@@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
 	return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-				struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	return true;
 }
@@ -524,32 +519,26 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 			       unsigned int devfn,
 			       int where, int size, u32 *val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
 	*val = 0xFFFFFFFF;
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_read(dn, where, size, val);
-	if (phb->flags & PNV_PHB_FLAG_EEH) {
+	ret = pnv_pci_cfg_read(pdn, where, size, val);
+	phb = pdn->phb->private_data;
+	if (phb->flags & PNV_PHB_FLAG_EEH && pdn->edev) {
 		if (*val == EEH_IO_ERROR_VALUE(size) &&
-		    eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
+		    eeh_dev_check_failure(pdn->edev))
                         return PCIBIOS_DEVICE_NOT_FOUND;
 	} else {
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 	}
 
 	return ret;
@@ -559,27 +548,21 @@ static int pnv_pci_write_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_write(dn, where, size, val);
+	ret = pnv_pci_cfg_write(pdn, where, size, val);
+	phb = pdn->phb->private_data;
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 
 	return ret;
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..e5b75b2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -219,9 +219,9 @@ extern struct pnv_eeh_ops ioda_eeh_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val);
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 10/17] powerpc/powernv: Use pci_dn in PCI config accessor
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

The PCI config accessors rely on device node. Unfortunately, VFs
don't have corresponding device nodes. So we have to switch to
pci_dn for PCI config access.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +++++-
 arch/powerpc/platforms/powernv/pci.c         |   69 ++++++++++----------------
 arch/powerpc/platforms/powernv/pci.h         |    4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1d19e79..c63b6c1 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -419,21 +419,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
 				   int where, int size, u32 *val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn)) {
 		*val = 0xFFFFFFFF;
 		return PCIBIOS_SET_FAILED;
 	}
 
-	return pnv_pci_cfg_read(dn, where, size, val);
+	return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
 				    int where, int size, u32 val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn))
 		return PCIBIOS_SET_FAILED;
 
-	return pnv_pci_cfg_write(dn, where, size, val);
+	return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4945e87..b7d4b9d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 	spin_unlock_irqrestore(&phb->lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-				     struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+	struct pnv_phb *phb = pdn->phb->private_data;
 	u8	fstate;
 	__be16	pcierr;
 	int	pe_no;
@@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	 * setup that yet. So all ER errors should be mapped to
 	 * reserved PE.
 	 */
-	pe_no = PCI_DN(dn)->pe_number;
+	pe_no = pdn->pe_number;
 	if (pe_no == IODA_INVALID_PE) {
 		if (phb->type == PNV_PHB_P5IOC2)
 			pe_no = 0;
@@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 
 	cfg_dbg(" -> EEH check, bdfn=%04x PE#%d fstate=%x\n",
-		(PCI_DN(dn)->busno << 8) | (PCI_DN(dn)->devfn),
-		pe_no, fstate);
+		(pdn->busno << 8) | (pdn->devfn), pe_no, fstate);
 
 	/* Clear the frozen state if applicable */
 	if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 	s64 rc;
@@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
 	return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 
@@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
-			      struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	struct eeh_dev *edev = NULL;
-	struct pnv_phb *phb = hose->private_data;
+	struct pnv_phb *phb = pdn->phb->private_data;
 
 	/* EEH not enabled ? */
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
 		return true;
 
 	/* PE reset or device removed ? */
-	edev = of_node_to_eeh_dev(dn);
+	edev = pdn->edev;
 	if (edev) {
 		if (edev->pe &&
 		    (edev->pe->state & EEH_PE_CFG_BLOCKED))
@@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
 	return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-				struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	return true;
 }
@@ -524,32 +519,26 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 			       unsigned int devfn,
 			       int where, int size, u32 *val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
 	*val = 0xFFFFFFFF;
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_read(dn, where, size, val);
-	if (phb->flags & PNV_PHB_FLAG_EEH) {
+	ret = pnv_pci_cfg_read(pdn, where, size, val);
+	phb = pdn->phb->private_data;
+	if (phb->flags & PNV_PHB_FLAG_EEH && pdn->edev) {
 		if (*val == EEH_IO_ERROR_VALUE(size) &&
-		    eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
+		    eeh_dev_check_failure(pdn->edev))
                         return PCIBIOS_DEVICE_NOT_FOUND;
 	} else {
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 	}
 
 	return ret;
@@ -559,27 +548,21 @@ static int pnv_pci_write_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_write(dn, where, size, val);
+	ret = pnv_pci_cfg_write(pdn, where, size, val);
+	phb = pdn->phb->private_data;
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 
 	return ret;
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..e5b75b2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -219,9 +219,9 @@ extern struct pnv_eeh_ops ioda_eeh_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val);
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 11/17] powerpc/powernv: Allocate pe->iommu_table dynamically
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

Current iommu_table of a PE is a static field. This will have a problem when
iommu_free_table is called.

This patch allocate iommu_table dynamically.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |    3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |    2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..5574eeb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
 	struct iommu_group *it_group;
 #endif
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+	void           *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 42b168f..be0c31a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -890,6 +890,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 		return;
 	}
 
+	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+			GFP_KERNEL, hose->node);
+	pe->tce32_table->data = pe;
+
 	/* Associate it with all child devices */
 	pnv_ioda_setup_same_PE(bus, pe);
 
@@ -978,7 +982,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1005,7 +1009,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, pe->tce32_table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1042,9 +1046,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       pe->tce32_table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, pe->tce32_table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1134,8 +1138,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1201,7 +1204,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1239,8 +1242,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1285,10 +1287,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1336,7 +1338,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index e5b75b2..7317777 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct iommu_table	*tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 11/17] powerpc/powernv: Allocate pe->iommu_table dynamically
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

Current iommu_table of a PE is a static field. This will have a problem when
iommu_free_table is called.

This patch allocate iommu_table dynamically.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |    3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |    2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..5574eeb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
 	struct iommu_group *it_group;
 #endif
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+	void           *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 42b168f..be0c31a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -890,6 +890,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 		return;
 	}
 
+	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+			GFP_KERNEL, hose->node);
+	pe->tce32_table->data = pe;
+
 	/* Associate it with all child devices */
 	pnv_ioda_setup_same_PE(bus, pe);
 
@@ -978,7 +982,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1005,7 +1009,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, pe->tce32_table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1042,9 +1046,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       pe->tce32_table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, pe->tce32_table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1134,8 +1138,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1201,7 +1204,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1239,8 +1242,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1285,10 +1287,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1336,7 +1338,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index e5b75b2..7317777 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct iommu_table	*tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
Mostly the total_pe number is different from the total_VFs, which will lead to
a conflict between MMIO space and the PE number.

For example, total_VFs is 128 and total_pe is 256, then the second half of M64
BAR space will be part of other PCI device, which may already belongs to other
PEs.

This patch reserve additional space for the PF IOV BAR, which is total_pe
number of VF's BAR size. By doing so, it prevents the conflict.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    4 ++
 arch/powerpc/include/asm/pci-bridge.h     |    3 ++
 arch/powerpc/kernel/pci-common.c          |    5 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
 4 files changed, 71 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..965547c 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -250,6 +250,10 @@ struct machdep_calls {
 	/* Reset the secondary bus of bridge */
 	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_IOV
+	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 93126cb..08c092e 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -170,6 +170,9 @@ struct pci_dn {
 #define IODA_INVALID_PE		(-1)
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
+#ifdef CONFIG_PCI_IOV
+	u16     max_vfs;		/* number of VFs IOV BAR expended */
+#endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
 	struct list_head list;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 889f743..832b7e1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 	if (ppc_md.pcibios_fixup_phb)
 		ppc_md.pcibios_fixup_phb(hose);
 
+#ifdef CONFIG_PCI_IOV
+	if (ppc_md.pcibios_fixup_sriov)
+		ppc_md.pcibios_fixup_sriov(bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index be0c31a..a9e61fa 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1720,6 +1720,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
 static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct resource *res;
+	int i;
+	resource_size_t size;
+	struct pci_dn *pdn;
+
+	if (!pdev->is_physfn || pdev->is_added)
+		return;
+
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+
+	pdn = pci_get_pdn(pdev);
+	pdn->max_vfs = 0;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
+				 res, pci_name(pdev));
+			continue;
+		}
+
+		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
+		size = pci_iov_resource_size(pdev, i);
+		res->end = res->start + size * phb->ioda.total_pe - 1;
+		dev_dbg(&pdev->dev, "                       %pR\n", res);
+		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
+				i - PCI_IOV_RESOURCES,
+				res, phb->ioda.total_pe);
+	}
+	pdn->max_vfs = phb->ioda.total_pe;
+}
+
+static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	struct pci_bus *b;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		b = pdev->subordinate;
+
+		if (b)
+			pnv_pci_ioda_fixup_sriov(b);
+
+		pnv_pci_ioda_fixup_iov_resources(pdev);
+	}
+}
+#endif /* CONFIG_PCI_IOV */
+
 /*
  * This function is supposed to be called on basis of PE from top
  * to bottom style. So the the I/O or MMIO segment assigned to
@@ -2096,6 +2152,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
+#ifdef CONFIG_PCI_IOV
+	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+#endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
 	/* Reset IODA tables to a clean state */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
Mostly the total_pe number is different from the total_VFs, which will lead to
a conflict between MMIO space and the PE number.

For example, total_VFs is 128 and total_pe is 256, then the second half of M64
BAR space will be part of other PCI device, which may already belongs to other
PEs.

This patch reserve additional space for the PF IOV BAR, which is total_pe
number of VF's BAR size. By doing so, it prevents the conflict.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    4 ++
 arch/powerpc/include/asm/pci-bridge.h     |    3 ++
 arch/powerpc/kernel/pci-common.c          |    5 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
 4 files changed, 71 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..965547c 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -250,6 +250,10 @@ struct machdep_calls {
 	/* Reset the secondary bus of bridge */
 	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_IOV
+	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 93126cb..08c092e 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -170,6 +170,9 @@ struct pci_dn {
 #define IODA_INVALID_PE		(-1)
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
+#ifdef CONFIG_PCI_IOV
+	u16     max_vfs;		/* number of VFs IOV BAR expended */
+#endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
 	struct list_head list;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 889f743..832b7e1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 	if (ppc_md.pcibios_fixup_phb)
 		ppc_md.pcibios_fixup_phb(hose);
 
+#ifdef CONFIG_PCI_IOV
+	if (ppc_md.pcibios_fixup_sriov)
+		ppc_md.pcibios_fixup_sriov(bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index be0c31a..a9e61fa 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1720,6 +1720,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
 static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct resource *res;
+	int i;
+	resource_size_t size;
+	struct pci_dn *pdn;
+
+	if (!pdev->is_physfn || pdev->is_added)
+		return;
+
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+
+	pdn = pci_get_pdn(pdev);
+	pdn->max_vfs = 0;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
+				 res, pci_name(pdev));
+			continue;
+		}
+
+		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
+		size = pci_iov_resource_size(pdev, i);
+		res->end = res->start + size * phb->ioda.total_pe - 1;
+		dev_dbg(&pdev->dev, "                       %pR\n", res);
+		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
+				i - PCI_IOV_RESOURCES,
+				res, phb->ioda.total_pe);
+	}
+	pdn->max_vfs = phb->ioda.total_pe;
+}
+
+static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	struct pci_bus *b;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		b = pdev->subordinate;
+
+		if (b)
+			pnv_pci_ioda_fixup_sriov(b);
+
+		pnv_pci_ioda_fixup_iov_resources(pdev);
+	}
+}
+#endif /* CONFIG_PCI_IOV */
+
 /*
  * This function is supposed to be called on basis of PE from top
  * to bottom style. So the the I/O or MMIO segment assigned to
@@ -2096,6 +2152,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
+#ifdef CONFIG_PCI_IOV
+	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+#endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
 	/* Reset IODA tables to a clean state */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

This patch implements the pcibios_iov_resource_alignment() on powernv
platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    3 +++
 arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 965547c..12e8eb8 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -252,6 +252,9 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
 	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
+			                                    int resno,
+							    resource_size_t align);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 832b7e1..8751dfb 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
+						 int resno,
+						 resource_size_t align)
+{
+	if (ppc_md.pcibios_iov_resource_alignment)
+		return ppc_md.pcibios_iov_resource_alignment(pdev,
+							       resno,
+							       align);
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a9e61fa..512dc48 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1952,6 +1952,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+							    int resno,
+							    resource_size_t align)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t iov_align;
+
+	iov_align = resource_size(&pdev->resource[resno]);
+	if (iov_align)
+		return iov_align;
+
+	if (pdn->max_vfs)
+		return pdn->max_vfs * align;
+
+	return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2154,6 +2173,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

This patch implements the pcibios_iov_resource_alignment() on powernv
platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    3 +++
 arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 965547c..12e8eb8 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -252,6 +252,9 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
 	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
+			                                    int resno,
+							    resource_size_t align);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 832b7e1..8751dfb 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
+						 int resno,
+						 resource_size_t align)
+{
+	if (ppc_md.pcibios_iov_resource_alignment)
+		return ppc_md.pcibios_iov_resource_alignment(pdev,
+							       resno,
+							       align);
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index a9e61fa..512dc48 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1952,6 +1952,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+							    int resno,
+							    resource_size_t align)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t iov_align;
+
+	iov_align = resource_size(&pdev->resource[resno]);
+	if (iov_align)
+		return iov_align;
+
+	if (pdn->max_vfs)
+		return pdn->max_vfs * align;
+
+	return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2154,6 +2173,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 14/17] powerpc/powernv: Shift VF resource with an offset
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

On PowrNV platform, resource position in M64 implies the PE# the resource
belongs to. In some particular case, adjustment of a resource is necessary to
locate it to a correct position in M64.

This patch introduces a function to shift the 'real' PF IOV BAR address
according to an offset.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 512dc48..f5aa1ef 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -14,6 +14,7 @@
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
+#include <linux/pci_regs.h>
 #include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/string.h>
@@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res;
+	resource_size_t size;
+
+	if (!dev->is_physfn)
+		return;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*offset;
+
+		dev_info(&dev->dev, "                 %pR\n", res);
+		pci_update_resource(dev, i);
+	}
+	pdn->max_vfs -= offset;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 14/17] powerpc/powernv: Shift VF resource with an offset
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

On PowrNV platform, resource position in M64 implies the PE# the resource
belongs to. In some particular case, adjustment of a resource is necessary to
locate it to a correct position in M64.

This patch introduces a function to shift the 'real' PF IOV BAR address
according to an offset.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 512dc48..f5aa1ef 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -14,6 +14,7 @@
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
+#include <linux/pci_regs.h>
 #include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/string.h>
@@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res;
+	resource_size_t size;
+
+	if (!dev->is_physfn)
+		return;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*offset;
+
+		dev_info(&dev->dev, "                 %pR\n", res);
+		pci_update_resource(dev, i);
+	}
+	pdn->max_vfs -= offset;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 15/17] powerpc/powernv: Allocate VF PE
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

VFs are created, when pci device is enabled.

This patch tries best to assign maximum resources and PEs for VF when pci
device is enabled. Enough M64 assigned to cover the IOV BAR, IOV BAR is
shifted to meet the PE# indicated by M64. VF's pdn->pdev and pdn->pe_number
are fixed.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    4 +
 arch/powerpc/kernel/pci_dn.c              |   11 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  452 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c      |   18 ++
 arch/powerpc/platforms/powernv/pci.h      |    7 +
 5 files changed, 477 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 08c092e..0c9c260 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -172,6 +172,10 @@ struct pci_dn {
 	int	pe_number;
 #ifdef CONFIG_PCI_IOV
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
+	u16     vf_pes;			/* VF PE# under this PF */
+	int     offset;			/* PE# for the first VF PE */
+#define IODA_INVALID_M64        (-1)
+	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index c99fb80..84a8e07 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -216,6 +216,17 @@ void remove_dev_pci_info(struct pci_dev *pdev)
 	struct pci_dn *pdn, *tmp;
 	int i;
 
+	/*
+	 * VF and VF PE is create/released dynamicly, which we need to
+	 * bind/unbind them. Otherwise when re-enable SRIOV, the VF and VF PE
+	 * would be mismatched.
+	 */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		pdn->pe_number = IODA_INVALID_PE;
+		return;
+	}
+
 	/* Only support IOV PF for now */
 	if (!pdev->is_physfn)
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f5aa1ef..be0c43c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -45,6 +45,9 @@
 #include "powernv.h"
 #include "pci.h"
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -57,11 +60,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV)
 		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
-	else
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
 		sprintf(pfix, "%04x:%02x     ",
 			pci_domain_nr(pe->pbus), pe->pbus->number);
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		sprintf(pfix, "%04x:%02x:%2x.%d",
+			pci_domain_nr(pe->parent_dev->bus),
+			(pe->rid & 0xff00) >> 8,
+			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
+#endif /* CONFIG_PCI_IOV*/
 
 	printk("%spci %s: [PE# %.3d] %pV",
 	       level, pfix, pe->pe_number, &vaf);
@@ -567,7 +577,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 			      bool is_add)
 {
 	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev;
+	struct pci_dev *pdev = NULL;
 	int ret;
 
 	/*
@@ -606,8 +616,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 
 	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
 		pdev = pe->pbus->self;
-	else
+	else if (pe->flags & PNV_IODA_PE_DEV)
 		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
 	while (pdev) {
 		struct pci_dn *pdn = pci_get_pdn(pdev);
 		struct pnv_ioda_pe *parent;
@@ -625,6 +639,89 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 	return 0;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	int64_t rc;
+	long rid_end, rid;
+
+	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;         break;
+		case  2: bcomp = OpalPciBus7Bits;       break;
+		case  4: bcomp = OpalPciBus6Bits;       break;
+		case  8: bcomp = OpalPciBus5Bits;       break;
+		case 16: bcomp = OpalPciBus4Bits;       break;
+		case 32: bcomp = OpalPciBus3Bits;       break;
+		default:
+			pr_err("%s: Number of subordinate busses %d"
+			       " unsupported\n",
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear the reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = 0;
+
+	/* Release from all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Dissociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+
+	pe->pbus = NULL;
+	pe->pdev = NULL;
+	pe->parent_dev = NULL;
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -653,13 +750,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 		default:
 			pr_err("%s: Number of subordinate busses %d"
 			       " unsupported\n",
-			       pci_name(pe->pbus->self), count);
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
 			/* Do an exact match only */
 			bcomp = OpalPciBusAll;
 		}
 		rid_end = pe->rid + (count << 8);
 	} else {
-		parent = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif /* CONFIG_PCI_IOV */
+			parent = pe->pdev->bus->self;
 		bcomp = OpalPciBusAll;
 		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
 		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
@@ -984,8 +1087,311 @@ static void pnv_pci_ioda_setup_PEs(void)
 }
 
 #ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    i;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		if (pdn->m64_wins[i] == IODA_INVALID_M64)
+			continue;
+		opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
+		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+	}
+
+	return 0;
+}
+
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	unsigned int           win;
+	struct resource       *res;
+	int                    i;
+	int64_t                rc;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	/* Initialize the m64_wins to IODA_INVALID_M64 */
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = pdev->resource + PCI_IOV_RESOURCES + i;
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		do {
+			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+					phb->ioda.m64_bar_idx + 1, 0);
+
+			if (win >= phb->ioda.m64_bar_idx + 1)
+				goto m64_failed;
+		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+		pdn->m64_wins[i] = win;
+
+		/* Map the M64 here */
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 pdn->m64_wins[i],
+						 res->start,
+						 0, /* unused */
+						 resource_size(res));
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
+			goto m64_failed;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
+			goto m64_failed;
+		}
+	}
+	return 0;
+
+m64_failed:
+	pnv_pci_vf_release_m64(pdev);
+	return -EBUSY;
+}
+
+static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct iommu_table    *tbl;
+	unsigned long         addr;
+	int64_t               rc;
+
+	bus = dev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	tbl = pe->tce32_table;
+	addr = tbl->it_base;
+
+	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+				   pe->pe_number << 1, 1, __pa(addr),
+				   0, 0x1000);
+
+	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+				        pe->pe_number,
+				        (pe->pe_number << 1) + 1,
+				        pe->tce_bypass_base,
+				        0);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+
+	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	free_pages(addr, get_order(TCE32_TABLE_SIZE));
+	pe->tce32_table = NULL;
+}
+
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe, *pe_n;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+
+	if (!pdev->is_physfn)
+		return;
+
+	pdn = pci_get_pdn(pdev);
+	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
+		if (pe->parent_dev != pdev)
+			continue;
+
+		pnv_pci_ioda2_release_dma_pe(pdev, pe);
+
+		/* Remove from list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_del(&pe->list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_ioda_deconfigure_pe(phb, pe);
+
+		pnv_ioda_free_pe(phb, pe->pe_number);
+	}
+}
+
+void pnv_pci_sriov_disable(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	struct pci_sriov      *iov;
+	u16 vf_num;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+	iov = pdev->sriov;
+	vf_num = pdn->vf_pes;
+
+	/* Release VF PEs */
+	pnv_ioda_release_vf_PE(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+
+		/* Release M64 BARs */
+		pnv_pci_vf_release_m64(pdev);
+
+		/* Release PE numbers */
+		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->offset = 0;
+	}
+
+	return;
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe;
+	int                    pe_num;
+	u16                    vf_index;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (!pdev->is_physfn)
+		return;
+
+	/* Reserve PE for each VF */
+	for (vf_index = 0; vf_index < vf_num; vf_index++) {
+		pe_num = pdn->offset + vf_index;
+
+		pe = &phb->ioda.pe_array[pe_num];
+		pe->pe_number = pe_num;
+		pe->phb = phb;
+		pe->flags = PNV_IODA_PE_VF;
+		pe->pbus = NULL;
+		pe->parent_dev = pdev;
+		pe->tce32_seg = -1;
+		pe->mve_number = -1;
+		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
+			   pci_iov_virtfn_devfn(pdev, vf_index);
+
+		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
+			hose->global_number, pdev->bus->number,
+			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
+			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+
+		if (pnv_ioda_configure_pe(phb, pe)) {
+			/* XXX What do we do here ? */
+			if (pe_num)
+				pnv_ioda_free_pe(phb, pe_num);
+			pe->pdev = NULL;
+			continue;
+		}
+
+		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+				GFP_KERNEL, hose->node);
+		pe->tce32_table->data = pe;
+
+		/* Put PE to the list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_add_tail(&pe->list, &phb->ioda.pe_list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+
+	}
+}
+
+int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    ret;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		/* Calculate available PE for required VFs */
+		mutex_lock(&phb->ioda.pe_alloc_mutex);
+		pdn->offset = bitmap_find_next_zero_area(
+			phb->ioda.pe_alloc, phb->ioda.total_pe,
+			0, vf_num, 0);
+		if (pdn->offset >= phb->ioda.total_pe) {
+			mutex_unlock(&phb->ioda.pe_alloc_mutex);
+			pr_info("Failed to enable VF\n");
+			pdn->offset = 0;
+			return -EBUSY;
+		}
+		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->vf_pes = vf_num;
+		mutex_unlock(&phb->ioda.pe_alloc_mutex);
+
+		/* Assign M64 BAR accordingly */
+		ret = pnv_pci_vf_assign_m64(pdev);
+		if (ret) {
+			pr_info("No enough M64 resource\n");
+			goto m64_failed;
+		}
+
+		/* Do some magic shift */
+		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+	}
+
+	/* Setup VF PEs */
+	pnv_ioda_setup_vf_PE(pdev, vf_num);
+
+	return 0;
+
+m64_failed:
+	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+	pdn->offset = 0;
+
+	return ret;
+}
+
 int pcibios_sriov_disable(struct pci_dev *pdev)
 {
+	pnv_pci_sriov_disable(pdev);
+
 	/* Release firmware data */
 	remove_dev_pci_info(pdev);
 	return 0;
@@ -993,7 +1399,10 @@ int pcibios_sriov_disable(struct pci_dev *pdev)
 
 int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 {
+	/* Allocate firmware data */
 	add_dev_pci_info(pdev);
+
+	pnv_pci_sriov_enable(pdev, vf_num);
 	return 0;
 }
 #endif /* CONFIG_PCI_IOV */
@@ -1190,9 +1599,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 	void *addr;
 
-	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
-#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
-
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
 	/* XXX FIXME: Allocate multi-level tables on PHB3 */
@@ -1255,12 +1661,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	return;
  fail:
@@ -1387,12 +1800,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
@@ -2086,6 +2506,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
+	mutex_init(&phb->ioda.pe_alloc_mutex);
 
 	/* Detect specific models for error handling */
 	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
@@ -2145,6 +2566,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
+	mutex_init(&phb->ioda.pe_list_mutex);
 
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b7d4b9d..269f1dd 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -714,6 +714,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 {
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 	struct pnv_phb *phb = hose->private_data;
+#ifdef CONFIG_PCI_IOV
+	struct pnv_ioda_pe *pe;
+	struct pci_dn *pdn;
+
+	/* Fix the VF pdn PE number */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			if (pe->rid == ((pdev->bus->number << 8) |
+			    (pdev->devfn & 0xff))) {
+				pdn->pe_number = pe->pe_number;
+				pe->pdev = pdev;
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_PCI_IOV */
 
 	/* If we have no phb structure, try to setup a fallback based on
 	 * the device-tree (RTAS PCI for example)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 7317777..39d42f2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
+#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -34,6 +35,9 @@ struct pnv_ioda_pe {
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
 	 */
+#ifdef CONFIG_PCI_IOV
+	struct pci_dev          *parent_dev;
+#endif
 	struct pci_dev		*pdev;
 	struct pci_bus		*pbus;
 
@@ -165,6 +169,8 @@ struct pnv_phb {
 
 			/* PE allocation bitmap */
 			unsigned long		*pe_alloc;
+			/* PE allocation mutex */
+			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
 			unsigned int		*m32_segmap;
@@ -179,6 +185,7 @@ struct pnv_phb {
 			 * on the sequence of creation
 			 */
 			struct list_head	pe_list;
+			struct mutex            pe_list_mutex;
 
 			/* Reverse map of PEs, will have to extend if
 			 * we are to support more than 256 PEs, indexed
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 15/17] powerpc/powernv: Allocate VF PE
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

VFs are created, when pci device is enabled.

This patch tries best to assign maximum resources and PEs for VF when pci
device is enabled. Enough M64 assigned to cover the IOV BAR, IOV BAR is
shifted to meet the PE# indicated by M64. VF's pdn->pdev and pdn->pe_number
are fixed.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    4 +
 arch/powerpc/kernel/pci_dn.c              |   11 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  452 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c      |   18 ++
 arch/powerpc/platforms/powernv/pci.h      |    7 +
 5 files changed, 477 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 08c092e..0c9c260 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -172,6 +172,10 @@ struct pci_dn {
 	int	pe_number;
 #ifdef CONFIG_PCI_IOV
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
+	u16     vf_pes;			/* VF PE# under this PF */
+	int     offset;			/* PE# for the first VF PE */
+#define IODA_INVALID_M64        (-1)
+	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index c99fb80..84a8e07 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -216,6 +216,17 @@ void remove_dev_pci_info(struct pci_dev *pdev)
 	struct pci_dn *pdn, *tmp;
 	int i;
 
+	/*
+	 * VF and VF PE is create/released dynamicly, which we need to
+	 * bind/unbind them. Otherwise when re-enable SRIOV, the VF and VF PE
+	 * would be mismatched.
+	 */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		pdn->pe_number = IODA_INVALID_PE;
+		return;
+	}
+
 	/* Only support IOV PF for now */
 	if (!pdev->is_physfn)
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index f5aa1ef..be0c43c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -45,6 +45,9 @@
 #include "powernv.h"
 #include "pci.h"
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -57,11 +60,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV)
 		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
-	else
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
 		sprintf(pfix, "%04x:%02x     ",
 			pci_domain_nr(pe->pbus), pe->pbus->number);
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		sprintf(pfix, "%04x:%02x:%2x.%d",
+			pci_domain_nr(pe->parent_dev->bus),
+			(pe->rid & 0xff00) >> 8,
+			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
+#endif /* CONFIG_PCI_IOV*/
 
 	printk("%spci %s: [PE# %.3d] %pV",
 	       level, pfix, pe->pe_number, &vaf);
@@ -567,7 +577,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 			      bool is_add)
 {
 	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev;
+	struct pci_dev *pdev = NULL;
 	int ret;
 
 	/*
@@ -606,8 +616,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 
 	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
 		pdev = pe->pbus->self;
-	else
+	else if (pe->flags & PNV_IODA_PE_DEV)
 		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
 	while (pdev) {
 		struct pci_dn *pdn = pci_get_pdn(pdev);
 		struct pnv_ioda_pe *parent;
@@ -625,6 +639,89 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 	return 0;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	int64_t rc;
+	long rid_end, rid;
+
+	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;         break;
+		case  2: bcomp = OpalPciBus7Bits;       break;
+		case  4: bcomp = OpalPciBus6Bits;       break;
+		case  8: bcomp = OpalPciBus5Bits;       break;
+		case 16: bcomp = OpalPciBus4Bits;       break;
+		case 32: bcomp = OpalPciBus3Bits;       break;
+		default:
+			pr_err("%s: Number of subordinate busses %d"
+			       " unsupported\n",
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear the reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = 0;
+
+	/* Release from all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Dissociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+
+	pe->pbus = NULL;
+	pe->pdev = NULL;
+	pe->parent_dev = NULL;
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -653,13 +750,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 		default:
 			pr_err("%s: Number of subordinate busses %d"
 			       " unsupported\n",
-			       pci_name(pe->pbus->self), count);
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
 			/* Do an exact match only */
 			bcomp = OpalPciBusAll;
 		}
 		rid_end = pe->rid + (count << 8);
 	} else {
-		parent = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif /* CONFIG_PCI_IOV */
+			parent = pe->pdev->bus->self;
 		bcomp = OpalPciBusAll;
 		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
 		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
@@ -984,8 +1087,311 @@ static void pnv_pci_ioda_setup_PEs(void)
 }
 
 #ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    i;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		if (pdn->m64_wins[i] == IODA_INVALID_M64)
+			continue;
+		opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
+		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+	}
+
+	return 0;
+}
+
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	unsigned int           win;
+	struct resource       *res;
+	int                    i;
+	int64_t                rc;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	/* Initialize the m64_wins to IODA_INVALID_M64 */
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = pdev->resource + PCI_IOV_RESOURCES + i;
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		do {
+			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+					phb->ioda.m64_bar_idx + 1, 0);
+
+			if (win >= phb->ioda.m64_bar_idx + 1)
+				goto m64_failed;
+		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+		pdn->m64_wins[i] = win;
+
+		/* Map the M64 here */
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 pdn->m64_wins[i],
+						 res->start,
+						 0, /* unused */
+						 resource_size(res));
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
+			goto m64_failed;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
+			goto m64_failed;
+		}
+	}
+	return 0;
+
+m64_failed:
+	pnv_pci_vf_release_m64(pdev);
+	return -EBUSY;
+}
+
+static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct iommu_table    *tbl;
+	unsigned long         addr;
+	int64_t               rc;
+
+	bus = dev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	tbl = pe->tce32_table;
+	addr = tbl->it_base;
+
+	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+				   pe->pe_number << 1, 1, __pa(addr),
+				   0, 0x1000);
+
+	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+				        pe->pe_number,
+				        (pe->pe_number << 1) + 1,
+				        pe->tce_bypass_base,
+				        0);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+
+	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	free_pages(addr, get_order(TCE32_TABLE_SIZE));
+	pe->tce32_table = NULL;
+}
+
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe, *pe_n;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+
+	if (!pdev->is_physfn)
+		return;
+
+	pdn = pci_get_pdn(pdev);
+	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
+		if (pe->parent_dev != pdev)
+			continue;
+
+		pnv_pci_ioda2_release_dma_pe(pdev, pe);
+
+		/* Remove from list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_del(&pe->list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_ioda_deconfigure_pe(phb, pe);
+
+		pnv_ioda_free_pe(phb, pe->pe_number);
+	}
+}
+
+void pnv_pci_sriov_disable(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	struct pci_sriov      *iov;
+	u16 vf_num;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+	iov = pdev->sriov;
+	vf_num = pdn->vf_pes;
+
+	/* Release VF PEs */
+	pnv_ioda_release_vf_PE(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+
+		/* Release M64 BARs */
+		pnv_pci_vf_release_m64(pdev);
+
+		/* Release PE numbers */
+		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->offset = 0;
+	}
+
+	return;
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe;
+	int                    pe_num;
+	u16                    vf_index;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (!pdev->is_physfn)
+		return;
+
+	/* Reserve PE for each VF */
+	for (vf_index = 0; vf_index < vf_num; vf_index++) {
+		pe_num = pdn->offset + vf_index;
+
+		pe = &phb->ioda.pe_array[pe_num];
+		pe->pe_number = pe_num;
+		pe->phb = phb;
+		pe->flags = PNV_IODA_PE_VF;
+		pe->pbus = NULL;
+		pe->parent_dev = pdev;
+		pe->tce32_seg = -1;
+		pe->mve_number = -1;
+		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
+			   pci_iov_virtfn_devfn(pdev, vf_index);
+
+		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
+			hose->global_number, pdev->bus->number,
+			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
+			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+
+		if (pnv_ioda_configure_pe(phb, pe)) {
+			/* XXX What do we do here ? */
+			if (pe_num)
+				pnv_ioda_free_pe(phb, pe_num);
+			pe->pdev = NULL;
+			continue;
+		}
+
+		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+				GFP_KERNEL, hose->node);
+		pe->tce32_table->data = pe;
+
+		/* Put PE to the list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_add_tail(&pe->list, &phb->ioda.pe_list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+
+	}
+}
+
+int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    ret;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		/* Calculate available PE for required VFs */
+		mutex_lock(&phb->ioda.pe_alloc_mutex);
+		pdn->offset = bitmap_find_next_zero_area(
+			phb->ioda.pe_alloc, phb->ioda.total_pe,
+			0, vf_num, 0);
+		if (pdn->offset >= phb->ioda.total_pe) {
+			mutex_unlock(&phb->ioda.pe_alloc_mutex);
+			pr_info("Failed to enable VF\n");
+			pdn->offset = 0;
+			return -EBUSY;
+		}
+		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->vf_pes = vf_num;
+		mutex_unlock(&phb->ioda.pe_alloc_mutex);
+
+		/* Assign M64 BAR accordingly */
+		ret = pnv_pci_vf_assign_m64(pdev);
+		if (ret) {
+			pr_info("No enough M64 resource\n");
+			goto m64_failed;
+		}
+
+		/* Do some magic shift */
+		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+	}
+
+	/* Setup VF PEs */
+	pnv_ioda_setup_vf_PE(pdev, vf_num);
+
+	return 0;
+
+m64_failed:
+	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+	pdn->offset = 0;
+
+	return ret;
+}
+
 int pcibios_sriov_disable(struct pci_dev *pdev)
 {
+	pnv_pci_sriov_disable(pdev);
+
 	/* Release firmware data */
 	remove_dev_pci_info(pdev);
 	return 0;
@@ -993,7 +1399,10 @@ int pcibios_sriov_disable(struct pci_dev *pdev)
 
 int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 {
+	/* Allocate firmware data */
 	add_dev_pci_info(pdev);
+
+	pnv_pci_sriov_enable(pdev, vf_num);
 	return 0;
 }
 #endif /* CONFIG_PCI_IOV */
@@ -1190,9 +1599,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 	void *addr;
 
-	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
-#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
-
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
 	/* XXX FIXME: Allocate multi-level tables on PHB3 */
@@ -1255,12 +1661,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	return;
  fail:
@@ -1387,12 +1800,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
@@ -2086,6 +2506,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
+	mutex_init(&phb->ioda.pe_alloc_mutex);
 
 	/* Detect specific models for error handling */
 	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
@@ -2145,6 +2566,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
+	mutex_init(&phb->ioda.pe_list_mutex);
 
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b7d4b9d..269f1dd 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -714,6 +714,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 {
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 	struct pnv_phb *phb = hose->private_data;
+#ifdef CONFIG_PCI_IOV
+	struct pnv_ioda_pe *pe;
+	struct pci_dn *pdn;
+
+	/* Fix the VF pdn PE number */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			if (pe->rid == ((pdev->bus->number << 8) |
+			    (pdev->devfn & 0xff))) {
+				pdn->pe_number = pe->pe_number;
+				pe->pdev = pdev;
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_PCI_IOV */
 
 	/* If we have no phb structure, try to setup a fallback based on
 	 * the device-tree (RTAS PCI for example)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 7317777..39d42f2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
+#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -34,6 +35,9 @@ struct pnv_ioda_pe {
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
 	 */
+#ifdef CONFIG_PCI_IOV
+	struct pci_dev          *parent_dev;
+#endif
 	struct pci_dev		*pdev;
 	struct pci_bus		*pbus;
 
@@ -165,6 +169,8 @@ struct pnv_phb {
 
 			/* PE allocation bitmap */
 			unsigned long		*pe_alloc;
+			/* PE allocation mutex */
+			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
 			unsigned int		*m32_segmap;
@@ -179,6 +185,7 @@ struct pnv_phb {
 			 * on the sequence of creation
 			 */
 			struct list_head	pe_list;
+			struct mutex            pe_list_mutex;
 
 			/* Reverse map of PEs, will have to extend if
 			 * we are to support more than 256 PEs, indexed
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

M64 aperture size is limited on PHB3. When the IOV BAR is too big, this will
exceed the limitation and failed to be assigned.

This patch introduce a different mechanism based on the IOV BAR size:

IOV BAR size is smaller than 64M, expand to total_pe.
IOV BAR size is bigger than 64M, roundup power2.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 0c9c260..538dbb5 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -174,6 +174,8 @@ struct pci_dn {
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
 	u16     vf_pes;			/* VF PE# under this PF */
 	int     offset;			/* PE# for the first VF PE */
+#define M64_PER_IOV 4
+	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
 	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index be0c43c..d738397 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	int i;
 	resource_size_t size;
 	struct pci_dn *pdn;
+	int mul, total_vfs;
 
 	if (!pdev->is_physfn || pdev->is_added)
 		return;
@@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	pdn = pci_get_pdn(pdev);
 	pdn->max_vfs = 0;
 
+	total_vfs = pci_sriov_get_totalvfs(pdev);
+	pdn->m64_per_iov = 1;
+	mul = phb->ioda.total_pe;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
+					res, pci_name(pdev));
+			continue;
+		}
+
+		size = pci_iov_resource_size(pdev, i);
+
+		/* bigger than 64M */
+		if (size > (1 << 26)) {
+			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
+					"is bigger than 64M, roundup power2\n", i);
+			pdn->m64_per_iov = M64_PER_IOV;
+			mul = __roundup_pow_of_two(total_vfs);
+			break;
+		}
+	}
+
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &pdev->resource[i];
 		if (!res->flags || res->parent)
@@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
 				i - PCI_IOV_RESOURCES,
-				res, phb->ioda.total_pe);
+				res, mul);
 	}
-	pdn->max_vfs = phb->ioda.total_pe;
+	pdn->max_vfs = mul;
 }
 
 static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

M64 aperture size is limited on PHB3. When the IOV BAR is too big, this will
exceed the limitation and failed to be assigned.

This patch introduce a different mechanism based on the IOV BAR size:

IOV BAR size is smaller than 64M, expand to total_pe.
IOV BAR size is bigger than 64M, roundup power2.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 0c9c260..538dbb5 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -174,6 +174,8 @@ struct pci_dn {
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
 	u16     vf_pes;			/* VF PE# under this PF */
 	int     offset;			/* PE# for the first VF PE */
+#define M64_PER_IOV 4
+	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
 	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index be0c43c..d738397 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	int i;
 	resource_size_t size;
 	struct pci_dn *pdn;
+	int mul, total_vfs;
 
 	if (!pdev->is_physfn || pdev->is_added)
 		return;
@@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	pdn = pci_get_pdn(pdev);
 	pdn->max_vfs = 0;
 
+	total_vfs = pci_sriov_get_totalvfs(pdev);
+	pdn->m64_per_iov = 1;
+	mul = phb->ioda.total_pe;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
+					res, pci_name(pdev));
+			continue;
+		}
+
+		size = pci_iov_resource_size(pdev, i);
+
+		/* bigger than 64M */
+		if (size > (1 << 26)) {
+			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
+					"is bigger than 64M, roundup power2\n", i);
+			pdn->m64_per_iov = M64_PER_IOV;
+			mul = __roundup_pow_of_two(total_vfs);
+			break;
+		}
+	}
+
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &pdev->resource[i];
 		if (!res->flags || res->parent)
@@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
 				i - PCI_IOV_RESOURCES,
-				res, phb->ioda.total_pe);
+				res, mul);
 	}
-	pdn->max_vfs = phb->ioda.total_pe;
+	pdn->max_vfs = mul;
 }
 
 static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 17/17] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  5:54   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

When IOV BAR is big, each of it is covered by 4 M64 window. This leads to
several VF PE sits in one PE in terms of M64.

This patch group VF PEs according to the M64 allocation.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  188 +++++++++++++++++++++++------
 2 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 538dbb5..6eb9fb6 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -177,7 +177,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
 	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
-	int     m64_wins[PCI_SRIOV_NUM_BARS];
+	int     m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d738397..29d10ff 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1093,26 +1093,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pci_dn         *pdn;
-	int                    i;
+	int                    i, j;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
 
-	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		if (pdn->m64_wins[i] == IODA_INVALID_M64)
-			continue;
-		opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
-		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
-		pdn->m64_wins[i] = IODA_INVALID_M64;
-	}
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		for (j = 0; j < M64_PER_IOV; j++) {
+			if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+				continue;
+			opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
+			clear_bit(pdn->m64_wins[i][j], &phb->ioda.m64_bar_alloc);
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+		}
 
 	return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
@@ -1120,17 +1121,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 	struct pci_dn         *pdn;
 	unsigned int           win;
 	struct resource       *res;
-	int                    i;
+	int                    i, j;
 	int64_t                rc;
+	int                    total_vfs;
+	resource_size_t        size, start;
+	int                    pe_num;
+	int                    vf_groups;
+	int                    vf_per_group;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
+	total_vfs = pci_sriov_get_totalvfs(pdev);
 
 	/* Initialize the m64_wins to IODA_INVALID_M64 */
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-		pdn->m64_wins[i] = IODA_INVALID_M64;
+		for (j = 0; j < M64_PER_IOV; j++)
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+
+	if (pdn->m64_per_iov == M64_PER_IOV) {
+		vf_groups = (vf_num <= M64_PER_IOV) ? vf_num: M64_PER_IOV;
+		vf_per_group = (vf_num <= M64_PER_IOV)? 1:
+			__roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+	} else {
+		vf_groups = 1;
+		vf_per_group = 1;
+	}
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = pdev->resource + PCI_IOV_RESOURCES + i;
@@ -1140,33 +1157,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 		if (!pnv_pci_is_mem_pref_64(res->flags))
 			continue;
 
-		do {
-			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-					phb->ioda.m64_bar_idx + 1, 0);
-
-			if (win >= phb->ioda.m64_bar_idx + 1)
-				goto m64_failed;
-		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+		for (j = 0; j < vf_groups; j++) {
+			do {
+				win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+						phb->ioda.m64_bar_idx + 1, 0);
+
+				if (win >= phb->ioda.m64_bar_idx + 1)
+					goto m64_failed;
+			} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+			pdn->m64_wins[i][j] = win;
+
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				size = pci_iov_resource_size(pdev,
+								   PCI_IOV_RESOURCES + i);
+				size = size * vf_per_group;
+				start = res->start + size * j;
+			} else {
+				size = resource_size(res);
+				start = res->start;
+			}
 
-		pdn->m64_wins[i] = win;
+			/* Map the M64 here */
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				pe_num = pdn->offset + j;
+				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+						pe_num, OPAL_M64_WINDOW_TYPE,
+						pdn->m64_wins[i][j], 0);
+			}
 
-		/* Map the M64 here */
-		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+			rc = opal_pci_set_phb_mem_window(phb->opal_id,
 						 OPAL_M64_WINDOW_TYPE,
-						 pdn->m64_wins[i],
-						 res->start,
+						 pdn->m64_wins[i][j],
+						 start,
 						 0, /* unused */
-						 resource_size(res));
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
-			goto m64_failed;
-		}
+						 size);
 
-		rc = opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
-			goto m64_failed;
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to set M64 BAR #%d: %lld\n",
+						win, rc);
+				goto m64_failed;
+			}
+
+			if (pdn->m64_per_iov == M64_PER_IOV)
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 2);
+			else
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 1);
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to enable M64 BAR #%d: %llx\n",
+						win, rc);
+				goto m64_failed;
+			}
 		}
 	}
 	return 0;
@@ -1208,22 +1253,53 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 	pe->tce32_table = NULL;
 }
 
-static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pnv_ioda_pe    *pe, *pe_n;
 	struct pci_dn         *pdn;
+	u16                    vf_index;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
 
 	if (!pdev->is_physfn)
 		return;
 
-	pdn = pci_get_pdn(pdev);
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++){
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_REMOVE_PE_FROM_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to unlink same"
+						" group PE#%d(%lld)\n", __func__,
+						pdn->offset + vf_index1, rc);
+				}
+	}
+
 	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
 		if (pe->parent_dev != pdev)
 			continue;
@@ -1258,10 +1334,11 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 	vf_num = pdn->vf_pes;
 
 	/* Release VF PEs */
-	pnv_ioda_release_vf_PE(pdev);
+	pnv_ioda_release_vf_PE(pdev, vf_num);
 
 	if (phb->type == PNV_PHB_IODA2) {
-		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, -pdn->offset);
 
 		/* Release M64 BARs */
 		pnv_pci_vf_release_m64(pdev);
@@ -1285,6 +1362,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 	int                    pe_num;
 	u16                    vf_index;
 	struct pci_dn         *pdn;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
@@ -1332,7 +1410,36 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+	}
 
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++) {
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_ADD_PE_TO_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to link same "
+						"group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+			}
 	}
 }
 
@@ -1366,14 +1473,15 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
 		/* Assign M64 BAR accordingly */
-		ret = pnv_pci_vf_assign_m64(pdev);
+		ret = pnv_pci_vf_assign_m64(pdev, vf_num);
 		if (ret) {
 			pr_info("No enough M64 resource\n");
 			goto m64_failed;
 		}
 
 		/* Do some magic shift */
-		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, pdn->offset);
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V10 17/17] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
@ 2014-12-22  5:54   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  5:54 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

When IOV BAR is big, each of it is covered by 4 M64 window. This leads to
several VF PE sits in one PE in terms of M64.

This patch group VF PEs according to the M64 allocation.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  188 +++++++++++++++++++++++------
 2 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 538dbb5..6eb9fb6 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -177,7 +177,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
 	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
-	int     m64_wins[PCI_SRIOV_NUM_BARS];
+	int     m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d738397..29d10ff 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1093,26 +1093,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pci_dn         *pdn;
-	int                    i;
+	int                    i, j;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
 
-	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		if (pdn->m64_wins[i] == IODA_INVALID_M64)
-			continue;
-		opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
-		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
-		pdn->m64_wins[i] = IODA_INVALID_M64;
-	}
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		for (j = 0; j < M64_PER_IOV; j++) {
+			if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+				continue;
+			opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
+			clear_bit(pdn->m64_wins[i][j], &phb->ioda.m64_bar_alloc);
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+		}
 
 	return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
@@ -1120,17 +1121,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 	struct pci_dn         *pdn;
 	unsigned int           win;
 	struct resource       *res;
-	int                    i;
+	int                    i, j;
 	int64_t                rc;
+	int                    total_vfs;
+	resource_size_t        size, start;
+	int                    pe_num;
+	int                    vf_groups;
+	int                    vf_per_group;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
+	total_vfs = pci_sriov_get_totalvfs(pdev);
 
 	/* Initialize the m64_wins to IODA_INVALID_M64 */
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-		pdn->m64_wins[i] = IODA_INVALID_M64;
+		for (j = 0; j < M64_PER_IOV; j++)
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+
+	if (pdn->m64_per_iov == M64_PER_IOV) {
+		vf_groups = (vf_num <= M64_PER_IOV) ? vf_num: M64_PER_IOV;
+		vf_per_group = (vf_num <= M64_PER_IOV)? 1:
+			__roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+	} else {
+		vf_groups = 1;
+		vf_per_group = 1;
+	}
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = pdev->resource + PCI_IOV_RESOURCES + i;
@@ -1140,33 +1157,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 		if (!pnv_pci_is_mem_pref_64(res->flags))
 			continue;
 
-		do {
-			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-					phb->ioda.m64_bar_idx + 1, 0);
-
-			if (win >= phb->ioda.m64_bar_idx + 1)
-				goto m64_failed;
-		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+		for (j = 0; j < vf_groups; j++) {
+			do {
+				win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+						phb->ioda.m64_bar_idx + 1, 0);
+
+				if (win >= phb->ioda.m64_bar_idx + 1)
+					goto m64_failed;
+			} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+			pdn->m64_wins[i][j] = win;
+
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				size = pci_iov_resource_size(pdev,
+								   PCI_IOV_RESOURCES + i);
+				size = size * vf_per_group;
+				start = res->start + size * j;
+			} else {
+				size = resource_size(res);
+				start = res->start;
+			}
 
-		pdn->m64_wins[i] = win;
+			/* Map the M64 here */
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				pe_num = pdn->offset + j;
+				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+						pe_num, OPAL_M64_WINDOW_TYPE,
+						pdn->m64_wins[i][j], 0);
+			}
 
-		/* Map the M64 here */
-		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+			rc = opal_pci_set_phb_mem_window(phb->opal_id,
 						 OPAL_M64_WINDOW_TYPE,
-						 pdn->m64_wins[i],
-						 res->start,
+						 pdn->m64_wins[i][j],
+						 start,
 						 0, /* unused */
-						 resource_size(res));
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
-			goto m64_failed;
-		}
+						 size);
 
-		rc = opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
-			goto m64_failed;
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to set M64 BAR #%d: %lld\n",
+						win, rc);
+				goto m64_failed;
+			}
+
+			if (pdn->m64_per_iov == M64_PER_IOV)
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 2);
+			else
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 1);
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to enable M64 BAR #%d: %llx\n",
+						win, rc);
+				goto m64_failed;
+			}
 		}
 	}
 	return 0;
@@ -1208,22 +1253,53 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 	pe->tce32_table = NULL;
 }
 
-static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pnv_ioda_pe    *pe, *pe_n;
 	struct pci_dn         *pdn;
+	u16                    vf_index;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
 
 	if (!pdev->is_physfn)
 		return;
 
-	pdn = pci_get_pdn(pdev);
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++){
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_REMOVE_PE_FROM_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to unlink same"
+						" group PE#%d(%lld)\n", __func__,
+						pdn->offset + vf_index1, rc);
+				}
+	}
+
 	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
 		if (pe->parent_dev != pdev)
 			continue;
@@ -1258,10 +1334,11 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 	vf_num = pdn->vf_pes;
 
 	/* Release VF PEs */
-	pnv_ioda_release_vf_PE(pdev);
+	pnv_ioda_release_vf_PE(pdev, vf_num);
 
 	if (phb->type == PNV_PHB_IODA2) {
-		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, -pdn->offset);
 
 		/* Release M64 BARs */
 		pnv_pci_vf_release_m64(pdev);
@@ -1285,6 +1362,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 	int                    pe_num;
 	u16                    vf_index;
 	struct pci_dn         *pdn;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
@@ -1332,7 +1410,36 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+	}
 
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++) {
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_ADD_PE_TO_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to link same "
+						"group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+			}
 	}
 }
 
@@ -1366,14 +1473,15 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
 		/* Assign M64 BAR accordingly */
-		ret = pnv_pci_vf_assign_m64(pdev);
+		ret = pnv_pci_vf_assign_m64(pdev, vf_num);
 		if (ret) {
 			pr_info("No enough M64 resource\n");
 			goto m64_failed;
 		}
 
 		/* Do some magic shift */
-		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, pdn->offset);
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH V10 00/17] Enable SRIOV on Power8
  2014-12-22  5:54 ` Wei Yang
@ 2014-12-22  6:05   ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  6:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, benh, gwshan, linux-pci, linuxppc-dev

Bjorn,

This patch set is tested on 3.19-rc1 and with the offset/stride update patch.

I see your comment on the MEM64 issue, so if that is reverted, this
patch set will not work. While I think we can work in parallel, I sent it here
for more comment and to see whether I understand your previous comments
correctly.

I will work with Yinghai to find a way to fix the bug 85491, hope linux kernel
could handle both cases.

Merry Christmas in advance ~

On Mon, Dec 22, 2014 at 01:54:20PM +0800, Wei Yang wrote:
>This patchset enables the SRIOV on POWER8.
>
>The gerneral idea is put each VF into one individual PE and allocate required
>resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
>allocation and adjustment for PF's IOV BAR.
>
>On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
>sit in its own PE. This gives more flexiblity, while at the mean time it
>brings on some restrictions on the PF's IOV BAR size and alignment.
>
>To achieve this effect, we need to do some hack on pci devices's resources.
>1. Expand the IOV BAR properly.
>   Done by pnv_pci_ioda_fixup_iov_resources().
>2. Shift the IOV BAR properly.
>   Done by pnv_pci_vf_resource_shift().
>3. IOV BAR alignment is calculated by arch dependent function instead of an
>   individual VF BAR size.
>   Done by pnv_pcibios_sriov_resource_alignment().
>4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>   sizing and assigning"
>
>Test Environment:
>       The SRIOV device tested is Emulex Lancer(10df:e220) and
>       Mellanox ConnectX-3(15b3:1003) on POWER8.
>
>Examples on pass through a VF to guest through vfio:
>	1. unbind the original driver and bind to vfio-pci driver
>	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
>	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
>	   Note: this should be done for each device in the same iommu_group
>	2. Start qemu and pass device through vfio
>	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
>		   -M pseries -m 2048 -enable-kvm -nographic \
>		   -drive file=/home/ywywyang/kvm/fc19.img \
>		   -monitor telnet:localhost:5435,server,nowait -boot cd \
>		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"
>
>Verify this is the exact VF response:
>	1. ping from a machine in the same subnet(the broadcast domain)
>	2. run arp -n on this machine
>	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
>	3. ifconfig in the guest
>	   # ifconfig eth1
>	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
>		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
>	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
>	        RX packets 175  bytes 13278 (12.9 KiB)
>	        RX errors 0  dropped 0  overruns 0  frame 0
>		TX packets 58  bytes 9276 (9.0 KiB)
>	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>	4. They have the same MAC address
>
>	Note: make sure you shutdown other network interfaces in guest.
>
>---
>v10:
>   * remove weak function pcibios_iov_resource_size()
>     the VF BAR size is stored in pci_sriov structure and retrieved from
>     pci_iov_resource_size()
>   * Use "Reserve additional" instead of "Expand" to be more acurate in the
>     change log
>   * add log message to show the PF's IOV BAR final size
>   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
>     for arch setup before enable VFs. Like the arch could fix up the BDF for
>     VFs, since the change of NumVFs would affect the BDF of VFs.
>   * Add some explanation of PE on Power arch in the documentation
>v9:
>   * make the change log consistent in the terminology
>     PF's IOV BAR -> the SRIOV BAR in PF
>     VF's BAR -> the normal BAR in VF's view
>   * rename all newly introduced function from _sriov_ to _iov_
>   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
>   * add the vendor id and device id of the tested devices
>   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
>     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
>   * rebase on 3.18-rc2 and tested
>v8:
>   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
>     retrieve the IOV BAR size.
>   * add a document Documentation/powerpc/pci_resource.txt to explain the
>     design.
>   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
>   * extract a function res_to_dev_res(), so that it is more general to get
>     additional size and alignment
>   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
>     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
>     lock.
>v7:
>   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
>   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
>     hardware directly. If not, calculate as usual.
>   * reorder the patch set, group them by subsystem:
>     PCI, powerpc, powernv
>   * rebase it on 3.16-rc6
>v6:
>   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
>     similar function is moved to
>     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
>     enabled, platform will try best to allocate resources for VFs.
>   * remove pcibios_sriov_resource_size weak function
>   * VF BAR size is retrieved from hardware directly in virtfn_add()
>v5:
>   * merge those SRIOV related platform functions in machdep_calls
>     wrap them in one CONFIG_PCI_IOV marco
>   * define IODA_INVALID_M64 to replace (-1)
>     use this value to represent the m64_wins is not used
>   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
>     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
>   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
>     reduce some log in kernel
>   * release M64 window in pnv_pci_ioda2_release_dma_pe()
>v4:
>   * code format fix, eg. not exceed 80 chars
>   * in commit "ppc/pnv: Add function to deconfig a PE"
>     check the bus has a bridge before print the name
>     remove a PE from its own PELTV
>   * change the function name for sriov resource size/alignment
>   * rebase on 3.16-rc3
>   * VFs will not rely on device node
>     As Grant Likely's comments, kernel should have the ability to handle the
>     lack of device_node gracefully. Gavin restructure the pci_dn, which
>     makes the VF will have pci_dn even when VF's device_node is not provided
>     by firmware.
>   * clean all the patch title to make them comply with one style
>   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>v3:
>   * change the return type of virtfn_bus/virtfn_devfn to int
>     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>   * reduce the second parameter or pcibios_sriov_disable()
>   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
>   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
>   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
>v2:
>   * change the return value of virtfn_bus/virtfn_devfn to 0
>   * move some TCE related marco definition to
>     arch/powerpc/platforms/powernv/pci.h
>   * fix the __pci_sriov_resource_alignment on powernv platform
>     During the sizing stage, the IOV BAR is truncated to 0, which will
>     effect the order of allocation. Fix this, so that make sure BAR will be
>     allocated ordered by their alignment.
>v1:
>   * improve the change log for
>     "PCI: Add weak __pci_sriov_resource_size() interface"
>     "PCI: Add weak __pci_sriov_resource_alignment() interface"
>     "PCI: take additional IOV BAR alignment in sizing and assigning"
>   * wrap VF PE code in CONFIG_PCI_IOV
>   * did regression test on P7.
>Gavin Shan (1):
>  powrepc/pci: Refactor pci_dn
>
>Wei Yang (16):
>  PCI/IOV: Export interface for retrieve VF's BDF
>  PCI/IOV: add VF enable/disable hook
>  PCI: Add weak pcibios_iov_resource_alignment() interface
>  PCI: Store VF BAR size in pci_sriov
>  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
>  powerpc/pci: Add PCI resource alignment documentation
>  powerpc/pci: Don't unset pci resources for VFs
>  powerpc/pci: remove pci_dn->pcidev field
>  powerpc/powernv: Use pci_dn in PCI config accessor
>  powerpc/powernv: Allocate pe->iommu_table dynamically
>  powerpc/powernv: Reserve additional space for IOV BAR according to
>    the number of total_pe
>  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
>    powernv
>  powerpc/powernv: Shift VF resource with an offset
>  powerpc/powernv: Allocate VF PE
>  powerpc/powernv: Reserve additional space for IOV BAR, with
>    m64_per_iov supported
>  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
>
> .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++
> arch/powerpc/include/asm/device.h                  |    3 +
> arch/powerpc/include/asm/iommu.h                   |    3 +
> arch/powerpc/include/asm/machdep.h                 |    7 +
> arch/powerpc/include/asm/pci-bridge.h              |   24 +-
> arch/powerpc/kernel/pci-common.c                   |   23 +
> arch/powerpc/kernel/pci_dn.c                       |  251 ++++++-
> arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
> arch/powerpc/platforms/powernv/pci-ioda.c          |  739 +++++++++++++++++++-
> arch/powerpc/platforms/powernv/pci.c               |   87 +--
> arch/powerpc/platforms/powernv/pci.h               |   13 +-
> drivers/pci/iov.c                                  |   80 ++-
> drivers/pci/pci.h                                  |    2 +
> drivers/pci/setup-bus.c                            |   85 ++-
> include/linux/pci.h                                |   17 +
> 15 files changed, 1449 insertions(+), 114 deletions(-)
> create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt
>
>-- 
>1.7.9.5

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V10 00/17] Enable SRIOV on Power8
@ 2014-12-22  6:05   ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2014-12-22  6:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, benh, linuxppc-dev, gwshan

Bjorn,

This patch set is tested on 3.19-rc1 and with the offset/stride update patch.

I see your comment on the MEM64 issue, so if that is reverted, this
patch set will not work. While I think we can work in parallel, I sent it here
for more comment and to see whether I understand your previous comments
correctly.

I will work with Yinghai to find a way to fix the bug 85491, hope linux kernel
could handle both cases.

Merry Christmas in advance ~

On Mon, Dec 22, 2014 at 01:54:20PM +0800, Wei Yang wrote:
>This patchset enables the SRIOV on POWER8.
>
>The gerneral idea is put each VF into one individual PE and allocate required
>resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
>allocation and adjustment for PF's IOV BAR.
>
>On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
>sit in its own PE. This gives more flexiblity, while at the mean time it
>brings on some restrictions on the PF's IOV BAR size and alignment.
>
>To achieve this effect, we need to do some hack on pci devices's resources.
>1. Expand the IOV BAR properly.
>   Done by pnv_pci_ioda_fixup_iov_resources().
>2. Shift the IOV BAR properly.
>   Done by pnv_pci_vf_resource_shift().
>3. IOV BAR alignment is calculated by arch dependent function instead of an
>   individual VF BAR size.
>   Done by pnv_pcibios_sriov_resource_alignment().
>4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>   sizing and assigning"
>
>Test Environment:
>       The SRIOV device tested is Emulex Lancer(10df:e220) and
>       Mellanox ConnectX-3(15b3:1003) on POWER8.
>
>Examples on pass through a VF to guest through vfio:
>	1. unbind the original driver and bind to vfio-pci driver
>	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
>	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
>	   Note: this should be done for each device in the same iommu_group
>	2. Start qemu and pass device through vfio
>	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
>		   -M pseries -m 2048 -enable-kvm -nographic \
>		   -drive file=/home/ywywyang/kvm/fc19.img \
>		   -monitor telnet:localhost:5435,server,nowait -boot cd \
>		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"
>
>Verify this is the exact VF response:
>	1. ping from a machine in the same subnet(the broadcast domain)
>	2. run arp -n on this machine
>	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
>	3. ifconfig in the guest
>	   # ifconfig eth1
>	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
>		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
>	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
>	        RX packets 175  bytes 13278 (12.9 KiB)
>	        RX errors 0  dropped 0  overruns 0  frame 0
>		TX packets 58  bytes 9276 (9.0 KiB)
>	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>	4. They have the same MAC address
>
>	Note: make sure you shutdown other network interfaces in guest.
>
>---
>v10:
>   * remove weak function pcibios_iov_resource_size()
>     the VF BAR size is stored in pci_sriov structure and retrieved from
>     pci_iov_resource_size()
>   * Use "Reserve additional" instead of "Expand" to be more acurate in the
>     change log
>   * add log message to show the PF's IOV BAR final size
>   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
>     for arch setup before enable VFs. Like the arch could fix up the BDF for
>     VFs, since the change of NumVFs would affect the BDF of VFs.
>   * Add some explanation of PE on Power arch in the documentation
>v9:
>   * make the change log consistent in the terminology
>     PF's IOV BAR -> the SRIOV BAR in PF
>     VF's BAR -> the normal BAR in VF's view
>   * rename all newly introduced function from _sriov_ to _iov_
>   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
>   * add the vendor id and device id of the tested devices
>   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
>     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
>   * rebase on 3.18-rc2 and tested
>v8:
>   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
>     retrieve the IOV BAR size.
>   * add a document Documentation/powerpc/pci_resource.txt to explain the
>     design.
>   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
>   * extract a function res_to_dev_res(), so that it is more general to get
>     additional size and alignment
>   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
>     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
>     lock.
>v7:
>   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
>   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
>     hardware directly. If not, calculate as usual.
>   * reorder the patch set, group them by subsystem:
>     PCI, powerpc, powernv
>   * rebase it on 3.16-rc6
>v6:
>   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
>     similar function is moved to
>     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
>     enabled, platform will try best to allocate resources for VFs.
>   * remove pcibios_sriov_resource_size weak function
>   * VF BAR size is retrieved from hardware directly in virtfn_add()
>v5:
>   * merge those SRIOV related platform functions in machdep_calls
>     wrap them in one CONFIG_PCI_IOV marco
>   * define IODA_INVALID_M64 to replace (-1)
>     use this value to represent the m64_wins is not used
>   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
>     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
>   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
>     reduce some log in kernel
>   * release M64 window in pnv_pci_ioda2_release_dma_pe()
>v4:
>   * code format fix, eg. not exceed 80 chars
>   * in commit "ppc/pnv: Add function to deconfig a PE"
>     check the bus has a bridge before print the name
>     remove a PE from its own PELTV
>   * change the function name for sriov resource size/alignment
>   * rebase on 3.16-rc3
>   * VFs will not rely on device node
>     As Grant Likely's comments, kernel should have the ability to handle the
>     lack of device_node gracefully. Gavin restructure the pci_dn, which
>     makes the VF will have pci_dn even when VF's device_node is not provided
>     by firmware.
>   * clean all the patch title to make them comply with one style
>   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>v3:
>   * change the return type of virtfn_bus/virtfn_devfn to int
>     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
>   * reduce the second parameter or pcibios_sriov_disable()
>   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
>   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
>   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
>v2:
>   * change the return value of virtfn_bus/virtfn_devfn to 0
>   * move some TCE related marco definition to
>     arch/powerpc/platforms/powernv/pci.h
>   * fix the __pci_sriov_resource_alignment on powernv platform
>     During the sizing stage, the IOV BAR is truncated to 0, which will
>     effect the order of allocation. Fix this, so that make sure BAR will be
>     allocated ordered by their alignment.
>v1:
>   * improve the change log for
>     "PCI: Add weak __pci_sriov_resource_size() interface"
>     "PCI: Add weak __pci_sriov_resource_alignment() interface"
>     "PCI: take additional IOV BAR alignment in sizing and assigning"
>   * wrap VF PE code in CONFIG_PCI_IOV
>   * did regression test on P7.
>Gavin Shan (1):
>  powrepc/pci: Refactor pci_dn
>
>Wei Yang (16):
>  PCI/IOV: Export interface for retrieve VF's BDF
>  PCI/IOV: add VF enable/disable hook
>  PCI: Add weak pcibios_iov_resource_alignment() interface
>  PCI: Store VF BAR size in pci_sriov
>  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
>  powerpc/pci: Add PCI resource alignment documentation
>  powerpc/pci: Don't unset pci resources for VFs
>  powerpc/pci: remove pci_dn->pcidev field
>  powerpc/powernv: Use pci_dn in PCI config accessor
>  powerpc/powernv: Allocate pe->iommu_table dynamically
>  powerpc/powernv: Reserve additional space for IOV BAR according to
>    the number of total_pe
>  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
>    powernv
>  powerpc/powernv: Shift VF resource with an offset
>  powerpc/powernv: Allocate VF PE
>  powerpc/powernv: Reserve additional space for IOV BAR, with
>    m64_per_iov supported
>  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
>
> .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++
> arch/powerpc/include/asm/device.h                  |    3 +
> arch/powerpc/include/asm/iommu.h                   |    3 +
> arch/powerpc/include/asm/machdep.h                 |    7 +
> arch/powerpc/include/asm/pci-bridge.h              |   24 +-
> arch/powerpc/kernel/pci-common.c                   |   23 +
> arch/powerpc/kernel/pci_dn.c                       |  251 ++++++-
> arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
> arch/powerpc/platforms/powernv/pci-ioda.c          |  739 +++++++++++++++++++-
> arch/powerpc/platforms/powernv/pci.c               |   87 +--
> arch/powerpc/platforms/powernv/pci.h               |   13 +-
> drivers/pci/iov.c                                  |   80 ++-
> drivers/pci/pci.h                                  |    2 +
> drivers/pci/setup-bus.c                            |   85 ++-
> include/linux/pci.h                                |   17 +
> 15 files changed, 1449 insertions(+), 114 deletions(-)
> create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt
>
>-- 
>1.7.9.5

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V10 00/17] Enable SRIOV on Power8
  2014-12-22  6:05   ` Wei Yang
@ 2015-01-13 18:05     ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-01-13 18:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Mon, Dec 22, 2014 at 02:05:22PM +0800, Wei Yang wrote:
> Bjorn,
> 
> This patch set is tested on 3.19-rc1 and with the offset/stride update patch.
> 
> I see your comment on the MEM64 issue, so if that is reverted, this
> patch set will not work. While I think we can work in parallel, I sent it here
> for more comment and to see whether I understand your previous comments
> correctly.
> 
> I will work with Yinghai to find a way to fix the bug 85491, hope linux kernel
> could handle both cases.

OK.  The autobuilder found some minor issues, so I'll look for a v11
posting that fixes those.  Please pick up the changelogs from my
pci/virtualization branch because I rewrapped them so they fit in 80
columns when shown by "git log".

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V10 00/17] Enable SRIOV on Power8
@ 2015-01-13 18:05     ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-01-13 18:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Mon, Dec 22, 2014 at 02:05:22PM +0800, Wei Yang wrote:
> Bjorn,
> 
> This patch set is tested on 3.19-rc1 and with the offset/stride update patch.
> 
> I see your comment on the MEM64 issue, so if that is reverted, this
> patch set will not work. While I think we can work in parallel, I sent it here
> for more comment and to see whether I understand your previous comments
> correctly.
> 
> I will work with Yinghai to find a way to fix the bug 85491, hope linux kernel
> could handle both cases.

OK.  The autobuilder found some minor issues, so I'll look for a v11
posting that fixes those.  Please pick up the changelogs from my
pci/virtualization branch because I rewrapped them so they fit in 80
columns when shown by "git log".

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH V11 00/17] Enable SRIOV on Power8
  2015-01-13 18:05     ` Bjorn Helgaas
@ 2015-01-15  2:27       ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

This patchset enables the SRIOV on POWER8.

The gerneral idea is put each VF into one individual PE and allocate required
resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
allocation and adjustment for PF's IOV BAR.

On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
sit in its own PE. This gives more flexiblity, while at the mean time it
brings on some restrictions on the PF's IOV BAR size and alignment.

To achieve this effect, we need to do some hack on pci devices's resources.
1. Expand the IOV BAR properly.
   Done by pnv_pci_ioda_fixup_iov_resources().
2. Shift the IOV BAR properly.
   Done by pnv_pci_vf_resource_shift().
3. IOV BAR alignment is calculated by arch dependent function instead of an
   individual VF BAR size.
   Done by pnv_pcibios_sriov_resource_alignment().
4. Take the IOV BAR alignment into consideration in the sizing and assigning.
   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
   sizing and assigning"

Test Environment:
       The SRIOV device tested is Emulex Lancer(10df:e220) and
       Mellanox ConnectX-3(15b3:1003) on POWER8.

Examples on pass through a VF to guest through vfio:
	1. unbind the original driver and bind to vfio-pci driver
	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
	   Note: this should be done for each device in the same iommu_group
	2. Start qemu and pass device through vfio
	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
		   -M pseries -m 2048 -enable-kvm -nographic \
		   -drive file=/home/ywywyang/kvm/fc19.img \
		   -monitor telnet:localhost:5435,server,nowait -boot cd \
		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"

Verify this is the exact VF response:
	1. ping from a machine in the same subnet(the broadcast domain)
	2. run arp -n on this machine
	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
	3. ifconfig in the guest
	   # ifconfig eth1
	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
	        RX packets 175  bytes 13278 (12.9 KiB)
	        RX errors 0  dropped 0  overruns 0  frame 0
		TX packets 58  bytes 9276 (9.0 KiB)
	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
	4. They have the same MAC address

	Note: make sure you shutdown other network interfaces in guest.

---
v11:
   * fix some compile warning
v10:
   * remove weak function pcibios_iov_resource_size()
     the VF BAR size is stored in pci_sriov structure and retrieved from
     pci_iov_resource_size()
   * Use "Reserve additional" instead of "Expand" to be more acurate in the
     change log
   * add log message to show the PF's IOV BAR final size
   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
     for arch setup before enable VFs. Like the arch could fix up the BDF for
     VFs, since the change of NumVFs would affect the BDF of VFs.
   * Add some explanation of PE on Power arch in the documentation
v9:
   * make the change log consistent in the terminology
     PF's IOV BAR -> the SRIOV BAR in PF
     VF's BAR -> the normal BAR in VF's view
   * rename all newly introduced function from _sriov_ to _iov_
   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
   * add the vendor id and device id of the tested devices
   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
   * rebase on 3.18-rc2 and tested
v8:
   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
     retrieve the IOV BAR size.
   * add a document Documentation/powerpc/pci_resource.txt to explain the
     design.
   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
   * extract a function res_to_dev_res(), so that it is more general to get
     additional size and alignment
   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
     lock.
v7:
   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
     hardware directly. If not, calculate as usual.
   * reorder the patch set, group them by subsystem:
     PCI, powerpc, powernv
   * rebase it on 3.16-rc6
v6:
   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
     similar function is moved to
     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
     enabled, platform will try best to allocate resources for VFs.
   * remove pcibios_sriov_resource_size weak function
   * VF BAR size is retrieved from hardware directly in virtfn_add()
v5:
   * merge those SRIOV related platform functions in machdep_calls
     wrap them in one CONFIG_PCI_IOV marco
   * define IODA_INVALID_M64 to replace (-1)
     use this value to represent the m64_wins is not used
   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
     reduce some log in kernel
   * release M64 window in pnv_pci_ioda2_release_dma_pe()
v4:
   * code format fix, eg. not exceed 80 chars
   * in commit "ppc/pnv: Add function to deconfig a PE"
     check the bus has a bridge before print the name
     remove a PE from its own PELTV
   * change the function name for sriov resource size/alignment
   * rebase on 3.16-rc3
   * VFs will not rely on device node
     As Grant Likely's comments, kernel should have the ability to handle the
     lack of device_node gracefully. Gavin restructure the pci_dn, which
     makes the VF will have pci_dn even when VF's device_node is not provided
     by firmware.
   * clean all the patch title to make them comply with one style
   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
v3:
   * change the return type of virtfn_bus/virtfn_devfn to int
     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
   * reduce the second parameter or pcibios_sriov_disable()
   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
v2:
   * change the return value of virtfn_bus/virtfn_devfn to 0
   * move some TCE related marco definition to
     arch/powerpc/platforms/powernv/pci.h
   * fix the __pci_sriov_resource_alignment on powernv platform
     During the sizing stage, the IOV BAR is truncated to 0, which will
     effect the order of allocation. Fix this, so that make sure BAR will be
     allocated ordered by their alignment.
v1:
   * improve the change log for
     "PCI: Add weak __pci_sriov_resource_size() interface"
     "PCI: Add weak __pci_sriov_resource_alignment() interface"
     "PCI: take additional IOV BAR alignment in sizing and assigning"
   * wrap VF PE code in CONFIG_PCI_IOV
   * did regression test on P7.

Gavin Shan (1):
  powrepc/pci: Refactor pci_dn

Wei Yang (16):
  PCI/IOV: Export interface for retrieve VF's BDF
  PCI/IOV: add VF enable/disable hook
  PCI: Add weak pcibios_iov_resource_alignment() interface
  PCI: Store VF BAR size in pci_sriov
  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  powerpc/pci: Add PCI resource alignment documentation
  powerpc/pci: Don't unset pci resources for VFs
  powerpc/pci: remove pci_dn->pcidev field
  powerpc/powernv: Use pci_dn in PCI config accessor
  powerpc/powernv: Allocate pe->iommu_table dynamically
  powerpc/powernv: Reserve additional space for IOV BAR according to
    the number of total_pe
  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
    powernv
  powerpc/powernv: Shift VF resource with an offset
  powerpc/powernv: Allocate VF PE
  powerpc/powernv: Reserve additional space for IOV BAR, with
    m64_per_iov supported
  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3

 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++
 arch/powerpc/include/asm/device.h                  |    3 +
 arch/powerpc/include/asm/iommu.h                   |    3 +
 arch/powerpc/include/asm/machdep.h                 |    7 +
 arch/powerpc/include/asm/pci-bridge.h              |   24 +-
 arch/powerpc/kernel/pci-common.c                   |   23 +
 arch/powerpc/kernel/pci_dn.c                       |  253 ++++++-
 arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
 arch/powerpc/platforms/powernv/pci-ioda.c          |  739 +++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c               |   87 +--
 arch/powerpc/platforms/powernv/pci.h               |   13 +-
 drivers/pci/iov.c                                  |   80 ++-
 drivers/pci/pci.h                                  |    2 +
 drivers/pci/setup-bus.c                            |   85 ++-
 include/linux/pci.h                                |   17 +
 15 files changed, 1451 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH V11 00/17] Enable SRIOV on Power8
@ 2015-01-15  2:27       ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

This patchset enables the SRIOV on POWER8.

The gerneral idea is put each VF into one individual PE and allocate required
resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
allocation and adjustment for PF's IOV BAR.

On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
sit in its own PE. This gives more flexiblity, while at the mean time it
brings on some restrictions on the PF's IOV BAR size and alignment.

To achieve this effect, we need to do some hack on pci devices's resources.
1. Expand the IOV BAR properly.
   Done by pnv_pci_ioda_fixup_iov_resources().
2. Shift the IOV BAR properly.
   Done by pnv_pci_vf_resource_shift().
3. IOV BAR alignment is calculated by arch dependent function instead of an
   individual VF BAR size.
   Done by pnv_pcibios_sriov_resource_alignment().
4. Take the IOV BAR alignment into consideration in the sizing and assigning.
   This is achieved by commit: "PCI: Take additional IOV BAR alignment in
   sizing and assigning"

Test Environment:
       The SRIOV device tested is Emulex Lancer(10df:e220) and
       Mellanox ConnectX-3(15b3:1003) on POWER8.

Examples on pass through a VF to guest through vfio:
	1. unbind the original driver and bind to vfio-pci driver
	   echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
	   echo  1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
	   Note: this should be done for each device in the same iommu_group
	2. Start qemu and pass device through vfio
	   /home/ywywyang/git/qemu-impreza/ppc64-softmmu/qemu-system-ppc64 \
		   -M pseries -m 2048 -enable-kvm -nographic \
		   -drive file=/home/ywywyang/kvm/fc19.img \
		   -monitor telnet:localhost:5435,server,nowait -boot cd \
		   -device "spapr-pci-vfio-host-bridge,id=CXGB3,iommu=26,index=6"

Verify this is the exact VF response:
	1. ping from a machine in the same subnet(the broadcast domain)
	2. run arp -n on this machine
	   9.115.251.20             ether   00:00:c9:df:ed:bf   C eth0
	3. ifconfig in the guest
	   # ifconfig eth1
	   eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
	        inet 9.115.251.20  netmask 255.255.255.0  broadcast 9.115.251.255
		inet6 fe80::200:c9ff:fedf:edbf  prefixlen 64  scopeid 0x20<link>
	        ether 00:00:c9:df:ed:bf  txqueuelen 1000 (Ethernet)
	        RX packets 175  bytes 13278 (12.9 KiB)
	        RX errors 0  dropped 0  overruns 0  frame 0
		TX packets 58  bytes 9276 (9.0 KiB)
	        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
	4. They have the same MAC address

	Note: make sure you shutdown other network interfaces in guest.

---
v11:
   * fix some compile warning
v10:
   * remove weak function pcibios_iov_resource_size()
     the VF BAR size is stored in pci_sriov structure and retrieved from
     pci_iov_resource_size()
   * Use "Reserve additional" instead of "Expand" to be more acurate in the
     change log
   * add log message to show the PF's IOV BAR final size
   * add pcibios_sriov_enable/disable() weak funcion in sriov_enable/disable()
     for arch setup before enable VFs. Like the arch could fix up the BDF for
     VFs, since the change of NumVFs would affect the BDF of VFs.
   * Add some explanation of PE on Power arch in the documentation
v9:
   * make the change log consistent in the terminology
     PF's IOV BAR -> the SRIOV BAR in PF
     VF's BAR -> the normal BAR in VF's view
   * rename all newly introduced function from _sriov_ to _iov_
   * rename the document to Documentation/powerpc/pci_iov_resource_on_powernv.txt
   * add the vendor id and device id of the tested devices
   * change return value from EINVAL to ENOSYS for pci_iov_virtfn_bus() and
     pci_iov_virtfn_devfn() when it is called on PF or SRIOV is not configured
   * rebase on 3.18-rc2 and tested
v8:
   * use weak funcion pcibios_sriov_resource_size() instead of some flag to
     retrieve the IOV BAR size.
   * add a document Documentation/powerpc/pci_resource.txt to explain the
     design.
   * make pci_iov_virtfn_bus()/pci_iov_virtfn_devfn() not inline.
   * extract a function res_to_dev_res(), so that it is more general to get
     additional size and alignment
   * fix one contention which is introduced in "powrepc/pci: Refactor pci_dn".
     the root cause is pci_get_slot() takes pci_bus_sem and leads to dead
     lock.
v7:
   * add IORESOURCE_ARCH flag for IOV BAR on powernv platform.
   * when IOV BAR has IORESOURCE_ARCH flag, the size is retrieved from
     hardware directly. If not, calculate as usual.
   * reorder the patch set, group them by subsystem:
     PCI, powerpc, powernv
   * rebase it on 3.16-rc6
v6:
   * remove pcibios_enable_sriov()/pcibios_disable_sriov() weak function
     similar function is moved to
     pnv_pci_enable_device_hook()/pnv_pci_disable_device_hook(). When PF is
     enabled, platform will try best to allocate resources for VFs.
   * remove pcibios_sriov_resource_size weak function
   * VF BAR size is retrieved from hardware directly in virtfn_add()
v5:
   * merge those SRIOV related platform functions in machdep_calls
     wrap them in one CONFIG_PCI_IOV marco
   * define IODA_INVALID_M64 to replace (-1)
     use this value to represent the m64_wins is not used
   * rename pnv_pci_release_dev_dma() to pnv_pci_ioda2_release_dma_pe()
     this function is a conterpart to pnv_pci_ioda2_setup_dma_pe()
   * change dev_info() to dev_dgb() in pnv_pci_ioda_fixup_iov_resources()
     reduce some log in kernel
   * release M64 window in pnv_pci_ioda2_release_dma_pe()
v4:
   * code format fix, eg. not exceed 80 chars
   * in commit "ppc/pnv: Add function to deconfig a PE"
     check the bus has a bridge before print the name
     remove a PE from its own PELTV
   * change the function name for sriov resource size/alignment
   * rebase on 3.16-rc3
   * VFs will not rely on device node
     As Grant Likely's comments, kernel should have the ability to handle the
     lack of device_node gracefully. Gavin restructure the pci_dn, which
     makes the VF will have pci_dn even when VF's device_node is not provided
     by firmware.
   * clean all the patch title to make them comply with one style
   * fix return value for pci_iov_virtfn_bus/pci_iov_virtfn_devfn
v3:
   * change the return type of virtfn_bus/virtfn_devfn to int
     change the name of these two functions to pci_iov_virtfn_bus/pci_iov_virtfn_devfn
   * reduce the second parameter or pcibios_sriov_disable()
   * use data instead of pe in "ppc/pnv: allocate pe->iommu_table dynamically"
   * rename __pci_sriov_resource_size to pcibios_sriov_resource_size
   * rename __pci_sriov_resource_alignment to pcibios_sriov_resource_alignment
v2:
   * change the return value of virtfn_bus/virtfn_devfn to 0
   * move some TCE related marco definition to
     arch/powerpc/platforms/powernv/pci.h
   * fix the __pci_sriov_resource_alignment on powernv platform
     During the sizing stage, the IOV BAR is truncated to 0, which will
     effect the order of allocation. Fix this, so that make sure BAR will be
     allocated ordered by their alignment.
v1:
   * improve the change log for
     "PCI: Add weak __pci_sriov_resource_size() interface"
     "PCI: Add weak __pci_sriov_resource_alignment() interface"
     "PCI: take additional IOV BAR alignment in sizing and assigning"
   * wrap VF PE code in CONFIG_PCI_IOV
   * did regression test on P7.

Gavin Shan (1):
  powrepc/pci: Refactor pci_dn

Wei Yang (16):
  PCI/IOV: Export interface for retrieve VF's BDF
  PCI/IOV: add VF enable/disable hook
  PCI: Add weak pcibios_iov_resource_alignment() interface
  PCI: Store VF BAR size in pci_sriov
  PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  powerpc/pci: Add PCI resource alignment documentation
  powerpc/pci: Don't unset pci resources for VFs
  powerpc/pci: remove pci_dn->pcidev field
  powerpc/powernv: Use pci_dn in PCI config accessor
  powerpc/powernv: Allocate pe->iommu_table dynamically
  powerpc/powernv: Reserve additional space for IOV BAR according to
    the number of total_pe
  powerpc/powernv: Implement pcibios_iov_resource_alignment() on
    powernv
  powerpc/powernv: Shift VF resource with an offset
  powerpc/powernv: Allocate VF PE
  powerpc/powernv: Reserve additional space for IOV BAR, with
    m64_per_iov supported
  powerpc/powernv: Group VF PE when IOV BAR is big on PHB3

 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++
 arch/powerpc/include/asm/device.h                  |    3 +
 arch/powerpc/include/asm/iommu.h                   |    3 +
 arch/powerpc/include/asm/machdep.h                 |    7 +
 arch/powerpc/include/asm/pci-bridge.h              |   24 +-
 arch/powerpc/kernel/pci-common.c                   |   23 +
 arch/powerpc/kernel/pci_dn.c                       |  253 ++++++-
 arch/powerpc/platforms/powernv/eeh-powernv.c       |   14 +-
 arch/powerpc/platforms/powernv/pci-ioda.c          |  739 +++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c               |   87 +--
 arch/powerpc/platforms/powernv/pci.h               |   13 +-
 drivers/pci/iov.c                                  |   80 ++-
 drivers/pci/pci.h                                  |    2 +
 drivers/pci/setup-bus.c                            |   85 ++-
 include/linux/pci.h                                |   17 +
 15 files changed, 1451 insertions(+), 114 deletions(-)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

When implementing the SR-IOV on PowerNV platform, some resource reservation
is needed for VFs which don't exist at the bootup stage. To do the match
between resources and VFs, the code need to get the VF's BDF in advance.

In this patch, it exports the interface to retrieve VF's BDF:
   * Make the virtfn_bus as an interface
   * Make the virtfn_devfn as an interface
   * Rename them with more specific name
   * Code cleanup in pci_sriov_resource_alignment()

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   22 +++++++++++++---------
 include/linux/pci.h |   11 +++++++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ea3a82c..e76d1a0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,14 +19,18 @@
 
 #define VIRTFN_ID_LEN	16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
 				    dev->sriov->stride * id) >> 8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * id) & 0xff;
 }
@@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
 
 	for ( ; total >= 0; total--) {
 		pci_iov_set_numvfs(dev, total);
-		busnr = virtfn_bus(dev, iov->total_VFs - 1);
+		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
 		if (busnr > max)
 			max = busnr;
 	}
@@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	struct pci_bus *bus;
 
 	mutex_lock(&iov->dev->sriov->lock);
-	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
 	if (!bus)
 		goto failed;
 
@@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	if (!virtfn)
 		goto failed0;
 
-	virtfn->devfn = virtfn_devfn(dev, id);
+	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
 	virtfn->vendor = dev->vendor;
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
 	pci_setup_device(virtfn);
@@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	struct pci_sriov *iov = dev->sriov;
 
 	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
-					     virtfn_bus(dev, id),
-					     virtfn_devfn(dev, id));
+					     pci_iov_virtfn_bus(dev, id),
+					     pci_iov_virtfn_devfn(dev, id));
 	if (!virtfn)
 		return;
 
@@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
+	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
 		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
 		return -ENOMEM;
 	}
@@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 	if (!reg)
 		return 0;
 
-	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
+	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
 	return resource_alignment(&tmp);
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 360a966..74ef944 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

When implementing the SR-IOV on PowerNV platform, some resource reservation
is needed for VFs which don't exist at the bootup stage. To do the match
between resources and VFs, the code need to get the VF's BDF in advance.

In this patch, it exports the interface to retrieve VF's BDF:
   * Make the virtfn_bus as an interface
   * Make the virtfn_devfn as an interface
   * Rename them with more specific name
   * Code cleanup in pci_sriov_resource_alignment()

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   22 +++++++++++++---------
 include/linux/pci.h |   11 +++++++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ea3a82c..e76d1a0 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -19,14 +19,18 @@
 
 #define VIRTFN_ID_LEN	16
 
-static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
 				    dev->sriov->stride * id) >> 8);
 }
 
-static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
 {
+	if (!dev->is_physfn)
+		return -EINVAL;
 	return (dev->devfn + dev->sriov->offset +
 		dev->sriov->stride * id) & 0xff;
 }
@@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
 
 	for ( ; total >= 0; total--) {
 		pci_iov_set_numvfs(dev, total);
-		busnr = virtfn_bus(dev, iov->total_VFs - 1);
+		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
 		if (busnr > max)
 			max = busnr;
 	}
@@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	struct pci_bus *bus;
 
 	mutex_lock(&iov->dev->sriov->lock);
-	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
 	if (!bus)
 		goto failed;
 
@@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 	if (!virtfn)
 		goto failed0;
 
-	virtfn->devfn = virtfn_devfn(dev, id);
+	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
 	virtfn->vendor = dev->vendor;
 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
 	pci_setup_device(virtfn);
@@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	struct pci_sriov *iov = dev->sriov;
 
 	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
-					     virtfn_bus(dev, id),
-					     virtfn_devfn(dev, id));
+					     pci_iov_virtfn_bus(dev, id),
+					     pci_iov_virtfn_devfn(dev, id));
 	if (!virtfn)
 		return;
 
@@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	iov->offset = offset;
 	iov->stride = stride;
 
-	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
+	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
 		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
 		return -ENOMEM;
 	}
@@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 	if (!reg)
 		return 0;
 
-	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
+	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
 	return resource_alignment(&tmp);
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 360a966..74ef944 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
 void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
 
 #ifdef CONFIG_PCI_IOV
+int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
+int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
+
 int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
 void pci_disable_sriov(struct pci_dev *dev);
 int pci_num_vf(struct pci_dev *dev);
@@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
 #else
+static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
+static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
+{
+	return -ENOSYS;
+}
 static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
 { return -ENODEV; }
 static inline void pci_disable_sriov(struct pci_dev *dev) { }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

VFs are dynamically created/released when driver enable them. On some
platforms, like PowerNV, special resources are necessary to enable VFs.

This patch adds two hooks for platform initialization before creating the
VFs.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e76d1a0..933d8cc 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -213,6 +213,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
 	int rc;
@@ -223,6 +228,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_dev *pdev;
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
+	int retval;
 
 	if (!nr_virtfn)
 		return 0;
@@ -297,6 +303,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	if (nr_virtfn < initial)
 		initial = nr_virtfn;
 
+	if ((retval = pcibios_sriov_enable(dev, initial))) {
+		dev_err(&dev->dev, "Failure %d from pcibios_sriov_setup()\n",
+			retval);
+		return retval;
+	}
+
 	for (i = 0; i < initial; i++) {
 		rc = virtfn_add(dev, i, 0);
 		if (rc)
@@ -325,6 +337,11 @@ failed:
 	return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static void sriov_disable(struct pci_dev *dev)
 {
 	int i;
@@ -336,6 +353,8 @@ static void sriov_disable(struct pci_dev *dev)
 	for (i = 0; i < iov->num_VFs; i++)
 		virtfn_remove(dev, i, 0);
 
+	pcibios_sriov_disable(dev, iov->num_VFs);
+
 	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

VFs are dynamically created/released when driver enable them. On some
platforms, like PowerNV, special resources are necessary to enable VFs.

This patch adds two hooks for platform initialization before creating the
VFs.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index e76d1a0..933d8cc 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -213,6 +213,11 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
 	pci_dev_put(dev);
 }
 
+int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
 	int rc;
@@ -223,6 +228,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	struct pci_dev *pdev;
 	struct pci_sriov *iov = dev->sriov;
 	int bars = 0;
+	int retval;
 
 	if (!nr_virtfn)
 		return 0;
@@ -297,6 +303,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	if (nr_virtfn < initial)
 		initial = nr_virtfn;
 
+	if ((retval = pcibios_sriov_enable(dev, initial))) {
+		dev_err(&dev->dev, "Failure %d from pcibios_sriov_setup()\n",
+			retval);
+		return retval;
+	}
+
 	for (i = 0; i < initial; i++) {
 		rc = virtfn_add(dev, i, 0);
 		if (rc)
@@ -325,6 +337,11 @@ failed:
 	return rc;
 }
 
+int __weak pcibios_sriov_disable(struct pci_dev *pdev, u16 vf_num)
+{
+       return 0;
+}
+
 static void sriov_disable(struct pci_dev *dev)
 {
 	int i;
@@ -336,6 +353,8 @@ static void sriov_disable(struct pci_dev *dev)
 	for (i = 0; i < iov->num_VFs; i++)
 		virtfn_remove(dev, i, 0);
 
+	pcibios_sriov_disable(dev, iov->num_VFs);
+
 	iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
 	pci_cfg_access_lock(dev);
 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

The alignment of PF's IOV BAR is designed to be the individual size of a
VF's BAR size. This works fine for many platforms, but on PowerNV platform
it needs some change.

The original alignment works, since at sizing and assigning stage the
requirement is from an individual VF's BAR size instead of the PF's IOV
BAR.  This is the reason for the original code to just retrieve the
individual VF BAR size as the alignment.

On PowerNV platform, it is required to align the whole PF IOV BAR to a
hardware segment. Based on this fact, the alignment of PF's IOV BAR should
be calculated seperately.

This patch introduces a weak pcibios_iov_resource_alignment() interface,
which gives platform a chance to implement specific method to calculate
the PF's IOV BAR alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   11 ++++++++++-
 include/linux/pci.h |    3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 933d8cc..5f48201 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
 		4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+		int resno, resource_size_t align)
+{
+	return align;
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
 	struct resource tmp;
 	int reg = pci_iov_resource_bar(dev, resno);
+	resource_size_t align;
 
 	if (!reg)
 		return 0;
 
 	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
-	return resource_alignment(&tmp);
+	align = resource_alignment(&tmp);
+
+	return pcibios_iov_resource_alignment(dev, resno, align);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 74ef944..ae7a7ea 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
+						 int resno,
+						 resource_size_t align);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

The alignment of PF's IOV BAR is designed to be the individual size of a
VF's BAR size. This works fine for many platforms, but on PowerNV platform
it needs some change.

The original alignment works, since at sizing and assigning stage the
requirement is from an individual VF's BAR size instead of the PF's IOV
BAR.  This is the reason for the original code to just retrieve the
individual VF BAR size as the alignment.

On PowerNV platform, it is required to align the whole PF IOV BAR to a
hardware segment. Based on this fact, the alignment of PF's IOV BAR should
be calculated seperately.

This patch introduces a weak pcibios_iov_resource_alignment() interface,
which gives platform a chance to implement specific method to calculate
the PF's IOV BAR alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   11 ++++++++++-
 include/linux/pci.h |    3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 933d8cc..5f48201 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
 		4 * (resno - PCI_IOV_RESOURCES);
 }
 
+resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
+		int resno, resource_size_t align)
+{
+	return align;
+}
+
 /**
  * pci_sriov_resource_alignment - get resource alignment for VF BAR
  * @dev: the PCI device
@@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
 {
 	struct resource tmp;
 	int reg = pci_iov_resource_bar(dev, resno);
+	resource_size_t align;
 
 	if (!reg)
 		return 0;
 
 	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
-	return resource_alignment(&tmp);
+	align = resource_alignment(&tmp);
+
+	return pcibios_iov_resource_alignment(dev, resno, align);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 74ef944..ae7a7ea 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
 resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 					 unsigned long type);
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
+						 int resno,
+						 resource_size_t align);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 04/17] PCI: Store VF BAR size in pci_sriov
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

Currently we don't store the VF BAR size, and each time we calculate the
size by dividing the PF's IOV BAR size by total_VFs.

This patch stores the VF BAR size in pci_sriov and introduces a function to
retrieve it. Also, it adds a log message to show the total PF's IOV BAR
size.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   28 ++++++++++++++++++++--------
 drivers/pci/pci.h   |    2 ++
 include/linux/pci.h |    3 +++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5f48201..b43628b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -100,6 +100,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	if (!dev->is_physfn)
+		return 0;
+
+	return dev->sriov->res[resno - PCI_IOV_RESOURCES];
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -135,8 +143,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
@@ -419,6 +426,12 @@ found:
 	pgsz &= ~(pgsz - 1);
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
+	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+	if (!iov) {
+		rc = -ENOMEM;
+		goto failed;
+	}
+
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = dev->resource + PCI_IOV_RESOURCES + i;
@@ -430,16 +443,15 @@ found:
 			rc = -EIO;
 			goto failed;
 		}
+		iov->res[res - dev->resource - PCI_IOV_RESOURCES] =
+			resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
+		dev_info(&dev->dev, "VF BAR%d: %pR (for %d VFs)",
+				(int)(res - dev->resource - PCI_IOV_RESOURCES),
+				res, total);
 		nres++;
 	}
 
-	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
-
 	iov->pos = pos;
 	iov->nres = nres;
 	iov->ctrl = ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 94faf97..b1c9fdd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -241,6 +241,8 @@ struct pci_sriov {
 	struct pci_dev *dev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	resource_size_t res[PCI_SRIOV_NUM_BARS];
+				/* VF BAR size */
 };
 
 #ifdef CONFIG_PCI_ATS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ae7a7ea..f0b5f87 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1670,6 +1670,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -1689,6 +1690,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 04/17] PCI: Store VF BAR size in pci_sriov
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

Currently we don't store the VF BAR size, and each time we calculate the
size by dividing the PF's IOV BAR size by total_VFs.

This patch stores the VF BAR size in pci_sriov and introduces a function to
retrieve it. Also, it adds a log message to show the total PF's IOV BAR
size.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   28 ++++++++++++++++++++--------
 drivers/pci/pci.h   |    2 ++
 include/linux/pci.h |    3 +++
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5f48201..b43628b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -100,6 +100,14 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	if (!dev->is_physfn)
+		return 0;
+
+	return dev->sriov->res[resno - PCI_IOV_RESOURCES];
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -135,8 +143,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
@@ -419,6 +426,12 @@ found:
 	pgsz &= ~(pgsz - 1);
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
+	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+	if (!iov) {
+		rc = -ENOMEM;
+		goto failed;
+	}
+
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = dev->resource + PCI_IOV_RESOURCES + i;
@@ -430,16 +443,15 @@ found:
 			rc = -EIO;
 			goto failed;
 		}
+		iov->res[res - dev->resource - PCI_IOV_RESOURCES] =
+			resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
+		dev_info(&dev->dev, "VF BAR%d: %pR (for %d VFs)",
+				(int)(res - dev->resource - PCI_IOV_RESOURCES),
+				res, total);
 		nres++;
 	}
 
-	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
-
 	iov->pos = pos;
 	iov->nres = nres;
 	iov->ctrl = ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 94faf97..b1c9fdd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -241,6 +241,8 @@ struct pci_sriov {
 	struct pci_dev *dev;	/* lowest numbered PF */
 	struct pci_dev *self;	/* this PF */
 	struct mutex lock;	/* lock for VF bus */
+	resource_size_t res[PCI_SRIOV_NUM_BARS];
+				/* VF BAR size */
 };
 
 #ifdef CONFIG_PCI_ATS
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ae7a7ea..f0b5f87 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1670,6 +1670,7 @@ int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -1689,6 +1690,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 05/17] PCI: Take additional PF's IOV BAR alignment in sizing and assigning
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

At resource sizing/assigning stage, resources are divided into two lists,
requested list and additional list, while the alignement of the additional
IOV BAR is not taken into the sizing and assigning procedure.

This is reasonable in the original implementation, since IOV BAR's
alignment is mostly the size of a PF BAR alignemt. This means the alignment
is already taken into consideration. While this rule may be violated on
some platform, eg.  PowerNV platform.

This patch takes the additional IOV BAR alignment in sizing and assigning
stage explicitly. When system MMIO space is not enough, the PF's IOV BAR
alignment will not contribute to the bridge. When system MMIO space is
enough, the additional alignment will contribute to the bridge.

Also it take advantage of pci_dev_resource::min_align to store this
additional alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c |   85 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 0482235..08252c7 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
 	}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-					struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+					       struct resource *res)
 {
 	struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
 			int idx = res - &dev_res->dev->resource[0];
 
 			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
-				 "res[%d]=%pR get_res_add_size add_size %llx\n",
+				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
 				 idx, dev_res->res,
-				 (unsigned long long)dev_res->add_size);
+				 (unsigned long long)dev_res->add_size,
+				 (unsigned long long)dev_res->min_align);
 
-			return dev_res->add_size;
+			return dev_res;
 		}
 	}
 
-	return 0;
+	return NULL;
+}
+
+static resource_size_t get_res_add_size(struct list_head *head,
+					struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+		struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->min_align : 0;
 }
 
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
 	LIST_HEAD(save_head);
 	LIST_HEAD(local_fail_head);
 	struct pci_dev_resource *save_res;
-	struct pci_dev_resource *dev_res, *tmp_res;
+	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
 	unsigned long fail_type;
+	resource_size_t add_align, align;
 
 	/* Check if optional add_size is there */
 	if (!realloc_head || list_empty(realloc_head))
@@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
 	}
 
 	/* Update res in head list with add_size in realloc_head list */
-	list_for_each_entry(dev_res, head, list)
+	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
 		dev_res->res->end += get_res_add_size(realloc_head,
 							dev_res->res);
 
+		/* 
+		 * There are two kinds additional resources in the list:
+		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
+		 * 2. SRIOV resource   -- IORESOURCE_SIZEALIGN
+		 * Here just fix the additional alignment for bridge
+		 */
+		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+			continue;
+
+		add_align = get_res_add_align(realloc_head, dev_res->res);
+
+		/* Reorder the list by their alignment */
+		if (add_align > dev_res->res->start) {
+			dev_res->res->start = add_align;
+			dev_res->res->end = add_align +
+				            resource_size(dev_res->res);
+
+			list_for_each_entry(dev_res2, head, list) {
+				align = pci_resource_alignment(dev_res2->dev,
+							       dev_res2->res);
+				if (add_align > align)
+					list_move_tail(&dev_res->list,
+						       &dev_res2->list);
+			}
+               }
+
+	}
+
 	/* Try updated head list with add_size added */
 	assign_requested_resources_sorted(head, &local_fail_head);
 
@@ -930,6 +979,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	struct resource *b_res = find_free_bus_resource(bus,
 					mask | IORESOURCE_PREFETCH, type);
 	resource_size_t children_add_size = 0;
+	resource_size_t children_add_align = 0;
+	resource_size_t add_align = 0;
 
 	if (!b_res)
 		return -ENOSPC;
@@ -954,6 +1005,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			/* put SRIOV requested res to the optional list */
 			if (realloc_head && i >= PCI_IOV_RESOURCES &&
 					i <= PCI_IOV_RESOURCE_END) {
+				add_align = max(pci_resource_alignment(dev, r), add_align);
 				r->end = r->start - 1;
 				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
 				children_add_size += r_size;
@@ -984,19 +1036,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			if (order > max_order)
 				max_order = order;
 
-			if (realloc_head)
+			if (realloc_head) {
 				children_add_size += get_res_add_size(realloc_head, r);
+				children_add_align = get_res_add_align(realloc_head, r);
+				add_align = max(add_align, children_add_align);
+			}
 		}
 	}
 
 	min_align = calculate_mem_align(aligns, max_order);
 	min_align = max(min_align, window_alignment(bus, b_res->flags));
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
+	add_align = max(min_align, add_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
 	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
 		calculate_memsize(size, min_size, add_size,
-				resource_size(b_res), min_align);
+				resource_size(b_res), add_align);
 	if (!size0 && !size1) {
 		if (b_res->start || b_res->end)
 			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
@@ -1008,10 +1064,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	b_res->end = size0 + min_align - 1;
 	b_res->flags |= IORESOURCE_STARTALIGN;
 	if (size1 > size0 && realloc_head) {
-		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
-		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
-			   b_res, &bus->busn_res,
-			   (unsigned long long)size1-size0);
+		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
+				 "%pR to %pR add_size %llx add_align %llx\n", b_res,
+				 &bus->busn_res, (unsigned long long)size1-size0,
+				 (unsigned long long)add_align);
 	}
 	return 0;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 05/17] PCI: Take additional PF's IOV BAR alignment in sizing and assigning
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

At resource sizing/assigning stage, resources are divided into two lists,
requested list and additional list, while the alignement of the additional
IOV BAR is not taken into the sizing and assigning procedure.

This is reasonable in the original implementation, since IOV BAR's
alignment is mostly the size of a PF BAR alignemt. This means the alignment
is already taken into consideration. While this rule may be violated on
some platform, eg.  PowerNV platform.

This patch takes the additional IOV BAR alignment in sizing and assigning
stage explicitly. When system MMIO space is not enough, the PF's IOV BAR
alignment will not contribute to the bridge. When system MMIO space is
enough, the additional alignment will contribute to the bridge.

Also it take advantage of pci_dev_resource::min_align to store this
additional alignment.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c |   85 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 0482235..08252c7 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -99,8 +99,8 @@ static void remove_from_list(struct list_head *head,
 	}
 }
 
-static resource_size_t get_res_add_size(struct list_head *head,
-					struct resource *res)
+static struct pci_dev_resource *res_to_dev_res(struct list_head *head,
+					       struct resource *res)
 {
 	struct pci_dev_resource *dev_res;
 
@@ -109,17 +109,37 @@ static resource_size_t get_res_add_size(struct list_head *head,
 			int idx = res - &dev_res->dev->resource[0];
 
 			dev_printk(KERN_DEBUG, &dev_res->dev->dev,
-				 "res[%d]=%pR get_res_add_size add_size %llx\n",
+				 "res[%d]=%pR res_to_dev_res add_size %llx min_align %llx\n",
 				 idx, dev_res->res,
-				 (unsigned long long)dev_res->add_size);
+				 (unsigned long long)dev_res->add_size,
+				 (unsigned long long)dev_res->min_align);
 
-			return dev_res->add_size;
+			return dev_res;
 		}
 	}
 
-	return 0;
+	return NULL;
+}
+
+static resource_size_t get_res_add_size(struct list_head *head,
+					struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->add_size : 0;
+}
+
+static resource_size_t get_res_add_align(struct list_head *head,
+		struct resource *res)
+{
+	struct pci_dev_resource *dev_res;
+
+	dev_res = res_to_dev_res(head, res);
+	return dev_res ? dev_res->min_align : 0;
 }
 
+
 /* Sort resources by alignment */
 static void pdev_sort_resources(struct pci_dev *dev, struct list_head *head)
 {
@@ -368,8 +388,9 @@ static void __assign_resources_sorted(struct list_head *head,
 	LIST_HEAD(save_head);
 	LIST_HEAD(local_fail_head);
 	struct pci_dev_resource *save_res;
-	struct pci_dev_resource *dev_res, *tmp_res;
+	struct pci_dev_resource *dev_res, *tmp_res, *dev_res2;
 	unsigned long fail_type;
+	resource_size_t add_align, align;
 
 	/* Check if optional add_size is there */
 	if (!realloc_head || list_empty(realloc_head))
@@ -384,10 +405,38 @@ static void __assign_resources_sorted(struct list_head *head,
 	}
 
 	/* Update res in head list with add_size in realloc_head list */
-	list_for_each_entry(dev_res, head, list)
+	list_for_each_entry_safe(dev_res, tmp_res, head, list) {
 		dev_res->res->end += get_res_add_size(realloc_head,
 							dev_res->res);
 
+		/* 
+		 * There are two kinds additional resources in the list:
+		 * 1. bridge resource  -- IORESOURCE_STARTALIGN
+		 * 2. SRIOV resource   -- IORESOURCE_SIZEALIGN
+		 * Here just fix the additional alignment for bridge
+		 */
+		if (!(dev_res->res->flags & IORESOURCE_STARTALIGN))
+			continue;
+
+		add_align = get_res_add_align(realloc_head, dev_res->res);
+
+		/* Reorder the list by their alignment */
+		if (add_align > dev_res->res->start) {
+			dev_res->res->start = add_align;
+			dev_res->res->end = add_align +
+				            resource_size(dev_res->res);
+
+			list_for_each_entry(dev_res2, head, list) {
+				align = pci_resource_alignment(dev_res2->dev,
+							       dev_res2->res);
+				if (add_align > align)
+					list_move_tail(&dev_res->list,
+						       &dev_res2->list);
+			}
+               }
+
+	}
+
 	/* Try updated head list with add_size added */
 	assign_requested_resources_sorted(head, &local_fail_head);
 
@@ -930,6 +979,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	struct resource *b_res = find_free_bus_resource(bus,
 					mask | IORESOURCE_PREFETCH, type);
 	resource_size_t children_add_size = 0;
+	resource_size_t children_add_align = 0;
+	resource_size_t add_align = 0;
 
 	if (!b_res)
 		return -ENOSPC;
@@ -954,6 +1005,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			/* put SRIOV requested res to the optional list */
 			if (realloc_head && i >= PCI_IOV_RESOURCES &&
 					i <= PCI_IOV_RESOURCE_END) {
+				add_align = max(pci_resource_alignment(dev, r), add_align);
 				r->end = r->start - 1;
 				add_to_list(realloc_head, dev, r, r_size, 0/* don't care */);
 				children_add_size += r_size;
@@ -984,19 +1036,23 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			if (order > max_order)
 				max_order = order;
 
-			if (realloc_head)
+			if (realloc_head) {
 				children_add_size += get_res_add_size(realloc_head, r);
+				children_add_align = get_res_add_align(realloc_head, r);
+				add_align = max(add_align, children_add_align);
+			}
 		}
 	}
 
 	min_align = calculate_mem_align(aligns, max_order);
 	min_align = max(min_align, window_alignment(bus, b_res->flags));
 	size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), min_align);
+	add_align = max(min_align, add_align);
 	if (children_add_size > add_size)
 		add_size = children_add_size;
 	size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
 		calculate_memsize(size, min_size, add_size,
-				resource_size(b_res), min_align);
+				resource_size(b_res), add_align);
 	if (!size0 && !size1) {
 		if (b_res->start || b_res->end)
 			dev_info(&bus->self->dev, "disabling bridge window %pR to %pR (unused)\n",
@@ -1008,10 +1064,11 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 	b_res->end = size0 + min_align - 1;
 	b_res->flags |= IORESOURCE_STARTALIGN;
 	if (size1 > size0 && realloc_head) {
-		add_to_list(realloc_head, bus->self, b_res, size1-size0, min_align);
-		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window %pR to %pR add_size %llx\n",
-			   b_res, &bus->busn_res,
-			   (unsigned long long)size1-size0);
+		add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+		dev_printk(KERN_DEBUG, &bus->self->dev, "bridge window "
+				 "%pR to %pR add_size %llx add_align %llx\n", b_res,
+				 &bus->busn_res, (unsigned long long)size1-size0,
+				 (unsigned long long)add_align);
 	}
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:
    1. size expaned
    2. aligned to M64BT size

This patch documents this change on the reason and how.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++++++++++++++++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 0000000..10d4ac2
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,215 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+Benjamin Herrenschmidt <benh@au1.ibm.com>
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerNV platform and how generic PCI code handle this
+requirement. The first two sections describes the concept to PE and the
+implementation on P8 (IODA2)
+
+1. General Introduction on the Purpose of PE
+PE stands for Partitionable Endpoint.
+
+The concept of PE is a way to group the various resources associated
+with a device or a set of device to provide isolation between partitions
+(ie. filtering of DMA, MSIs etc...) and to provide a mechanism to freeze
+a device that is causing errors in order to limit the possibility of
+propagation of bad data.
+
+There is thus, in HW, a table of PE states that contains a pair of
+"frozen" state bits (one for MMIO and one for DMA, they get set together
+but can be cleared independently) for each PE.
+
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value. MSIs are also blocked. There's a bit more state that
+captures things like the details of the error that caused the freeze etc...
+but that's not critical.
+
+The interesting part is how the various type of PCIe transactions (MMIO,
+DMA,...) are matched to their corresponding PEs.
+
+Following section provides a rough description of what we have on P8 (IODA2).
+Keep in mind that this is all per PHB (host bridge). Each PHB is a completely
+separate HW entity which replicates the entire logic, so has its own set
+of PEs etc...
+
+2. Implementation of PE on P8 (IODA2)
+First, P8 has 256 PEs per PHB.
+
+ * Inbound
+
+For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but
+accessed in HW by the chip) that provides a direct correspondence between
+a PCIe RID (bus/dev/fn) with a "PE" number. We call this the RTT.
+
+ - For DMA we then provide an entire address space for each PE that can contains
+two "windows", depending on the value of PCI bit 59. Each window can then be
+configured to be remapped via a "TCE table" (iommu translation table), which has
+various configurable characteristics which we can describe another day.
+
+ - For MSIs, we have two windows in the address space (one at the top of the 32-bit
+space and one much higher) which, via a combination of the address and MSI value,
+will result in one of the 2048 interrupts per bridge being triggered. There's
+a PE value in the interrupt controller descriptor table as well which is compared
+with the PE obtained from the RTT to "authorize" the device to emit that specific
+interrupt.
+
+ - Error messages just use the RTT.
+
+ * Outbound. That's where the tricky part is.
+
+The PHB basically has a concept of "windows" from the CPU address space to the
+PCI address space. There is one M32 window and 16 M64 windows. They have different
+characteristics. First what they have in common: they are configured to forward a
+configurable portion of the CPU address space to the PCIe bus and must be naturally
+aligned power of two in size. The rest is different:
+
+  - The M32 window:
+
+    * It is limited to 4G in size
+
+    * It drops the top bits of the address (above the size) and replaces them with
+a configurable value. This is typically used to generate 32-bit PCIe accesses. We
+configure that window at boot from FW and don't touch it from Linux, it's usually
+set to forward a 2G portion of address space from the CPU to PCIe
+0x8000_0000..0xffff_ffff. (Note: The top 64K are actually reserved for MSIs but
+this is not a problem at this point, we just need to ensure Linux doesn't assign
+anything there, the M32 logic ignores that however and will forward in that space
+if we try).
+
+    * It is divided into 256 segments of equal size. A table in the chip provides
+for each of these 256 segments a PE#. That allows to essentially assign portions
+of the MMIO space to PEs on a segment granularity. For a 2G window, this is 8M.
+
+Now, this is the "main" window we use in Linux today (excluding SR-IOV). We
+basically use the trick of forcing the bridge MMIO windows onto a segment
+alignment/granularity so that the space behind a bridge can be assigned to a PE.
+
+Ideally we would like to be able to have individual functions in PE's but that
+would mean using a completely different address allocation scheme where individual
+function BARs can be "grouped" to fit in one or more segments....
+
+ - The M64 windows.
+
+   * Their smallest size is 1M
+
+   * They do not translate addresses (the address on PCIe is the same as the
+address on the PowerBus. There is a way to also set the top 14 bits which are
+not conveyed by PowerBus but we don't use this).
+
+   * They can be configured to be segmented or not. When segmented, they have
+256 segments, however they are not remapped. The segment number *is* the PE
+number. When no segmented, the PE number can be specified for the entire
+window.
+
+   * They support overlaps in which case there is a well defined ordering of
+matching (I don't remember off hand which of the lower or higher numbered
+window takes priority but basically it's well defined).
+
+We have code (fairly new compared to the M32 stuff) that exploits that for
+large BARs in 64-bit space:
+
+We create a single big M64 that covers the entire region of address space that
+has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
+it comes out of a different "reserve"). We configure that window as segmented.
+
+Then we do the same thing as with M32, using the bridge aligment trick, to
+match to those giant segments.
+
+Since we cannot remap, we have two additional constraints:
+
+  - We do the PE# allocation *after* the 64-bit space has been assigned since
+the segments used will derive directly the PE#, we then "update" the M32 PE#
+for the devices that use both 32-bit and 64-bit spaces or assign the remaining
+PE# to 32-bit only devices.
+
+  - We cannot "group" segments in HW so if a device ends up using more than
+one segment, we end up with more than one PE#. There is a HW mechanism to
+make the freeze state cascade to "companion" PEs but that only work for PCIe
+error messages (typically used so that if you freeze a switch, it freezes all
+its children). So we do it in SW. We lose a bit of effectiveness of EEH in that
+case, but that's the best we found. So when any of the PEs freezes, we freeze
+the other ones for that "domain". We thus introduce the concept of "master PE"
+which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
+for the remaining M64 segments.
+
+We would like to investigate using additional M64's in "single PE" mode to
+overlay over specific BARs to work around some of that, for example for devices
+with very large BARs (some GPUs), it would make sense, but we haven't done it
+yet.
+
+Finally, the plan to use M64 for SR-IOV, which will be described more in next
+two sections. So for a given IOV BAR, we need to effectively reserve the
+entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
+the beginning of a free range of segments/PEs inside that M64.
+
+The goal is of course to be able to give a separate PE for each VF...
+
+3. Hardware requirement on PowerNV platform for SRIOV
+On PowerNV platform, IODA2 version, it has 16 M64 BARs, which is used to map
+MMIO range to PE#. Each M64 BAR would cover one MMIO range and this range is
+divided by *total_pe* number evenly with one piece corresponding to one PE.
+
+We decide to leverage this M64 BAR to map VFs to their individual PE, since
+for SRIOV VFs their BAR share the same size.
+
+By doing so, it introduces another problem. The *total_pe* number usually is
+bigger than the total_VFs. If we map one IOV BAR directly to one M64 BAR, some
+part in M64 BAR will map to another devices MMIO range.
+
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.0 Direct map IOV BAR
+
+As Figure 1.0 indicates, the range [total_VFs, total_pe - 1] in M64 BAR may
+map to some MMIO range on other device.
+
+The solution currently we have is to expand the IOV BAR to *total_pe* number.
+
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.1 Map expanded IOV BAR
+
+By expanding the IOV BAR, this ensures the whole M64 range will not effect
+others.
+
+4. How generic PCI code handle it
+Till now, it looks good to make it work, while another problem comes. The M64
+BAR start address needs to be size aligned, while the original generic PCI
+code assign the IOV BAR with individual VF BAR size aligned.
+
+Since usually one SRIOV VF BAR size is the same as its PF size, the original
+generic PCI code will not count in the IOV BAR alignment. (The alignment is
+the same as its PF.) With the change from PowerNV platform, this changes. The
+alignment of the IOV BAR is now the total size, then we need to count in it.
+
+From:
+	alignment(IOV BAR) = size(VF BAR) = size(PF BAR)
+To:
+	alignment(IOV BAR) = size(IOV BAR)
+
+In commit(PCI: Take additional IOV BAR alignment in sizing and assigning), it
+has add_align to track the alignment from IOV BAR and use it to meet the
+requirement.
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:
    1. size expaned
    2. aligned to M64BT size

This patch documents this change on the reason and how.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++++++++++++++++
 1 file changed, 215 insertions(+)
 create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 0000000..10d4ac2
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,215 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+Benjamin Herrenschmidt <benh@au1.ibm.com>
+26 Aug 2014
+
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerNV platform and how generic PCI code handle this
+requirement. The first two sections describes the concept to PE and the
+implementation on P8 (IODA2)
+
+1. General Introduction on the Purpose of PE
+PE stands for Partitionable Endpoint.
+
+The concept of PE is a way to group the various resources associated
+with a device or a set of device to provide isolation between partitions
+(ie. filtering of DMA, MSIs etc...) and to provide a mechanism to freeze
+a device that is causing errors in order to limit the possibility of
+propagation of bad data.
+
+There is thus, in HW, a table of PE states that contains a pair of
+"frozen" state bits (one for MMIO and one for DMA, they get set together
+but can be cleared independently) for each PE.
+
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value. MSIs are also blocked. There's a bit more state that
+captures things like the details of the error that caused the freeze etc...
+but that's not critical.
+
+The interesting part is how the various type of PCIe transactions (MMIO,
+DMA,...) are matched to their corresponding PEs.
+
+Following section provides a rough description of what we have on P8 (IODA2).
+Keep in mind that this is all per PHB (host bridge). Each PHB is a completely
+separate HW entity which replicates the entire logic, so has its own set
+of PEs etc...
+
+2. Implementation of PE on P8 (IODA2)
+First, P8 has 256 PEs per PHB.
+
+ * Inbound
+
+For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but
+accessed in HW by the chip) that provides a direct correspondence between
+a PCIe RID (bus/dev/fn) with a "PE" number. We call this the RTT.
+
+ - For DMA we then provide an entire address space for each PE that can contains
+two "windows", depending on the value of PCI bit 59. Each window can then be
+configured to be remapped via a "TCE table" (iommu translation table), which has
+various configurable characteristics which we can describe another day.
+
+ - For MSIs, we have two windows in the address space (one at the top of the 32-bit
+space and one much higher) which, via a combination of the address and MSI value,
+will result in one of the 2048 interrupts per bridge being triggered. There's
+a PE value in the interrupt controller descriptor table as well which is compared
+with the PE obtained from the RTT to "authorize" the device to emit that specific
+interrupt.
+
+ - Error messages just use the RTT.
+
+ * Outbound. That's where the tricky part is.
+
+The PHB basically has a concept of "windows" from the CPU address space to the
+PCI address space. There is one M32 window and 16 M64 windows. They have different
+characteristics. First what they have in common: they are configured to forward a
+configurable portion of the CPU address space to the PCIe bus and must be naturally
+aligned power of two in size. The rest is different:
+
+  - The M32 window:
+
+    * It is limited to 4G in size
+
+    * It drops the top bits of the address (above the size) and replaces them with
+a configurable value. This is typically used to generate 32-bit PCIe accesses. We
+configure that window at boot from FW and don't touch it from Linux, it's usually
+set to forward a 2G portion of address space from the CPU to PCIe
+0x8000_0000..0xffff_ffff. (Note: The top 64K are actually reserved for MSIs but
+this is not a problem at this point, we just need to ensure Linux doesn't assign
+anything there, the M32 logic ignores that however and will forward in that space
+if we try).
+
+    * It is divided into 256 segments of equal size. A table in the chip provides
+for each of these 256 segments a PE#. That allows to essentially assign portions
+of the MMIO space to PEs on a segment granularity. For a 2G window, this is 8M.
+
+Now, this is the "main" window we use in Linux today (excluding SR-IOV). We
+basically use the trick of forcing the bridge MMIO windows onto a segment
+alignment/granularity so that the space behind a bridge can be assigned to a PE.
+
+Ideally we would like to be able to have individual functions in PE's but that
+would mean using a completely different address allocation scheme where individual
+function BARs can be "grouped" to fit in one or more segments....
+
+ - The M64 windows.
+
+   * Their smallest size is 1M
+
+   * They do not translate addresses (the address on PCIe is the same as the
+address on the PowerBus. There is a way to also set the top 14 bits which are
+not conveyed by PowerBus but we don't use this).
+
+   * They can be configured to be segmented or not. When segmented, they have
+256 segments, however they are not remapped. The segment number *is* the PE
+number. When no segmented, the PE number can be specified for the entire
+window.
+
+   * They support overlaps in which case there is a well defined ordering of
+matching (I don't remember off hand which of the lower or higher numbered
+window takes priority but basically it's well defined).
+
+We have code (fairly new compared to the M32 stuff) that exploits that for
+large BARs in 64-bit space:
+
+We create a single big M64 that covers the entire region of address space that
+has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
+it comes out of a different "reserve"). We configure that window as segmented.
+
+Then we do the same thing as with M32, using the bridge aligment trick, to
+match to those giant segments.
+
+Since we cannot remap, we have two additional constraints:
+
+  - We do the PE# allocation *after* the 64-bit space has been assigned since
+the segments used will derive directly the PE#, we then "update" the M32 PE#
+for the devices that use both 32-bit and 64-bit spaces or assign the remaining
+PE# to 32-bit only devices.
+
+  - We cannot "group" segments in HW so if a device ends up using more than
+one segment, we end up with more than one PE#. There is a HW mechanism to
+make the freeze state cascade to "companion" PEs but that only work for PCIe
+error messages (typically used so that if you freeze a switch, it freezes all
+its children). So we do it in SW. We lose a bit of effectiveness of EEH in that
+case, but that's the best we found. So when any of the PEs freezes, we freeze
+the other ones for that "domain". We thus introduce the concept of "master PE"
+which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
+for the remaining M64 segments.
+
+We would like to investigate using additional M64's in "single PE" mode to
+overlay over specific BARs to work around some of that, for example for devices
+with very large BARs (some GPUs), it would make sense, but we haven't done it
+yet.
+
+Finally, the plan to use M64 for SR-IOV, which will be described more in next
+two sections. So for a given IOV BAR, we need to effectively reserve the
+entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
+the beginning of a free range of segments/PEs inside that M64.
+
+The goal is of course to be able to give a separate PE for each VF...
+
+3. Hardware requirement on PowerNV platform for SRIOV
+On PowerNV platform, IODA2 version, it has 16 M64 BARs, which is used to map
+MMIO range to PE#. Each M64 BAR would cover one MMIO range and this range is
+divided by *total_pe* number evenly with one piece corresponding to one PE.
+
+We decide to leverage this M64 BAR to map VFs to their individual PE, since
+for SRIOV VFs their BAR share the same size.
+
+By doing so, it introduces another problem. The *total_pe* number usually is
+bigger than the total_VFs. If we map one IOV BAR directly to one M64 BAR, some
+part in M64 BAR will map to another devices MMIO range.
+
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.0 Direct map IOV BAR
+
+As Figure 1.0 indicates, the range [total_VFs, total_pe - 1] in M64 BAR may
+map to some MMIO range on other device.
+
+The solution currently we have is to expand the IOV BAR to *total_pe* number.
+
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           IOV BAR
+     0      1                     total_VFs - 1          total_pe - 1
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+
+                           M64 BAR
+
+		Figure 1.1 Map expanded IOV BAR
+
+By expanding the IOV BAR, this ensures the whole M64 range will not effect
+others.
+
+4. How generic PCI code handle it
+Till now, it looks good to make it work, while another problem comes. The M64
+BAR start address needs to be size aligned, while the original generic PCI
+code assign the IOV BAR with individual VF BAR size aligned.
+
+Since usually one SRIOV VF BAR size is the same as its PF size, the original
+generic PCI code will not count in the IOV BAR alignment. (The alignment is
+the same as its PF.) With the change from PowerNV platform, this changes. The
+alignment of the IOV BAR is now the total size, then we need to count in it.
+
+From:
+	alignment(IOV BAR) = size(VF BAR) = size(PF BAR)
+To:
+	alignment(IOV BAR) = size(IOV BAR)
+
+In commit(PCI: Take additional IOV BAR alignment in sizing and assigning), it
+has add_align to track the alignment from IOV BAR and use it to meet the
+requirement.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
resources will be cleaned out during device header fixup time and then get
reassigned by PCI core. However, the VF resources won't be reassigned and
thus, we shouldn't clean them out.

This patch adds a condition. If the pci_dev is a VF, skip the resource
unset process.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 37d512d..889f743 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 		       pci_name(dev));
 		return;
 	}
+
+	if (dev->is_virtfn)
+		return;
+
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
 		struct resource *res = dev->resource + i;
 		struct pci_bus_region reg;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
resources will be cleaned out during device header fixup time and then get
reassigned by PCI core. However, the VF resources won't be reassigned and
thus, we shouldn't clean them out.

This patch adds a condition. If the pci_dev is a VF, skip the resource
unset process.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 37d512d..889f743 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
 		       pci_name(dev));
 		return;
 	}
+
+	if (dev->is_virtfn)
+		return;
+
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
 		struct resource *res = dev->resource + i;
 		struct pci_bus_region reg;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
  2015-01-15  2:27       ` Wei Yang
                         ` (7 preceding siblings ...)
  (?)
@ 2015-01-15  2:27       ` Wei Yang
  2015-02-20 23:19           ` Bjorn Helgaas
  -1 siblings, 1 reply; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev

From: Gavin Shan <gwshan@linux.vnet.ibm.com>

pci_dn is the extension of PCI device node and it's created from
device node. Unfortunately, VFs that are enabled dynamically by
PF's driver and they don't have corresponding device nodes, and
pci_dn. The patch refactors pci_dn to support VFs:

   * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
     to the child list of pci_dn of PF's bridge. pci_dn of other
     device put to the child list of pci_dn of its upstream bridge.

   * VF's pci_dn is expected to be created dynamically when PF
     enabling VFs. VF's pci_dn will be destroyed when PF disabling
     VFs. pci_dn of other device is still created from device node
     as before.

   * For one particular PCI device (VF or not), its pci_dn can be
     found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
     or parent's list. The fast path (fetching pci_dn through PCI
     device instance) is populated during early fixup time.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/device.h         |    3 +
 arch/powerpc/include/asm/pci-bridge.h     |   14 +-
 arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
 4 files changed, 270 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h
index 38faede..29992cd 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -34,6 +34,9 @@ struct dev_archdata {
 #ifdef CONFIG_SWIOTLB
 	dma_addr_t		max_direct_dma_addr;
 #endif
+#ifdef CONFIG_PPC64
+	void			*firmware_data;
+#endif
 #ifdef CONFIG_EEH
 	struct eeh_dev		*edev;
 #endif
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 725247b..c1b7dd5 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -89,6 +89,7 @@ struct pci_controller {
 
 #ifdef CONFIG_PPC64
 	unsigned long buid;
+	void *firmware_data;
 #endif	/* CONFIG_PPC64 */
 
 	void *private_data;
@@ -150,9 +151,13 @@ static inline int isa_vaddr_is_ioport(void __iomem *address)
 struct iommu_table;
 
 struct pci_dn {
+	int     flags;
+#define PCI_DN_FLAG_IOV_VF     0x01
+
 	int	busno;			/* pci bus number */
 	int	devfn;			/* pci device and function number */
 
+	struct  pci_dn *parent;
 	struct  pci_controller *phb;	/* for pci devices */
 	struct	iommu_table *iommu_table;	/* for phb's or bridges */
 	struct	device_node *node;	/* back-pointer to the device_node */
@@ -167,14 +172,19 @@ struct pci_dn {
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
 #endif
+	struct list_head child_list;
+	struct list_head list;
 };
 
 /* Get the pointer to a device_node's pci_dn */
 #define PCI_DN(dn)	((struct pci_dn *) (dn)->data)
 
+extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+					   int devfn);
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
-
-extern void * update_dn_pci_info(struct device_node *dn, void *data);
+extern struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num);
+extern void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num);
+extern void *update_dn_pci_info(struct device_node *dn, void *data);
 
 static inline int pci_device_from_OF_node(struct device_node *np,
 					  u8 *bus, u8 *devfn)
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 1f61fab..6536573 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -32,12 +32,224 @@
 #include <asm/ppc-pci.h>
 #include <asm/firmware.h>
 
+/*
+ * The function is used to find the firmware data of one
+ * specific PCI device, which is attached to the indicated
+ * PCI bus. For VFs, their firmware data is linked to that
+ * one of PF's bridge. For other devices, their firmware
+ * data is linked to that of their bridge.
+ */
+static struct pci_dn *pci_bus_to_pdn(struct pci_bus *bus)
+{
+	struct pci_bus *pbus;
+	struct device_node *dn;
+	struct pci_dn *pdn;
+
+	/*
+	 * We probably have virtual bus which doesn't
+	 * have associated bridge.
+	 */
+	pbus = bus;
+	while (pbus) {
+		if (pci_is_root_bus(pbus) || pbus->self)
+			break;
+
+		pbus = pbus->parent;
+	}
+
+	/*
+	 * Except virtual bus, all PCI buses should
+	 * have device nodes.
+	 */
+	dn = pci_bus_to_OF_node(pbus);
+	pdn = dn ? PCI_DN(dn) : NULL;
+
+	return pdn;
+}
+
+struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
+				    int devfn)
+{
+	struct device_node *dn = NULL;
+	struct pci_dn *parent, *pdn;
+	struct pci_dev *pdev = NULL;
+
+	/* Fast path: fetch from PCI device */
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		if (pdev->devfn == devfn) {
+			if (pdev->dev.archdata.firmware_data)
+				return pdev->dev.archdata.firmware_data;
+
+			dn = pci_device_to_OF_node(pdev);
+			break;
+		}
+	}
+
+	/* Fast path: fetch from device node */
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/* Slow path: fetch from firmware data hierarchy */
+	parent = pci_bus_to_pdn(bus);
+	if (!parent)
+		return NULL;
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == bus->number &&
+                    pdn->devfn == devfn)
+                        return pdn;
+        }
+
+	return NULL;
+}
+
 struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 {
-	struct device_node *dn = pci_device_to_OF_node(pdev);
-	if (!dn)
+	struct device_node *dn;
+	struct pci_dn *parent, *pdn;
+
+	/* Search device directly */
+	if (pdev->dev.archdata.firmware_data)
+		return pdev->dev.archdata.firmware_data;
+
+	/* Check device node */
+	dn = pci_device_to_OF_node(pdev);
+	pdn = dn ? PCI_DN(dn) : NULL;
+	if (pdn)
+		return pdn;
+
+	/*
+	 * VFs don't have device nodes. We hook their
+	 * firmware data to PF's bridge.
+	 */
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return NULL;
+
+	list_for_each_entry(pdn, &parent->child_list, list) {
+		if (pdn->busno == pdev->bus->number &&
+		    pdn->devfn == pdev->devfn)
+			return pdn;
+	}
+
+	return NULL;
+}
+
+#ifdef CONFIG_PCI_IOV
+static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
+					   struct pci_dev *pdev,
+					   int busno, int devfn)
+{
+	struct pci_dn *pdn;
+
+	/* Except PHB, we always have parent firmware data */
+	if (!parent)
+		return NULL;
+
+	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
+	if (!pdn) {
+		pr_warn("%s: Out of memory !\n", __func__);
+		return NULL;
+	}
+
+	pdn->phb = parent->phb;
+	pdn->parent = parent;
+	pdn->busno = busno;
+	pdn->devfn = devfn;
+#ifdef CONFIG_PPC_POWERNV
+	pdn->pe_number = IODA_INVALID_PE;
+#endif
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	list_add_tail(&pdn->list, &parent->child_list);
+
+	/*
+	 * If we already have PCI device instance, lets
+	 * bind them.
+	 */
+	if (pdev)
+		pdev->dev.archdata.firmware_data = pdn;
+
+	return pdn;
+}
+#endif // CONFIG_PCI_IOV
+
+struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent, *pdn;
+	int i;
+
+	/* Only support IOV for now */
+	if (!pdev->is_physfn)
+		return pci_get_pdn(pdev);
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return NULL;
+
+	pdn->flags |= PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
 		return NULL;
-	return PCI_DN(dn);
+
+	for (i = 0; i < vf_num; i++) {
+		pdn = add_one_dev_pci_info(parent, NULL,
+					   pci_iov_virtfn_bus(pdev, i),
+					   pci_iov_virtfn_devfn(pdev, i));
+		if (!pdn) {
+			pr_warn("%s: Cannot create firmware data "
+				"for VF#%d of %s\n",
+				__func__, i, pci_name(pdev));
+			return NULL;
+		}
+	}
+#endif
+
+	return pci_get_pdn(pdev);
+}
+
+void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
+{
+#ifdef CONFIG_PCI_IOV
+	struct pci_dn *parent;
+	struct pci_dn *pdn, *tmp;
+	int i;
+
+	/* Only support IOV PF for now */
+	if (!pdev->is_physfn)
+		return;
+
+	/* Check if VFs have been populated */
+	pdn = pci_get_pdn(pdev);
+	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
+		return;
+
+	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
+	parent = pci_bus_to_pdn(pdev->bus);
+	if (!parent)
+		return;
+
+	/*
+	 * We might introduce flag to pci_dn in future
+	 * so that we can release VF's firmware data in
+	 * a batch mode.
+	 */
+	for (i = 0; i < vf_num; i++) {
+		list_for_each_entry_safe(pdn, tmp,
+			&parent->child_list, list) {
+			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
+			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
+				continue;
+
+			if (!list_empty(&pdn->list))
+				list_del(&pdn->list);
+			kfree(pdn);
+		}
+	}
+#endif
 }
 
 /*
@@ -49,6 +261,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	struct pci_controller *phb = data;
 	const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
 	const __be32 *regs;
+	struct device_node *parent;
 	struct pci_dn *pdn;
 
 	pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL);
@@ -70,6 +283,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data)
 	}
 
 	pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
+
+	/* Attach to parent node */
+	INIT_LIST_HEAD(&pdn->child_list);
+	INIT_LIST_HEAD(&pdn->list);
+	parent = of_get_parent(dn);
+	pdn->parent = parent ? PCI_DN(parent) : NULL;
+	if (pdn->parent)
+		list_add_tail(&pdn->list, &pdn->parent->child_list);
+
 	return NULL;
 }
 
@@ -150,6 +372,7 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb)
 	if (pdn) {
 		pdn->devfn = pdn->busno = -1;
 		pdn->phb = phb;
+		phb->firmware_data = pdn;
 	}
 
 	/* Update dn->phb ptrs for new phb and children devices */
@@ -173,3 +396,16 @@ void __init pci_devs_phb_init(void)
 	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
 		pci_devs_phb_init_dynamic(phb);
 }
+
+static void pci_dev_pdn_setup(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	if (pdev->dev.archdata.firmware_data)
+		return;
+
+	/* Setup the fast path */
+	pdn = pci_get_pdn(pdev);
+	pdev->dev.archdata.firmware_data = pdn;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index fac88ed..5a8e6b1 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -949,6 +949,22 @@ static void pnv_pci_ioda_setup_PEs(void)
 	}
 }
 
+#ifdef CONFIG_PCI_IOV
+int pcibios_sriov_disable(struct pci_dev *pdev, u16 vf_num)
+{
+	/* Release firmware data */
+	remove_dev_pci_info(pdev, vf_num);
+	return 0;
+}
+
+int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	/* Allocate firmware data */
+	add_dev_pci_info(pdev, vf_num);
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev)
 {
 	struct pci_dn *pdn = pci_get_pdn(pdev);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 09/17] powerpc/pci: remove pci_dn->pcidev field
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:27         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

The field pci_dn->pcidev is assigned but not used.

This patch removes this field.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index c1b7dd5..334e745 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -164,7 +164,6 @@ struct pci_dn {
 
 	int	pci_ext_config_space;	/* for pci devices */
 
-	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev;		/* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5a8e6b1..665f57c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -832,7 +832,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 				pci_name(dev));
 			continue;
 		}
-		pdn->pcidev = dev;
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 09/17] powerpc/pci: remove pci_dn->pcidev field
@ 2015-01-15  2:27         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:27 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

The field pci_dn->pcidev is assigned but not used.

This patch removes this field.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    1 -
 arch/powerpc/platforms/powernv/pci-ioda.c |    1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index c1b7dd5..334e745 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -164,7 +164,6 @@ struct pci_dn {
 
 	int	pci_ext_config_space;	/* for pci devices */
 
-	struct	pci_dev *pcidev;	/* back-pointer to the pci device */
 #ifdef CONFIG_EEH
 	struct eeh_dev *edev;		/* eeh device */
 #endif
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5a8e6b1..665f57c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -832,7 +832,6 @@ static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe)
 				pci_name(dev));
 			continue;
 		}
-		pdn->pcidev = dev;
 		pdn->pe_number = pe->pe_number;
 		pe->dma_weight += pnv_ioda_dma_weight(dev);
 		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 10/17] powerpc/powernv: Use pci_dn in PCI config accessor
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

The PCI config accessors rely on device node. Unfortunately, VFs
don't have corresponding device nodes. So we have to switch to
pci_dn for PCI config access.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +++++-
 arch/powerpc/platforms/powernv/pci.c         |   69 ++++++++++----------------
 arch/powerpc/platforms/powernv/pci.h         |    4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1d19e79..c63b6c1 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -419,21 +419,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
 				   int where, int size, u32 *val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn)) {
 		*val = 0xFFFFFFFF;
 		return PCIBIOS_SET_FAILED;
 	}
 
-	return pnv_pci_cfg_read(dn, where, size, val);
+	return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
 				    int where, int size, u32 val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn))
 		return PCIBIOS_SET_FAILED;
 
-	return pnv_pci_cfg_write(dn, where, size, val);
+	return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4945e87..b7d4b9d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 	spin_unlock_irqrestore(&phb->lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-				     struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+	struct pnv_phb *phb = pdn->phb->private_data;
 	u8	fstate;
 	__be16	pcierr;
 	int	pe_no;
@@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	 * setup that yet. So all ER errors should be mapped to
 	 * reserved PE.
 	 */
-	pe_no = PCI_DN(dn)->pe_number;
+	pe_no = pdn->pe_number;
 	if (pe_no == IODA_INVALID_PE) {
 		if (phb->type == PNV_PHB_P5IOC2)
 			pe_no = 0;
@@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 
 	cfg_dbg(" -> EEH check, bdfn=%04x PE#%d fstate=%x\n",
-		(PCI_DN(dn)->busno << 8) | (PCI_DN(dn)->devfn),
-		pe_no, fstate);
+		(pdn->busno << 8) | (pdn->devfn), pe_no, fstate);
 
 	/* Clear the frozen state if applicable */
 	if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 	s64 rc;
@@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
 	return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 
@@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
-			      struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	struct eeh_dev *edev = NULL;
-	struct pnv_phb *phb = hose->private_data;
+	struct pnv_phb *phb = pdn->phb->private_data;
 
 	/* EEH not enabled ? */
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
 		return true;
 
 	/* PE reset or device removed ? */
-	edev = of_node_to_eeh_dev(dn);
+	edev = pdn->edev;
 	if (edev) {
 		if (edev->pe &&
 		    (edev->pe->state & EEH_PE_CFG_BLOCKED))
@@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
 	return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-				struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	return true;
 }
@@ -524,32 +519,26 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 			       unsigned int devfn,
 			       int where, int size, u32 *val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
 	*val = 0xFFFFFFFF;
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_read(dn, where, size, val);
-	if (phb->flags & PNV_PHB_FLAG_EEH) {
+	ret = pnv_pci_cfg_read(pdn, where, size, val);
+	phb = pdn->phb->private_data;
+	if (phb->flags & PNV_PHB_FLAG_EEH && pdn->edev) {
 		if (*val == EEH_IO_ERROR_VALUE(size) &&
-		    eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
+		    eeh_dev_check_failure(pdn->edev))
                         return PCIBIOS_DEVICE_NOT_FOUND;
 	} else {
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 	}
 
 	return ret;
@@ -559,27 +548,21 @@ static int pnv_pci_write_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_write(dn, where, size, val);
+	ret = pnv_pci_cfg_write(pdn, where, size, val);
+	phb = pdn->phb->private_data;
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 
 	return ret;
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..e5b75b2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -219,9 +219,9 @@ extern struct pnv_eeh_ops ioda_eeh_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val);
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 10/17] powerpc/powernv: Use pci_dn in PCI config accessor
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

The PCI config accessors rely on device node. Unfortunately, VFs
don't have corresponding device nodes. So we have to switch to
pci_dn for PCI config access.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-powernv.c |   14 +++++-
 arch/powerpc/platforms/powernv/pci.c         |   69 ++++++++++----------------
 arch/powerpc/platforms/powernv/pci.h         |    4 +-
 3 files changed, 40 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 1d19e79..c63b6c1 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -419,21 +419,31 @@ static inline bool powernv_eeh_cfg_blocked(struct device_node *dn)
 static int powernv_eeh_read_config(struct device_node *dn,
 				   int where, int size, u32 *val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn)) {
 		*val = 0xFFFFFFFF;
 		return PCIBIOS_SET_FAILED;
 	}
 
-	return pnv_pci_cfg_read(dn, where, size, val);
+	return pnv_pci_cfg_read(pdn, where, size, val);
 }
 
 static int powernv_eeh_write_config(struct device_node *dn,
 				    int where, int size, u32 val)
 {
+	struct pci_dn *pdn = PCI_DN(dn);
+
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	if (powernv_eeh_cfg_blocked(dn))
 		return PCIBIOS_SET_FAILED;
 
-	return pnv_pci_cfg_write(dn, where, size, val);
+	return pnv_pci_cfg_write(pdn, where, size, val);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 4945e87..b7d4b9d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -366,9 +366,9 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 	spin_unlock_irqrestore(&phb->lock, flags);
 }
 
-static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
-				     struct device_node *dn)
+static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
+	struct pnv_phb *phb = pdn->phb->private_data;
 	u8	fstate;
 	__be16	pcierr;
 	int	pe_no;
@@ -379,7 +379,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	 * setup that yet. So all ER errors should be mapped to
 	 * reserved PE.
 	 */
-	pe_no = PCI_DN(dn)->pe_number;
+	pe_no = pdn->pe_number;
 	if (pe_no == IODA_INVALID_PE) {
 		if (phb->type == PNV_PHB_P5IOC2)
 			pe_no = 0;
@@ -407,8 +407,7 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 
 	cfg_dbg(" -> EEH check, bdfn=%04x PE#%d fstate=%x\n",
-		(PCI_DN(dn)->busno << 8) | (PCI_DN(dn)->devfn),
-		pe_no, fstate);
+		(pdn->busno << 8) | (pdn->devfn), pe_no, fstate);
 
 	/* Clear the frozen state if applicable */
 	if (fstate == OPAL_EEH_STOPPED_MMIO_FREEZE ||
@@ -425,10 +424,9 @@ static void pnv_pci_config_check_eeh(struct pnv_phb *phb,
 	}
 }
 
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 	s64 rc;
@@ -462,10 +460,9 @@ int pnv_pci_cfg_read(struct device_node *dn,
 	return PCIBIOS_SUCCESSFUL;
 }
 
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val)
 {
-	struct pci_dn *pdn = PCI_DN(dn);
 	struct pnv_phb *phb = pdn->phb->private_data;
 	u32 bdfn = (pdn->busno << 8) | pdn->devfn;
 
@@ -489,18 +486,17 @@ int pnv_pci_cfg_write(struct device_node *dn,
 }
 
 #if CONFIG_EEH
-static bool pnv_pci_cfg_check(struct pci_controller *hose,
-			      struct device_node *dn)
+static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	struct eeh_dev *edev = NULL;
-	struct pnv_phb *phb = hose->private_data;
+	struct pnv_phb *phb = pdn->phb->private_data;
 
 	/* EEH not enabled ? */
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
 		return true;
 
 	/* PE reset or device removed ? */
-	edev = of_node_to_eeh_dev(dn);
+	edev = pdn->edev;
 	if (edev) {
 		if (edev->pe &&
 		    (edev->pe->state & EEH_PE_CFG_BLOCKED))
@@ -513,8 +509,7 @@ static bool pnv_pci_cfg_check(struct pci_controller *hose,
 	return true;
 }
 #else
-static inline pnv_pci_cfg_check(struct pci_controller *hose,
-				struct device_node *dn)
+static inline pnv_pci_cfg_check(struct pci_dn *pdn)
 {
 	return true;
 }
@@ -524,32 +519,26 @@ static int pnv_pci_read_config(struct pci_bus *bus,
 			       unsigned int devfn,
 			       int where, int size, u32 *val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
 	*val = 0xFFFFFFFF;
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_read(dn, where, size, val);
-	if (phb->flags & PNV_PHB_FLAG_EEH) {
+	ret = pnv_pci_cfg_read(pdn, where, size, val);
+	phb = pdn->phb->private_data;
+	if (phb->flags & PNV_PHB_FLAG_EEH && pdn->edev) {
 		if (*val == EEH_IO_ERROR_VALUE(size) &&
-		    eeh_dev_check_failure(of_node_to_eeh_dev(dn)))
+		    eeh_dev_check_failure(pdn->edev))
                         return PCIBIOS_DEVICE_NOT_FOUND;
 	} else {
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 	}
 
 	return ret;
@@ -559,27 +548,21 @@ static int pnv_pci_write_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 val)
 {
-	struct device_node *dn, *busdn = pci_bus_to_OF_node(bus);
 	struct pci_dn *pdn;
 	struct pnv_phb *phb;
-	bool found = false;
 	int ret;
 
-	for (dn = busdn->child; dn; dn = dn->sibling) {
-		pdn = PCI_DN(dn);
-		if (pdn && pdn->devfn == devfn) {
-			phb = pdn->phb->private_data;
-			found = true;
-			break;
-		}
-	}
+	pdn = pci_get_pdn_by_devfn(bus, devfn);
+	if (!pdn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	if (!found || !pnv_pci_cfg_check(pdn->phb, dn))
+	if (!pnv_pci_cfg_check(pdn))
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
-	ret = pnv_pci_cfg_write(dn, where, size, val);
+	ret = pnv_pci_cfg_write(pdn, where, size, val);
+	phb = pdn->phb->private_data;
 	if (!(phb->flags & PNV_PHB_FLAG_EEH))
-		pnv_pci_config_check_eeh(phb, dn);
+		pnv_pci_config_check_eeh(pdn);
 
 	return ret;
 }
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 6c02ff8..e5b75b2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -219,9 +219,9 @@ extern struct pnv_eeh_ops ioda_eeh_ops;
 
 void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
 				unsigned char *log_buff);
-int pnv_pci_cfg_read(struct device_node *dn,
+int pnv_pci_cfg_read(struct pci_dn *pdn,
 		     int where, int size, u32 *val);
-int pnv_pci_cfg_write(struct device_node *dn,
+int pnv_pci_cfg_write(struct pci_dn *pdn,
 		      int where, int size, u32 val);
 extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
 				      void *tce_mem, u64 tce_size,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 11/17] powerpc/powernv: Allocate pe->iommu_table dynamically
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

Current iommu_table of a PE is a static field. This will have a problem
when iommu_free_table is called.

This patch allocate iommu_table dynamically.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |    3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |    2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..5574eeb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
 	struct iommu_group *it_group;
 #endif
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+	void           *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 665f57c..31335a7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -890,6 +890,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 		return;
 	}
 
+	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+			GFP_KERNEL, hose->node);
+	pe->tce32_table->data = pe;
+
 	/* Associate it with all child devices */
 	pnv_ioda_setup_same_PE(bus, pe);
 
@@ -979,7 +983,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1006,7 +1010,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, pe->tce32_table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1043,9 +1047,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       pe->tce32_table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, pe->tce32_table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1135,8 +1139,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1202,7 +1205,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1240,8 +1243,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1286,10 +1288,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1337,7 +1339,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index e5b75b2..7317777 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct iommu_table	*tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 11/17] powerpc/powernv: Allocate pe->iommu_table dynamically
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

Current iommu_table of a PE is a static field. This will have a problem
when iommu_free_table is called.

This patch allocate iommu_table dynamically.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/iommu.h          |    3 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   26 ++++++++++++++------------
 arch/powerpc/platforms/powernv/pci.h      |    2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 9cfa370..5574eeb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,9 @@ struct iommu_table {
 	struct iommu_group *it_group;
 #endif
 	void (*set_bypass)(struct iommu_table *tbl, bool enable);
+#ifdef CONFIG_PPC_POWERNV
+	void           *data;
+#endif
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 665f57c..31335a7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -890,6 +890,10 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all)
 		return;
 	}
 
+	pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+			GFP_KERNEL, hose->node);
+	pe->tce32_table->data = pe;
+
 	/* Associate it with all child devices */
 	pnv_ioda_setup_same_PE(bus, pe);
 
@@ -979,7 +983,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+	set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1006,7 +1010,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
 	} else {
 		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
-		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+		set_iommu_table_base(&pdev->dev, pe->tce32_table);
 	}
 	*pdev->dev.dma_mask = dma_mask;
 	return 0;
@@ -1043,9 +1047,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		if (add_to_iommu_group)
 			set_iommu_table_base_and_group(&dev->dev,
-						       &pe->tce32_table);
+						       pe->tce32_table);
 		else
-			set_iommu_table_base(&dev->dev, &pe->tce32_table);
+			set_iommu_table_base(&dev->dev, pe->tce32_table);
 
 		if (dev->subordinate)
 			pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -1135,8 +1139,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
 				 __be64 *startp, __be64 *endp, bool rm)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	struct pnv_phb *phb = pe->phb;
 
 	if (phb->type == PNV_PHB_IODA1)
@@ -1202,7 +1205,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
 				  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1240,8 +1243,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
-	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
-					      tce32_table);
+	struct pnv_ioda_pe *pe = tbl->data;
 	uint16_t window_id = (pe->pe_number << 1 ) + 1;
 	int64_t rc;
 
@@ -1286,10 +1288,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
 	pe->tce_bypass_base = 1ull << 59;
 
 	/* Install set_bypass callback for VFIO */
-	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+	pe->tce32_table->set_bypass = pnv_pci_ioda2_set_bypass;
 
 	/* Enable bypass by default */
-	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+	pnv_pci_ioda2_set_bypass(pe->tce32_table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1337,7 +1339,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	}
 
 	/* Setup linux iommu table */
-	tbl = &pe->tce32_table;
+	tbl = pe->tce32_table;
 	pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
 			IOMMU_PAGE_SHIFT_4K);
 
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index e5b75b2..7317777 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -53,7 +53,7 @@ struct pnv_ioda_pe {
 	/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
 	int			tce32_seg;
 	int			tce32_segcount;
-	struct iommu_table	tce32_table;
+	struct iommu_table	*tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
 	/* 64-bit TCE bypass region */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
Mostly the total_pe number is different from the total_VFs, which will lead
to a conflict between MMIO space and the PE number.

For example, total_VFs is 128 and total_pe is 256, then the second half of
M64 BAR space will be part of other PCI device, which may already belongs
to other PEs.

This patch reserve additional space for the PF IOV BAR, which is total_pe
number of VF's BAR size. By doing so, it prevents the conflict.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    4 ++
 arch/powerpc/include/asm/pci-bridge.h     |    3 ++
 arch/powerpc/kernel/pci-common.c          |    5 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
 4 files changed, 71 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..965547c 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -250,6 +250,10 @@ struct machdep_calls {
 	/* Reset the secondary bus of bridge */
 	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_IOV
+	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 334e745..b857ec4 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -170,6 +170,9 @@ struct pci_dn {
 #define IODA_INVALID_PE		(-1)
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
+#ifdef CONFIG_PCI_IOV
+	u16     max_vfs;		/* number of VFs IOV BAR expended */
+#endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
 	struct list_head list;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 889f743..832b7e1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 	if (ppc_md.pcibios_fixup_phb)
 		ppc_md.pcibios_fixup_phb(hose);
 
+#ifdef CONFIG_PCI_IOV
+	if (ppc_md.pcibios_fixup_sriov)
+		ppc_md.pcibios_fixup_sriov(bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 31335a7..6704fdf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1721,6 +1721,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
 static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct resource *res;
+	int i;
+	resource_size_t size;
+	struct pci_dn *pdn;
+
+	if (!pdev->is_physfn || pdev->is_added)
+		return;
+
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+
+	pdn = pci_get_pdn(pdev);
+	pdn->max_vfs = 0;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
+				 res, pci_name(pdev));
+			continue;
+		}
+
+		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
+		size = pci_iov_resource_size(pdev, i);
+		res->end = res->start + size * phb->ioda.total_pe - 1;
+		dev_dbg(&pdev->dev, "                       %pR\n", res);
+		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
+				i - PCI_IOV_RESOURCES,
+				res, phb->ioda.total_pe);
+	}
+	pdn->max_vfs = phb->ioda.total_pe;
+}
+
+static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	struct pci_bus *b;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		b = pdev->subordinate;
+
+		if (b)
+			pnv_pci_ioda_fixup_sriov(b);
+
+		pnv_pci_ioda_fixup_iov_resources(pdev);
+	}
+}
+#endif /* CONFIG_PCI_IOV */
+
 /*
  * This function is supposed to be called on basis of PE from top
  * to bottom style. So the the I/O or MMIO segment assigned to
@@ -2097,6 +2153,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
+#ifdef CONFIG_PCI_IOV
+	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+#endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
 	/* Reset IODA tables to a clean state */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
Mostly the total_pe number is different from the total_VFs, which will lead
to a conflict between MMIO space and the PE number.

For example, total_VFs is 128 and total_pe is 256, then the second half of
M64 BAR space will be part of other PCI device, which may already belongs
to other PEs.

This patch reserve additional space for the PF IOV BAR, which is total_pe
number of VF's BAR size. By doing so, it prevents the conflict.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    4 ++
 arch/powerpc/include/asm/pci-bridge.h     |    3 ++
 arch/powerpc/kernel/pci-common.c          |    5 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
 4 files changed, 71 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index c8175a3..965547c 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -250,6 +250,10 @@ struct machdep_calls {
 	/* Reset the secondary bus of bridge */
 	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
 
+#ifdef CONFIG_PCI_IOV
+	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Called to shutdown machine specific hardware not already controlled
 	 * by other drivers.
 	 */
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 334e745..b857ec4 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -170,6 +170,9 @@ struct pci_dn {
 #define IODA_INVALID_PE		(-1)
 #ifdef CONFIG_PPC_POWERNV
 	int	pe_number;
+#ifdef CONFIG_PCI_IOV
+	u16     max_vfs;		/* number of VFs IOV BAR expended */
+#endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
 	struct list_head list;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 889f743..832b7e1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
 	if (ppc_md.pcibios_fixup_phb)
 		ppc_md.pcibios_fixup_phb(hose);
 
+#ifdef CONFIG_PCI_IOV
+	if (ppc_md.pcibios_fixup_sriov)
+		ppc_md.pcibios_fixup_sriov(bus);
+#endif /* CONFIG_PCI_IOV */
+
 	/* Configure PCI Express settings */
 	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
 		struct pci_bus *child;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 31335a7..6704fdf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1721,6 +1721,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
 static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
 #endif /* CONFIG_PCI_MSI */
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
+{
+	struct pci_controller *hose;
+	struct pnv_phb *phb;
+	struct resource *res;
+	int i;
+	resource_size_t size;
+	struct pci_dn *pdn;
+
+	if (!pdev->is_physfn || pdev->is_added)
+		return;
+
+	hose = pci_bus_to_host(pdev->bus);
+	phb = hose->private_data;
+
+	pdn = pci_get_pdn(pdev);
+	pdn->max_vfs = 0;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
+				 res, pci_name(pdev));
+			continue;
+		}
+
+		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
+		size = pci_iov_resource_size(pdev, i);
+		res->end = res->start + size * phb->ioda.total_pe - 1;
+		dev_dbg(&pdev->dev, "                       %pR\n", res);
+		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
+				i - PCI_IOV_RESOURCES,
+				res, phb->ioda.total_pe);
+	}
+	pdn->max_vfs = phb->ioda.total_pe;
+}
+
+static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
+{
+	struct pci_dev *pdev;
+	struct pci_bus *b;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list) {
+		b = pdev->subordinate;
+
+		if (b)
+			pnv_pci_ioda_fixup_sriov(b);
+
+		pnv_pci_ioda_fixup_iov_resources(pdev);
+	}
+}
+#endif /* CONFIG_PCI_IOV */
+
 /*
  * This function is supposed to be called on basis of PE from top
  * to bottom style. So the the I/O or MMIO segment assigned to
@@ -2097,6 +2153,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
 	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
+#ifdef CONFIG_PCI_IOV
+	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+#endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
 	/* Reset IODA tables to a clean state */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

This patch implements the pcibios_iov_resource_alignment() on powernv
platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    3 +++
 arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 965547c..12e8eb8 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -252,6 +252,9 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
 	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
+			                                    int resno,
+							    resource_size_t align);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 832b7e1..8751dfb 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
+						 int resno,
+						 resource_size_t align)
+{
+	if (ppc_md.pcibios_iov_resource_alignment)
+		return ppc_md.pcibios_iov_resource_alignment(pdev,
+							       resno,
+							       align);
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6704fdf..8bad2b0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1953,6 +1953,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+							    int resno,
+							    resource_size_t align)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t iov_align;
+
+	iov_align = resource_size(&pdev->resource[resno]);
+	if (iov_align)
+		return iov_align;
+
+	if (pdn->max_vfs)
+		return pdn->max_vfs * align;
+
+	return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2155,6 +2174,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

This patch implements the pcibios_iov_resource_alignment() on powernv
platform.

On PowerNV platform, there are 3 cases for the IOV BAR:
1. initial state, the IOV BAR size is multiple times of VF BAR size
2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
3. sizing stage, the IOV BAR is truncated to 0

pnv_pci_iov_resource_alignment() handle these three cases respectively.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h        |    3 +++
 arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 965547c..12e8eb8 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -252,6 +252,9 @@ struct machdep_calls {
 
 #ifdef CONFIG_PCI_IOV
 	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
+	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
+			                                    int resno,
+							    resource_size_t align);
 #endif /* CONFIG_PCI_IOV */
 
 	/* Called to shutdown machine specific hardware not already controlled
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 832b7e1..8751dfb 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
 	pci_reset_secondary_bus(dev);
 }
 
+#ifdef CONFIG_PCI_IOV
+resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
+						 int resno,
+						 resource_size_t align)
+{
+	if (ppc_md.pcibios_iov_resource_alignment)
+		return ppc_md.pcibios_iov_resource_alignment(pdev,
+							       resno,
+							       align);
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6704fdf..8bad2b0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1953,6 +1953,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
 	return phb->ioda.io_segsize;
 }
 
+#ifdef CONFIG_PCI_IOV
+static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
+							    int resno,
+							    resource_size_t align)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	resource_size_t iov_align;
+
+	iov_align = resource_size(&pdev->resource[resno]);
+	if (iov_align)
+		return iov_align;
+
+	if (pdn->max_vfs)
+		return pdn->max_vfs * align;
+
+	return align;
+}
+#endif /* CONFIG_PCI_IOV */
+
 /* Prevent enabling devices for which we couldn't properly
  * assign a PE
  */
@@ -2155,6 +2174,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
 #ifdef CONFIG_PCI_IOV
 	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
+	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
 #endif /* CONFIG_PCI_IOV */
 	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

On PowrNV platform, resource position in M64 implies the PE# the resource
belongs to. In some particular case, adjustment of a resource is necessary
to locate it to a correct position in M64.

This patch introduces a function to shift the 'real' PF IOV BAR address
according to an offset.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8bad2b0..62bb2eb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -14,6 +14,7 @@
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
+#include <linux/pci_regs.h>
 #include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/string.h>
@@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res;
+	resource_size_t size;
+
+	if (!dev->is_physfn)
+		return;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*offset;
+
+		dev_info(&dev->dev, "                 %pR\n", res);
+		pci_update_resource(dev, i);
+	}
+	pdn->max_vfs -= offset;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

On PowrNV platform, resource position in M64 implies the PE# the resource
belongs to. In some particular case, adjustment of a resource is necessary
to locate it to a correct position in M64.

This patch introduces a function to shift the 'real' PF IOV BAR address
according to an offset.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8bad2b0..62bb2eb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -14,6 +14,7 @@
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/crash_dump.h>
+#include <linux/pci_regs.h>
 #include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/string.h>
@@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res;
+	resource_size_t size;
+
+	if (!dev->is_physfn)
+		return;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*offset;
+
+		dev_info(&dev->dev, "                 %pR\n", res);
+		pci_update_resource(dev, i);
+	}
+	pdn->max_vfs -= offset;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 15/17] powerpc/powernv: Allocate VF PE
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

VFs are created, when pci device is enabled.

This patch tries best to assign maximum resources and PEs for VF when pci
device is enabled. Enough M64 assigned to cover the IOV BAR, IOV BAR is
shifted to meet the PE# indicated by M64. VF's pdn->pdev and
pdn->pe_number are fixed.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    4 +
 arch/powerpc/kernel/pci_dn.c              |   11 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  451 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c      |   18 ++
 arch/powerpc/platforms/powernv/pci.h      |    7 +
 5 files changed, 476 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index b857ec4..d61c384 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -172,6 +172,10 @@ struct pci_dn {
 	int	pe_number;
 #ifdef CONFIG_PCI_IOV
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
+	u16     vf_pes;			/* VF PE# under this PF */
+	int     offset;			/* PE# for the first VF PE */
+#define IODA_INVALID_M64        (-1)
+	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 6536573..36aaa8e 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -218,6 +218,17 @@ void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
 	struct pci_dn *pdn, *tmp;
 	int i;
 
+	/*
+	 * VF and VF PE is create/released dynamicly, which we need to
+	 * bind/unbind them. Otherwise when re-enable SRIOV, the VF and VF PE
+	 * would be mismatched.
+	 */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		pdn->pe_number = IODA_INVALID_PE;
+		return;
+	}
+
 	/* Only support IOV PF for now */
 	if (!pdev->is_physfn)
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 62bb2eb..94fe6e1 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -45,6 +45,9 @@
 #include "powernv.h"
 #include "pci.h"
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -57,11 +60,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV)
 		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
-	else
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
 		sprintf(pfix, "%04x:%02x     ",
 			pci_domain_nr(pe->pbus), pe->pbus->number);
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		sprintf(pfix, "%04x:%02x:%2x.%d",
+			pci_domain_nr(pe->parent_dev->bus),
+			(pe->rid & 0xff00) >> 8,
+			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
+#endif /* CONFIG_PCI_IOV*/
 
 	printk("%spci %s: [PE# %.3d] %pV",
 	       level, pfix, pe->pe_number, &vaf);
@@ -567,7 +577,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 			      bool is_add)
 {
 	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev;
+	struct pci_dev *pdev = NULL;
 	int ret;
 
 	/*
@@ -606,8 +616,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 
 	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
 		pdev = pe->pbus->self;
-	else
+	else if (pe->flags & PNV_IODA_PE_DEV)
 		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
 	while (pdev) {
 		struct pci_dn *pdn = pci_get_pdn(pdev);
 		struct pnv_ioda_pe *parent;
@@ -625,6 +639,89 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 	return 0;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	int64_t rc;
+	long rid_end, rid;
+
+	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;         break;
+		case  2: bcomp = OpalPciBus7Bits;       break;
+		case  4: bcomp = OpalPciBus6Bits;       break;
+		case  8: bcomp = OpalPciBus5Bits;       break;
+		case 16: bcomp = OpalPciBus4Bits;       break;
+		case 32: bcomp = OpalPciBus3Bits;       break;
+		default:
+			pr_err("%s: Number of subordinate busses %d"
+			       " unsupported\n",
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear the reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = 0;
+
+	/* Release from all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Dissociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+
+	pe->pbus = NULL;
+	pe->pdev = NULL;
+	pe->parent_dev = NULL;
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -653,13 +750,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 		default:
 			pr_err("%s: Number of subordinate busses %d"
 			       " unsupported\n",
-			       pci_name(pe->pbus->self), count);
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
 			/* Do an exact match only */
 			bcomp = OpalPciBusAll;
 		}
 		rid_end = pe->rid + (count << 8);
 	} else {
-		parent = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif /* CONFIG_PCI_IOV */
+			parent = pe->pdev->bus->self;
 		bcomp = OpalPciBusAll;
 		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
 		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
@@ -984,8 +1087,311 @@ static void pnv_pci_ioda_setup_PEs(void)
 }
 
 #ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    i;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		if (pdn->m64_wins[i] == IODA_INVALID_M64)
+			continue;
+		opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
+		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+	}
+
+	return 0;
+}
+
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	unsigned int           win;
+	struct resource       *res;
+	int                    i;
+	int64_t                rc;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	/* Initialize the m64_wins to IODA_INVALID_M64 */
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = pdev->resource + PCI_IOV_RESOURCES + i;
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		do {
+			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+					phb->ioda.m64_bar_idx + 1, 0);
+
+			if (win >= phb->ioda.m64_bar_idx + 1)
+				goto m64_failed;
+		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+		pdn->m64_wins[i] = win;
+
+		/* Map the M64 here */
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 pdn->m64_wins[i],
+						 res->start,
+						 0, /* unused */
+						 resource_size(res));
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
+			goto m64_failed;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
+			goto m64_failed;
+		}
+	}
+	return 0;
+
+m64_failed:
+	pnv_pci_vf_release_m64(pdev);
+	return -EBUSY;
+}
+
+static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct iommu_table    *tbl;
+	unsigned long         addr;
+	int64_t               rc;
+
+	bus = dev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	tbl = pe->tce32_table;
+	addr = tbl->it_base;
+
+	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+				   pe->pe_number << 1, 1, __pa(addr),
+				   0, 0x1000);
+
+	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+				        pe->pe_number,
+				        (pe->pe_number << 1) + 1,
+				        pe->tce_bypass_base,
+				        0);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+
+	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	free_pages(addr, get_order(TCE32_TABLE_SIZE));
+	pe->tce32_table = NULL;
+}
+
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe, *pe_n;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+
+	if (!pdev->is_physfn)
+		return;
+
+	pdn = pci_get_pdn(pdev);
+	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
+		if (pe->parent_dev != pdev)
+			continue;
+
+		pnv_pci_ioda2_release_dma_pe(pdev, pe);
+
+		/* Remove from list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_del(&pe->list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_ioda_deconfigure_pe(phb, pe);
+
+		pnv_ioda_free_pe(phb, pe->pe_number);
+	}
+}
+
+void pnv_pci_sriov_disable(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	struct pci_sriov      *iov;
+	u16 vf_num;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+	iov = pdev->sriov;
+	vf_num = pdn->vf_pes;
+
+	/* Release VF PEs */
+	pnv_ioda_release_vf_PE(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+
+		/* Release M64 BARs */
+		pnv_pci_vf_release_m64(pdev);
+
+		/* Release PE numbers */
+		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->offset = 0;
+	}
+
+	return;
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe;
+	int                    pe_num;
+	u16                    vf_index;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (!pdev->is_physfn)
+		return;
+
+	/* Reserve PE for each VF */
+	for (vf_index = 0; vf_index < vf_num; vf_index++) {
+		pe_num = pdn->offset + vf_index;
+
+		pe = &phb->ioda.pe_array[pe_num];
+		pe->pe_number = pe_num;
+		pe->phb = phb;
+		pe->flags = PNV_IODA_PE_VF;
+		pe->pbus = NULL;
+		pe->parent_dev = pdev;
+		pe->tce32_seg = -1;
+		pe->mve_number = -1;
+		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
+			   pci_iov_virtfn_devfn(pdev, vf_index);
+
+		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
+			hose->global_number, pdev->bus->number,
+			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
+			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+
+		if (pnv_ioda_configure_pe(phb, pe)) {
+			/* XXX What do we do here ? */
+			if (pe_num)
+				pnv_ioda_free_pe(phb, pe_num);
+			pe->pdev = NULL;
+			continue;
+		}
+
+		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+				GFP_KERNEL, hose->node);
+		pe->tce32_table->data = pe;
+
+		/* Put PE to the list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_add_tail(&pe->list, &phb->ioda.pe_list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+
+	}
+}
+
+int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    ret;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		/* Calculate available PE for required VFs */
+		mutex_lock(&phb->ioda.pe_alloc_mutex);
+		pdn->offset = bitmap_find_next_zero_area(
+			phb->ioda.pe_alloc, phb->ioda.total_pe,
+			0, vf_num, 0);
+		if (pdn->offset >= phb->ioda.total_pe) {
+			mutex_unlock(&phb->ioda.pe_alloc_mutex);
+			pr_info("Failed to enable VF\n");
+			pdn->offset = 0;
+			return -EBUSY;
+		}
+		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->vf_pes = vf_num;
+		mutex_unlock(&phb->ioda.pe_alloc_mutex);
+
+		/* Assign M64 BAR accordingly */
+		ret = pnv_pci_vf_assign_m64(pdev);
+		if (ret) {
+			pr_info("No enough M64 resource\n");
+			goto m64_failed;
+		}
+
+		/* Do some magic shift */
+		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+	}
+
+	/* Setup VF PEs */
+	pnv_ioda_setup_vf_PE(pdev, vf_num);
+
+	return 0;
+
+m64_failed:
+	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+	pdn->offset = 0;
+
+	return ret;
+}
+
 int pcibios_sriov_disable(struct pci_dev *pdev, u16 vf_num)
 {
+	pnv_pci_sriov_disable(pdev);
+
 	/* Release firmware data */
 	remove_dev_pci_info(pdev, vf_num);
 	return 0;
@@ -995,6 +1401,8 @@ int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 {
 	/* Allocate firmware data */
 	add_dev_pci_info(pdev, vf_num);
+
+	pnv_pci_sriov_enable(pdev, vf_num);
 	return 0;
 }
 #endif /* CONFIG_PCI_IOV */
@@ -1191,9 +1599,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 	void *addr;
 
-	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
-#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
-
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
 	/* XXX FIXME: Allocate multi-level tables on PHB3 */
@@ -1256,12 +1661,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	return;
  fail:
@@ -1388,12 +1800,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
@@ -2087,6 +2506,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
+	mutex_init(&phb->ioda.pe_alloc_mutex);
 
 	/* Detect specific models for error handling */
 	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
@@ -2146,6 +2566,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
+	mutex_init(&phb->ioda.pe_list_mutex);
 
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b7d4b9d..269f1dd 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -714,6 +714,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 {
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 	struct pnv_phb *phb = hose->private_data;
+#ifdef CONFIG_PCI_IOV
+	struct pnv_ioda_pe *pe;
+	struct pci_dn *pdn;
+
+	/* Fix the VF pdn PE number */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			if (pe->rid == ((pdev->bus->number << 8) |
+			    (pdev->devfn & 0xff))) {
+				pdn->pe_number = pe->pe_number;
+				pe->pdev = pdev;
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_PCI_IOV */
 
 	/* If we have no phb structure, try to setup a fallback based on
 	 * the device-tree (RTAS PCI for example)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 7317777..39d42f2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
+#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -34,6 +35,9 @@ struct pnv_ioda_pe {
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
 	 */
+#ifdef CONFIG_PCI_IOV
+	struct pci_dev          *parent_dev;
+#endif
 	struct pci_dev		*pdev;
 	struct pci_bus		*pbus;
 
@@ -165,6 +169,8 @@ struct pnv_phb {
 
 			/* PE allocation bitmap */
 			unsigned long		*pe_alloc;
+			/* PE allocation mutex */
+			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
 			unsigned int		*m32_segmap;
@@ -179,6 +185,7 @@ struct pnv_phb {
 			 * on the sequence of creation
 			 */
 			struct list_head	pe_list;
+			struct mutex            pe_list_mutex;
 
 			/* Reverse map of PEs, will have to extend if
 			 * we are to support more than 256 PEs, indexed
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 15/17] powerpc/powernv: Allocate VF PE
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

VFs are created, when pci device is enabled.

This patch tries best to assign maximum resources and PEs for VF when pci
device is enabled. Enough M64 assigned to cover the IOV BAR, IOV BAR is
shifted to meet the PE# indicated by M64. VF's pdn->pdev and
pdn->pe_number are fixed.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    4 +
 arch/powerpc/kernel/pci_dn.c              |   11 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  451 ++++++++++++++++++++++++++++-
 arch/powerpc/platforms/powernv/pci.c      |   18 ++
 arch/powerpc/platforms/powernv/pci.h      |    7 +
 5 files changed, 476 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index b857ec4..d61c384 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -172,6 +172,10 @@ struct pci_dn {
 	int	pe_number;
 #ifdef CONFIG_PCI_IOV
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
+	u16     vf_pes;			/* VF PE# under this PF */
+	int     offset;			/* PE# for the first VF PE */
+#define IODA_INVALID_M64        (-1)
+	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 6536573..36aaa8e 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -218,6 +218,17 @@ void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
 	struct pci_dn *pdn, *tmp;
 	int i;
 
+	/*
+	 * VF and VF PE is create/released dynamicly, which we need to
+	 * bind/unbind them. Otherwise when re-enable SRIOV, the VF and VF PE
+	 * would be mismatched.
+	 */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		pdn->pe_number = IODA_INVALID_PE;
+		return;
+	}
+
 	/* Only support IOV PF for now */
 	if (!pdev->is_physfn)
 		return;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 62bb2eb..94fe6e1 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -45,6 +45,9 @@
 #include "powernv.h"
 #include "pci.h"
 
+/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
+#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 			    const char *fmt, ...)
 {
@@ -57,11 +60,18 @@ static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 	vaf.fmt = fmt;
 	vaf.va = &args;
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV)
 		strlcpy(pfix, dev_name(&pe->pdev->dev), sizeof(pfix));
-	else
+	else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
 		sprintf(pfix, "%04x:%02x     ",
 			pci_domain_nr(pe->pbus), pe->pbus->number);
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		sprintf(pfix, "%04x:%02x:%2x.%d",
+			pci_domain_nr(pe->parent_dev->bus),
+			(pe->rid & 0xff00) >> 8,
+			PCI_SLOT(pe->rid), PCI_FUNC(pe->rid));
+#endif /* CONFIG_PCI_IOV*/
 
 	printk("%spci %s: [PE# %.3d] %pV",
 	       level, pfix, pe->pe_number, &vaf);
@@ -567,7 +577,7 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 			      bool is_add)
 {
 	struct pnv_ioda_pe *slave;
-	struct pci_dev *pdev;
+	struct pci_dev *pdev = NULL;
 	int ret;
 
 	/*
@@ -606,8 +616,12 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 
 	if (pe->flags & (PNV_IODA_PE_BUS_ALL | PNV_IODA_PE_BUS))
 		pdev = pe->pbus->self;
-	else
+	else if (pe->flags & PNV_IODA_PE_DEV)
 		pdev = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+	else if (pe->flags & PNV_IODA_PE_VF)
+		pdev = pe->parent_dev->bus->self;
+#endif /* CONFIG_PCI_IOV */
 	while (pdev) {
 		struct pci_dn *pdn = pci_get_pdn(pdev);
 		struct pnv_ioda_pe *parent;
@@ -625,6 +639,89 @@ static int pnv_ioda_set_peltv(struct pnv_phb *phb,
 	return 0;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+{
+	struct pci_dev *parent;
+	uint8_t bcomp, dcomp, fcomp;
+	int64_t rc;
+	long rid_end, rid;
+
+	/* Currently, we just deconfigure VF PE. Bus PE will always there.*/
+	if (pe->pbus) {
+		int count;
+
+		dcomp = OPAL_IGNORE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_IGNORE_RID_FUNCTION_NUMBER;
+		parent = pe->pbus->self;
+		if (pe->flags & PNV_IODA_PE_BUS_ALL)
+			count = pe->pbus->busn_res.end - pe->pbus->busn_res.start + 1;
+		else
+			count = 1;
+
+		switch(count) {
+		case  1: bcomp = OpalPciBusAll;         break;
+		case  2: bcomp = OpalPciBus7Bits;       break;
+		case  4: bcomp = OpalPciBus6Bits;       break;
+		case  8: bcomp = OpalPciBus5Bits;       break;
+		case 16: bcomp = OpalPciBus4Bits;       break;
+		case 32: bcomp = OpalPciBus3Bits;       break;
+		default:
+			pr_err("%s: Number of subordinate busses %d"
+			       " unsupported\n",
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
+			/* Do an exact match only */
+			bcomp = OpalPciBusAll;
+		}
+		rid_end = pe->rid + (count << 8);
+	} else {
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+			parent = pe->pdev->bus->self;
+		bcomp = OpalPciBusAll;
+		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
+		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
+		rid_end = pe->rid + 1;
+	}
+
+	/* Clear the reverse map */
+	for (rid = pe->rid; rid < rid_end; rid++)
+		phb->ioda.pe_rmap[rid] = 0;
+
+	/* Release from all parents PELT-V */
+	while (parent) {
+		struct pci_dn *pdn = pci_get_pdn(parent);
+		if (pdn && pdn->pe_number != IODA_INVALID_PE) {
+			rc = opal_pci_set_peltv(phb->opal_id, pdn->pe_number,
+						pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+			/* XXX What to do in case of error ? */
+		}
+		parent = parent->bus->self;
+	}
+
+	opal_pci_eeh_freeze_set(phb->opal_id, pe->pe_number,
+				  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+
+	/* Dissociate PE in PELT */
+	rc = opal_pci_set_peltv(phb->opal_id, pe->pe_number,
+				pe->pe_number, OPAL_REMOVE_PE_FROM_DOMAIN);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld remove self from PELTV\n", rc);
+	rc = opal_pci_set_pe(phb->opal_id, pe->pe_number, pe->rid,
+			     bcomp, dcomp, fcomp, OPAL_UNMAP_PE);
+	if (rc)
+		pe_err(pe, "OPAL error %ld trying to setup PELT table\n", rc);
+
+	pe->pbus = NULL;
+	pe->pdev = NULL;
+	pe->parent_dev = NULL;
+
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
 	struct pci_dev *parent;
@@ -653,13 +750,19 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 		default:
 			pr_err("%s: Number of subordinate busses %d"
 			       " unsupported\n",
-			       pci_name(pe->pbus->self), count);
+			       pci_is_root_bus(pe->pbus)?"root bus":pci_name(pe->pbus->self),
+			       count);
 			/* Do an exact match only */
 			bcomp = OpalPciBusAll;
 		}
 		rid_end = pe->rid + (count << 8);
 	} else {
-		parent = pe->pdev->bus->self;
+#ifdef CONFIG_PCI_IOV
+		if (pe->flags & PNV_IODA_PE_VF)
+			parent = pe->parent_dev;
+		else
+#endif /* CONFIG_PCI_IOV */
+			parent = pe->pdev->bus->self;
 		bcomp = OpalPciBusAll;
 		dcomp = OPAL_COMPARE_RID_DEVICE_NUMBER;
 		fcomp = OPAL_COMPARE_RID_FUNCTION_NUMBER;
@@ -984,8 +1087,311 @@ static void pnv_pci_ioda_setup_PEs(void)
 }
 
 #ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    i;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		if (pdn->m64_wins[i] == IODA_INVALID_M64)
+			continue;
+		opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
+		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+	}
+
+	return 0;
+}
+
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	unsigned int           win;
+	struct resource       *res;
+	int                    i;
+	int64_t                rc;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	/* Initialize the m64_wins to IODA_INVALID_M64 */
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		pdn->m64_wins[i] = IODA_INVALID_M64;
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = pdev->resource + PCI_IOV_RESOURCES + i;
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		do {
+			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+					phb->ioda.m64_bar_idx + 1, 0);
+
+			if (win >= phb->ioda.m64_bar_idx + 1)
+				goto m64_failed;
+		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+		pdn->m64_wins[i] = win;
+
+		/* Map the M64 here */
+		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+						 OPAL_M64_WINDOW_TYPE,
+						 pdn->m64_wins[i],
+						 res->start,
+						 0, /* unused */
+						 resource_size(res));
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
+			goto m64_failed;
+		}
+
+		rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
+		if (rc != OPAL_SUCCESS) {
+			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
+			goto m64_failed;
+		}
+	}
+	return 0;
+
+m64_failed:
+	pnv_pci_vf_release_m64(pdev);
+	return -EBUSY;
+}
+
+static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct iommu_table    *tbl;
+	unsigned long         addr;
+	int64_t               rc;
+
+	bus = dev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	tbl = pe->tce32_table;
+	addr = tbl->it_base;
+
+	opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
+				   pe->pe_number << 1, 1, __pa(addr),
+				   0, 0x1000);
+
+	rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+				        pe->pe_number,
+				        (pe->pe_number << 1) + 1,
+				        pe->tce_bypass_base,
+				        0);
+	if (rc)
+		pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
+
+	iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+	free_pages(addr, get_order(TCE32_TABLE_SIZE));
+	pe->tce32_table = NULL;
+}
+
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe, *pe_n;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+
+	if (!pdev->is_physfn)
+		return;
+
+	pdn = pci_get_pdn(pdev);
+	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
+		if (pe->parent_dev != pdev)
+			continue;
+
+		pnv_pci_ioda2_release_dma_pe(pdev, pe);
+
+		/* Remove from list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_del(&pe->list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_ioda_deconfigure_pe(phb, pe);
+
+		pnv_ioda_free_pe(phb, pe->pe_number);
+	}
+}
+
+void pnv_pci_sriov_disable(struct pci_dev *pdev)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	struct pci_sriov      *iov;
+	u16 vf_num;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+	iov = pdev->sriov;
+	vf_num = pdn->vf_pes;
+
+	/* Release VF PEs */
+	pnv_ioda_release_vf_PE(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+
+		/* Release M64 BARs */
+		pnv_pci_vf_release_m64(pdev);
+
+		/* Release PE numbers */
+		bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->offset = 0;
+	}
+
+	return;
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+				       struct pnv_ioda_pe *pe);
+static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pnv_ioda_pe    *pe;
+	int                    pe_num;
+	u16                    vf_index;
+	struct pci_dn         *pdn;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (!pdev->is_physfn)
+		return;
+
+	/* Reserve PE for each VF */
+	for (vf_index = 0; vf_index < vf_num; vf_index++) {
+		pe_num = pdn->offset + vf_index;
+
+		pe = &phb->ioda.pe_array[pe_num];
+		pe->pe_number = pe_num;
+		pe->phb = phb;
+		pe->flags = PNV_IODA_PE_VF;
+		pe->pbus = NULL;
+		pe->parent_dev = pdev;
+		pe->tce32_seg = -1;
+		pe->mve_number = -1;
+		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
+			   pci_iov_virtfn_devfn(pdev, vf_index);
+
+		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%d\n",
+			hose->global_number, pdev->bus->number,
+			PCI_SLOT(pci_iov_virtfn_devfn(pdev, vf_index)),
+			PCI_FUNC(pci_iov_virtfn_devfn(pdev, vf_index)), pe_num);
+
+		if (pnv_ioda_configure_pe(phb, pe)) {
+			/* XXX What do we do here ? */
+			if (pe_num)
+				pnv_ioda_free_pe(phb, pe_num);
+			pe->pdev = NULL;
+			continue;
+		}
+
+		pe->tce32_table = kzalloc_node(sizeof(struct iommu_table),
+				GFP_KERNEL, hose->node);
+		pe->tce32_table->data = pe;
+
+		/* Put PE to the list */
+		mutex_lock(&phb->ioda.pe_list_mutex);
+		list_add_tail(&pe->list, &phb->ioda.pe_list);
+		mutex_unlock(&phb->ioda.pe_list_mutex);
+
+		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+
+	}
+}
+
+int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
+{
+	struct pci_bus        *bus;
+	struct pci_controller *hose;
+	struct pnv_phb        *phb;
+	struct pci_dn         *pdn;
+	int                    ret;
+
+	bus = pdev->bus;
+	hose = pci_bus_to_host(bus);
+	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
+
+	if (phb->type == PNV_PHB_IODA2) {
+		/* Calculate available PE for required VFs */
+		mutex_lock(&phb->ioda.pe_alloc_mutex);
+		pdn->offset = bitmap_find_next_zero_area(
+			phb->ioda.pe_alloc, phb->ioda.total_pe,
+			0, vf_num, 0);
+		if (pdn->offset >= phb->ioda.total_pe) {
+			mutex_unlock(&phb->ioda.pe_alloc_mutex);
+			pr_info("Failed to enable VF\n");
+			pdn->offset = 0;
+			return -EBUSY;
+		}
+		bitmap_set(phb->ioda.pe_alloc, pdn->offset, vf_num);
+		pdn->vf_pes = vf_num;
+		mutex_unlock(&phb->ioda.pe_alloc_mutex);
+
+		/* Assign M64 BAR accordingly */
+		ret = pnv_pci_vf_assign_m64(pdev);
+		if (ret) {
+			pr_info("No enough M64 resource\n");
+			goto m64_failed;
+		}
+
+		/* Do some magic shift */
+		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+	}
+
+	/* Setup VF PEs */
+	pnv_ioda_setup_vf_PE(pdev, vf_num);
+
+	return 0;
+
+m64_failed:
+	bitmap_clear(phb->ioda.pe_alloc, pdn->offset, vf_num);
+	pdn->offset = 0;
+
+	return ret;
+}
+
 int pcibios_sriov_disable(struct pci_dev *pdev, u16 vf_num)
 {
+	pnv_pci_sriov_disable(pdev);
+
 	/* Release firmware data */
 	remove_dev_pci_info(pdev, vf_num);
 	return 0;
@@ -995,6 +1401,8 @@ int pcibios_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 {
 	/* Allocate firmware data */
 	add_dev_pci_info(pdev, vf_num);
+
+	pnv_pci_sriov_enable(pdev, vf_num);
 	return 0;
 }
 #endif /* CONFIG_PCI_IOV */
@@ -1191,9 +1599,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 	int64_t rc;
 	void *addr;
 
-	/* 256M DMA window, 4K TCE pages, 8 bytes TCE */
-#define TCE32_TABLE_SIZE	((0x10000000 / 0x1000) * 8)
-
 	/* XXX FIXME: Handle 64-bit only DMA devices */
 	/* XXX FIXME: Provide 64-bit DMA facilities & non-4K TCE tables etc.. */
 	/* XXX FIXME: Allocate multi-level tables on PHB3 */
@@ -1256,12 +1661,19 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 				 TCE_PCI_SWINV_PAIR);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	return;
  fail:
@@ -1388,12 +1800,19 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 		tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 	}
 	iommu_init_table(tbl, phb->hose->node);
-	iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
 
-	if (pe->pdev)
+	if (pe->flags & PNV_IODA_PE_DEV) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
-	else
+	} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
 		pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+	} else if (pe->flags & PNV_IODA_PE_VF) {
+		iommu_register_group(tbl, phb->hose->global_number,
+				     pe->pe_number);
+	}
 
 	/* Also create a bypass window */
 	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
@@ -2087,6 +2506,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 	phb->hub_id = hub_id;
 	phb->opal_id = phb_id;
 	phb->type = ioda_type;
+	mutex_init(&phb->ioda.pe_alloc_mutex);
 
 	/* Detect specific models for error handling */
 	if (of_device_is_compatible(np, "ibm,p7ioc-pciex"))
@@ -2146,6 +2566,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	INIT_LIST_HEAD(&phb->ioda.pe_dma_list);
 	INIT_LIST_HEAD(&phb->ioda.pe_list);
+	mutex_init(&phb->ioda.pe_list_mutex);
 
 	/* Calculate how many 32-bit TCE segments we have */
 	phb->ioda.tce32_count = phb->ioda.m32_pci_base >> 28;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b7d4b9d..269f1dd 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -714,6 +714,24 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 {
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
 	struct pnv_phb *phb = hose->private_data;
+#ifdef CONFIG_PCI_IOV
+	struct pnv_ioda_pe *pe;
+	struct pci_dn *pdn;
+
+	/* Fix the VF pdn PE number */
+	if (pdev->is_virtfn) {
+		pdn = pci_get_pdn(pdev);
+		WARN_ON(pdn->pe_number != IODA_INVALID_PE);
+		list_for_each_entry(pe, &phb->ioda.pe_list, list) {
+			if (pe->rid == ((pdev->bus->number << 8) |
+			    (pdev->devfn & 0xff))) {
+				pdn->pe_number = pe->pe_number;
+				pe->pdev = pdev;
+				break;
+			}
+		}
+	}
+#endif /* CONFIG_PCI_IOV */
 
 	/* If we have no phb structure, try to setup a fallback based on
 	 * the device-tree (RTAS PCI for example)
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 7317777..39d42f2 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -23,6 +23,7 @@ enum pnv_phb_model {
 #define PNV_IODA_PE_BUS_ALL	(1 << 2)	/* PE has subordinate buses	*/
 #define PNV_IODA_PE_MASTER	(1 << 3)	/* Master PE in compound case	*/
 #define PNV_IODA_PE_SLAVE	(1 << 4)	/* Slave PE in compound case	*/
+#define PNV_IODA_PE_VF		(1 << 5)	/* PE for one VF 		*/
 
 /* Data associated with a PE, including IOMMU tracking etc.. */
 struct pnv_phb;
@@ -34,6 +35,9 @@ struct pnv_ioda_pe {
 	 * entire bus (& children). In the former case, pdev
 	 * is populated, in the later case, pbus is.
 	 */
+#ifdef CONFIG_PCI_IOV
+	struct pci_dev          *parent_dev;
+#endif
 	struct pci_dev		*pdev;
 	struct pci_bus		*pbus;
 
@@ -165,6 +169,8 @@ struct pnv_phb {
 
 			/* PE allocation bitmap */
 			unsigned long		*pe_alloc;
+			/* PE allocation mutex */
+			struct mutex		pe_alloc_mutex;
 
 			/* M32 & IO segment maps */
 			unsigned int		*m32_segmap;
@@ -179,6 +185,7 @@ struct pnv_phb {
 			 * on the sequence of creation
 			 */
 			struct list_head	pe_list;
+			struct mutex            pe_list_mutex;
 
 			/* Reverse map of PEs, will have to extend if
 			 * we are to support more than 256 PEs, indexed
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
will exceed the limitation and failed to be assigned.

This patch introduce a different mechanism based on the IOV BAR size:

IOV BAR size is smaller than 64M, expand to total_pe.
IOV BAR size is bigger than 64M, roundup power2.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index d61c384..7156486 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -174,6 +174,8 @@ struct pci_dn {
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
 	u16     vf_pes;			/* VF PE# under this PF */
 	int     offset;			/* PE# for the first VF PE */
+#define M64_PER_IOV 4
+	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
 	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 94fe6e1..23ea873 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	int i;
 	resource_size_t size;
 	struct pci_dn *pdn;
+	int mul, total_vfs;
 
 	if (!pdev->is_physfn || pdev->is_added)
 		return;
@@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	pdn = pci_get_pdn(pdev);
 	pdn->max_vfs = 0;
 
+	total_vfs = pci_sriov_get_totalvfs(pdev);
+	pdn->m64_per_iov = 1;
+	mul = phb->ioda.total_pe;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
+					res, pci_name(pdev));
+			continue;
+		}
+
+		size = pci_iov_resource_size(pdev, i);
+
+		/* bigger than 64M */
+		if (size > (1 << 26)) {
+			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
+					"is bigger than 64M, roundup power2\n", i);
+			pdn->m64_per_iov = M64_PER_IOV;
+			mul = __roundup_pow_of_two(total_vfs);
+			break;
+		}
+	}
+
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &pdev->resource[i];
 		if (!res->flags || res->parent)
@@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
 				i - PCI_IOV_RESOURCES,
-				res, phb->ioda.total_pe);
+				res, mul);
 	}
-	pdn->max_vfs = phb->ioda.total_pe;
+	pdn->max_vfs = mul;
 }
 
 static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
will exceed the limitation and failed to be assigned.

This patch introduce a different mechanism based on the IOV BAR size:

IOV BAR size is smaller than 64M, expand to total_pe.
IOV BAR size is bigger than 64M, roundup power2.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 ++
 arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index d61c384..7156486 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -174,6 +174,8 @@ struct pci_dn {
 	u16     max_vfs;		/* number of VFs IOV BAR expended */
 	u16     vf_pes;			/* VF PE# under this PF */
 	int     offset;			/* PE# for the first VF PE */
+#define M64_PER_IOV 4
+	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
 	int     m64_wins[PCI_SRIOV_NUM_BARS];
 #endif /* CONFIG_PCI_IOV */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 94fe6e1..23ea873 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	int i;
 	resource_size_t size;
 	struct pci_dn *pdn;
+	int mul, total_vfs;
 
 	if (!pdev->is_physfn || pdev->is_added)
 		return;
@@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 	pdn = pci_get_pdn(pdev);
 	pdn->max_vfs = 0;
 
+	total_vfs = pci_sriov_get_totalvfs(pdev);
+	pdn->m64_per_iov = 1;
+	mul = phb->ioda.total_pe;
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &pdev->resource[i];
+		if (!res->flags || res->parent)
+			continue;
+		if (!pnv_pci_is_mem_pref_64(res->flags)) {
+			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
+					res, pci_name(pdev));
+			continue;
+		}
+
+		size = pci_iov_resource_size(pdev, i);
+
+		/* bigger than 64M */
+		if (size > (1 << 26)) {
+			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
+					"is bigger than 64M, roundup power2\n", i);
+			pdn->m64_per_iov = M64_PER_IOV;
+			mul = __roundup_pow_of_two(total_vfs);
+			break;
+		}
+	}
+
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &pdev->resource[i];
 		if (!res->flags || res->parent)
@@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
 				i - PCI_IOV_RESOURCES,
-				res, phb->ioda.total_pe);
+				res, mul);
 	}
-	pdn->max_vfs = phb->ioda.total_pe;
+	pdn->max_vfs = mul;
 }
 
 static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 17/17] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
  2015-01-15  2:27       ` Wei Yang
@ 2015-01-15  2:28         ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

When IOV BAR is big, each of it is covered by 4 M64 window. This leads to
several VF PE sits in one PE in terms of M64.

This patch group VF PEs according to the M64 allocation.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  188 +++++++++++++++++++++++------
 2 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 7156486..ad39a42 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -177,7 +177,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
 	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
-	int     m64_wins[PCI_SRIOV_NUM_BARS];
+	int     m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 23ea873..8456ae8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1093,26 +1093,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pci_dn         *pdn;
-	int                    i;
+	int                    i, j;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
 
-	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		if (pdn->m64_wins[i] == IODA_INVALID_M64)
-			continue;
-		opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
-		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
-		pdn->m64_wins[i] = IODA_INVALID_M64;
-	}
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		for (j = 0; j < M64_PER_IOV; j++) {
+			if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+				continue;
+			opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
+			clear_bit(pdn->m64_wins[i][j], &phb->ioda.m64_bar_alloc);
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+		}
 
 	return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
@@ -1120,17 +1121,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 	struct pci_dn         *pdn;
 	unsigned int           win;
 	struct resource       *res;
-	int                    i;
+	int                    i, j;
 	int64_t                rc;
+	int                    total_vfs;
+	resource_size_t        size, start;
+	int                    pe_num;
+	int                    vf_groups;
+	int                    vf_per_group;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
+	total_vfs = pci_sriov_get_totalvfs(pdev);
 
 	/* Initialize the m64_wins to IODA_INVALID_M64 */
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-		pdn->m64_wins[i] = IODA_INVALID_M64;
+		for (j = 0; j < M64_PER_IOV; j++)
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+
+	if (pdn->m64_per_iov == M64_PER_IOV) {
+		vf_groups = (vf_num <= M64_PER_IOV) ? vf_num: M64_PER_IOV;
+		vf_per_group = (vf_num <= M64_PER_IOV)? 1:
+			__roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+	} else {
+		vf_groups = 1;
+		vf_per_group = 1;
+	}
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = pdev->resource + PCI_IOV_RESOURCES + i;
@@ -1140,33 +1157,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 		if (!pnv_pci_is_mem_pref_64(res->flags))
 			continue;
 
-		do {
-			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-					phb->ioda.m64_bar_idx + 1, 0);
-
-			if (win >= phb->ioda.m64_bar_idx + 1)
-				goto m64_failed;
-		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+		for (j = 0; j < vf_groups; j++) {
+			do {
+				win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+						phb->ioda.m64_bar_idx + 1, 0);
+
+				if (win >= phb->ioda.m64_bar_idx + 1)
+					goto m64_failed;
+			} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+			pdn->m64_wins[i][j] = win;
+
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				size = pci_iov_resource_size(pdev,
+								   PCI_IOV_RESOURCES + i);
+				size = size * vf_per_group;
+				start = res->start + size * j;
+			} else {
+				size = resource_size(res);
+				start = res->start;
+			}
 
-		pdn->m64_wins[i] = win;
+			/* Map the M64 here */
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				pe_num = pdn->offset + j;
+				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+						pe_num, OPAL_M64_WINDOW_TYPE,
+						pdn->m64_wins[i][j], 0);
+			}
 
-		/* Map the M64 here */
-		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+			rc = opal_pci_set_phb_mem_window(phb->opal_id,
 						 OPAL_M64_WINDOW_TYPE,
-						 pdn->m64_wins[i],
-						 res->start,
+						 pdn->m64_wins[i][j],
+						 start,
 						 0, /* unused */
-						 resource_size(res));
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
-			goto m64_failed;
-		}
+						 size);
 
-		rc = opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
-			goto m64_failed;
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to set M64 BAR #%d: %lld\n",
+						win, rc);
+				goto m64_failed;
+			}
+
+			if (pdn->m64_per_iov == M64_PER_IOV)
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 2);
+			else
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 1);
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to enable M64 BAR #%d: %llx\n",
+						win, rc);
+				goto m64_failed;
+			}
 		}
 	}
 	return 0;
@@ -1208,22 +1253,53 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 	pe->tce32_table = NULL;
 }
 
-static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pnv_ioda_pe    *pe, *pe_n;
 	struct pci_dn         *pdn;
+	u16                    vf_index;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
 
 	if (!pdev->is_physfn)
 		return;
 
-	pdn = pci_get_pdn(pdev);
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++){
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_REMOVE_PE_FROM_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to unlink same"
+						" group PE#%d(%lld)\n", __func__,
+						pdn->offset + vf_index1, rc);
+				}
+	}
+
 	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
 		if (pe->parent_dev != pdev)
 			continue;
@@ -1258,10 +1334,11 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 	vf_num = pdn->vf_pes;
 
 	/* Release VF PEs */
-	pnv_ioda_release_vf_PE(pdev);
+	pnv_ioda_release_vf_PE(pdev, vf_num);
 
 	if (phb->type == PNV_PHB_IODA2) {
-		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, -pdn->offset);
 
 		/* Release M64 BARs */
 		pnv_pci_vf_release_m64(pdev);
@@ -1285,6 +1362,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 	int                    pe_num;
 	u16                    vf_index;
 	struct pci_dn         *pdn;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
@@ -1332,7 +1410,36 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+	}
 
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++) {
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_ADD_PE_TO_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to link same "
+						"group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+			}
 	}
 }
 
@@ -1366,14 +1473,15 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
 		/* Assign M64 BAR accordingly */
-		ret = pnv_pci_vf_assign_m64(pdev);
+		ret = pnv_pci_vf_assign_m64(pdev, vf_num);
 		if (ret) {
 			pr_info("No enough M64 resource\n");
 			goto m64_failed;
 		}
 
 		/* Do some magic shift */
-		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, pdn->offset);
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH V11 17/17] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3
@ 2015-01-15  2:28         ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-01-15  2:28 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

When IOV BAR is big, each of it is covered by 4 M64 window. This leads to
several VF PE sits in one PE in terms of M64.

This patch group VF PEs according to the M64 allocation.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pci-bridge.h     |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  188 +++++++++++++++++++++++------
 2 files changed, 149 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 7156486..ad39a42 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -177,7 +177,7 @@ struct pci_dn {
 #define M64_PER_IOV 4
 	int     m64_per_iov;
 #define IODA_INVALID_M64        (-1)
-	int     m64_wins[PCI_SRIOV_NUM_BARS];
+	int     m64_wins[PCI_SRIOV_NUM_BARS][M64_PER_IOV];
 #endif /* CONFIG_PCI_IOV */
 #endif
 	struct list_head child_list;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 23ea873..8456ae8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1093,26 +1093,27 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev)
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pci_dn         *pdn;
-	int                    i;
+	int                    i, j;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
 
-	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-		if (pdn->m64_wins[i] == IODA_INVALID_M64)
-			continue;
-		opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 0);
-		clear_bit(pdn->m64_wins[i], &phb->ioda.m64_bar_alloc);
-		pdn->m64_wins[i] = IODA_INVALID_M64;
-	}
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
+		for (j = 0; j < M64_PER_IOV; j++) {
+			if (pdn->m64_wins[i][j] == IODA_INVALID_M64)
+				continue;
+			opal_pci_phb_mmio_enable(phb->opal_id,
+				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 0);
+			clear_bit(pdn->m64_wins[i][j], &phb->ioda.m64_bar_alloc);
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+		}
 
 	return 0;
 }
 
-static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
+static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
@@ -1120,17 +1121,33 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 	struct pci_dn         *pdn;
 	unsigned int           win;
 	struct resource       *res;
-	int                    i;
+	int                    i, j;
 	int64_t                rc;
+	int                    total_vfs;
+	resource_size_t        size, start;
+	int                    pe_num;
+	int                    vf_groups;
+	int                    vf_per_group;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
 	pdn = pci_get_pdn(pdev);
+	total_vfs = pci_sriov_get_totalvfs(pdev);
 
 	/* Initialize the m64_wins to IODA_INVALID_M64 */
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-		pdn->m64_wins[i] = IODA_INVALID_M64;
+		for (j = 0; j < M64_PER_IOV; j++)
+			pdn->m64_wins[i][j] = IODA_INVALID_M64;
+
+	if (pdn->m64_per_iov == M64_PER_IOV) {
+		vf_groups = (vf_num <= M64_PER_IOV) ? vf_num: M64_PER_IOV;
+		vf_per_group = (vf_num <= M64_PER_IOV)? 1:
+			__roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+	} else {
+		vf_groups = 1;
+		vf_per_group = 1;
+	}
 
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = pdev->resource + PCI_IOV_RESOURCES + i;
@@ -1140,33 +1157,61 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev)
 		if (!pnv_pci_is_mem_pref_64(res->flags))
 			continue;
 
-		do {
-			win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-					phb->ioda.m64_bar_idx + 1, 0);
-
-			if (win >= phb->ioda.m64_bar_idx + 1)
-				goto m64_failed;
-		} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+		for (j = 0; j < vf_groups; j++) {
+			do {
+				win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+						phb->ioda.m64_bar_idx + 1, 0);
+
+				if (win >= phb->ioda.m64_bar_idx + 1)
+					goto m64_failed;
+			} while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+			pdn->m64_wins[i][j] = win;
+
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				size = pci_iov_resource_size(pdev,
+								   PCI_IOV_RESOURCES + i);
+				size = size * vf_per_group;
+				start = res->start + size * j;
+			} else {
+				size = resource_size(res);
+				start = res->start;
+			}
 
-		pdn->m64_wins[i] = win;
+			/* Map the M64 here */
+			if (pdn->m64_per_iov == M64_PER_IOV) {
+				pe_num = pdn->offset + j;
+				rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+						pe_num, OPAL_M64_WINDOW_TYPE,
+						pdn->m64_wins[i][j], 0);
+			}
 
-		/* Map the M64 here */
-		rc = opal_pci_set_phb_mem_window(phb->opal_id,
+			rc = opal_pci_set_phb_mem_window(phb->opal_id,
 						 OPAL_M64_WINDOW_TYPE,
-						 pdn->m64_wins[i],
-						 res->start,
+						 pdn->m64_wins[i][j],
+						 start,
 						 0, /* unused */
-						 resource_size(res));
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to map M64 BAR #%d: %lld\n", win, rc);
-			goto m64_failed;
-		}
+						 size);
 
-		rc = opal_pci_phb_mmio_enable(phb->opal_id,
-				OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i], 1);
-		if (rc != OPAL_SUCCESS) {
-			pr_err("Failed to enable M64 BAR #%d: %llx\n", win, rc);
-			goto m64_failed;
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to set M64 BAR #%d: %lld\n",
+						win, rc);
+				goto m64_failed;
+			}
+
+			if (pdn->m64_per_iov == M64_PER_IOV)
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 2);
+			else
+				rc = opal_pci_phb_mmio_enable(phb->opal_id,
+				     OPAL_M64_WINDOW_TYPE, pdn->m64_wins[i][j], 1);
+
+			if (rc != OPAL_SUCCESS) {
+				pr_err("Failed to enable M64 BAR #%d: %llx\n",
+						win, rc);
+				goto m64_failed;
+			}
 		}
 	}
 	return 0;
@@ -1208,22 +1253,53 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
 	pe->tce32_table = NULL;
 }
 
-static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
+static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 vf_num)
 {
 	struct pci_bus        *bus;
 	struct pci_controller *hose;
 	struct pnv_phb        *phb;
 	struct pnv_ioda_pe    *pe, *pe_n;
 	struct pci_dn         *pdn;
+	u16                    vf_index;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
 	phb = hose->private_data;
+	pdn = pci_get_pdn(pdev);
 
 	if (!pdev->is_physfn)
 		return;
 
-	pdn = pci_get_pdn(pdev);
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++){
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_REMOVE_PE_FROM_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to unlink same"
+						" group PE#%d(%lld)\n", __func__,
+						pdn->offset + vf_index1, rc);
+				}
+	}
+
 	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
 		if (pe->parent_dev != pdev)
 			continue;
@@ -1258,10 +1334,11 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev)
 	vf_num = pdn->vf_pes;
 
 	/* Release VF PEs */
-	pnv_ioda_release_vf_PE(pdev);
+	pnv_ioda_release_vf_PE(pdev, vf_num);
 
 	if (phb->type == PNV_PHB_IODA2) {
-		pnv_pci_vf_resource_shift(pdev, -pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, -pdn->offset);
 
 		/* Release M64 BARs */
 		pnv_pci_vf_release_m64(pdev);
@@ -1285,6 +1362,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 	int                    pe_num;
 	u16                    vf_index;
 	struct pci_dn         *pdn;
+	int64_t                rc;
 
 	bus = pdev->bus;
 	hose = pci_bus_to_host(bus);
@@ -1332,7 +1410,36 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_list_mutex);
 
 		pnv_pci_ioda2_setup_dma_pe(phb, pe);
+	}
 
+	if (pdn->m64_per_iov == M64_PER_IOV && vf_num > M64_PER_IOV) {
+		int   vf_group;
+		int   vf_per_group;
+		int   vf_index1;
+
+		vf_per_group = __roundup_pow_of_two(vf_num) / pdn->m64_per_iov;
+
+		for (vf_group = 0; vf_group < M64_PER_IOV; vf_group++)
+			for (vf_index = vf_group * vf_per_group;
+				vf_index < (vf_group + 1) * vf_per_group &&
+				vf_index < vf_num;
+				vf_index++)
+				for (vf_index1 = vf_group * vf_per_group;
+					vf_index1 < (vf_group + 1) * vf_per_group &&
+					vf_index1 < vf_num;
+					vf_index1++) {
+
+					rc = opal_pci_set_peltv(phb->opal_id,
+						pdn->offset + vf_index,
+						pdn->offset + vf_index1,
+						OPAL_ADD_PE_TO_DOMAIN);
+
+					if (rc)
+					    pr_warn("%s: Failed to link same "
+						"group PE#%d(%lld)\n",
+						__func__,
+						pdn->offset + vf_index1, rc);
+			}
 	}
 }
 
@@ -1366,14 +1473,15 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
 		/* Assign M64 BAR accordingly */
-		ret = pnv_pci_vf_assign_m64(pdev);
+		ret = pnv_pci_vf_assign_m64(pdev, vf_num);
 		if (ret) {
 			pr_info("No enough M64 resource\n");
 			goto m64_failed;
 		}
 
 		/* Do some magic shift */
-		pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1)
+			pnv_pci_vf_resource_shift(pdev, pdn->offset);
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset
  2015-01-15  2:28         ` Wei Yang
@ 2015-01-30 23:08           ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-01-30 23:08 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:28:04AM +0800, Wei Yang wrote:
> On PowrNV platform, resource position in M64 implies the PE# the resource
> belongs to. In some particular case, adjustment of a resource is necessary
> to locate it to a correct position in M64.
> 
> This patch introduces a function to shift the 'real' PF IOV BAR address
> according to an offset.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 8bad2b0..62bb2eb 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -14,6 +14,7 @@
>  #include <linux/kernel.h>
>  #include <linux/pci.h>
>  #include <linux/crash_dump.h>
> +#include <linux/pci_regs.h>
>  #include <linux/debugfs.h>
>  #include <linux/delay.h>
>  #include <linux/string.h>
> @@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>  	return 10;
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> +{
> +	struct pci_dn *pdn = pci_get_pdn(dev);
> +	int i;
> +	struct resource *res;
> +	resource_size_t size;
> +
> +	if (!dev->is_physfn)
> +		return;
> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &dev->resource[i];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
> +		size = pci_iov_resource_size(dev, i);
> +		res->start += size*offset;

It seems like you should adjust res->end, too.  Am I missing something?

And I'm not sure it's safe to move the resource here, because if we move it
outside the bounds of the parent, we'll corrupt the resource tree.  Maybe
we're safe for some reason here, but it requires more analysis than I've
done to prove it.

> +
> +		dev_info(&dev->dev, "                 %pR\n", res);
> +		pci_update_resource(dev, i);
> +	}
> +	pdn->max_vfs -= offset;
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  #if 0
>  static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  {
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset
@ 2015-01-30 23:08           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-01-30 23:08 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:28:04AM +0800, Wei Yang wrote:
> On PowrNV platform, resource position in M64 implies the PE# the resource
> belongs to. In some particular case, adjustment of a resource is necessary
> to locate it to a correct position in M64.
> 
> This patch introduces a function to shift the 'real' PF IOV BAR address
> according to an offset.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 8bad2b0..62bb2eb 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -14,6 +14,7 @@
>  #include <linux/kernel.h>
>  #include <linux/pci.h>
>  #include <linux/crash_dump.h>
> +#include <linux/pci_regs.h>
>  #include <linux/debugfs.h>
>  #include <linux/delay.h>
>  #include <linux/string.h>
> @@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>  	return 10;
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> +{
> +	struct pci_dn *pdn = pci_get_pdn(dev);
> +	int i;
> +	struct resource *res;
> +	resource_size_t size;
> +
> +	if (!dev->is_physfn)
> +		return;
> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &dev->resource[i];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
> +		size = pci_iov_resource_size(dev, i);
> +		res->start += size*offset;

It seems like you should adjust res->end, too.  Am I missing something?

And I'm not sure it's safe to move the resource here, because if we move it
outside the bounds of the parent, we'll corrupt the resource tree.  Maybe
we're safe for some reason here, but it requires more analysis than I've
done to prove it.

> +
> +		dev_info(&dev->dev, "                 %pR\n", res);
> +		pci_update_resource(dev, i);
> +	}
> +	pdn->max_vfs -= offset;
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  #if 0
>  static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  {
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset
  2015-01-30 23:08           ` Bjorn Helgaas
@ 2015-02-03  1:30             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-03  1:30 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Fri, Jan 30, 2015 at 05:08:03PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:04AM +0800, Wei Yang wrote:
>> On PowrNV platform, resource position in M64 implies the PE# the resource
>> belongs to. In some particular case, adjustment of a resource is necessary
>> to locate it to a correct position in M64.
>> 
>> This patch introduces a function to shift the 'real' PF IOV BAR address
>> according to an offset.
>> 
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
>>  1 file changed, 31 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 8bad2b0..62bb2eb 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/kernel.h>
>>  #include <linux/pci.h>
>>  #include <linux/crash_dump.h>
>> +#include <linux/pci_regs.h>
>>  #include <linux/debugfs.h>
>>  #include <linux/delay.h>
>>  #include <linux/string.h>
>> @@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>>  	return 10;
>>  }
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +{
>> +	struct pci_dn *pdn = pci_get_pdn(dev);
>> +	int i;
>> +	struct resource *res;
>> +	resource_size_t size;
>> +
>> +	if (!dev->is_physfn)
>> +		return;
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &dev->resource[i];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>> +		size = pci_iov_resource_size(dev, i);
>> +		res->start += size*offset;
>
>It seems like you should adjust res->end, too.  Am I missing something?
>

Bjorn, below is the mail I didn't manage to send yesterday.
------------------------------------------------------------------------------
No, what I am doing here is a little tricky.

After reserve extra space for the IOV BAR, this range is marked down in the
resource tree. And this function is called in pnv_pci_sriov_enable(), which is
called when driver wants to enable the VF. How much the size this function
want to shift depends on the PE number range is assigned in this turn. (The PE
number range is recorded in the pdn->offset for the start number.)

So we don't dear to extend the end address, otherwise it will overlap other
devices' space. Or "Shrink" is more accurate than "Shift", we just "Shift" the
start address instead of the end address. The consequence is the IOV BAR is
told be sit on a new place.

Previously I think this has no problem, while taking a close look at it, there
is a potential problem. The real size of the IOV BAR is in hardware itself.
This means when we "Shift" the start address of the IOV BAR, the end address
is also "Shift". If the end address is out of the space we have reserved, the
MMIO transaction in this range will be seen by two devices. (If this my
understanding of this part is not correct, please let me know.)

So what I need to fix is make sure after the shift, the real end address(start
+ total_vf * size) is still in the range of the extended IOV BAR range.
Because we have reserved more space for the IOV BAR, we have some extra space
to shift the IOV BAR.

One thing I am not sure is could the real end address be
(start + num_vf * size)?


>And I'm not sure it's safe to move the resource here, because if we move it
>outside the bounds of the parent, we'll corrupt the resource tree.  Maybe
>we're safe for some reason here, but it requires more analysis than I've
>done to prove it.
>
>> +
>> +		dev_info(&dev->dev, "                 %pR\n", res);
>> +		pci_update_resource(dev, i);
>> +	}
>> +	pdn->max_vfs -= offset;
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  #if 0
>>  static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>>  {
>> -- 
>> 1.7.9.5
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset
@ 2015-02-03  1:30             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-03  1:30 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Fri, Jan 30, 2015 at 05:08:03PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:04AM +0800, Wei Yang wrote:
>> On PowrNV platform, resource position in M64 implies the PE# the resource
>> belongs to. In some particular case, adjustment of a resource is necessary
>> to locate it to a correct position in M64.
>> 
>> This patch introduces a function to shift the 'real' PF IOV BAR address
>> according to an offset.
>> 
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   31 +++++++++++++++++++++++++++++
>>  1 file changed, 31 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 8bad2b0..62bb2eb 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/kernel.h>
>>  #include <linux/pci.h>
>>  #include <linux/crash_dump.h>
>> +#include <linux/pci_regs.h>
>>  #include <linux/debugfs.h>
>>  #include <linux/delay.h>
>>  #include <linux/string.h>
>> @@ -749,6 +750,36 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>>  	return 10;
>>  }
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +{
>> +	struct pci_dn *pdn = pci_get_pdn(dev);
>> +	int i;
>> +	struct resource *res;
>> +	resource_size_t size;
>> +
>> +	if (!dev->is_physfn)
>> +		return;
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &dev->resource[i];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>> +		size = pci_iov_resource_size(dev, i);
>> +		res->start += size*offset;
>
>It seems like you should adjust res->end, too.  Am I missing something?
>

Bjorn, below is the mail I didn't manage to send yesterday.
------------------------------------------------------------------------------
No, what I am doing here is a little tricky.

After reserve extra space for the IOV BAR, this range is marked down in the
resource tree. And this function is called in pnv_pci_sriov_enable(), which is
called when driver wants to enable the VF. How much the size this function
want to shift depends on the PE number range is assigned in this turn. (The PE
number range is recorded in the pdn->offset for the start number.)

So we don't dear to extend the end address, otherwise it will overlap other
devices' space. Or "Shrink" is more accurate than "Shift", we just "Shift" the
start address instead of the end address. The consequence is the IOV BAR is
told be sit on a new place.

Previously I think this has no problem, while taking a close look at it, there
is a potential problem. The real size of the IOV BAR is in hardware itself.
This means when we "Shift" the start address of the IOV BAR, the end address
is also "Shift". If the end address is out of the space we have reserved, the
MMIO transaction in this range will be seen by two devices. (If this my
understanding of this part is not correct, please let me know.)

So what I need to fix is make sure after the shift, the real end address(start
+ total_vf * size) is still in the range of the extended IOV BAR range.
Because we have reserved more space for the IOV BAR, we have some extra space
to shift the IOV BAR.

One thing I am not sure is could the real end address be
(start + num_vf * size)?


>And I'm not sure it's safe to move the resource here, because if we move it
>outside the bounds of the parent, we'll corrupt the resource tree.  Maybe
>we're safe for some reason here, but it requires more analysis than I've
>done to prove it.
>
>> +
>> +		dev_info(&dev->dev, "                 %pR\n", res);
>> +		pci_update_resource(dev, i);
>> +	}
>> +	pdn->max_vfs -= offset;
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  #if 0
>>  static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>>  {
>> -- 
>> 1.7.9.5
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-01-30 23:08           ` Bjorn Helgaas
@ 2015-02-03  7:01             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-03  7:01 UTC (permalink / raw)
  To: bhelgaas, gwshan, benh; +Cc: linux-pci, linuxppc-dev, Wei Yang

The actual IOV BAR range is determined by the start address and the actual
size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
chance the actual end address exceed the limit and overlap with other
devices.

This patch adds a check to make sure after shifting, the range will not
overlap with other devices.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8456ae8..1a1e74b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 }
 
 #ifdef CONFIG_PCI_IOV
-static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 {
 	struct pci_dn *pdn = pci_get_pdn(dev);
 	int i;
 	struct resource *res;
 	resource_size_t size;
+	u16 vf_num;
 
 	if (!dev->is_physfn)
-		return;
+		return -EINVAL;
 
+	vf_num = pdn->vf_pes;
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &dev->resource[i];
 		if (!res->flags || !res->parent)
@@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
 		size = pci_iov_resource_size(dev, i);
 		res->start += size*offset;
-
 		dev_info(&dev->dev, "                 %pR\n", res);
+
+		/*
+		 * The actual IOV BAR range is determined by the start address
+		 * and the actual size for vf_num VFs BAR. The check here is
+		 * to make sure after shifting, the range will not overlap
+		 * with other device.
+		 */
+		if ((res->start + (size * vf_num)) > res->end) {
+			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
+					" other device after shift\n");
+			goto failed;
+		}
+	}
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
 		pci_update_resource(dev, i);
 	}
 	pdn->max_vfs -= offset;
+	return 0;
+
+failed:
+	for (; i >= PCI_IOV_RESOURCES; i--) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*(-offset);
+		dev_info(&dev->dev, "                 %pR\n", res);
+	}
+	return -EBUSY;
 }
 #endif /* CONFIG_PCI_IOV */
 
@@ -1480,8 +1520,11 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		}
 
 		/* Do some magic shift */
-		if (pdn->m64_per_iov == 1)
-			pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1) {
+			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
+			if (ret)
+				goto m64_failed;
+		}
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-03  7:01             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-03  7:01 UTC (permalink / raw)
  To: bhelgaas, gwshan, benh; +Cc: linux-pci, Wei Yang, linuxppc-dev

The actual IOV BAR range is determined by the start address and the actual
size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
chance the actual end address exceed the limit and overlap with other
devices.

This patch adds a check to make sure after shifting, the range will not
overlap with other devices.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8456ae8..1a1e74b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 }
 
 #ifdef CONFIG_PCI_IOV
-static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 {
 	struct pci_dn *pdn = pci_get_pdn(dev);
 	int i;
 	struct resource *res;
 	resource_size_t size;
+	u16 vf_num;
 
 	if (!dev->is_physfn)
-		return;
+		return -EINVAL;
 
+	vf_num = pdn->vf_pes;
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
 		res = &dev->resource[i];
 		if (!res->flags || !res->parent)
@@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
 		size = pci_iov_resource_size(dev, i);
 		res->start += size*offset;
-
 		dev_info(&dev->dev, "                 %pR\n", res);
+
+		/*
+		 * The actual IOV BAR range is determined by the start address
+		 * and the actual size for vf_num VFs BAR. The check here is
+		 * to make sure after shifting, the range will not overlap
+		 * with other device.
+		 */
+		if ((res->start + (size * vf_num)) > res->end) {
+			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
+					" other device after shift\n");
+			goto failed;
+		}
+	}
+
+	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
 		pci_update_resource(dev, i);
 	}
 	pdn->max_vfs -= offset;
+	return 0;
+
+failed:
+	for (; i >= PCI_IOV_RESOURCES; i--) {
+		res = &dev->resource[i];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
+		size = pci_iov_resource_size(dev, i);
+		res->start += size*(-offset);
+		dev_info(&dev->dev, "                 %pR\n", res);
+	}
+	return -EBUSY;
 }
 #endif /* CONFIG_PCI_IOV */
 
@@ -1480,8 +1520,11 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
 		}
 
 		/* Do some magic shift */
-		if (pdn->m64_per_iov == 1)
-			pnv_pci_vf_resource_shift(pdev, pdn->offset);
+		if (pdn->m64_per_iov == 1) {
+			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
+			if (ret)
+				goto m64_failed;
+		}
 	}
 
 	/* Setup VF PEs */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-02-03  7:01             ` Wei Yang
@ 2015-02-04  0:19               ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04  0:19 UTC (permalink / raw)
  To: Wei Yang; +Cc: gwshan, benh, linux-pci, linuxppc-dev

On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
> The actual IOV BAR range is determined by the start address and the actual
> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
> chance the actual end address exceed the limit and overlap with other
> devices.
> 
> This patch adds a check to make sure after shifting, the range will not
> overlap with other devices.

I folded this into the previous patch (the one that adds
pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
with the following one ("powerpc/powernv: Allocate VF PE") because this one
references pdn->vf_pes, which is added by "Allocate VF PE".

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
>  1 file changed, 48 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 8456ae8..1a1e74b 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>  }
>  
>  #ifdef CONFIG_PCI_IOV
> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  {
>  	struct pci_dn *pdn = pci_get_pdn(dev);
>  	int i;
>  	struct resource *res;
>  	resource_size_t size;
> +	u16 vf_num;
>  
>  	if (!dev->is_physfn)
> -		return;
> +		return -EINVAL;
>  
> +	vf_num = pdn->vf_pes;

I can't actually build this, but I don't think pdn->vf_pes is defined yet.

>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>  		res = &dev->resource[i];
>  		if (!res->flags || !res->parent)
> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>  		size = pci_iov_resource_size(dev, i);
>  		res->start += size*offset;
> -
>  		dev_info(&dev->dev, "                 %pR\n", res);
> +
> +		/*
> +		 * The actual IOV BAR range is determined by the start address
> +		 * and the actual size for vf_num VFs BAR. The check here is
> +		 * to make sure after shifting, the range will not overlap
> +		 * with other device.
> +		 */
> +		if ((res->start + (size * vf_num)) > res->end) {
> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
> +					" other device after shift\n");

sriov_init() sets up "res" with enough space to contain TotalVF copies
of the VF BAR.  By the time we get here, that "res" is in the resource
tree, and you should be able to see it in /proc/iomem.

For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
SR-IOV Capability contains a base address of 0x8000_0000, the resource
would be:

  [mem 0x8000_0000-0x87ff_ffff]

We have to assume there's another resource starting immediately after
this one, i.e., at 0x8800_0000, and we have to make sure that when we
change this resource and turn on SR-IOV, we don't overlap with it.

The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
may be smaller than TotalVFs).

If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
new resource will be:

  [mem 0x8170_0000-0x826f_ffff]

That's fine; it doesn't extend past the original end of 0x87ff_ffff.
But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
so the new resource will be:

  [mem 0x8780_0000-0x887f_ffff]

and that's a problem because we have two devices responding at
0x8800_0000.

Your test of "res->start + (size * vf_num)) > res->end" is not strict
enough to catch this problem.

I think we need something like the patch below.  I restructured it so
we don't have to back out any resource changes if we fail.

This shifting strategy seems to imply that the closer NumVFs is to
TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
== TotalVFs, you wouldn't be able to shift at all.  In this example,
you could shift by anything from 0 to 128 - 16 = 112, but if you
wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?

I think your M64 BAR gets split into 256 segments, regardless of what
TotalVFs is, so if you expanded the resource to 256 * 1MB for this
example, you would be able to shift by up to 256 - NumVFs.  Do you
actually do this somewhere?

I pushed an updated pci/virtualization branch with these updates.  I
think there's also a leak that needs to be fixed that Dan Carpenter
pointed out.

Bjorn

> +			goto failed;
> +		}
> +	}
> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &dev->resource[i];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
>  		pci_update_resource(dev, i);
>  	}
>  	pdn->max_vfs -= offset;
> +	return 0;
> +
> +failed:
> +	for (; i >= PCI_IOV_RESOURCES; i--) {
> +		res = &dev->resource[i];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
> +		size = pci_iov_resource_size(dev, i);
> +		res->start += size*(-offset);
> +		dev_info(&dev->dev, "                 %pR\n", res);
> +	}
> +	return -EBUSY;
>  }
>  #endif /* CONFIG_PCI_IOV */
>  
> @@ -1480,8 +1520,11 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>  		}
>  
>  		/* Do some magic shift */
> -		if (pdn->m64_per_iov == 1)
> -			pnv_pci_vf_resource_shift(pdev, pdn->offset);
> +		if (pdn->m64_per_iov == 1) {
> +			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
> +			if (ret)
> +				goto m64_failed;
> +		}
>  	}
>  
>  	/* Setup VF PEs */

commit 9849fc0c807d3544dbd682354325b2454f139ca4
Author: Wei Yang <weiyang@linux.vnet.ibm.com>
Date:   Tue Feb 3 13:09:30 2015 -0600

    powerpc/powernv: Shift VF resource with an offset
    
    On PowerNV platform, resource position in M64 implies the PE# the resource
    belongs to.  In some cases, adjustment of a resource is necessary to locate
    it to a correct position in M64.
    
    Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
    according to an offset.
    
    [bhelgaas: rework loops, rework overlap check, index resource[]
    conventionally, remove pci_regs.h include]
    Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 74de05e58e1d..6ffedcc291a8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -749,6 +749,74 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res, res2;
+	resource_size_t size, end;
+	u16 vf_num;
+
+	if (!dev->is_physfn)
+		return -EINVAL;
+
+	/*
+	 * "offset" is in VFs.  The M64 BARs are sized so that when they
+	 * are segmented, each segment is the same size as the IOV BAR.
+	 * Each segment is in a separate PE, and the high order bits of the
+	 * address are the PE number.  Therefore, each VF's BAR is in a
+	 * separate PE, and changing the IOV BAR start address changes the
+	 * range of PEs the VFs are in.
+	 */
+	vf_num = pdn->vf_pes;		// FIXME not defined yet
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		/*
+		 * The actual IOV BAR range is determined by the start address
+		 * and the actual size for vf_num VFs BAR.  This check is to
+		 * make sure that after shifting, the range will not overlap
+		 * with another device.
+		 */
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
+		res2.flags = res->flags;
+		res2.start = res->start + (size * offset);
+		res2.end = res2.start + (size * vf_num) - 1;
+
+		if (res2.end > res->end) {
+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
+				i, &res2, res, vf_num, offset);
+			return -EBUSY;
+		}
+	}
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
+		res2 = *res;
+		res->start += size * offset;
+
+		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
+			 i, &res2, res, vf_num, offset);
+		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
+	}
+	pdn->max_vfs -= offset;
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-04  0:19               ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04  0:19 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
> The actual IOV BAR range is determined by the start address and the actual
> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
> chance the actual end address exceed the limit and overlap with other
> devices.
> 
> This patch adds a check to make sure after shifting, the range will not
> overlap with other devices.

I folded this into the previous patch (the one that adds
pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
with the following one ("powerpc/powernv: Allocate VF PE") because this one
references pdn->vf_pes, which is added by "Allocate VF PE".

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
>  1 file changed, 48 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 8456ae8..1a1e74b 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>  }
>  
>  #ifdef CONFIG_PCI_IOV
> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  {
>  	struct pci_dn *pdn = pci_get_pdn(dev);
>  	int i;
>  	struct resource *res;
>  	resource_size_t size;
> +	u16 vf_num;
>  
>  	if (!dev->is_physfn)
> -		return;
> +		return -EINVAL;
>  
> +	vf_num = pdn->vf_pes;

I can't actually build this, but I don't think pdn->vf_pes is defined yet.

>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>  		res = &dev->resource[i];
>  		if (!res->flags || !res->parent)
> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>  		size = pci_iov_resource_size(dev, i);
>  		res->start += size*offset;
> -
>  		dev_info(&dev->dev, "                 %pR\n", res);
> +
> +		/*
> +		 * The actual IOV BAR range is determined by the start address
> +		 * and the actual size for vf_num VFs BAR. The check here is
> +		 * to make sure after shifting, the range will not overlap
> +		 * with other device.
> +		 */
> +		if ((res->start + (size * vf_num)) > res->end) {
> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
> +					" other device after shift\n");

sriov_init() sets up "res" with enough space to contain TotalVF copies
of the VF BAR.  By the time we get here, that "res" is in the resource
tree, and you should be able to see it in /proc/iomem.

For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
SR-IOV Capability contains a base address of 0x8000_0000, the resource
would be:

  [mem 0x8000_0000-0x87ff_ffff]

We have to assume there's another resource starting immediately after
this one, i.e., at 0x8800_0000, and we have to make sure that when we
change this resource and turn on SR-IOV, we don't overlap with it.

The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
may be smaller than TotalVFs).

If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
new resource will be:

  [mem 0x8170_0000-0x826f_ffff]

That's fine; it doesn't extend past the original end of 0x87ff_ffff.
But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
so the new resource will be:

  [mem 0x8780_0000-0x887f_ffff]

and that's a problem because we have two devices responding at
0x8800_0000.

Your test of "res->start + (size * vf_num)) > res->end" is not strict
enough to catch this problem.

I think we need something like the patch below.  I restructured it so
we don't have to back out any resource changes if we fail.

This shifting strategy seems to imply that the closer NumVFs is to
TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
== TotalVFs, you wouldn't be able to shift at all.  In this example,
you could shift by anything from 0 to 128 - 16 = 112, but if you
wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?

I think your M64 BAR gets split into 256 segments, regardless of what
TotalVFs is, so if you expanded the resource to 256 * 1MB for this
example, you would be able to shift by up to 256 - NumVFs.  Do you
actually do this somewhere?

I pushed an updated pci/virtualization branch with these updates.  I
think there's also a leak that needs to be fixed that Dan Carpenter
pointed out.

Bjorn

> +			goto failed;
> +		}
> +	}
> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &dev->resource[i];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
>  		pci_update_resource(dev, i);
>  	}
>  	pdn->max_vfs -= offset;
> +	return 0;
> +
> +failed:
> +	for (; i >= PCI_IOV_RESOURCES; i--) {
> +		res = &dev->resource[i];
> +		if (!res->flags || !res->parent)
> +			continue;
> +
> +		if (!pnv_pci_is_mem_pref_64(res->flags))
> +			continue;
> +
> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
> +		size = pci_iov_resource_size(dev, i);
> +		res->start += size*(-offset);
> +		dev_info(&dev->dev, "                 %pR\n", res);
> +	}
> +	return -EBUSY;
>  }
>  #endif /* CONFIG_PCI_IOV */
>  
> @@ -1480,8 +1520,11 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>  		}
>  
>  		/* Do some magic shift */
> -		if (pdn->m64_per_iov == 1)
> -			pnv_pci_vf_resource_shift(pdev, pdn->offset);
> +		if (pdn->m64_per_iov == 1) {
> +			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
> +			if (ret)
> +				goto m64_failed;
> +		}
>  	}
>  
>  	/* Setup VF PEs */

commit 9849fc0c807d3544dbd682354325b2454f139ca4
Author: Wei Yang <weiyang@linux.vnet.ibm.com>
Date:   Tue Feb 3 13:09:30 2015 -0600

    powerpc/powernv: Shift VF resource with an offset
    
    On PowerNV platform, resource position in M64 implies the PE# the resource
    belongs to.  In some cases, adjustment of a resource is necessary to locate
    it to a correct position in M64.
    
    Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
    according to an offset.
    
    [bhelgaas: rework loops, rework overlap check, index resource[]
    conventionally, remove pci_regs.h include]
    Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 74de05e58e1d..6ffedcc291a8 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -749,6 +749,74 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
 	return 10;
 }
 
+#ifdef CONFIG_PCI_IOV
+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
+{
+	struct pci_dn *pdn = pci_get_pdn(dev);
+	int i;
+	struct resource *res, res2;
+	resource_size_t size, end;
+	u16 vf_num;
+
+	if (!dev->is_physfn)
+		return -EINVAL;
+
+	/*
+	 * "offset" is in VFs.  The M64 BARs are sized so that when they
+	 * are segmented, each segment is the same size as the IOV BAR.
+	 * Each segment is in a separate PE, and the high order bits of the
+	 * address are the PE number.  Therefore, each VF's BAR is in a
+	 * separate PE, and changing the IOV BAR start address changes the
+	 * range of PEs the VFs are in.
+	 */
+	vf_num = pdn->vf_pes;		// FIXME not defined yet
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		/*
+		 * The actual IOV BAR range is determined by the start address
+		 * and the actual size for vf_num VFs BAR.  This check is to
+		 * make sure that after shifting, the range will not overlap
+		 * with another device.
+		 */
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
+		res2.flags = res->flags;
+		res2.start = res->start + (size * offset);
+		res2.end = res2.start + (size * vf_num) - 1;
+
+		if (res2.end > res->end) {
+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
+				i, &res2, res, vf_num, offset);
+			return -EBUSY;
+		}
+	}
+
+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+		res = &dev->resource[i + PCI_IOV_RESOURCES];
+		if (!res->flags || !res->parent)
+			continue;
+
+		if (!pnv_pci_is_mem_pref_64(res->flags))
+			continue;
+
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
+		res2 = *res;
+		res->start += size * offset;
+
+		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
+			 i, &res2, res, vf_num, offset);
+		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
+	}
+	pdn->max_vfs -= offset;
+	return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
 #if 0
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-02-04  0:19               ` Bjorn Helgaas
@ 2015-02-04  3:34                 ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04  3:34 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, gwshan, benh, linux-pci, linuxppc-dev

On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
>> The actual IOV BAR range is determined by the start address and the actual
>> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
>> chance the actual end address exceed the limit and overlap with other
>> devices.
>> 
>> This patch adds a check to make sure after shifting, the range will not
>> overlap with other devices.
>
>I folded this into the previous patch (the one that adds
>pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
>with the following one ("powerpc/powernv: Allocate VF PE") because this one
>references pdn->vf_pes, which is added by "Allocate VF PE".
>

Yes. Both need this.

>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
>>  1 file changed, 48 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 8456ae8..1a1e74b 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>>  }
>>  
>>  #ifdef CONFIG_PCI_IOV
>> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>>  {
>>  	struct pci_dn *pdn = pci_get_pdn(dev);
>>  	int i;
>>  	struct resource *res;
>>  	resource_size_t size;
>> +	u16 vf_num;
>>  
>>  	if (!dev->is_physfn)
>> -		return;
>> +		return -EINVAL;
>>  
>> +	vf_num = pdn->vf_pes;
>
>I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>

The pdn->vf_pes is defined in the next patch, it is not defined yet.

I thought the incremental patch means a patch on top of the current patch set,
so it is defined as the last patch.

>>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>>  		res = &dev->resource[i];
>>  		if (!res->flags || !res->parent)
>> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>>  		size = pci_iov_resource_size(dev, i);
>>  		res->start += size*offset;
>> -
>>  		dev_info(&dev->dev, "                 %pR\n", res);
>> +
>> +		/*
>> +		 * The actual IOV BAR range is determined by the start address
>> +		 * and the actual size for vf_num VFs BAR. The check here is
>> +		 * to make sure after shifting, the range will not overlap
>> +		 * with other device.
>> +		 */
>> +		if ((res->start + (size * vf_num)) > res->end) {
>> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
>> +					" other device after shift\n");
>
>sriov_init() sets up "res" with enough space to contain TotalVF copies
>of the VF BAR.  By the time we get here, that "res" is in the resource
>tree, and you should be able to see it in /proc/iomem.
>
>For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
>resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
>SR-IOV Capability contains a base address of 0x8000_0000, the resource
>would be:
>
>  [mem 0x8000_0000-0x87ff_ffff]
>
>We have to assume there's another resource starting immediately after
>this one, i.e., at 0x8800_0000, and we have to make sure that when we
>change this resource and turn on SR-IOV, we don't overlap with it.
>
>The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
>hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
>may be smaller than TotalVFs).
>
>If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
>1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
>new resource will be:
>
>  [mem 0x8170_0000-0x826f_ffff]
>
>That's fine; it doesn't extend past the original end of 0x87ff_ffff.
>But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
>to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
>so the new resource will be:
>
>  [mem 0x8780_0000-0x887f_ffff]
>
>and that's a problem because we have two devices responding at
>0x8800_0000.
>
>Your test of "res->start + (size * vf_num)) > res->end" is not strict
>enough to catch this problem.
>

Yep, you are right.

>I think we need something like the patch below.  I restructured it so
>we don't have to back out any resource changes if we fail.
>
>This shifting strategy seems to imply that the closer NumVFs is to
>TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
>== TotalVFs, you wouldn't be able to shift at all.  In this example,
>you could shift by anything from 0 to 128 - 16 = 112, but if you
>wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?
>
>I think your M64 BAR gets split into 256 segments, regardless of what
>TotalVFs is, so if you expanded the resource to 256 * 1MB for this
>example, you would be able to shift by up to 256 - NumVFs.  Do you
>actually do this somewhere?
>

Yes, after expanding the resource to 256 * 1MB, it is able to shift up to 
256 - NumVFs. But currently, on my system, I don't see one case really do
this.

On my system, there is an Emulex card with 4 PFs.

0006:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0006:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0006:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0006:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)

The max VFs for them are 80, 80, 20, 20, with total number of 200 VFs.

be2net 0006:01:00.0:  Shifting VF BAR [mem 0x3d40 1000 0000 - 0x3d40 10ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.0:                  [mem 0x3d40 1003 0000 - 0x3d40 10ff ffff 64bit pref]    253 segs offset 3
PE range [3 - 82]
be2net 0006:01:00.1:  Shifting VF BAR [mem 0x3d40 1100 0000 - 0x3d40 11ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.1:                  [mem 0x3d40 1153 0000 - 0x3d40 11ff ffff 64bit pref]    173 segs offset 83
PE range [83 - 162]
be2net 0006:01:00.2:  Shifting VF BAR [mem 0x3d40 1200 0000 - 0x3d40 12ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.2:                  [mem 0x3d40 12a3 0000 - 0x3d40 12ff ffff 64bit pref]    93  segs offset 163
PE range [163 - 182]
be2net 0006:01:00.3:  Shifting VF BAR [mem 0x3d40 1300 0000 - 0x3d40 13ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.3:                  [mem 0x3d40 13b7 0000 - 0x3d40 13ff ffff 64bit pref]    73  segs offset 183
PE range [183 - 202]

After enable the max number of VFs, even the last VF still has 73 number VF
BAR size. So this not trigger the limit, but proves the shift offset could be
larger than (TotalVFs - NumVFs).

>I pushed an updated pci/virtualization branch with these updates.  I
>think there's also a leak that needs to be fixed that Dan Carpenter
>pointed out.
>
>Bjorn
>
>> +			goto failed;
>> +		}
>> +	}
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &dev->resource[i];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>>  		pci_update_resource(dev, i);
>>  	}
>>  	pdn->max_vfs -= offset;
>> +	return 0;
>> +
>> +failed:
>> +	for (; i >= PCI_IOV_RESOURCES; i--) {
>> +		res = &dev->resource[i];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>> +		size = pci_iov_resource_size(dev, i);
>> +		res->start += size*(-offset);
>> +		dev_info(&dev->dev, "                 %pR\n", res);
>> +	}
>> +	return -EBUSY;
>>  }
>>  #endif /* CONFIG_PCI_IOV */
>>  
>> @@ -1480,8 +1520,11 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>>  		}
>>  
>>  		/* Do some magic shift */
>> -		if (pdn->m64_per_iov == 1)
>> -			pnv_pci_vf_resource_shift(pdev, pdn->offset);
>> +		if (pdn->m64_per_iov == 1) {
>> +			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
>> +			if (ret)
>> +				goto m64_failed;
>> +		}
>>  	}
>>  
>>  	/* Setup VF PEs */
>
>commit 9849fc0c807d3544dbd682354325b2454f139ca4
>Author: Wei Yang <weiyang@linux.vnet.ibm.com>
>Date:   Tue Feb 3 13:09:30 2015 -0600
>
>    powerpc/powernv: Shift VF resource with an offset
>    
>    On PowerNV platform, resource position in M64 implies the PE# the resource
>    belongs to.  In some cases, adjustment of a resource is necessary to locate
>    it to a correct position in M64.
>    
>    Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>    according to an offset.
>    
>    [bhelgaas: rework loops, rework overlap check, index resource[]
>    conventionally, remove pci_regs.h include]
>    Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>index 74de05e58e1d..6ffedcc291a8 100644
>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>@@ -749,6 +749,74 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
> 	return 10;
> }
>
>+#ifdef CONFIG_PCI_IOV
>+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>+{
>+	struct pci_dn *pdn = pci_get_pdn(dev);
>+	int i;
>+	struct resource *res, res2;
>+	resource_size_t size, end;
>+	u16 vf_num;
>+
>+	if (!dev->is_physfn)
>+		return -EINVAL;
>+
>+	/*
>+	 * "offset" is in VFs.  The M64 BARs are sized so that when they
>+	 * are segmented, each segment is the same size as the IOV BAR.
>+	 * Each segment is in a separate PE, and the high order bits of the
>+	 * address are the PE number.  Therefore, each VF's BAR is in a
>+	 * separate PE, and changing the IOV BAR start address changes the
>+	 * range of PEs the VFs are in.
>+	 */
>+	vf_num = pdn->vf_pes;		// FIXME not defined yet

Do you want me poll your pci/virtualization branch and fix this?

>+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>+		res = &dev->resource[i + PCI_IOV_RESOURCES];
>+		if (!res->flags || !res->parent)
>+			continue;
>+
>+		if (!pnv_pci_is_mem_pref_64(res->flags))
>+			continue;
>+
>+		/*
>+		 * The actual IOV BAR range is determined by the start address
>+		 * and the actual size for vf_num VFs BAR.  This check is to
>+		 * make sure that after shifting, the range will not overlap
>+		 * with another device.
>+		 */
>+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>+		res2.flags = res->flags;
>+		res2.start = res->start + (size * offset);
>+		res2.end = res2.start + (size * vf_num) - 1;
>+
>+		if (res2.end > res->end) {
>+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>+				i, &res2, res, vf_num, offset);
>+			return -EBUSY;
>+		}
>+	}
>+
>+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>+		res = &dev->resource[i + PCI_IOV_RESOURCES];
>+		if (!res->flags || !res->parent)
>+			continue;
>+
>+		if (!pnv_pci_is_mem_pref_64(res->flags))
>+			continue;
>+
>+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>+		res2 = *res;
>+		res->start += size * offset;
>+
>+		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>+			 i, &res2, res, vf_num, offset);
>+		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>+	}
>+	pdn->max_vfs -= offset;
>+	return 0;
>+}
>+#endif /* CONFIG_PCI_IOV */
>+
> #if 0
> static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
> {

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-04  3:34                 ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04  3:34 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
>> The actual IOV BAR range is determined by the start address and the actual
>> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
>> chance the actual end address exceed the limit and overlap with other
>> devices.
>> 
>> This patch adds a check to make sure after shifting, the range will not
>> overlap with other devices.
>
>I folded this into the previous patch (the one that adds
>pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
>with the following one ("powerpc/powernv: Allocate VF PE") because this one
>references pdn->vf_pes, which is added by "Allocate VF PE".
>

Yes. Both need this.

>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
>>  1 file changed, 48 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 8456ae8..1a1e74b 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>>  }
>>  
>>  #ifdef CONFIG_PCI_IOV
>> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>>  {
>>  	struct pci_dn *pdn = pci_get_pdn(dev);
>>  	int i;
>>  	struct resource *res;
>>  	resource_size_t size;
>> +	u16 vf_num;
>>  
>>  	if (!dev->is_physfn)
>> -		return;
>> +		return -EINVAL;
>>  
>> +	vf_num = pdn->vf_pes;
>
>I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>

The pdn->vf_pes is defined in the next patch, it is not defined yet.

I thought the incremental patch means a patch on top of the current patch set,
so it is defined as the last patch.

>>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>>  		res = &dev->resource[i];
>>  		if (!res->flags || !res->parent)
>> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>>  		size = pci_iov_resource_size(dev, i);
>>  		res->start += size*offset;
>> -
>>  		dev_info(&dev->dev, "                 %pR\n", res);
>> +
>> +		/*
>> +		 * The actual IOV BAR range is determined by the start address
>> +		 * and the actual size for vf_num VFs BAR. The check here is
>> +		 * to make sure after shifting, the range will not overlap
>> +		 * with other device.
>> +		 */
>> +		if ((res->start + (size * vf_num)) > res->end) {
>> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
>> +					" other device after shift\n");
>
>sriov_init() sets up "res" with enough space to contain TotalVF copies
>of the VF BAR.  By the time we get here, that "res" is in the resource
>tree, and you should be able to see it in /proc/iomem.
>
>For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
>resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
>SR-IOV Capability contains a base address of 0x8000_0000, the resource
>would be:
>
>  [mem 0x8000_0000-0x87ff_ffff]
>
>We have to assume there's another resource starting immediately after
>this one, i.e., at 0x8800_0000, and we have to make sure that when we
>change this resource and turn on SR-IOV, we don't overlap with it.
>
>The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
>hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
>may be smaller than TotalVFs).
>
>If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
>1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
>new resource will be:
>
>  [mem 0x8170_0000-0x826f_ffff]
>
>That's fine; it doesn't extend past the original end of 0x87ff_ffff.
>But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
>to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
>so the new resource will be:
>
>  [mem 0x8780_0000-0x887f_ffff]
>
>and that's a problem because we have two devices responding at
>0x8800_0000.
>
>Your test of "res->start + (size * vf_num)) > res->end" is not strict
>enough to catch this problem.
>

Yep, you are right.

>I think we need something like the patch below.  I restructured it so
>we don't have to back out any resource changes if we fail.
>
>This shifting strategy seems to imply that the closer NumVFs is to
>TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
>== TotalVFs, you wouldn't be able to shift at all.  In this example,
>you could shift by anything from 0 to 128 - 16 = 112, but if you
>wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?
>
>I think your M64 BAR gets split into 256 segments, regardless of what
>TotalVFs is, so if you expanded the resource to 256 * 1MB for this
>example, you would be able to shift by up to 256 - NumVFs.  Do you
>actually do this somewhere?
>

Yes, after expanding the resource to 256 * 1MB, it is able to shift up to 
256 - NumVFs. But currently, on my system, I don't see one case really do
this.

On my system, there is an Emulex card with 4 PFs.

0006:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0006:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0006:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
0006:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)

The max VFs for them are 80, 80, 20, 20, with total number of 200 VFs.

be2net 0006:01:00.0:  Shifting VF BAR [mem 0x3d40 1000 0000 - 0x3d40 10ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.0:                  [mem 0x3d40 1003 0000 - 0x3d40 10ff ffff 64bit pref]    253 segs offset 3
PE range [3 - 82]
be2net 0006:01:00.1:  Shifting VF BAR [mem 0x3d40 1100 0000 - 0x3d40 11ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.1:                  [mem 0x3d40 1153 0000 - 0x3d40 11ff ffff 64bit pref]    173 segs offset 83
PE range [83 - 162]
be2net 0006:01:00.2:  Shifting VF BAR [mem 0x3d40 1200 0000 - 0x3d40 12ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.2:                  [mem 0x3d40 12a3 0000 - 0x3d40 12ff ffff 64bit pref]    93  segs offset 163
PE range [163 - 182]
be2net 0006:01:00.3:  Shifting VF BAR [mem 0x3d40 1300 0000 - 0x3d40 13ff ffff 64bit pref] to 256 segs
be2net 0006:01:00.3:                  [mem 0x3d40 13b7 0000 - 0x3d40 13ff ffff 64bit pref]    73  segs offset 183
PE range [183 - 202]

After enable the max number of VFs, even the last VF still has 73 number VF
BAR size. So this not trigger the limit, but proves the shift offset could be
larger than (TotalVFs - NumVFs).

>I pushed an updated pci/virtualization branch with these updates.  I
>think there's also a leak that needs to be fixed that Dan Carpenter
>pointed out.
>
>Bjorn
>
>> +			goto failed;
>> +		}
>> +	}
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &dev->resource[i];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>>  		pci_update_resource(dev, i);
>>  	}
>>  	pdn->max_vfs -= offset;
>> +	return 0;
>> +
>> +failed:
>> +	for (; i >= PCI_IOV_RESOURCES; i--) {
>> +		res = &dev->resource[i];
>> +		if (!res->flags || !res->parent)
>> +			continue;
>> +
>> +		if (!pnv_pci_is_mem_pref_64(res->flags))
>> +			continue;
>> +
>> +		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>> +		size = pci_iov_resource_size(dev, i);
>> +		res->start += size*(-offset);
>> +		dev_info(&dev->dev, "                 %pR\n", res);
>> +	}
>> +	return -EBUSY;
>>  }
>>  #endif /* CONFIG_PCI_IOV */
>>  
>> @@ -1480,8 +1520,11 @@ int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 vf_num)
>>  		}
>>  
>>  		/* Do some magic shift */
>> -		if (pdn->m64_per_iov == 1)
>> -			pnv_pci_vf_resource_shift(pdev, pdn->offset);
>> +		if (pdn->m64_per_iov == 1) {
>> +			ret = pnv_pci_vf_resource_shift(pdev, pdn->offset);
>> +			if (ret)
>> +				goto m64_failed;
>> +		}
>>  	}
>>  
>>  	/* Setup VF PEs */
>
>commit 9849fc0c807d3544dbd682354325b2454f139ca4
>Author: Wei Yang <weiyang@linux.vnet.ibm.com>
>Date:   Tue Feb 3 13:09:30 2015 -0600
>
>    powerpc/powernv: Shift VF resource with an offset
>    
>    On PowerNV platform, resource position in M64 implies the PE# the resource
>    belongs to.  In some cases, adjustment of a resource is necessary to locate
>    it to a correct position in M64.
>    
>    Add pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address
>    according to an offset.
>    
>    [bhelgaas: rework loops, rework overlap check, index resource[]
>    conventionally, remove pci_regs.h include]
>    Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>index 74de05e58e1d..6ffedcc291a8 100644
>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>@@ -749,6 +749,74 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
> 	return 10;
> }
>
>+#ifdef CONFIG_PCI_IOV
>+static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>+{
>+	struct pci_dn *pdn = pci_get_pdn(dev);
>+	int i;
>+	struct resource *res, res2;
>+	resource_size_t size, end;
>+	u16 vf_num;
>+
>+	if (!dev->is_physfn)
>+		return -EINVAL;
>+
>+	/*
>+	 * "offset" is in VFs.  The M64 BARs are sized so that when they
>+	 * are segmented, each segment is the same size as the IOV BAR.
>+	 * Each segment is in a separate PE, and the high order bits of the
>+	 * address are the PE number.  Therefore, each VF's BAR is in a
>+	 * separate PE, and changing the IOV BAR start address changes the
>+	 * range of PEs the VFs are in.
>+	 */
>+	vf_num = pdn->vf_pes;		// FIXME not defined yet

Do you want me poll your pci/virtualization branch and fix this?

>+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>+		res = &dev->resource[i + PCI_IOV_RESOURCES];
>+		if (!res->flags || !res->parent)
>+			continue;
>+
>+		if (!pnv_pci_is_mem_pref_64(res->flags))
>+			continue;
>+
>+		/*
>+		 * The actual IOV BAR range is determined by the start address
>+		 * and the actual size for vf_num VFs BAR.  This check is to
>+		 * make sure that after shifting, the range will not overlap
>+		 * with another device.
>+		 */
>+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>+		res2.flags = res->flags;
>+		res2.start = res->start + (size * offset);
>+		res2.end = res2.start + (size * vf_num) - 1;
>+
>+		if (res2.end > res->end) {
>+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>+				i, &res2, res, vf_num, offset);
>+			return -EBUSY;
>+		}
>+	}
>+
>+	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>+		res = &dev->resource[i + PCI_IOV_RESOURCES];
>+		if (!res->flags || !res->parent)
>+			continue;
>+
>+		if (!pnv_pci_is_mem_pref_64(res->flags))
>+			continue;
>+
>+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>+		res2 = *res;
>+		res->start += size * offset;
>+
>+		dev_info(&dev->dev, "VF BAR%d: %pR shifted to %pR (enabling %d VFs shifted by %d)\n",
>+			 i, &res2, res, vf_num, offset);
>+		pci_update_resource(dev, i + PCI_IOV_RESOURCES);
>+	}
>+	pdn->max_vfs -= offset;
>+	return 0;
>+}
>+#endif /* CONFIG_PCI_IOV */
>+
> #if 0
> static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
> {

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-02-04  3:34                 ` Wei Yang
@ 2015-02-04 14:19                   ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 14:19 UTC (permalink / raw)
  To: Wei Yang; +Cc: Gavin Shan, Benjamin Herrenschmidt, linux-pci, linuxppc-dev

On Tue, Feb 3, 2015 at 9:34 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>>On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:

>>> +    vf_num = pdn->vf_pes;
>>
>>I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>>
>
> The pdn->vf_pes is defined in the next patch, it is not defined yet.
>
> I thought the incremental patch means a patch on top of the current patch set,
> so it is defined as the last patch.

Yes, that's fine.  I want to keep the series bisectable, so I'll fold
these patches together.

>>I pushed an updated pci/virtualization branch with these updates.  I
>>think there's also a leak that needs to be fixed that Dan Carpenter
>>pointed out.

>>+      vf_num = pdn->vf_pes;           // FIXME not defined yet
>
> Do you want me poll your pci/virtualization branch and fix this?

Don't worry about this FIXME; I can fix that by squashing two patches.
But please do pull my pci/virtualization branch and fix this one (this
is the one that I thought Dan Carpenter pointed out):

  drivers/pci/iov.c:488 sriov_init() warn: possible memory leak of 'iov'

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-04 14:19                   ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 14:19 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, Benjamin Herrenschmidt, linuxppc-dev, Gavin Shan

On Tue, Feb 3, 2015 at 9:34 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>>On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:

>>> +    vf_num = pdn->vf_pes;
>>
>>I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>>
>
> The pdn->vf_pes is defined in the next patch, it is not defined yet.
>
> I thought the incremental patch means a patch on top of the current patch set,
> so it is defined as the last patch.

Yes, that's fine.  I want to keep the series bisectable, so I'll fold
these patches together.

>>I pushed an updated pci/virtualization branch with these updates.  I
>>think there's also a leak that needs to be fixed that Dan Carpenter
>>pointed out.

>>+      vf_num = pdn->vf_pes;           // FIXME not defined yet
>
> Do you want me poll your pci/virtualization branch and fix this?

Don't worry about this FIXME; I can fix that by squashing two patches.
But please do pull my pci/virtualization branch and fix this one (this
is the one that I thought Dan Carpenter pointed out):

  drivers/pci/iov.c:488 sriov_init() warn: possible memory leak of 'iov'

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-02-04 14:19                   ` Bjorn Helgaas
@ 2015-02-04 15:20                     ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 15:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Wei Yang, Gavin Shan, Benjamin Herrenschmidt, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 08:19:14AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 3, 2015 at 9:34 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
>> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>>>On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
>
>>>> +    vf_num = pdn->vf_pes;
>>>
>>>I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>>>
>>
>> The pdn->vf_pes is defined in the next patch, it is not defined yet.
>>
>> I thought the incremental patch means a patch on top of the current patch set,
>> so it is defined as the last patch.
>
>Yes, that's fine.  I want to keep the series bisectable, so I'll fold
>these patches together.
>
>>>I pushed an updated pci/virtualization branch with these updates.  I
>>>think there's also a leak that needs to be fixed that Dan Carpenter
>>>pointed out.
>
>>>+      vf_num = pdn->vf_pes;           // FIXME not defined yet
>>
>> Do you want me poll your pci/virtualization branch and fix this?
>
>Don't worry about this FIXME; I can fix that by squashing two patches.
>But please do pull my pci/virtualization branch and fix this one (this
>is the one that I thought Dan Carpenter pointed out):
>
>  drivers/pci/iov.c:488 sriov_init() warn: possible memory leak of 'iov'

Sure, let me take a look.

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-04 15:20                     ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 15:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Wei Yang, Benjamin Herrenschmidt, linuxppc-dev, Gavin Shan

On Wed, Feb 04, 2015 at 08:19:14AM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 3, 2015 at 9:34 PM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote:
>> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>>>On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
>
>>>> +    vf_num = pdn->vf_pes;
>>>
>>>I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>>>
>>
>> The pdn->vf_pes is defined in the next patch, it is not defined yet.
>>
>> I thought the incremental patch means a patch on top of the current patch set,
>> so it is defined as the last patch.
>
>Yes, that's fine.  I want to keep the series bisectable, so I'll fold
>these patches together.
>
>>>I pushed an updated pci/virtualization branch with these updates.  I
>>>think there's also a leak that needs to be fixed that Dan Carpenter
>>>pointed out.
>
>>>+      vf_num = pdn->vf_pes;           // FIXME not defined yet
>>
>> Do you want me poll your pci/virtualization branch and fix this?
>
>Don't worry about this FIXME; I can fix that by squashing two patches.
>But please do pull my pci/virtualization branch and fix this one (this
>is the one that I thought Dan Carpenter pointed out):
>
>  drivers/pci/iov.c:488 sriov_init() warn: possible memory leak of 'iov'

Sure, let me take a look.

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH] pci/iov: fix memory leak introduced in "PCI: Store individual VF BAR size in struct pci_sriov"
  2015-02-04 14:19                   ` Bjorn Helgaas
@ 2015-02-04 16:08                     ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 16:08 UTC (permalink / raw)
  To: gwshan, benh; +Cc: linux-pci, linuxppc-dev, Wei Yang

Bjorn, this is an error introduced in the patch "PCI: Store individual VF BAR
size in struct pci_sriov".

This patch is based on the pci/virtualization branch. I have tried, it could
merge with the bad one cleanly.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index d64b9df..721987b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -430,10 +430,8 @@ found:
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
 	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
+	if (!iov)
+		return -ENOMEM;
 
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
@@ -485,6 +483,8 @@ failed:
 		res->flags = 0;
 	}
 
+	kfree(iov);
+
 	return rc;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH] pci/iov: fix memory leak introduced in "PCI: Store individual VF BAR size in struct pci_sriov"
@ 2015-02-04 16:08                     ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 16:08 UTC (permalink / raw)
  To: gwshan, benh; +Cc: linux-pci, Wei Yang, linuxppc-dev

Bjorn, this is an error introduced in the patch "PCI: Store individual VF BAR
size in struct pci_sriov".

This patch is based on the pci/virtualization branch. I have tried, it could
merge with the bad one cleanly.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index d64b9df..721987b 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -430,10 +430,8 @@ found:
 	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
 
 	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
-	if (!iov) {
-		rc = -ENOMEM;
-		goto failed;
-	}
+	if (!iov)
+		return -ENOMEM;
 
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
@@ -485,6 +483,8 @@ failed:
 		res->flags = 0;
 	}
 
+	kfree(iov);
+
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH] pci/iov: fix memory leak introduced in "PCI: Store individual VF BAR size in struct pci_sriov"
  2015-02-04 16:08                     ` Wei Yang
@ 2015-02-04 16:28                       ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 16:28 UTC (permalink / raw)
  To: Wei Yang; +Cc: gwshan, benh, linux-pci, linuxppc-dev

On Thu, Feb 05, 2015 at 12:08:50AM +0800, Wei Yang wrote:
> Bjorn, this is an error introduced in the patch "PCI: Store individual VF BAR
> size in struct pci_sriov".
> 
> This patch is based on the pci/virtualization branch. I have tried, it could
> merge with the bad one cleanly.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>

Great, thanks.  I folded this into "PCI: Store individual VF BAR size in
struct pci_sriov".

> ---
>  drivers/pci/iov.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index d64b9df..721987b 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -430,10 +430,8 @@ found:
>  	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
>  
>  	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
> -	if (!iov) {
> -		rc = -ENOMEM;
> -		goto failed;
> -	}
> +	if (!iov)
> +		return -ENOMEM;
>  
>  	nres = 0;
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> @@ -485,6 +483,8 @@ failed:
>  		res->flags = 0;
>  	}
>  
> +	kfree(iov);
> +
>  	return rc;
>  }
>  
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] pci/iov: fix memory leak introduced in "PCI: Store individual VF BAR size in struct pci_sriov"
@ 2015-02-04 16:28                       ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 16:28 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Feb 05, 2015 at 12:08:50AM +0800, Wei Yang wrote:
> Bjorn, this is an error introduced in the patch "PCI: Store individual VF BAR
> size in struct pci_sriov".
> 
> This patch is based on the pci/virtualization branch. I have tried, it could
> merge with the bad one cleanly.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>

Great, thanks.  I folded this into "PCI: Store individual VF BAR size in
struct pci_sriov".

> ---
>  drivers/pci/iov.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index d64b9df..721987b 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -430,10 +430,8 @@ found:
>  	pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
>  
>  	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
> -	if (!iov) {
> -		rc = -ENOMEM;
> -		goto failed;
> -	}
> +	if (!iov)
> +		return -ENOMEM;
>  
>  	nres = 0;
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> @@ -485,6 +483,8 @@ failed:
>  		res->flags = 0;
>  	}
>  
> +	kfree(iov);
> +
>  	return rc;
>  }
>  
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-02-04  3:34                 ` Wei Yang
@ 2015-02-04 20:53                   ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 20:53 UTC (permalink / raw)
  To: Wei Yang; +Cc: gwshan, benh, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 11:34:09AM +0800, Wei Yang wrote:
> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
> >> The actual IOV BAR range is determined by the start address and the actual
> >> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
> >> chance the actual end address exceed the limit and overlap with other
> >> devices.
> >> 
> >> This patch adds a check to make sure after shifting, the range will not
> >> overlap with other devices.
> >
> >I folded this into the previous patch (the one that adds
> >pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
> >with the following one ("powerpc/powernv: Allocate VF PE") because this one
> >references pdn->vf_pes, which is added by "Allocate VF PE".
> >
> 
> Yes. Both need this.
> 
> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
> >>  1 file changed, 48 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> index 8456ae8..1a1e74b 100644
> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
> >>  }
> >>  
> >>  #ifdef CONFIG_PCI_IOV
> >> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >>  {
> >>  	struct pci_dn *pdn = pci_get_pdn(dev);
> >>  	int i;
> >>  	struct resource *res;
> >>  	resource_size_t size;
> >> +	u16 vf_num;
> >>  
> >>  	if (!dev->is_physfn)
> >> -		return;
> >> +		return -EINVAL;
> >>  
> >> +	vf_num = pdn->vf_pes;
> >
> >I can't actually build this, but I don't think pdn->vf_pes is defined yet.
> >
> 
> The pdn->vf_pes is defined in the next patch, it is not defined yet.
> 
> I thought the incremental patch means a patch on top of the current patch set,
> so it is defined as the last patch.
> 
> >>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> >>  		res = &dev->resource[i];
> >>  		if (!res->flags || !res->parent)
> >> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
> >>  		size = pci_iov_resource_size(dev, i);
> >>  		res->start += size*offset;
> >> -
> >>  		dev_info(&dev->dev, "                 %pR\n", res);
> >> +
> >> +		/*
> >> +		 * The actual IOV BAR range is determined by the start address
> >> +		 * and the actual size for vf_num VFs BAR. The check here is
> >> +		 * to make sure after shifting, the range will not overlap
> >> +		 * with other device.
> >> +		 */
> >> +		if ((res->start + (size * vf_num)) > res->end) {
> >> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
> >> +					" other device after shift\n");
> >
> >sriov_init() sets up "res" with enough space to contain TotalVF copies
> >of the VF BAR.  By the time we get here, that "res" is in the resource
> >tree, and you should be able to see it in /proc/iomem.
> >
> >For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
> >resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
> >SR-IOV Capability contains a base address of 0x8000_0000, the resource
> >would be:
> >
> >  [mem 0x8000_0000-0x87ff_ffff]
> >
> >We have to assume there's another resource starting immediately after
> >this one, i.e., at 0x8800_0000, and we have to make sure that when we
> >change this resource and turn on SR-IOV, we don't overlap with it.
> >
> >The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
> >hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
> >may be smaller than TotalVFs).
> >
> >If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
> >1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
> >new resource will be:
> >
> >  [mem 0x8170_0000-0x826f_ffff]
> >
> >That's fine; it doesn't extend past the original end of 0x87ff_ffff.
> >But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
> >to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
> >so the new resource will be:
> >
> >  [mem 0x8780_0000-0x887f_ffff]
> >
> >and that's a problem because we have two devices responding at
> >0x8800_0000.
> >
> >Your test of "res->start + (size * vf_num)) > res->end" is not strict
> >enough to catch this problem.
> >
> 
> Yep, you are right.
> 
> >I think we need something like the patch below.  I restructured it so
> >we don't have to back out any resource changes if we fail.
> >
> >This shifting strategy seems to imply that the closer NumVFs is to
> >TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
> >== TotalVFs, you wouldn't be able to shift at all.  In this example,
> >you could shift by anything from 0 to 128 - 16 = 112, but if you
> >wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?
> >
> >I think your M64 BAR gets split into 256 segments, regardless of what
> >TotalVFs is, so if you expanded the resource to 256 * 1MB for this
> >example, you would be able to shift by up to 256 - NumVFs.  Do you
> >actually do this somewhere?
> >
> 
> Yes, after expanding the resource to 256 * 1MB, it is able to shift up to 
> 256 - NumVFs. 

Oh, I see where the expansion happens.  We started in sriov_init() with:

  res->end = res->start + resource_size(res) * total - 1;

where "total" is TotalVFs, and you expand it to the maximum number of PEs
in pnv_pci_ioda_fixup_iov_resources():

  res->end = res->start + size * phb->ioda.total_pe - 1;

in this path:

  pcibios_scan_phb
    pci_create_root_bus
    pci_scan_child_bus
      ...
        sriov_init
	  res->end = res->start + ...	# as above
    ppc_md.pcibios_fixup_sriov		# pnv_pci_ioda_fixup_sriov
    pnv_pci_ioda_fixup_sriov(bus)
      list_for_each_entry(dev, &bus->devices, ...)
        if (dev->subordinate)
	  pnv_pci_ioda_fixup_sriov(dev->subordinate)	# recurse
        pnv_pci_ioda_fixup_iov_resources(dev)
	  res->end = res->start + ...	# fixup

I think this will be cleaner if you add an arch interface for use by
sriov_init(), e.g.,

  resource_size_t __weak pcibios_iov_size(struct pci_dev *dev, int resno)
  {
    struct resource *res = &dev->resource[resno + PCI_IOV_RESOURCES];

    return resource_size(res) * dev->iov->total_VFs;
  }

  static int sriov_int(...)
  {
    ...
    res->end = res->start + pcibios_iov_size(dev, i) - 1;
    ...
  }

and powerpc could override this.  That way we would set the size once and
we wouldn't need a fixup pass, which will keep the pcibios_scan_phb() code
similar to the common path in pci_scan_root_bus().

> But currently, on my system, I don't see one case really do
> this.
> 
> On my system, there is an Emulex card with 4 PFs.
> 
> 0006:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 0006:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 0006:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 0006:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 
> The max VFs for them are 80, 80, 20, 20, with total number of 200 VFs.
> 
> be2net 0006:01:00.0:  Shifting VF BAR [mem 0x3d40 1000 0000 - 0x3d40 10ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.0:                  [mem 0x3d40 1003 0000 - 0x3d40 10ff ffff 64bit pref]    253 segs offset 3
> PE range [3 - 82]
> be2net 0006:01:00.1:  Shifting VF BAR [mem 0x3d40 1100 0000 - 0x3d40 11ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.1:                  [mem 0x3d40 1153 0000 - 0x3d40 11ff ffff 64bit pref]    173 segs offset 83
> PE range [83 - 162]
> be2net 0006:01:00.2:  Shifting VF BAR [mem 0x3d40 1200 0000 - 0x3d40 12ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.2:                  [mem 0x3d40 12a3 0000 - 0x3d40 12ff ffff 64bit pref]    93  segs offset 163
> PE range [163 - 182]
> be2net 0006:01:00.3:  Shifting VF BAR [mem 0x3d40 1300 0000 - 0x3d40 13ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.3:                  [mem 0x3d40 13b7 0000 - 0x3d40 13ff ffff 64bit pref]    73  segs offset 183
> PE range [183 - 202]
> 
> After enable the max number of VFs, even the last VF still has 73 number VF
> BAR size. So this not trigger the limit, but proves the shift offset could be
> larger than (TotalVFs - NumVFs).

You expanded the overall resource from "TotalVFs * size" to "256 * size".
So the offset can be larger than "TotalVFs - NumVFs" but it still cannot be
larger than "256 - NumVFs".  The point is that the range claimed by the
hardware cannot extend past the range we told the resource tree about.
That's what the "if (res2.end > res->end)" test is checking.

Normally we compute res->end based on TotalVFs.  For PHB3, you compute
res->end based on 256.  Either way, we need to make sure we don't program
the BAR with an address that causes the hardware to respond to addresses
past res->end.

Bjorn

> >+		/*
> >+		 * The actual IOV BAR range is determined by the start address
> >+		 * and the actual size for vf_num VFs BAR.  This check is to
> >+		 * make sure that after shifting, the range will not overlap
> >+		 * with another device.
> >+		 */
> >+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> >+		res2.flags = res->flags;
> >+		res2.start = res->start + (size * offset);
> >+		res2.end = res2.start + (size * vf_num) - 1;
> >+
> >+		if (res2.end > res->end) {
> >+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
> >+				i, &res2, res, vf_num, offset);
> >+			return -EBUSY;
> >+		}

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-04 20:53                   ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 20:53 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Wed, Feb 04, 2015 at 11:34:09AM +0800, Wei Yang wrote:
> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
> >On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
> >> The actual IOV BAR range is determined by the start address and the actual
> >> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
> >> chance the actual end address exceed the limit and overlap with other
> >> devices.
> >> 
> >> This patch adds a check to make sure after shifting, the range will not
> >> overlap with other devices.
> >
> >I folded this into the previous patch (the one that adds
> >pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
> >with the following one ("powerpc/powernv: Allocate VF PE") because this one
> >references pdn->vf_pes, which is added by "Allocate VF PE".
> >
> 
> Yes. Both need this.
> 
> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
> >>  1 file changed, 48 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> index 8456ae8..1a1e74b 100644
> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
> >>  }
> >>  
> >>  #ifdef CONFIG_PCI_IOV
> >> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >>  {
> >>  	struct pci_dn *pdn = pci_get_pdn(dev);
> >>  	int i;
> >>  	struct resource *res;
> >>  	resource_size_t size;
> >> +	u16 vf_num;
> >>  
> >>  	if (!dev->is_physfn)
> >> -		return;
> >> +		return -EINVAL;
> >>  
> >> +	vf_num = pdn->vf_pes;
> >
> >I can't actually build this, but I don't think pdn->vf_pes is defined yet.
> >
> 
> The pdn->vf_pes is defined in the next patch, it is not defined yet.
> 
> I thought the incremental patch means a patch on top of the current patch set,
> so it is defined as the last patch.
> 
> >>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> >>  		res = &dev->resource[i];
> >>  		if (!res->flags || !res->parent)
> >> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
> >>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
> >>  		size = pci_iov_resource_size(dev, i);
> >>  		res->start += size*offset;
> >> -
> >>  		dev_info(&dev->dev, "                 %pR\n", res);
> >> +
> >> +		/*
> >> +		 * The actual IOV BAR range is determined by the start address
> >> +		 * and the actual size for vf_num VFs BAR. The check here is
> >> +		 * to make sure after shifting, the range will not overlap
> >> +		 * with other device.
> >> +		 */
> >> +		if ((res->start + (size * vf_num)) > res->end) {
> >> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
> >> +					" other device after shift\n");
> >
> >sriov_init() sets up "res" with enough space to contain TotalVF copies
> >of the VF BAR.  By the time we get here, that "res" is in the resource
> >tree, and you should be able to see it in /proc/iomem.
> >
> >For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
> >resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
> >SR-IOV Capability contains a base address of 0x8000_0000, the resource
> >would be:
> >
> >  [mem 0x8000_0000-0x87ff_ffff]
> >
> >We have to assume there's another resource starting immediately after
> >this one, i.e., at 0x8800_0000, and we have to make sure that when we
> >change this resource and turn on SR-IOV, we don't overlap with it.
> >
> >The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
> >hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
> >may be smaller than TotalVFs).
> >
> >If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
> >1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
> >new resource will be:
> >
> >  [mem 0x8170_0000-0x826f_ffff]
> >
> >That's fine; it doesn't extend past the original end of 0x87ff_ffff.
> >But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
> >to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
> >so the new resource will be:
> >
> >  [mem 0x8780_0000-0x887f_ffff]
> >
> >and that's a problem because we have two devices responding at
> >0x8800_0000.
> >
> >Your test of "res->start + (size * vf_num)) > res->end" is not strict
> >enough to catch this problem.
> >
> 
> Yep, you are right.
> 
> >I think we need something like the patch below.  I restructured it so
> >we don't have to back out any resource changes if we fail.
> >
> >This shifting strategy seems to imply that the closer NumVFs is to
> >TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
> >== TotalVFs, you wouldn't be able to shift at all.  In this example,
> >you could shift by anything from 0 to 128 - 16 = 112, but if you
> >wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?
> >
> >I think your M64 BAR gets split into 256 segments, regardless of what
> >TotalVFs is, so if you expanded the resource to 256 * 1MB for this
> >example, you would be able to shift by up to 256 - NumVFs.  Do you
> >actually do this somewhere?
> >
> 
> Yes, after expanding the resource to 256 * 1MB, it is able to shift up to 
> 256 - NumVFs. 

Oh, I see where the expansion happens.  We started in sriov_init() with:

  res->end = res->start + resource_size(res) * total - 1;

where "total" is TotalVFs, and you expand it to the maximum number of PEs
in pnv_pci_ioda_fixup_iov_resources():

  res->end = res->start + size * phb->ioda.total_pe - 1;

in this path:

  pcibios_scan_phb
    pci_create_root_bus
    pci_scan_child_bus
      ...
        sriov_init
	  res->end = res->start + ...	# as above
    ppc_md.pcibios_fixup_sriov		# pnv_pci_ioda_fixup_sriov
    pnv_pci_ioda_fixup_sriov(bus)
      list_for_each_entry(dev, &bus->devices, ...)
        if (dev->subordinate)
	  pnv_pci_ioda_fixup_sriov(dev->subordinate)	# recurse
        pnv_pci_ioda_fixup_iov_resources(dev)
	  res->end = res->start + ...	# fixup

I think this will be cleaner if you add an arch interface for use by
sriov_init(), e.g.,

  resource_size_t __weak pcibios_iov_size(struct pci_dev *dev, int resno)
  {
    struct resource *res = &dev->resource[resno + PCI_IOV_RESOURCES];

    return resource_size(res) * dev->iov->total_VFs;
  }

  static int sriov_int(...)
  {
    ...
    res->end = res->start + pcibios_iov_size(dev, i) - 1;
    ...
  }

and powerpc could override this.  That way we would set the size once and
we wouldn't need a fixup pass, which will keep the pcibios_scan_phb() code
similar to the common path in pci_scan_root_bus().

> But currently, on my system, I don't see one case really do
> this.
> 
> On my system, there is an Emulex card with 4 PFs.
> 
> 0006:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 0006:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 0006:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 0006:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
> 
> The max VFs for them are 80, 80, 20, 20, with total number of 200 VFs.
> 
> be2net 0006:01:00.0:  Shifting VF BAR [mem 0x3d40 1000 0000 - 0x3d40 10ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.0:                  [mem 0x3d40 1003 0000 - 0x3d40 10ff ffff 64bit pref]    253 segs offset 3
> PE range [3 - 82]
> be2net 0006:01:00.1:  Shifting VF BAR [mem 0x3d40 1100 0000 - 0x3d40 11ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.1:                  [mem 0x3d40 1153 0000 - 0x3d40 11ff ffff 64bit pref]    173 segs offset 83
> PE range [83 - 162]
> be2net 0006:01:00.2:  Shifting VF BAR [mem 0x3d40 1200 0000 - 0x3d40 12ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.2:                  [mem 0x3d40 12a3 0000 - 0x3d40 12ff ffff 64bit pref]    93  segs offset 163
> PE range [163 - 182]
> be2net 0006:01:00.3:  Shifting VF BAR [mem 0x3d40 1300 0000 - 0x3d40 13ff ffff 64bit pref] to 256 segs
> be2net 0006:01:00.3:                  [mem 0x3d40 13b7 0000 - 0x3d40 13ff ffff 64bit pref]    73  segs offset 183
> PE range [183 - 202]
> 
> After enable the max number of VFs, even the last VF still has 73 number VF
> BAR size. So this not trigger the limit, but proves the shift offset could be
> larger than (TotalVFs - NumVFs).

You expanded the overall resource from "TotalVFs * size" to "256 * size".
So the offset can be larger than "TotalVFs - NumVFs" but it still cannot be
larger than "256 - NumVFs".  The point is that the range claimed by the
hardware cannot extend past the range we told the resource tree about.
That's what the "if (res2.end > res->end)" test is checking.

Normally we compute res->end based on TotalVFs.  For PHB3, you compute
res->end based on 256.  Either way, we need to make sure we don't program
the BAR with an address that causes the hardware to respond to addresses
past res->end.

Bjorn

> >+		/*
> >+		 * The actual IOV BAR range is determined by the start address
> >+		 * and the actual size for vf_num VFs BAR.  This check is to
> >+		 * make sure that after shifting, the range will not overlap
> >+		 * with another device.
> >+		 */
> >+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> >+		res2.flags = res->flags;
> >+		res2.start = res->start + (size * offset);
> >+		res2.end = res2.start + (size * vf_num) - 1;
> >+
> >+		if (res2.end > res->end) {
> >+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
> >+				i, &res2, res, vf_num, offset);
> >+			return -EBUSY;
> >+		}

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-01-15  2:28         ` Wei Yang
@ 2015-02-04 21:26           ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 21:26 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:28:02AM +0800, Wei Yang wrote:
> On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
> Mostly the total_pe number is different from the total_VFs, which will lead
> to a conflict between MMIO space and the PE number.
> 
> For example, total_VFs is 128 and total_pe is 256, then the second half of
> M64 BAR space will be part of other PCI device, which may already belongs
> to other PEs.
> 
> This patch reserve additional space for the PF IOV BAR, which is total_pe
> number of VF's BAR size. By doing so, it prevents the conflict.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/machdep.h        |    4 ++
>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>  arch/powerpc/kernel/pci-common.c          |    5 +++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
>  4 files changed, 71 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index c8175a3..965547c 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -250,6 +250,10 @@ struct machdep_calls {
>  	/* Reset the secondary bus of bridge */
>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>  
> +#ifdef CONFIG_PCI_IOV
> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
> +#endif /* CONFIG_PCI_IOV */
> +
>  	/* Called to shutdown machine specific hardware not already controlled
>  	 * by other drivers.
>  	 */
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 334e745..b857ec4 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -170,6 +170,9 @@ struct pci_dn {
>  #define IODA_INVALID_PE		(-1)
>  #ifdef CONFIG_PPC_POWERNV
>  	int	pe_number;
> +#ifdef CONFIG_PCI_IOV
> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
> +#endif /* CONFIG_PCI_IOV */
>  #endif
>  	struct list_head child_list;
>  	struct list_head list;
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 889f743..832b7e1 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>  	if (ppc_md.pcibios_fixup_phb)
>  		ppc_md.pcibios_fixup_phb(hose);
>  
> +#ifdef CONFIG_PCI_IOV
> +	if (ppc_md.pcibios_fixup_sriov)
> +		ppc_md.pcibios_fixup_sriov(bus);
> +#endif /* CONFIG_PCI_IOV */
> +
>  	/* Configure PCI Express settings */
>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>  		struct pci_bus *child;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 31335a7..6704fdf 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1721,6 +1721,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>  #endif /* CONFIG_PCI_MSI */
>  
> +#ifdef CONFIG_PCI_IOV
> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> +{
> +	struct pci_controller *hose;
> +	struct pnv_phb *phb;
> +	struct resource *res;
> +	int i;
> +	resource_size_t size;
> +	struct pci_dn *pdn;
> +
> +	if (!pdev->is_physfn || pdev->is_added)
> +		return;
> +
> +	hose = pci_bus_to_host(pdev->bus);
> +	phb = hose->private_data;
> +
> +	pdn = pci_get_pdn(pdev);
> +	pdn->max_vfs = 0;

^^^ point A

> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &pdev->resource[i];
> +		if (!res->flags || res->parent)
> +			continue;
> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> +			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
> +				 res, pci_name(pdev));
> +			continue;
> +		}
> +
> +		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
> +		size = pci_iov_resource_size(pdev, i);
> +		res->end = res->start + size * phb->ioda.total_pe - 1;
> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> +				i - PCI_IOV_RESOURCES,
> +				res, phb->ioda.total_pe);
> +	}
> +	pdn->max_vfs = phb->ioda.total_pe;

I don't think you change phb->ioda.total_pe between point A (above) and
here.  Does that mean "max_vfs == 0" is some kind of flag, maybe related to
the three cases in pnv_pci_iov_resource_alignment()?  Ewww!

If it doesn't change, just set "pdn->max_vfs = phb->ioda.total_pe" at
point A and be done with it.

But hopefully you can just remove this fixup path altogether.

> +}
> +
> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +	struct pci_bus *b;
> +
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		b = pdev->subordinate;
> +
> +		if (b)
> +			pnv_pci_ioda_fixup_sriov(b);
> +
> +		pnv_pci_ioda_fixup_iov_resources(pdev);
> +	}
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  /*
>   * This function is supposed to be called on basis of PE from top
>   * to bottom style. So the the I/O or MMIO segment assigned to
> @@ -2097,6 +2153,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
> +#ifdef CONFIG_PCI_IOV
> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> +#endif /* CONFIG_PCI_IOV */
>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>  
>  	/* Reset IODA tables to a clean state */
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-02-04 21:26           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 21:26 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:28:02AM +0800, Wei Yang wrote:
> On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
> Mostly the total_pe number is different from the total_VFs, which will lead
> to a conflict between MMIO space and the PE number.
> 
> For example, total_VFs is 128 and total_pe is 256, then the second half of
> M64 BAR space will be part of other PCI device, which may already belongs
> to other PEs.
> 
> This patch reserve additional space for the PF IOV BAR, which is total_pe
> number of VF's BAR size. By doing so, it prevents the conflict.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/machdep.h        |    4 ++
>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>  arch/powerpc/kernel/pci-common.c          |    5 +++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
>  4 files changed, 71 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index c8175a3..965547c 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -250,6 +250,10 @@ struct machdep_calls {
>  	/* Reset the secondary bus of bridge */
>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>  
> +#ifdef CONFIG_PCI_IOV
> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
> +#endif /* CONFIG_PCI_IOV */
> +
>  	/* Called to shutdown machine specific hardware not already controlled
>  	 * by other drivers.
>  	 */
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index 334e745..b857ec4 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -170,6 +170,9 @@ struct pci_dn {
>  #define IODA_INVALID_PE		(-1)
>  #ifdef CONFIG_PPC_POWERNV
>  	int	pe_number;
> +#ifdef CONFIG_PCI_IOV
> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
> +#endif /* CONFIG_PCI_IOV */
>  #endif
>  	struct list_head child_list;
>  	struct list_head list;
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 889f743..832b7e1 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>  	if (ppc_md.pcibios_fixup_phb)
>  		ppc_md.pcibios_fixup_phb(hose);
>  
> +#ifdef CONFIG_PCI_IOV
> +	if (ppc_md.pcibios_fixup_sriov)
> +		ppc_md.pcibios_fixup_sriov(bus);
> +#endif /* CONFIG_PCI_IOV */
> +
>  	/* Configure PCI Express settings */
>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>  		struct pci_bus *child;
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 31335a7..6704fdf 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1721,6 +1721,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>  #endif /* CONFIG_PCI_MSI */
>  
> +#ifdef CONFIG_PCI_IOV
> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> +{
> +	struct pci_controller *hose;
> +	struct pnv_phb *phb;
> +	struct resource *res;
> +	int i;
> +	resource_size_t size;
> +	struct pci_dn *pdn;
> +
> +	if (!pdev->is_physfn || pdev->is_added)
> +		return;
> +
> +	hose = pci_bus_to_host(pdev->bus);
> +	phb = hose->private_data;
> +
> +	pdn = pci_get_pdn(pdev);
> +	pdn->max_vfs = 0;

^^^ point A

> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &pdev->resource[i];
> +		if (!res->flags || res->parent)
> +			continue;
> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> +			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
> +				 res, pci_name(pdev));
> +			continue;
> +		}
> +
> +		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
> +		size = pci_iov_resource_size(pdev, i);
> +		res->end = res->start + size * phb->ioda.total_pe - 1;
> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> +				i - PCI_IOV_RESOURCES,
> +				res, phb->ioda.total_pe);
> +	}
> +	pdn->max_vfs = phb->ioda.total_pe;

I don't think you change phb->ioda.total_pe between point A (above) and
here.  Does that mean "max_vfs == 0" is some kind of flag, maybe related to
the three cases in pnv_pci_iov_resource_alignment()?  Ewww!

If it doesn't change, just set "pdn->max_vfs = phb->ioda.total_pe" at
point A and be done with it.

But hopefully you can just remove this fixup path altogether.

> +}
> +
> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> +{
> +	struct pci_dev *pdev;
> +	struct pci_bus *b;
> +
> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
> +		b = pdev->subordinate;
> +
> +		if (b)
> +			pnv_pci_ioda_fixup_sriov(b);
> +
> +		pnv_pci_ioda_fixup_iov_resources(pdev);
> +	}
> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  /*
>   * This function is supposed to be called on basis of PE from top
>   * to bottom style. So the the I/O or MMIO segment assigned to
> @@ -2097,6 +2153,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
> +#ifdef CONFIG_PCI_IOV
> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> +#endif /* CONFIG_PCI_IOV */
>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>  
>  	/* Reset IODA tables to a clean state */
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
  2015-01-15  2:28         ` Wei Yang
@ 2015-02-04 21:26           ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 21:26 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:28:03AM +0800, Wei Yang wrote:
> This patch implements the pcibios_iov_resource_alignment() on powernv
> platform.
> 
> On PowerNV platform, there are 3 cases for the IOV BAR:
> 1. initial state, the IOV BAR size is multiple times of VF BAR size
> 2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
> 3. sizing stage, the IOV BAR is truncated to 0
> 
> pnv_pci_iov_resource_alignment() handle these three cases respectively.
>
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/machdep.h        |    3 +++
>  arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index 965547c..12e8eb8 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -252,6 +252,9 @@ struct machdep_calls {
>  
>  #ifdef CONFIG_PCI_IOV
>  	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
> +	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
> +			                                    int resno,
> +							    resource_size_t align);
>  #endif /* CONFIG_PCI_IOV */
>  
>  	/* Called to shutdown machine specific hardware not already controlled
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 832b7e1..8751dfb 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
>  	pci_reset_secondary_bus(dev);
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
> +						 int resno,
> +						 resource_size_t align)
> +{
> +	if (ppc_md.pcibios_iov_resource_alignment)
> +		return ppc_md.pcibios_iov_resource_alignment(pdev,
> +							       resno,
> +							       align);
> +
> +	return 0;

This isn't right, is it?  The default (weak) version returns
pci_iov_resource_size(dev, resno).  When you don't have a
ppc_md.pcibios_iov_resource_alignment pointer, don't you
want to do that, too?

> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  static resource_size_t pcibios_io_size(const struct pci_controller *hose)
>  {
>  #ifdef CONFIG_PPC64
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6704fdf..8bad2b0 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1953,6 +1953,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
>  	return phb->ioda.io_segsize;
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
> +							    int resno,
> +							    resource_size_t align)
> +{
> +	struct pci_dn *pdn = pci_get_pdn(pdev);
> +	resource_size_t iov_align;
> +
> +	iov_align = resource_size(&pdev->resource[resno]);
> +	if (iov_align)
> +		return iov_align;
> +
> +	if (pdn->max_vfs)
> +		return pdn->max_vfs * align;
> +
> +	return align;

pcibios_iov_resource_alignment() returns different things depending on when
you call it?  That doesn't sound good.

Is this related to my questions about sriov_init() and
pnv_pci_ioda_fixup_iov_resources()?  If you adopted my suggestion and set
the size once in sriov_init(), would that get rid of one of these cases?

Maybe it would help me understand if you explained the three cases a bit
more.

> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  /* Prevent enabling devices for which we couldn't properly
>   * assign a PE
>   */
> @@ -2155,6 +2174,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>  #ifdef CONFIG_PCI_IOV
>  	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> +	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
>  #endif /* CONFIG_PCI_IOV */
>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
@ 2015-02-04 21:26           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 21:26 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:28:03AM +0800, Wei Yang wrote:
> This patch implements the pcibios_iov_resource_alignment() on powernv
> platform.
> 
> On PowerNV platform, there are 3 cases for the IOV BAR:
> 1. initial state, the IOV BAR size is multiple times of VF BAR size
> 2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
> 3. sizing stage, the IOV BAR is truncated to 0
> 
> pnv_pci_iov_resource_alignment() handle these three cases respectively.
>
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/machdep.h        |    3 +++
>  arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
>  3 files changed, 37 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index 965547c..12e8eb8 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -252,6 +252,9 @@ struct machdep_calls {
>  
>  #ifdef CONFIG_PCI_IOV
>  	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
> +	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
> +			                                    int resno,
> +							    resource_size_t align);
>  #endif /* CONFIG_PCI_IOV */
>  
>  	/* Called to shutdown machine specific hardware not already controlled
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 832b7e1..8751dfb 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
>  	pci_reset_secondary_bus(dev);
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
> +						 int resno,
> +						 resource_size_t align)
> +{
> +	if (ppc_md.pcibios_iov_resource_alignment)
> +		return ppc_md.pcibios_iov_resource_alignment(pdev,
> +							       resno,
> +							       align);
> +
> +	return 0;

This isn't right, is it?  The default (weak) version returns
pci_iov_resource_size(dev, resno).  When you don't have a
ppc_md.pcibios_iov_resource_alignment pointer, don't you
want to do that, too?

> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  static resource_size_t pcibios_io_size(const struct pci_controller *hose)
>  {
>  #ifdef CONFIG_PPC64
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6704fdf..8bad2b0 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1953,6 +1953,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
>  	return phb->ioda.io_segsize;
>  }
>  
> +#ifdef CONFIG_PCI_IOV
> +static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
> +							    int resno,
> +							    resource_size_t align)
> +{
> +	struct pci_dn *pdn = pci_get_pdn(pdev);
> +	resource_size_t iov_align;
> +
> +	iov_align = resource_size(&pdev->resource[resno]);
> +	if (iov_align)
> +		return iov_align;
> +
> +	if (pdn->max_vfs)
> +		return pdn->max_vfs * align;
> +
> +	return align;

pcibios_iov_resource_alignment() returns different things depending on when
you call it?  That doesn't sound good.

Is this related to my questions about sriov_init() and
pnv_pci_ioda_fixup_iov_resources()?  If you adopted my suggestion and set
the size once in sriov_init(), would that get rid of one of these cases?

Maybe it would help me understand if you explained the three cases a bit
more.

> +}
> +#endif /* CONFIG_PCI_IOV */
> +
>  /* Prevent enabling devices for which we couldn't properly
>   * assign a PE
>   */
> @@ -2155,6 +2174,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>  #ifdef CONFIG_PCI_IOV
>  	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
> +	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
>  #endif /* CONFIG_PCI_IOV */
>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-01-15  2:28         ` Wei Yang
@ 2015-02-04 22:05           ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 22:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:28:06AM +0800, Wei Yang wrote:
> M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
> will exceed the limitation and failed to be assigned.
> 
> This patch introduce a different mechanism based on the IOV BAR size:
> 
> IOV BAR size is smaller than 64M, expand to total_pe.
> IOV BAR size is bigger than 64M, roundup power2.

Can you elaborate on this a little more?  I assume this is talking about an
M64 BAR.  What is the size limit for this?

If I understand correctly, hardware always splits an M64 BAR into 256
segments.  If you wanted each 128MB VF BAR in a separate PE, the M64 BAR
would have to be 256 * 128MB = 32GB.  So maybe the idea is that instead of
consuming 32GB of address space, you let each VF BAR span several segments.
If each 128MB VF BAR spanned 4 segments, each segment would be 32MB and the
whole M64 BAR would be 256 * 32MB = 8GB.  But you would have to use
"companion" PEs so all 4 segments would be in the same "domain," and that
would reduce the number of VFs you could support from 256 to 256/4 = 64.

If that were the intent, I would think TotalVFs would be involved, but it's
not.  For a device with TotalVFs=8 and a 128MB VF BAR, it would make sense
to dedicate 4 segments to each of those BARs because you could only use 8*4
= 32 segments total.  If the device had TotalVFs=128, your only choices
would be 1 segment per BAR (256 segments * 128MB/segment = 32GB) or 2
segments per BAR (256 segments * 64MB/segment = 16GB).

If you use 2 segments per BAR and you want NumVFs=128, you can't shift
anything, so the PE assignments are completely fixed (VF0 in PE0, VF1 in
PE1, etc.)

It seems like you'd make different choices about pdn->max_vfs for these two
devices, but currently you only look at the VF BAR size, not at TotalVFs.

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>  2 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index d61c384..7156486 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -174,6 +174,8 @@ struct pci_dn {
>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>  	u16     vf_pes;			/* VF PE# under this PF */
>  	int     offset;			/* PE# for the first VF PE */
> +#define M64_PER_IOV 4
> +	int     m64_per_iov;
>  #define IODA_INVALID_M64        (-1)
>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>  #endif /* CONFIG_PCI_IOV */
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 94fe6e1..23ea873 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	int i;
>  	resource_size_t size;
>  	struct pci_dn *pdn;
> +	int mul, total_vfs;
>  
>  	if (!pdev->is_physfn || pdev->is_added)
>  		return;
> @@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	pdn = pci_get_pdn(pdev);
>  	pdn->max_vfs = 0;
>  
> +	total_vfs = pci_sriov_get_totalvfs(pdev);
> +	pdn->m64_per_iov = 1;
> +	mul = phb->ioda.total_pe;
> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &pdev->resource[i];
> +		if (!res->flags || res->parent)
> +			continue;
> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> +			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
> +					res, pci_name(pdev));
> +			continue;
> +		}
> +
> +		size = pci_iov_resource_size(pdev, i);
> +
> +		/* bigger than 64M */
> +		if (size > (1 << 26)) {
> +			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
> +					"is bigger than 64M, roundup power2\n", i);
> +			pdn->m64_per_iov = M64_PER_IOV;
> +			mul = __roundup_pow_of_two(total_vfs);

I think this might deserve more comment in dmesg.  "roundup power2" doesn't
really tell me anything, especially since you mention a BAR, but you're
actually rounding up total_vfs, not the BAR size.

Does this reduce the number of possible VFs?  We already can't set NumVFs
higher than TotalVFs.  Does this make it so we can't set NumVFs higher than
pdn->max_vfs?


> +			break;
> +		}
> +	}
> +
>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>  		res = &pdev->resource[i];
>  		if (!res->flags || res->parent)
> @@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  
>  		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
>  		size = pci_iov_resource_size(pdev, i);
> -		res->end = res->start + size * phb->ioda.total_pe - 1;
> +		res->end = res->start + size * mul - 1;
>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>  				i - PCI_IOV_RESOURCES,
> -				res, phb->ioda.total_pe);
> +				res, mul);
>  	}
> -	pdn->max_vfs = phb->ioda.total_pe;
> +	pdn->max_vfs = mul;

Maybe this is the part that makes it hard to compute the size in
sriov_init() -- you reduce pdn->max_vfs in some cases, and maybe that can
affect the space you reserve for IOV BARs?  E.g., maybe you reduce
pdn->max_vfs to 128 because VF BAR 3 is larger than 64MB, but you've
already expanded the IOV space for VF BAR 1 based on 256 PEs?

But then I'm still confused because the loop here in
pnv_pci_ioda_fixup_iov_resources() always expands the resource based on
phb->ioda.total_pe; that part doesn't depend on "mul" or pdn->max_vfs at
all.

Or maybe this is just a bug in the fixup loop, and it *should* depend on
"mul"?

>  }
>  
>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
@ 2015-02-04 22:05           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 22:05 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:28:06AM +0800, Wei Yang wrote:
> M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
> will exceed the limitation and failed to be assigned.
> 
> This patch introduce a different mechanism based on the IOV BAR size:
> 
> IOV BAR size is smaller than 64M, expand to total_pe.
> IOV BAR size is bigger than 64M, roundup power2.

Can you elaborate on this a little more?  I assume this is talking about an
M64 BAR.  What is the size limit for this?

If I understand correctly, hardware always splits an M64 BAR into 256
segments.  If you wanted each 128MB VF BAR in a separate PE, the M64 BAR
would have to be 256 * 128MB = 32GB.  So maybe the idea is that instead of
consuming 32GB of address space, you let each VF BAR span several segments.
If each 128MB VF BAR spanned 4 segments, each segment would be 32MB and the
whole M64 BAR would be 256 * 32MB = 8GB.  But you would have to use
"companion" PEs so all 4 segments would be in the same "domain," and that
would reduce the number of VFs you could support from 256 to 256/4 = 64.

If that were the intent, I would think TotalVFs would be involved, but it's
not.  For a device with TotalVFs=8 and a 128MB VF BAR, it would make sense
to dedicate 4 segments to each of those BARs because you could only use 8*4
= 32 segments total.  If the device had TotalVFs=128, your only choices
would be 1 segment per BAR (256 segments * 128MB/segment = 32GB) or 2
segments per BAR (256 segments * 64MB/segment = 16GB).

If you use 2 segments per BAR and you want NumVFs=128, you can't shift
anything, so the PE assignments are completely fixed (VF0 in PE0, VF1 in
PE1, etc.)

It seems like you'd make different choices about pdn->max_vfs for these two
devices, but currently you only look at the VF BAR size, not at TotalVFs.

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>  2 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index d61c384..7156486 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -174,6 +174,8 @@ struct pci_dn {
>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>  	u16     vf_pes;			/* VF PE# under this PF */
>  	int     offset;			/* PE# for the first VF PE */
> +#define M64_PER_IOV 4
> +	int     m64_per_iov;
>  #define IODA_INVALID_M64        (-1)
>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>  #endif /* CONFIG_PCI_IOV */
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 94fe6e1..23ea873 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	int i;
>  	resource_size_t size;
>  	struct pci_dn *pdn;
> +	int mul, total_vfs;
>  
>  	if (!pdev->is_physfn || pdev->is_added)
>  		return;
> @@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  	pdn = pci_get_pdn(pdev);
>  	pdn->max_vfs = 0;
>  
> +	total_vfs = pci_sriov_get_totalvfs(pdev);
> +	pdn->m64_per_iov = 1;
> +	mul = phb->ioda.total_pe;
> +
> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
> +		res = &pdev->resource[i];
> +		if (!res->flags || res->parent)
> +			continue;
> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
> +			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
> +					res, pci_name(pdev));
> +			continue;
> +		}
> +
> +		size = pci_iov_resource_size(pdev, i);
> +
> +		/* bigger than 64M */
> +		if (size > (1 << 26)) {
> +			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
> +					"is bigger than 64M, roundup power2\n", i);
> +			pdn->m64_per_iov = M64_PER_IOV;
> +			mul = __roundup_pow_of_two(total_vfs);

I think this might deserve more comment in dmesg.  "roundup power2" doesn't
really tell me anything, especially since you mention a BAR, but you're
actually rounding up total_vfs, not the BAR size.

Does this reduce the number of possible VFs?  We already can't set NumVFs
higher than TotalVFs.  Does this make it so we can't set NumVFs higher than
pdn->max_vfs?


> +			break;
> +		}
> +	}
> +
>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>  		res = &pdev->resource[i];
>  		if (!res->flags || res->parent)
> @@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  
>  		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
>  		size = pci_iov_resource_size(pdev, i);
> -		res->end = res->start + size * phb->ioda.total_pe - 1;
> +		res->end = res->start + size * mul - 1;
>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>  				i - PCI_IOV_RESOURCES,
> -				res, phb->ioda.total_pe);
> +				res, mul);
>  	}
> -	pdn->max_vfs = phb->ioda.total_pe;
> +	pdn->max_vfs = mul;

Maybe this is the part that makes it hard to compute the size in
sriov_init() -- you reduce pdn->max_vfs in some cases, and maybe that can
affect the space you reserve for IOV BARs?  E.g., maybe you reduce
pdn->max_vfs to 128 because VF BAR 3 is larger than 64MB, but you've
already expanded the IOV space for VF BAR 1 based on 256 PEs?

But then I'm still confused because the loop here in
pnv_pci_ioda_fixup_iov_resources() always expands the resource based on
phb->ioda.total_pe; that part doesn't depend on "mul" or pdn->max_vfs at
all.

Or maybe this is just a bug in the fixup loop, and it *should* depend on
"mul"?

>  }
>  
>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
  2015-02-04 21:26           ` Bjorn Helgaas
@ 2015-02-04 22:45             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 22:45 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 03:26:14PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:03AM +0800, Wei Yang wrote:
>> This patch implements the pcibios_iov_resource_alignment() on powernv
>> platform.
>> 
>> On PowerNV platform, there are 3 cases for the IOV BAR:
>> 1. initial state, the IOV BAR size is multiple times of VF BAR size
>> 2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
>> 3. sizing stage, the IOV BAR is truncated to 0
>> 
>> pnv_pci_iov_resource_alignment() handle these three cases respectively.
>>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/machdep.h        |    3 +++
>>  arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
>>  3 files changed, 37 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index 965547c..12e8eb8 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -252,6 +252,9 @@ struct machdep_calls {
>>  
>>  #ifdef CONFIG_PCI_IOV
>>  	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
>> +	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
>> +			                                    int resno,
>> +							    resource_size_t align);
>>  #endif /* CONFIG_PCI_IOV */
>>  
>>  	/* Called to shutdown machine specific hardware not already controlled
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 832b7e1..8751dfb 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
>>  	pci_reset_secondary_bus(dev);
>>  }
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
>> +						 int resno,
>> +						 resource_size_t align)
>> +{
>> +	if (ppc_md.pcibios_iov_resource_alignment)
>> +		return ppc_md.pcibios_iov_resource_alignment(pdev,
>> +							       resno,
>> +							       align);
>> +
>> +	return 0;
>
>This isn't right, is it?  The default (weak) version returns
>pci_iov_resource_size(dev, resno).  When you don't have a
>ppc_md.pcibios_iov_resource_alignment pointer, don't you
>want to do that, too?
>

You are right, this isn't correct.

It should return align here.

>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  static resource_size_t pcibios_io_size(const struct pci_controller *hose)
>>  {
>>  #ifdef CONFIG_PPC64
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 6704fdf..8bad2b0 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1953,6 +1953,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
>>  	return phb->ioda.io_segsize;
>>  }
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>> +							    int resno,
>> +							    resource_size_t align)
>> +{
>> +	struct pci_dn *pdn = pci_get_pdn(pdev);
>> +	resource_size_t iov_align;
>> +
>> +	iov_align = resource_size(&pdev->resource[resno]);
>> +	if (iov_align)
>> +		return iov_align;
>> +
>> +	if (pdn->max_vfs)
>> +		return pdn->max_vfs * align;
>> +
>> +	return align;
>
>pcibios_iov_resource_alignment() returns different things depending on when
>you call it?  That doesn't sound good.
>

Agree, this is not a good way to address this problem.

>Is this related to my questions about sriov_init() and
>pnv_pci_ioda_fixup_iov_resources()?  If you adopted my suggestion and set
>the size once in sriov_init(), would that get rid of one of these cases?
>
>Maybe it would help me understand if you explained the three cases a bit
>more.

Sure, and this helps me too :)

First pci_sriov_resource_alignment() returns the single VF BAR size in the
original version. And the purpose for introducing the
pcibios_iov_resource_alignment() is to give arch a chance to return different
value. For powernv platform, we want to return the 256 * single VF BAR size.

This size is used in the pbus_size_mem() for sizing stage and in
pci_assign_unassigned_root_bus_resources() for assigning stage. Normally, on
powernv platform, it just need to return the resource_size() of this IOV BAR,
while there is a problem in pci_assign_unassigned_root_bus_resources().

Since the IOV BAR is an "additional" resource, in pbus_size_mem() the resource
will be truncated to 0. So the first case will fail. And the size needs to be
calculated from the max_vfs * VF BAR size.

max_vfs is field in pci_dn, which may not be set before the fixup is called.
Even currently I don't see someone ask for IOV BAR alignment before fixup, I
am not sure in the future no one will do this.  I believe you method mentioned
in another mail will solve this problem. This means when sriov_init() returns,
we get the exact number of VF BAR size it reseved.

The last case is to make the logic tight. In case both the above two cases
fails, this will return the original value.

I believe after using your proposed method, to "fixup" the IOV BAR in
sriov_init(), we could just return the max_vfs * VF BAR size.

>
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  /* Prevent enabling devices for which we couldn't properly
>>   * assign a PE
>>   */
>> @@ -2155,6 +2174,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>  #ifdef CONFIG_PCI_IOV
>>  	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> +	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
>>  #endif /* CONFIG_PCI_IOV */
>>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>  
>> -- 
>> 1.7.9.5
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv
@ 2015-02-04 22:45             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 22:45 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Feb 04, 2015 at 03:26:14PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:03AM +0800, Wei Yang wrote:
>> This patch implements the pcibios_iov_resource_alignment() on powernv
>> platform.
>> 
>> On PowerNV platform, there are 3 cases for the IOV BAR:
>> 1. initial state, the IOV BAR size is multiple times of VF BAR size
>> 2. after expanded, the IOV BAR size is expanded to meet the M64 segment size
>> 3. sizing stage, the IOV BAR is truncated to 0
>> 
>> pnv_pci_iov_resource_alignment() handle these three cases respectively.
>>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/machdep.h        |    3 +++
>>  arch/powerpc/kernel/pci-common.c          |   14 ++++++++++++++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   20 ++++++++++++++++++++
>>  3 files changed, 37 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index 965547c..12e8eb8 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -252,6 +252,9 @@ struct machdep_calls {
>>  
>>  #ifdef CONFIG_PCI_IOV
>>  	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
>> +	resource_size_t (*pcibios_iov_resource_alignment)(struct pci_dev *,
>> +			                                    int resno,
>> +							    resource_size_t align);
>>  #endif /* CONFIG_PCI_IOV */
>>  
>>  	/* Called to shutdown machine specific hardware not already controlled
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 832b7e1..8751dfb 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -130,6 +130,20 @@ void pcibios_reset_secondary_bus(struct pci_dev *dev)
>>  	pci_reset_secondary_bus(dev);
>>  }
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev,
>> +						 int resno,
>> +						 resource_size_t align)
>> +{
>> +	if (ppc_md.pcibios_iov_resource_alignment)
>> +		return ppc_md.pcibios_iov_resource_alignment(pdev,
>> +							       resno,
>> +							       align);
>> +
>> +	return 0;
>
>This isn't right, is it?  The default (weak) version returns
>pci_iov_resource_size(dev, resno).  When you don't have a
>ppc_md.pcibios_iov_resource_alignment pointer, don't you
>want to do that, too?
>

You are right, this isn't correct.

It should return align here.

>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  static resource_size_t pcibios_io_size(const struct pci_controller *hose)
>>  {
>>  #ifdef CONFIG_PPC64
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 6704fdf..8bad2b0 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1953,6 +1953,25 @@ static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
>>  	return phb->ioda.io_segsize;
>>  }
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>> +							    int resno,
>> +							    resource_size_t align)
>> +{
>> +	struct pci_dn *pdn = pci_get_pdn(pdev);
>> +	resource_size_t iov_align;
>> +
>> +	iov_align = resource_size(&pdev->resource[resno]);
>> +	if (iov_align)
>> +		return iov_align;
>> +
>> +	if (pdn->max_vfs)
>> +		return pdn->max_vfs * align;
>> +
>> +	return align;
>
>pcibios_iov_resource_alignment() returns different things depending on when
>you call it?  That doesn't sound good.
>

Agree, this is not a good way to address this problem.

>Is this related to my questions about sriov_init() and
>pnv_pci_ioda_fixup_iov_resources()?  If you adopted my suggestion and set
>the size once in sriov_init(), would that get rid of one of these cases?
>
>Maybe it would help me understand if you explained the three cases a bit
>more.

Sure, and this helps me too :)

First pci_sriov_resource_alignment() returns the single VF BAR size in the
original version. And the purpose for introducing the
pcibios_iov_resource_alignment() is to give arch a chance to return different
value. For powernv platform, we want to return the 256 * single VF BAR size.

This size is used in the pbus_size_mem() for sizing stage and in
pci_assign_unassigned_root_bus_resources() for assigning stage. Normally, on
powernv platform, it just need to return the resource_size() of this IOV BAR,
while there is a problem in pci_assign_unassigned_root_bus_resources().

Since the IOV BAR is an "additional" resource, in pbus_size_mem() the resource
will be truncated to 0. So the first case will fail. And the size needs to be
calculated from the max_vfs * VF BAR size.

max_vfs is field in pci_dn, which may not be set before the fixup is called.
Even currently I don't see someone ask for IOV BAR alignment before fixup, I
am not sure in the future no one will do this.  I believe you method mentioned
in another mail will solve this problem. This means when sriov_init() returns,
we get the exact number of VF BAR size it reseved.

The last case is to make the logic tight. In case both the above two cases
fails, this will return the original value.

I believe after using your proposed method, to "fixup" the IOV BAR in
sriov_init(), we could just return the max_vfs * VF BAR size.

>
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  /* Prevent enabling devices for which we couldn't properly
>>   * assign a PE
>>   */
>> @@ -2155,6 +2174,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>>  #ifdef CONFIG_PCI_IOV
>>  	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> +	ppc_md.pcibios_iov_resource_alignment = pnv_pci_iov_resource_alignment;
>>  #endif /* CONFIG_PCI_IOV */
>>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>  
>> -- 
>> 1.7.9.5
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
  2015-02-04 21:26           ` Bjorn Helgaas
@ 2015-02-04 23:08             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 23:08 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 03:26:07PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:02AM +0800, Wei Yang wrote:
>> On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
>> Mostly the total_pe number is different from the total_VFs, which will lead
>> to a conflict between MMIO space and the PE number.
>> 
>> For example, total_VFs is 128 and total_pe is 256, then the second half of
>> M64 BAR space will be part of other PCI device, which may already belongs
>> to other PEs.
>> 
>> This patch reserve additional space for the PF IOV BAR, which is total_pe
>> number of VF's BAR size. By doing so, it prevents the conflict.
>> 
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/machdep.h        |    4 ++
>>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>>  arch/powerpc/kernel/pci-common.c          |    5 +++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
>>  4 files changed, 71 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index c8175a3..965547c 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -250,6 +250,10 @@ struct machdep_calls {
>>  	/* Reset the secondary bus of bridge */
>>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Called to shutdown machine specific hardware not already controlled
>>  	 * by other drivers.
>>  	 */
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index 334e745..b857ec4 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -170,6 +170,9 @@ struct pci_dn {
>>  #define IODA_INVALID_PE		(-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  	int	pe_number;
>> +#ifdef CONFIG_PCI_IOV
>> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
>> +#endif /* CONFIG_PCI_IOV */
>>  #endif
>>  	struct list_head child_list;
>>  	struct list_head list;
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 889f743..832b7e1 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>>  	if (ppc_md.pcibios_fixup_phb)
>>  		ppc_md.pcibios_fixup_phb(hose);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	if (ppc_md.pcibios_fixup_sriov)
>> +		ppc_md.pcibios_fixup_sriov(bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Configure PCI Express settings */
>>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>>  		struct pci_bus *child;
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 31335a7..6704fdf 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1721,6 +1721,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>>  #endif /* CONFIG_PCI_MSI */
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> +{
>> +	struct pci_controller *hose;
>> +	struct pnv_phb *phb;
>> +	struct resource *res;
>> +	int i;
>> +	resource_size_t size;
>> +	struct pci_dn *pdn;
>> +
>> +	if (!pdev->is_physfn || pdev->is_added)
>> +		return;
>> +
>> +	hose = pci_bus_to_host(pdev->bus);
>> +	phb = hose->private_data;
>> +
>> +	pdn = pci_get_pdn(pdev);
>> +	pdn->max_vfs = 0;
>
>^^^ point A
>
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &pdev->resource[i];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
>> +				 res, pci_name(pdev));
>> +			continue;
>> +		}
>> +
>> +		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
>> +		size = pci_iov_resource_size(pdev, i);
>> +		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
>> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> +				i - PCI_IOV_RESOURCES,
>> +				res, phb->ioda.total_pe);
>> +	}
>> +	pdn->max_vfs = phb->ioda.total_pe;
>
>I don't think you change phb->ioda.total_pe between point A (above) and
>here.  Does that mean "max_vfs == 0" is some kind of flag, maybe related to
>the three cases in pnv_pci_iov_resource_alignment()?  Ewww!
>

Not a flag exactly.

The pci_dn is initialized to 0, I reset it to 0 explicitly since on the
hotplug case, we need to redo all those stuff again.

Putting this in a function like pcibios_release_device() seems more
reasonable. I will think about it later.


>If it doesn't change, just set "pdn->max_vfs = phb->ioda.total_pe" at
>point A and be done with it.

This will change. As you see in another patch.
Will reply that mail for details.

>
>But hopefully you can just remove this fixup path altogether.
>

Agree.

Use a weak function as you proposed is a better way.

>> +}
>> +
>> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *pdev;
>> +	struct pci_bus *b;
>> +
>> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> +		b = pdev->subordinate;
>> +
>> +		if (b)
>> +			pnv_pci_ioda_fixup_sriov(b);
>> +
>> +		pnv_pci_ioda_fixup_iov_resources(pdev);
>> +	}
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  /*
>>   * This function is supposed to be called on basis of PE from top
>>   * to bottom style. So the the I/O or MMIO segment assigned to
>> @@ -2097,6 +2153,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>> +#ifdef CONFIG_PCI_IOV
>> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> +#endif /* CONFIG_PCI_IOV */
>>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>  
>>  	/* Reset IODA tables to a clean state */
>> -- 
>> 1.7.9.5
>> 

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe
@ 2015-02-04 23:08             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-04 23:08 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Feb 04, 2015 at 03:26:07PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:02AM +0800, Wei Yang wrote:
>> On PHB3, PF IOV BAR will be covered by M64 BAR to have better PE isolation.
>> Mostly the total_pe number is different from the total_VFs, which will lead
>> to a conflict between MMIO space and the PE number.
>> 
>> For example, total_VFs is 128 and total_pe is 256, then the second half of
>> M64 BAR space will be part of other PCI device, which may already belongs
>> to other PEs.
>> 
>> This patch reserve additional space for the PF IOV BAR, which is total_pe
>> number of VF's BAR size. By doing so, it prevents the conflict.
>> 
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/machdep.h        |    4 ++
>>  arch/powerpc/include/asm/pci-bridge.h     |    3 ++
>>  arch/powerpc/kernel/pci-common.c          |    5 +++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   59 +++++++++++++++++++++++++++++
>>  4 files changed, 71 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index c8175a3..965547c 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -250,6 +250,10 @@ struct machdep_calls {
>>  	/* Reset the secondary bus of bridge */
>>  	void  (*pcibios_reset_secondary_bus)(struct pci_dev *dev);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	void (*pcibios_fixup_sriov)(struct pci_bus *bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Called to shutdown machine specific hardware not already controlled
>>  	 * by other drivers.
>>  	 */
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index 334e745..b857ec4 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -170,6 +170,9 @@ struct pci_dn {
>>  #define IODA_INVALID_PE		(-1)
>>  #ifdef CONFIG_PPC_POWERNV
>>  	int	pe_number;
>> +#ifdef CONFIG_PCI_IOV
>> +	u16     max_vfs;		/* number of VFs IOV BAR expended */
>> +#endif /* CONFIG_PCI_IOV */
>>  #endif
>>  	struct list_head child_list;
>>  	struct list_head list;
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 889f743..832b7e1 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -1636,6 +1636,11 @@ void pcibios_scan_phb(struct pci_controller *hose)
>>  	if (ppc_md.pcibios_fixup_phb)
>>  		ppc_md.pcibios_fixup_phb(hose);
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +	if (ppc_md.pcibios_fixup_sriov)
>> +		ppc_md.pcibios_fixup_sriov(bus);
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  	/* Configure PCI Express settings */
>>  	if (bus && !pci_has_flag(PCI_PROBE_ONLY)) {
>>  		struct pci_bus *child;
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 31335a7..6704fdf 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -1721,6 +1721,62 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>>  static void pnv_pci_init_ioda_msis(struct pnv_phb *phb) { }
>>  #endif /* CONFIG_PCI_MSI */
>>  
>> +#ifdef CONFIG_PCI_IOV
>> +static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>> +{
>> +	struct pci_controller *hose;
>> +	struct pnv_phb *phb;
>> +	struct resource *res;
>> +	int i;
>> +	resource_size_t size;
>> +	struct pci_dn *pdn;
>> +
>> +	if (!pdev->is_physfn || pdev->is_added)
>> +		return;
>> +
>> +	hose = pci_bus_to_host(pdev->bus);
>> +	phb = hose->private_data;
>> +
>> +	pdn = pci_get_pdn(pdev);
>> +	pdn->max_vfs = 0;
>
>^^^ point A
>
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &pdev->resource[i];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, " Skipping expanding IOV BAR %pR on %s\n",
>> +				 res, pci_name(pdev));
>> +			continue;
>> +		}
>> +
>> +		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
>> +		size = pci_iov_resource_size(pdev, i);
>> +		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		dev_dbg(&pdev->dev, "                       %pR\n", res);
>> +		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>> +				i - PCI_IOV_RESOURCES,
>> +				res, phb->ioda.total_pe);
>> +	}
>> +	pdn->max_vfs = phb->ioda.total_pe;
>
>I don't think you change phb->ioda.total_pe between point A (above) and
>here.  Does that mean "max_vfs == 0" is some kind of flag, maybe related to
>the three cases in pnv_pci_iov_resource_alignment()?  Ewww!
>

Not a flag exactly.

The pci_dn is initialized to 0, I reset it to 0 explicitly since on the
hotplug case, we need to redo all those stuff again.

Putting this in a function like pcibios_release_device() seems more
reasonable. I will think about it later.


>If it doesn't change, just set "pdn->max_vfs = phb->ioda.total_pe" at
>point A and be done with it.

This will change. As you see in another patch.
Will reply that mail for details.

>
>But hopefully you can just remove this fixup path altogether.
>

Agree.

Use a weak function as you proposed is a better way.

>> +}
>> +
>> +static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> +{
>> +	struct pci_dev *pdev;
>> +	struct pci_bus *b;
>> +
>> +	list_for_each_entry(pdev, &bus->devices, bus_list) {
>> +		b = pdev->subordinate;
>> +
>> +		if (b)
>> +			pnv_pci_ioda_fixup_sriov(b);
>> +
>> +		pnv_pci_ioda_fixup_iov_resources(pdev);
>> +	}
>> +}
>> +#endif /* CONFIG_PCI_IOV */
>> +
>>  /*
>>   * This function is supposed to be called on basis of PE from top
>>   * to bottom style. So the the I/O or MMIO segment assigned to
>> @@ -2097,6 +2153,9 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>>  	ppc_md.pcibios_enable_device_hook = pnv_pci_enable_device_hook;
>>  	ppc_md.pcibios_window_alignment = pnv_pci_window_alignment;
>>  	ppc_md.pcibios_reset_secondary_bus = pnv_pci_reset_secondary_bus;
>> +#ifdef CONFIG_PCI_IOV
>> +	ppc_md.pcibios_fixup_sriov = pnv_pci_ioda_fixup_sriov;
>> +#endif /* CONFIG_PCI_IOV */
>>  	pci_add_flags(PCI_REASSIGN_ALL_RSRC);
>>  
>>  	/* Reset IODA tables to a clean state */
>> -- 
>> 1.7.9.5
>> 

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
  2015-01-15  2:27         ` Wei Yang
@ 2015-02-04 23:44           ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 23:44 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:27:56AM +0800, Wei Yang wrote:
> In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
> adjusted:
>     1. size expaned
>     2. aligned to M64BT size
> 
> This patch documents this change on the reason and how.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++++++++++++++++
>  1 file changed, 215 insertions(+)
>  create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt
> 
> diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> new file mode 100644
> index 0000000..10d4ac2
> --- /dev/null
> +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt

I added the following two patches on top of this because I'm still confused
about the difference between the M64 window and the M64 BARs.  Several
parts of the writeup seem to imply that there are several M64 windows, but
that seems to be incorrect.

And I tried to write something about M64 BARs, too.  But it could well be
incorrect.

Please correct as necessary.  Ultimately I'll just fold everything into the
original patch so there's only one.

Bjorn


commit 6f46b79d243c24fd02c662c43aec6c829013ff64
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Jan 30 11:01:59 2015 -0600

    Try to fix references to M64 window vs M64 BARs.  If there really is only
    one M64 window, I'm still a little confused about why there are so many
    places that seem to mention multiple M64 windows.

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
index 10d4ac2f25b5..140df9cb58bd 100644
--- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -59,7 +59,7 @@ interrupt.
  * Outbound. That's where the tricky part is.
 
 The PHB basically has a concept of "windows" from the CPU address space to the
-PCI address space. There is one M32 window and 16 M64 windows. They have different
+PCI address space. There is one M32 window and one M64 window. They have different
 characteristics. First what they have in common: they are configured to forward a
 configurable portion of the CPU address space to the PCIe bus and must be naturally
 aligned power of two in size. The rest is different:
@@ -89,29 +89,31 @@ Ideally we would like to be able to have individual functions in PE's but that
 would mean using a completely different address allocation scheme where individual
 function BARs can be "grouped" to fit in one or more segments....
 
- - The M64 windows.
+ - The M64 window:
 
-   * Their smallest size is 1M
+   * Must be at least 256MB in size
 
-   * They do not translate addresses (the address on PCIe is the same as the
+   * Does not translate addresses (the address on PCIe is the same as the
 address on the PowerBus. There is a way to also set the top 14 bits which are
 not conveyed by PowerBus but we don't use this).
 
-   * They can be configured to be segmented or not. When segmented, they have
+   * Can be configured to be segmented or not. When segmented, it has
 256 segments, however they are not remapped. The segment number *is* the PE
 number. When no segmented, the PE number can be specified for the entire
 window.
 
-   * They support overlaps in which case there is a well defined ordering of
+   * Supports overlaps in which case there is a well defined ordering of
 matching (I don't remember off hand which of the lower or higher numbered
 window takes priority but basically it's well defined).
+^^^^^^ This sounds like there are multiple M64 windows.   Or maybe this
+paragraph is really about overlaps between M64 *BARs*, not M64 windows.
 
 We have code (fairly new compared to the M32 stuff) that exploits that for
 large BARs in 64-bit space:
 
-We create a single big M64 that covers the entire region of address space that
+We configure the M64 to cover the entire region of address space that
 has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
-it comes out of a different "reserve"). We configure that window as segmented.
+it comes out of a different "reserve"). We configure it as segmented.
 
 Then we do the same thing as with M32, using the bridge aligment trick, to
 match to those giant segments.
@@ -133,15 +135,15 @@ the other ones for that "domain". We thus introduce the concept of "master PE"
 which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
 for the remaining M64 segments.
 
-We would like to investigate using additional M64's in "single PE" mode to
+We would like to investigate using additional M64 BARs (?) in "single PE" mode to
 overlay over specific BARs to work around some of that, for example for devices
 with very large BARs (some GPUs), it would make sense, but we haven't done it
 yet.
 
-Finally, the plan to use M64 for SR-IOV, which will be described more in next
+Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
 two sections. So for a given IOV BAR, we need to effectively reserve the
 entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
-the beginning of a free range of segments/PEs inside that M64.
+the beginning of a free range of segments/PEs inside that M64 BAR.
 
 The goal is of course to be able to give a separate PE for each VF...
 

commit 0f069e6a30e4c3de02f8c60aadd64fb64d434e7d
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Thu Jan 29 13:37:49 2015 -0600

    This adds description about M64 BARs.  Previously, these were mentioned,
    but I don't think there was actually anything specific about how they
    worked.

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
index 140df9cb58bd..2e4811fae7fb 100644
--- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -58,7 +58,7 @@ interrupt.
 
  * Outbound. That's where the tricky part is.
 
-The PHB basically has a concept of "windows" from the CPU address space to the
+Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the
 PCI address space. There is one M32 window and one M64 window. They have different
 characteristics. First what they have in common: they are configured to forward a
 configurable portion of the CPU address space to the PCIe bus and must be naturally
@@ -140,6 +140,69 @@ overlay over specific BARs to work around some of that, for example for devices
 with very large BARs (some GPUs), it would make sense, but we haven't done it
 yet.
 
+ - The M64 BARs.
+
+IODA2 has 16 M64 "BARs."  These are not traditional PCI BARs that assign
+space for device registers or memory, and they're not normal window
+registers that describe the base and size of a bridge aperture.
+
+Rather, these M64 BARs associate pieces of an existing M64 window with PEs.
+The BAR describes a region of a window, and the region is divided into 256
+segments, just like a segmented M64 window.  As with segmented M64 windows,
+there's no lookup table: the segment number is the PE#.  The minimum size
+of a segment is 1MB, so each M64 BAR covers at least 256MB of space in an
+M64 window.
+
+The advantage of the M64 BARs is that they can be programmed to cover only
+part of an M64 window, and you can use several of them at the same time.
+That makes them useful for SR-IOV Virtual Functions, because each VF can be
+assigned to a separate PE.
+
+SR-IOV BACKGROUND
+
+The PCIe SR-IOV feature allows a single Physical Function (PF) to support
+several Virtual Functions (VFs).  Registers in the PF's SR-IOV Capability
+control the number of VFs, whether the VFs are enabled, and the MMIO
+resources assigned to the VFs.
+
+Each VF has its own VF BARs.  Software can write to a normal PCI BAR to
+discover the BAR size and assign address for it.  VF BARs aren't like that;
+the size discovery and address assignment is done via BARs in the *PF*
+SR-IOV Capability, and the BARs in VF config space are read-only zeros.
+
+When a PF SR-IOV BAR is programmed, it sets the base address for all the
+corresponding VF BARs.  For example, if the PF SR-IOV Capability is
+programmed to enable eight VFs, and it describes a 1MB BAR 0 for those VFs,
+the address in that PF BAR sets the base of an 8MB region that contains all
+eight of the VF BARs.
+
+STRATEGIES FOR ISOLATING VFs IN PEs:
+
+- M32 window: There's one M32 window, and it is split into 256
+  equally-sized segments.  The finest granularity possible is a 256MB
+  window with 1MB segments.  VF BARs that are 1MB or larger could be mapped
+  to separate PEs in this window.  Each segment can be individually mapped
+  to a PE via the lookup table, so this is quite flexible, but it works
+  best when all the VF BARs are the same size.  If they are different
+  sizes, the entire window has to be small enough that the segment matches
+  the smallest VF BAR, and larger VF BARs span several segments.
+
+- M64 window: A non-segmented M64 window is mapped entirely to a single PE,
+  so it could only isolate one VF.  A segmented M64 window could be used
+  just like the M32 window, but the segments can't be individually mapped
+  to PEs (the segment number is the PE number), so there isn't as much
+  flexibility.  A VF with multiple BARs would have to be be in a "domain"
+  of multiple PEs, which is not as well isolated as a single PE.
+
+- M64 BAR: An M64 BAR effectively segments a region of an M64 window.  As
+  usual, the region is split into 256 equally-sized pieces, and as in
+  segmented M64 windows, the segment number is the PE number.  But there
+  are several M64 BARs, and they can be set to different base addresses and
+  different segment sizes.  So if we have VFs that each have a 1MB BAR and
+  a 32MB BAR, we could use one M64 BAR to assign 1MB segments and another
+  M64 BAR to assign 32MB segments.
+
+
 Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
 two sections. So for a given IOV BAR, we need to effectively reserve the
 entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
@ 2015-02-04 23:44           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 23:44 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:27:56AM +0800, Wei Yang wrote:
> In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
> adjusted:
>     1. size expaned
>     2. aligned to M64BT size
> 
> This patch documents this change on the reason and how.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  .../powerpc/pci_iov_resource_on_powernv.txt        |  215 ++++++++++++++++++++
>  1 file changed, 215 insertions(+)
>  create mode 100644 Documentation/powerpc/pci_iov_resource_on_powernv.txt
> 
> diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> new file mode 100644
> index 0000000..10d4ac2
> --- /dev/null
> +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt

I added the following two patches on top of this because I'm still confused
about the difference between the M64 window and the M64 BARs.  Several
parts of the writeup seem to imply that there are several M64 windows, but
that seems to be incorrect.

And I tried to write something about M64 BARs, too.  But it could well be
incorrect.

Please correct as necessary.  Ultimately I'll just fold everything into the
original patch so there's only one.

Bjorn


commit 6f46b79d243c24fd02c662c43aec6c829013ff64
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Jan 30 11:01:59 2015 -0600

    Try to fix references to M64 window vs M64 BARs.  If there really is only
    one M64 window, I'm still a little confused about why there are so many
    places that seem to mention multiple M64 windows.

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
index 10d4ac2f25b5..140df9cb58bd 100644
--- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -59,7 +59,7 @@ interrupt.
  * Outbound. That's where the tricky part is.
 
 The PHB basically has a concept of "windows" from the CPU address space to the
-PCI address space. There is one M32 window and 16 M64 windows. They have different
+PCI address space. There is one M32 window and one M64 window. They have different
 characteristics. First what they have in common: they are configured to forward a
 configurable portion of the CPU address space to the PCIe bus and must be naturally
 aligned power of two in size. The rest is different:
@@ -89,29 +89,31 @@ Ideally we would like to be able to have individual functions in PE's but that
 would mean using a completely different address allocation scheme where individual
 function BARs can be "grouped" to fit in one or more segments....
 
- - The M64 windows.
+ - The M64 window:
 
-   * Their smallest size is 1M
+   * Must be at least 256MB in size
 
-   * They do not translate addresses (the address on PCIe is the same as the
+   * Does not translate addresses (the address on PCIe is the same as the
 address on the PowerBus. There is a way to also set the top 14 bits which are
 not conveyed by PowerBus but we don't use this).
 
-   * They can be configured to be segmented or not. When segmented, they have
+   * Can be configured to be segmented or not. When segmented, it has
 256 segments, however they are not remapped. The segment number *is* the PE
 number. When no segmented, the PE number can be specified for the entire
 window.
 
-   * They support overlaps in which case there is a well defined ordering of
+   * Supports overlaps in which case there is a well defined ordering of
 matching (I don't remember off hand which of the lower or higher numbered
 window takes priority but basically it's well defined).
+^^^^^^ This sounds like there are multiple M64 windows.   Or maybe this
+paragraph is really about overlaps between M64 *BARs*, not M64 windows.
 
 We have code (fairly new compared to the M32 stuff) that exploits that for
 large BARs in 64-bit space:
 
-We create a single big M64 that covers the entire region of address space that
+We configure the M64 to cover the entire region of address space that
 has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
-it comes out of a different "reserve"). We configure that window as segmented.
+it comes out of a different "reserve"). We configure it as segmented.
 
 Then we do the same thing as with M32, using the bridge aligment trick, to
 match to those giant segments.
@@ -133,15 +135,15 @@ the other ones for that "domain". We thus introduce the concept of "master PE"
 which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
 for the remaining M64 segments.
 
-We would like to investigate using additional M64's in "single PE" mode to
+We would like to investigate using additional M64 BARs (?) in "single PE" mode to
 overlay over specific BARs to work around some of that, for example for devices
 with very large BARs (some GPUs), it would make sense, but we haven't done it
 yet.
 
-Finally, the plan to use M64 for SR-IOV, which will be described more in next
+Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
 two sections. So for a given IOV BAR, we need to effectively reserve the
 entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
-the beginning of a free range of segments/PEs inside that M64.
+the beginning of a free range of segments/PEs inside that M64 BAR.
 
 The goal is of course to be able to give a separate PE for each VF...
 

commit 0f069e6a30e4c3de02f8c60aadd64fb64d434e7d
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Thu Jan 29 13:37:49 2015 -0600

    This adds description about M64 BARs.  Previously, these were mentioned,
    but I don't think there was actually anything specific about how they
    worked.

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
index 140df9cb58bd..2e4811fae7fb 100644
--- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -58,7 +58,7 @@ interrupt.
 
  * Outbound. That's where the tricky part is.
 
-The PHB basically has a concept of "windows" from the CPU address space to the
+Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the
 PCI address space. There is one M32 window and one M64 window. They have different
 characteristics. First what they have in common: they are configured to forward a
 configurable portion of the CPU address space to the PCIe bus and must be naturally
@@ -140,6 +140,69 @@ overlay over specific BARs to work around some of that, for example for devices
 with very large BARs (some GPUs), it would make sense, but we haven't done it
 yet.
 
+ - The M64 BARs.
+
+IODA2 has 16 M64 "BARs."  These are not traditional PCI BARs that assign
+space for device registers or memory, and they're not normal window
+registers that describe the base and size of a bridge aperture.
+
+Rather, these M64 BARs associate pieces of an existing M64 window with PEs.
+The BAR describes a region of a window, and the region is divided into 256
+segments, just like a segmented M64 window.  As with segmented M64 windows,
+there's no lookup table: the segment number is the PE#.  The minimum size
+of a segment is 1MB, so each M64 BAR covers at least 256MB of space in an
+M64 window.
+
+The advantage of the M64 BARs is that they can be programmed to cover only
+part of an M64 window, and you can use several of them at the same time.
+That makes them useful for SR-IOV Virtual Functions, because each VF can be
+assigned to a separate PE.
+
+SR-IOV BACKGROUND
+
+The PCIe SR-IOV feature allows a single Physical Function (PF) to support
+several Virtual Functions (VFs).  Registers in the PF's SR-IOV Capability
+control the number of VFs, whether the VFs are enabled, and the MMIO
+resources assigned to the VFs.
+
+Each VF has its own VF BARs.  Software can write to a normal PCI BAR to
+discover the BAR size and assign address for it.  VF BARs aren't like that;
+the size discovery and address assignment is done via BARs in the *PF*
+SR-IOV Capability, and the BARs in VF config space are read-only zeros.
+
+When a PF SR-IOV BAR is programmed, it sets the base address for all the
+corresponding VF BARs.  For example, if the PF SR-IOV Capability is
+programmed to enable eight VFs, and it describes a 1MB BAR 0 for those VFs,
+the address in that PF BAR sets the base of an 8MB region that contains all
+eight of the VF BARs.
+
+STRATEGIES FOR ISOLATING VFs IN PEs:
+
+- M32 window: There's one M32 window, and it is split into 256
+  equally-sized segments.  The finest granularity possible is a 256MB
+  window with 1MB segments.  VF BARs that are 1MB or larger could be mapped
+  to separate PEs in this window.  Each segment can be individually mapped
+  to a PE via the lookup table, so this is quite flexible, but it works
+  best when all the VF BARs are the same size.  If they are different
+  sizes, the entire window has to be small enough that the segment matches
+  the smallest VF BAR, and larger VF BARs span several segments.
+
+- M64 window: A non-segmented M64 window is mapped entirely to a single PE,
+  so it could only isolate one VF.  A segmented M64 window could be used
+  just like the M32 window, but the segments can't be individually mapped
+  to PEs (the segment number is the PE number), so there isn't as much
+  flexibility.  A VF with multiple BARs would have to be be in a "domain"
+  of multiple PEs, which is not as well isolated as a single PE.
+
+- M64 BAR: An M64 BAR effectively segments a region of an M64 window.  As
+  usual, the region is split into 256 equally-sized pieces, and as in
+  segmented M64 windows, the segment number is the PE number.  But there
+  are several M64 BARs, and they can be set to different base addresses and
+  different segment sizes.  So if we have VFs that each have a 1MB BAR and
+  a 32MB BAR, we could use one M64 BAR to assign 1MB segments and another
+  M64 BAR to assign 32MB segments.
+
+
 Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
 two sections. So for a given IOV BAR, we need to effectively reserve the
 entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 00/17] Enable SRIOV on Power8
  2015-01-15  2:27       ` Wei Yang
@ 2015-02-04 23:44         ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 23:44 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:27:50AM +0800, Wei Yang wrote:
> This patchset enables the SRIOV on POWER8.
> 
> The gerneral idea is put each VF into one individual PE and allocate required
> resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
> allocation and adjustment for PF's IOV BAR.
> 
> On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
> sit in its own PE. This gives more flexiblity, while at the mean time it
> brings on some restrictions on the PF's IOV BAR size and alignment.
> 
> To achieve this effect, we need to do some hack on pci devices's resources.
> 1. Expand the IOV BAR properly.
>    Done by pnv_pci_ioda_fixup_iov_resources().
> 2. Shift the IOV BAR properly.
>    Done by pnv_pci_vf_resource_shift().
> 3. IOV BAR alignment is calculated by arch dependent function instead of an
>    individual VF BAR size.
>    Done by pnv_pcibios_sriov_resource_alignment().
> 4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>    This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>    sizing and assigning"

I was hoping to merge this during the v3.20 merge window, but that will
likely open next week, and none of these patches have been in linux-next at
all yet, so I think next week would be rushing it a bit.  

Most of the changes are in arch/powerpc, which does help, but there are
some changes in pci/setup-bus.c that I would like to have some runtime on.
The changes aren't extensive, but I don't understand that code well enough
to be comfortable based on just reading the patch.

I pushed the current state of this patchset to my pci/virtualization
branch.  I think the best way forward would be for you to start with that
branch, since I've made quite a few tweaks to the patches you posted to the
list.  Then you can post a v12 with any changes you make for the next
round.

Ben, I know you chimed in earlier to help explain PEs.  Are you or another
powerpc maintainer planning to ack all this?

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 00/17] Enable SRIOV on Power8
@ 2015-02-04 23:44         ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-04 23:44 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:27:50AM +0800, Wei Yang wrote:
> This patchset enables the SRIOV on POWER8.
> 
> The gerneral idea is put each VF into one individual PE and allocate required
> resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
> allocation and adjustment for PF's IOV BAR.
> 
> On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
> sit in its own PE. This gives more flexiblity, while at the mean time it
> brings on some restrictions on the PF's IOV BAR size and alignment.
> 
> To achieve this effect, we need to do some hack on pci devices's resources.
> 1. Expand the IOV BAR properly.
>    Done by pnv_pci_ioda_fixup_iov_resources().
> 2. Shift the IOV BAR properly.
>    Done by pnv_pci_vf_resource_shift().
> 3. IOV BAR alignment is calculated by arch dependent function instead of an
>    individual VF BAR size.
>    Done by pnv_pcibios_sriov_resource_alignment().
> 4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>    This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>    sizing and assigning"

I was hoping to merge this during the v3.20 merge window, but that will
likely open next week, and none of these patches have been in linux-next at
all yet, so I think next week would be rushing it a bit.  

Most of the changes are in arch/powerpc, which does help, but there are
some changes in pci/setup-bus.c that I would like to have some runtime on.
The changes aren't extensive, but I don't understand that code well enough
to be comfortable based on just reading the patch.

I pushed the current state of this patchset to my pci/virtualization
branch.  I think the best way forward would be for you to start with that
branch, since I've made quite a few tweaks to the patches you posted to the
list.  Then you can post a v12 with any changes you make for the next
round.

Ben, I know you chimed in earlier to help explain PEs.  Are you or another
powerpc maintainer planning to ack all this?

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-02-04 22:05           ` Bjorn Helgaas
@ 2015-02-05  0:07             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  0:07 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 04:05:18PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:06AM +0800, Wei Yang wrote:
>> M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
>> will exceed the limitation and failed to be assigned.
>> 
>> This patch introduce a different mechanism based on the IOV BAR size:
>> 
>> IOV BAR size is smaller than 64M, expand to total_pe.
>> IOV BAR size is bigger than 64M, roundup power2.

To be consistent, IOV BAR in the change log should be VF BAR.

>
>Can you elaborate on this a little more?  I assume this is talking about an
>M64 BAR.  What is the size limit for this?
>

Yes, this is a corner case. This is really not easy to understand and I don't
talk about it in the Documentation.

>If I understand correctly, hardware always splits an M64 BAR into 256
>segments.  If you wanted each 128MB VF BAR in a separate PE, the M64 BAR
>would have to be 256 * 128MB = 32GB.  So maybe the idea is that instead of
>consuming 32GB of address space, you let each VF BAR span several segments.
>If each 128MB VF BAR spanned 4 segments, each segment would be 32MB and the
>whole M64 BAR would be 256 * 32MB = 8GB.  But you would have to use
>"companion" PEs so all 4 segments would be in the same "domain," and that
>would reduce the number of VFs you could support from 256 to 256/4 = 64.
>

You are right on the reason. When the VF BAR is big and we still use the
previous mechanism, this will used up the total M64 window. 

>If that were the intent, I would think TotalVFs would be involved, but it's
>not.  For a device with TotalVFs=8 and a 128MB VF BAR, it would make sense
>to dedicate 4 segments to each of those BARs because you could only use 8*4
>= 32 segments total.  If the device had TotalVFs=128, your only choices
>would be 1 segment per BAR (256 segments * 128MB/segment = 32GB) or 2
>segments per BAR (256 segments * 64MB/segment = 16GB).
>

Hmm... it is time to tell you another usage of the M64 BAR. There are two mode
of the M64 BAR. First one is the M64 BAR will cover a 256 even segmented
space, with each segment maps to a PE. Second one is the M64 BAR specify a
space and the whole space belongs to one PE. It is called "Single PE" mode.

In this implementation of this case, we choose the "Single PE" mode. The
advantage of this choice is in some case, we still have a chance to map a VF
into an individual PE, while the mechanism you proposed couldn't. I have to
admit, our choice has disadvantages. It will use more M64 BAR than yours.

Let me explain how current choice works, so that you could know the
difference.

When we detected the VF BAR is big enough, we expand the IOV BAR to another
number of VF BARs instead of 256. The number should be power_of_2 value, since
the M64 BAR hardware requirement.

Then we decide to use 4 M64 BARs for one IOV BAR at most. (So sad, we just
have 16 M64 BARs) Then how we map it?

If the user want to enable less than or equal to 4 VFs, that's great! We could
use one M64 BAR to map one VF BAR, since the M64 BAR works in "Single Mode".
In this case, each VF sits in their own PE.

If the user want to enable more VFs, so sad several VFs have to share one PE.

The logic for PE sharing is in the last commit.

m64_per_iov 
   This is a mark.
   If it is 1, means VF BAR size is not big. Use M64 BAR in segmented mode.
   If it is M64_PER_IOV, means VF BAR size is big. Use M64 BAR in "Single PE"
   mode.
vf_groups
   Indicates one IOV BAR will be divided in how many groups, and each group
   use one M64 BAR.
   In normal case, one M64 BAR per IOV BAR.
   In this special case, depends on the vf_num user want to enable.
vf_per_group
   In normal case, this is not used.
   In this special case, it helps to calculate the start address and size of
   each M64 BAR.

>If you use 2 segments per BAR and you want NumVFs=128, you can't shift
>anything, so the PE assignments are completely fixed (VF0 in PE0, VF1 in
>PE1, etc.)
>
>It seems like you'd make different choices about pdn->max_vfs for these two
>devices, but currently you only look at the VF BAR size, not at TotalVFs.
>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>>  2 files changed, 32 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index d61c384..7156486 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -174,6 +174,8 @@ struct pci_dn {
>>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>>  	u16     vf_pes;			/* VF PE# under this PF */
>>  	int     offset;			/* PE# for the first VF PE */
>> +#define M64_PER_IOV 4
>> +	int     m64_per_iov;
>>  #define IODA_INVALID_M64        (-1)
>>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>>  #endif /* CONFIG_PCI_IOV */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 94fe6e1..23ea873 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	int i;
>>  	resource_size_t size;
>>  	struct pci_dn *pdn;
>> +	int mul, total_vfs;
>>  
>>  	if (!pdev->is_physfn || pdev->is_added)
>>  		return;
>> @@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	pdn = pci_get_pdn(pdev);
>>  	pdn->max_vfs = 0;
>>  
>> +	total_vfs = pci_sriov_get_totalvfs(pdev);
>> +	pdn->m64_per_iov = 1;
>> +	mul = phb->ioda.total_pe;
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &pdev->resource[i];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
>> +					res, pci_name(pdev));
>> +			continue;
>> +		}
>> +
>> +		size = pci_iov_resource_size(pdev, i);
>> +
>> +		/* bigger than 64M */
>> +		if (size > (1 << 26)) {
>> +			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
>> +					"is bigger than 64M, roundup power2\n", i);
>> +			pdn->m64_per_iov = M64_PER_IOV;
>> +			mul = __roundup_pow_of_two(total_vfs);
>
>I think this might deserve more comment in dmesg.  "roundup power2" doesn't
>really tell me anything, especially since you mention a BAR, but you're
>actually rounding up total_vfs, not the BAR size.
>
>Does this reduce the number of possible VFs?  We already can't set NumVFs
>higher than TotalVFs.  Does this make it so we can't set NumVFs higher than
>pdn->max_vfs?
>

Hope my explanation above could help you understand this.

I am on IRC now, any problem, you could ping me :-)

>
>> +			break;
>> +		}
>> +	}
>> +
>>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>>  		res = &pdev->resource[i];
>>  		if (!res->flags || res->parent)
>> @@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  
>>  		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
>>  		size = pci_iov_resource_size(pdev, i);
>> -		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		res->end = res->start + size * mul - 1;
>>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>>  				i - PCI_IOV_RESOURCES,
>> -				res, phb->ioda.total_pe);
>> +				res, mul);
>>  	}
>> -	pdn->max_vfs = phb->ioda.total_pe;
>> +	pdn->max_vfs = mul;
>
>Maybe this is the part that makes it hard to compute the size in
>sriov_init() -- you reduce pdn->max_vfs in some cases, and maybe that can
>affect the space you reserve for IOV BARs?  E.g., maybe you reduce
>pdn->max_vfs to 128 because VF BAR 3 is larger than 64MB, but you've
>already expanded the IOV space for VF BAR 1 based on 256 PEs?
>
>But then I'm still confused because the loop here in
>pnv_pci_ioda_fixup_iov_resources() always expands the resource based on
>phb->ioda.total_pe; that part doesn't depend on "mul" or pdn->max_vfs at
>all.
>
>Or maybe this is just a bug in the fixup loop, and it *should* depend on
>"mul"?
>
>>  }
>>  
>>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> -- 
>> 1.7.9.5
>> 

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported
@ 2015-02-05  0:07             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  0:07 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Feb 04, 2015 at 04:05:18PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:28:06AM +0800, Wei Yang wrote:
>> M64 aperture size is limited on PHB3. When the IOV BAR is too big, this
>> will exceed the limitation and failed to be assigned.
>> 
>> This patch introduce a different mechanism based on the IOV BAR size:
>> 
>> IOV BAR size is smaller than 64M, expand to total_pe.
>> IOV BAR size is bigger than 64M, roundup power2.

To be consistent, IOV BAR in the change log should be VF BAR.

>
>Can you elaborate on this a little more?  I assume this is talking about an
>M64 BAR.  What is the size limit for this?
>

Yes, this is a corner case. This is really not easy to understand and I don't
talk about it in the Documentation.

>If I understand correctly, hardware always splits an M64 BAR into 256
>segments.  If you wanted each 128MB VF BAR in a separate PE, the M64 BAR
>would have to be 256 * 128MB = 32GB.  So maybe the idea is that instead of
>consuming 32GB of address space, you let each VF BAR span several segments.
>If each 128MB VF BAR spanned 4 segments, each segment would be 32MB and the
>whole M64 BAR would be 256 * 32MB = 8GB.  But you would have to use
>"companion" PEs so all 4 segments would be in the same "domain," and that
>would reduce the number of VFs you could support from 256 to 256/4 = 64.
>

You are right on the reason. When the VF BAR is big and we still use the
previous mechanism, this will used up the total M64 window. 

>If that were the intent, I would think TotalVFs would be involved, but it's
>not.  For a device with TotalVFs=8 and a 128MB VF BAR, it would make sense
>to dedicate 4 segments to each of those BARs because you could only use 8*4
>= 32 segments total.  If the device had TotalVFs=128, your only choices
>would be 1 segment per BAR (256 segments * 128MB/segment = 32GB) or 2
>segments per BAR (256 segments * 64MB/segment = 16GB).
>

Hmm... it is time to tell you another usage of the M64 BAR. There are two mode
of the M64 BAR. First one is the M64 BAR will cover a 256 even segmented
space, with each segment maps to a PE. Second one is the M64 BAR specify a
space and the whole space belongs to one PE. It is called "Single PE" mode.

In this implementation of this case, we choose the "Single PE" mode. The
advantage of this choice is in some case, we still have a chance to map a VF
into an individual PE, while the mechanism you proposed couldn't. I have to
admit, our choice has disadvantages. It will use more M64 BAR than yours.

Let me explain how current choice works, so that you could know the
difference.

When we detected the VF BAR is big enough, we expand the IOV BAR to another
number of VF BARs instead of 256. The number should be power_of_2 value, since
the M64 BAR hardware requirement.

Then we decide to use 4 M64 BARs for one IOV BAR at most. (So sad, we just
have 16 M64 BARs) Then how we map it?

If the user want to enable less than or equal to 4 VFs, that's great! We could
use one M64 BAR to map one VF BAR, since the M64 BAR works in "Single Mode".
In this case, each VF sits in their own PE.

If the user want to enable more VFs, so sad several VFs have to share one PE.

The logic for PE sharing is in the last commit.

m64_per_iov 
   This is a mark.
   If it is 1, means VF BAR size is not big. Use M64 BAR in segmented mode.
   If it is M64_PER_IOV, means VF BAR size is big. Use M64 BAR in "Single PE"
   mode.
vf_groups
   Indicates one IOV BAR will be divided in how many groups, and each group
   use one M64 BAR.
   In normal case, one M64 BAR per IOV BAR.
   In this special case, depends on the vf_num user want to enable.
vf_per_group
   In normal case, this is not used.
   In this special case, it helps to calculate the start address and size of
   each M64 BAR.

>If you use 2 segments per BAR and you want NumVFs=128, you can't shift
>anything, so the PE assignments are completely fixed (VF0 in PE0, VF1 in
>PE1, etc.)
>
>It seems like you'd make different choices about pdn->max_vfs for these two
>devices, but currently you only look at the VF BAR size, not at TotalVFs.
>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/pci-bridge.h     |    2 ++
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   33 ++++++++++++++++++++++++++---
>>  2 files changed, 32 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
>> index d61c384..7156486 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -174,6 +174,8 @@ struct pci_dn {
>>  	u16     max_vfs;		/* number of VFs IOV BAR expended */
>>  	u16     vf_pes;			/* VF PE# under this PF */
>>  	int     offset;			/* PE# for the first VF PE */
>> +#define M64_PER_IOV 4
>> +	int     m64_per_iov;
>>  #define IODA_INVALID_M64        (-1)
>>  	int     m64_wins[PCI_SRIOV_NUM_BARS];
>>  #endif /* CONFIG_PCI_IOV */
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 94fe6e1..23ea873 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2180,6 +2180,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	int i;
>>  	resource_size_t size;
>>  	struct pci_dn *pdn;
>> +	int mul, total_vfs;
>>  
>>  	if (!pdev->is_physfn || pdev->is_added)
>>  		return;
>> @@ -2190,6 +2191,32 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  	pdn = pci_get_pdn(pdev);
>>  	pdn->max_vfs = 0;
>>  
>> +	total_vfs = pci_sriov_get_totalvfs(pdev);
>> +	pdn->m64_per_iov = 1;
>> +	mul = phb->ioda.total_pe;
>> +
>> +	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> +		res = &pdev->resource[i];
>> +		if (!res->flags || res->parent)
>> +			continue;
>> +		if (!pnv_pci_is_mem_pref_64(res->flags)) {
>> +			dev_warn(&pdev->dev, " non M64 IOV BAR %pR on %s\n",
>> +					res, pci_name(pdev));
>> +			continue;
>> +		}
>> +
>> +		size = pci_iov_resource_size(pdev, i);
>> +
>> +		/* bigger than 64M */
>> +		if (size > (1 << 26)) {
>> +			dev_info(&pdev->dev, "PowerNV: VF BAR[%d] size "
>> +					"is bigger than 64M, roundup power2\n", i);
>> +			pdn->m64_per_iov = M64_PER_IOV;
>> +			mul = __roundup_pow_of_two(total_vfs);
>
>I think this might deserve more comment in dmesg.  "roundup power2" doesn't
>really tell me anything, especially since you mention a BAR, but you're
>actually rounding up total_vfs, not the BAR size.
>
>Does this reduce the number of possible VFs?  We already can't set NumVFs
>higher than TotalVFs.  Does this make it so we can't set NumVFs higher than
>pdn->max_vfs?
>

Hope my explanation above could help you understand this.

I am on IRC now, any problem, you could ping me :-)

>
>> +			break;
>> +		}
>> +	}
>> +
>>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>>  		res = &pdev->resource[i];
>>  		if (!res->flags || res->parent)
>> @@ -2202,13 +2229,13 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  
>>  		dev_dbg(&pdev->dev, " Fixing VF BAR[%d] %pR to\n", i, res);
>>  		size = pci_iov_resource_size(pdev, i);
>> -		res->end = res->start + size * phb->ioda.total_pe - 1;
>> +		res->end = res->start + size * mul - 1;
>>  		dev_dbg(&pdev->dev, "                       %pR\n", res);
>>  		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>>  				i - PCI_IOV_RESOURCES,
>> -				res, phb->ioda.total_pe);
>> +				res, mul);
>>  	}
>> -	pdn->max_vfs = phb->ioda.total_pe;
>> +	pdn->max_vfs = mul;
>
>Maybe this is the part that makes it hard to compute the size in
>sriov_init() -- you reduce pdn->max_vfs in some cases, and maybe that can
>affect the space you reserve for IOV BARs?  E.g., maybe you reduce
>pdn->max_vfs to 128 because VF BAR 3 is larger than 64MB, but you've
>already expanded the IOV space for VF BAR 1 based on 256 PEs?
>
>But then I'm still confused because the loop here in
>pnv_pci_ioda_fixup_iov_resources() always expands the resource based on
>phb->ioda.total_pe; that part doesn't depend on "mul" or pdn->max_vfs at
>all.
>
>Or maybe this is just a bug in the fixup loop, and it *should* depend on
>"mul"?
>
>>  }
>>  
>>  static void pnv_pci_ioda_fixup_sriov(struct pci_bus *bus)
>> -- 
>> 1.7.9.5
>> 

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 00/17] Enable SRIOV on Power8
  2015-02-04 23:44         ` Bjorn Helgaas
@ 2015-02-05  0:13           ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  0:13 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 05:44:42PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:27:50AM +0800, Wei Yang wrote:
>> This patchset enables the SRIOV on POWER8.
>> 
>> The gerneral idea is put each VF into one individual PE and allocate required
>> resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
>> allocation and adjustment for PF's IOV BAR.
>> 
>> On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
>> sit in its own PE. This gives more flexiblity, while at the mean time it
>> brings on some restrictions on the PF's IOV BAR size and alignment.
>> 
>> To achieve this effect, we need to do some hack on pci devices's resources.
>> 1. Expand the IOV BAR properly.
>>    Done by pnv_pci_ioda_fixup_iov_resources().
>> 2. Shift the IOV BAR properly.
>>    Done by pnv_pci_vf_resource_shift().
>> 3. IOV BAR alignment is calculated by arch dependent function instead of an
>>    individual VF BAR size.
>>    Done by pnv_pcibios_sriov_resource_alignment().
>> 4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>>    This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>>    sizing and assigning"
>
>I was hoping to merge this during the v3.20 merge window, but that will
>likely open next week, and none of these patches have been in linux-next at
>all yet, so I think next week would be rushing it a bit.  
>
>Most of the changes are in arch/powerpc, which does help, but there are
>some changes in pci/setup-bus.c that I would like to have some runtime on.
>The changes aren't extensive, but I don't understand that code well enough
>to be comfortable based on just reading the patch.
>
>I pushed the current state of this patchset to my pci/virtualization
>branch.  I think the best way forward would be for you to start with that
>branch, since I've made quite a few tweaks to the patches you posted to the
>list.  Then you can post a v12 with any changes you make for the next
>round.

Thanks :-)

I will rebase code on pci/virtualization and do changes as you mentioned in
previous letters.

Really thanks for your review, which must take you a lot of time, especially
on the PE and M64 BAR stuff. Also thanks for you kindness and tolerance on my
mistakes in communications :-)

>
>Ben, I know you chimed in earlier to help explain PEs.  Are you or another
>powerpc maintainer planning to ack all this?
>
>Bjorn

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 00/17] Enable SRIOV on Power8
@ 2015-02-05  0:13           ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  0:13 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Feb 04, 2015 at 05:44:42PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:27:50AM +0800, Wei Yang wrote:
>> This patchset enables the SRIOV on POWER8.
>> 
>> The gerneral idea is put each VF into one individual PE and allocate required
>> resources like MMIO/DMA/MSI. The major difficulty comes from the MMIO
>> allocation and adjustment for PF's IOV BAR.
>> 
>> On P8, we use M64BT to cover a PF's IOV BAR, which could make an individual VF
>> sit in its own PE. This gives more flexiblity, while at the mean time it
>> brings on some restrictions on the PF's IOV BAR size and alignment.
>> 
>> To achieve this effect, we need to do some hack on pci devices's resources.
>> 1. Expand the IOV BAR properly.
>>    Done by pnv_pci_ioda_fixup_iov_resources().
>> 2. Shift the IOV BAR properly.
>>    Done by pnv_pci_vf_resource_shift().
>> 3. IOV BAR alignment is calculated by arch dependent function instead of an
>>    individual VF BAR size.
>>    Done by pnv_pcibios_sriov_resource_alignment().
>> 4. Take the IOV BAR alignment into consideration in the sizing and assigning.
>>    This is achieved by commit: "PCI: Take additional IOV BAR alignment in
>>    sizing and assigning"
>
>I was hoping to merge this during the v3.20 merge window, but that will
>likely open next week, and none of these patches have been in linux-next at
>all yet, so I think next week would be rushing it a bit.  
>
>Most of the changes are in arch/powerpc, which does help, but there are
>some changes in pci/setup-bus.c that I would like to have some runtime on.
>The changes aren't extensive, but I don't understand that code well enough
>to be comfortable based on just reading the patch.
>
>I pushed the current state of this patchset to my pci/virtualization
>branch.  I think the best way forward would be for you to start with that
>branch, since I've made quite a few tweaks to the patches you posted to the
>list.  Then you can post a v12 with any changes you make for the next
>round.

Thanks :-)

I will rebase code on pci/virtualization and do changes as you mentioned in
previous letters.

Really thanks for your review, which must take you a lot of time, especially
on the PE and M64 BAR stuff. Also thanks for you kindness and tolerance on my
mistakes in communications :-)

>
>Ben, I know you chimed in earlier to help explain PEs.  Are you or another
>powerpc maintainer planning to ack all this?
>
>Bjorn

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
  2015-02-04 20:53                   ` Bjorn Helgaas
@ 2015-02-05  3:01                     ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  3:01 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, gwshan, benh, linux-pci, linuxppc-dev

On Wed, Feb 04, 2015 at 02:53:13PM -0600, Bjorn Helgaas wrote:
>On Wed, Feb 04, 2015 at 11:34:09AM +0800, Wei Yang wrote:
>> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
>> >> The actual IOV BAR range is determined by the start address and the actual
>> >> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
>> >> chance the actual end address exceed the limit and overlap with other
>> >> devices.
>> >> 
>> >> This patch adds a check to make sure after shifting, the range will not
>> >> overlap with other devices.
>> >
>> >I folded this into the previous patch (the one that adds
>> >pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
>> >with the following one ("powerpc/powernv: Allocate VF PE") because this one
>> >references pdn->vf_pes, which is added by "Allocate VF PE".
>> >
>> 
>> Yes. Both need this.
>> 
>> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> ---
>> >>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
>> >>  1 file changed, 48 insertions(+), 5 deletions(-)
>> >> 
>> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> >> index 8456ae8..1a1e74b 100644
>> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> >> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>> >>  }
>> >>  
>> >>  #ifdef CONFIG_PCI_IOV
>> >> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >>  {
>> >>  	struct pci_dn *pdn = pci_get_pdn(dev);
>> >>  	int i;
>> >>  	struct resource *res;
>> >>  	resource_size_t size;
>> >> +	u16 vf_num;
>> >>  
>> >>  	if (!dev->is_physfn)
>> >> -		return;
>> >> +		return -EINVAL;
>> >>  
>> >> +	vf_num = pdn->vf_pes;
>> >
>> >I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>> >
>> 
>> The pdn->vf_pes is defined in the next patch, it is not defined yet.
>> 
>> I thought the incremental patch means a patch on top of the current patch set,
>> so it is defined as the last patch.
>> 
>> >>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> >>  		res = &dev->resource[i];
>> >>  		if (!res->flags || !res->parent)
>> >> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>> >>  		size = pci_iov_resource_size(dev, i);
>> >>  		res->start += size*offset;
>> >> -
>> >>  		dev_info(&dev->dev, "                 %pR\n", res);
>> >> +
>> >> +		/*
>> >> +		 * The actual IOV BAR range is determined by the start address
>> >> +		 * and the actual size for vf_num VFs BAR. The check here is
>> >> +		 * to make sure after shifting, the range will not overlap
>> >> +		 * with other device.
>> >> +		 */
>> >> +		if ((res->start + (size * vf_num)) > res->end) {
>> >> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
>> >> +					" other device after shift\n");
>> >
>> >sriov_init() sets up "res" with enough space to contain TotalVF copies
>> >of the VF BAR.  By the time we get here, that "res" is in the resource
>> >tree, and you should be able to see it in /proc/iomem.
>> >
>> >For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
>> >resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
>> >SR-IOV Capability contains a base address of 0x8000_0000, the resource
>> >would be:
>> >
>> >  [mem 0x8000_0000-0x87ff_ffff]
>> >
>> >We have to assume there's another resource starting immediately after
>> >this one, i.e., at 0x8800_0000, and we have to make sure that when we
>> >change this resource and turn on SR-IOV, we don't overlap with it.
>> >
>> >The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
>> >hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
>> >may be smaller than TotalVFs).
>> >
>> >If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
>> >1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
>> >new resource will be:
>> >
>> >  [mem 0x8170_0000-0x826f_ffff]
>> >
>> >That's fine; it doesn't extend past the original end of 0x87ff_ffff.
>> >But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
>> >to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
>> >so the new resource will be:
>> >
>> >  [mem 0x8780_0000-0x887f_ffff]
>> >
>> >and that's a problem because we have two devices responding at
>> >0x8800_0000.
>> >
>> >Your test of "res->start + (size * vf_num)) > res->end" is not strict
>> >enough to catch this problem.
>> >
>> 
>> Yep, you are right.
>> 
>> >I think we need something like the patch below.  I restructured it so
>> >we don't have to back out any resource changes if we fail.
>> >
>> >This shifting strategy seems to imply that the closer NumVFs is to
>> >TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
>> >== TotalVFs, you wouldn't be able to shift at all.  In this example,
>> >you could shift by anything from 0 to 128 - 16 = 112, but if you
>> >wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?
>> >
>> >I think your M64 BAR gets split into 256 segments, regardless of what
>> >TotalVFs is, so if you expanded the resource to 256 * 1MB for this
>> >example, you would be able to shift by up to 256 - NumVFs.  Do you
>> >actually do this somewhere?
>> >
>> 
>> Yes, after expanding the resource to 256 * 1MB, it is able to shift up to 
>> 256 - NumVFs. 
>
>Oh, I see where the expansion happens.  We started in sriov_init() with:
>
>  res->end = res->start + resource_size(res) * total - 1;
>
>where "total" is TotalVFs, and you expand it to the maximum number of PEs
>in pnv_pci_ioda_fixup_iov_resources():
>
>  res->end = res->start + size * phb->ioda.total_pe - 1;
>
>in this path:
>
>  pcibios_scan_phb
>    pci_create_root_bus
>    pci_scan_child_bus
>      ...
>        sriov_init
>	  res->end = res->start + ...	# as above
>    ppc_md.pcibios_fixup_sriov		# pnv_pci_ioda_fixup_sriov
>    pnv_pci_ioda_fixup_sriov(bus)
>      list_for_each_entry(dev, &bus->devices, ...)
>        if (dev->subordinate)
>	  pnv_pci_ioda_fixup_sriov(dev->subordinate)	# recurse
>        pnv_pci_ioda_fixup_iov_resources(dev)
>	  res->end = res->start + ...	# fixup
>
>I think this will be cleaner if you add an arch interface for use by
>sriov_init(), e.g.,
>
>  resource_size_t __weak pcibios_iov_size(struct pci_dev *dev, int resno)
>  {
>    struct resource *res = &dev->resource[resno + PCI_IOV_RESOURCES];
>
>    return resource_size(res) * dev->iov->total_VFs;
>  }
>
>  static int sriov_int(...)
>  {
>    ...
>    res->end = res->start + pcibios_iov_size(dev, i) - 1;
>    ...
>  }
>
>and powerpc could override this.  That way we would set the size once and
>we wouldn't need a fixup pass, which will keep the pcibios_scan_phb() code
>similar to the common path in pci_scan_root_bus().
>

Bjorn,

The idea is a very good one, but when I try to implement it, there is some
issues. The rely on each other.

The "fixup" will go through all the IOV BARs and calculate the number to
expand. This calculation is based on pci_iov_resource_size(), which is just
under initialization at this stage. BTW, the pdev->sriov is not set neither.

And the "fixup" should be just invoked once, while the pcibios_iov_size() will
be called on every IOV BAR.

>> But currently, on my system, I don't see one case really do
>> this.
>> 
>> On my system, there is an Emulex card with 4 PFs.
>> 
>> 0006:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 0006:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 0006:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 0006:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 
>> The max VFs for them are 80, 80, 20, 20, with total number of 200 VFs.
>> 
>> be2net 0006:01:00.0:  Shifting VF BAR [mem 0x3d40 1000 0000 - 0x3d40 10ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.0:                  [mem 0x3d40 1003 0000 - 0x3d40 10ff ffff 64bit pref]    253 segs offset 3
>> PE range [3 - 82]
>> be2net 0006:01:00.1:  Shifting VF BAR [mem 0x3d40 1100 0000 - 0x3d40 11ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.1:                  [mem 0x3d40 1153 0000 - 0x3d40 11ff ffff 64bit pref]    173 segs offset 83
>> PE range [83 - 162]
>> be2net 0006:01:00.2:  Shifting VF BAR [mem 0x3d40 1200 0000 - 0x3d40 12ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.2:                  [mem 0x3d40 12a3 0000 - 0x3d40 12ff ffff 64bit pref]    93  segs offset 163
>> PE range [163 - 182]
>> be2net 0006:01:00.3:  Shifting VF BAR [mem 0x3d40 1300 0000 - 0x3d40 13ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.3:                  [mem 0x3d40 13b7 0000 - 0x3d40 13ff ffff 64bit pref]    73  segs offset 183
>> PE range [183 - 202]
>> 
>> After enable the max number of VFs, even the last VF still has 73 number VF
>> BAR size. So this not trigger the limit, but proves the shift offset could be
>> larger than (TotalVFs - NumVFs).
>
>You expanded the overall resource from "TotalVFs * size" to "256 * size".
>So the offset can be larger than "TotalVFs - NumVFs" but it still cannot be
>larger than "256 - NumVFs".  The point is that the range claimed by the
>hardware cannot extend past the range we told the resource tree about.
>That's what the "if (res2.end > res->end)" test is checking.
>
>Normally we compute res->end based on TotalVFs.  For PHB3, you compute
>res->end based on 256.  Either way, we need to make sure we don't program
>the BAR with an address that causes the hardware to respond to addresses
>past res->end.
>
>Bjorn
>
>> >+		/*
>> >+		 * The actual IOV BAR range is determined by the start address
>> >+		 * and the actual size for vf_num VFs BAR.  This check is to
>> >+		 * make sure that after shifting, the range will not overlap
>> >+		 * with another device.
>> >+		 */
>> >+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> >+		res2.flags = res->flags;
>> >+		res2.start = res->start + (size * offset);
>> >+		res2.end = res2.start + (size * vf_num) - 1;
>> >+
>> >+		if (res2.end > res->end) {
>> >+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> >+				i, &res2, res, vf_num, offset);
>> >+			return -EBUSY;
>> >+		}

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting
@ 2015-02-05  3:01                     ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  3:01 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Wed, Feb 04, 2015 at 02:53:13PM -0600, Bjorn Helgaas wrote:
>On Wed, Feb 04, 2015 at 11:34:09AM +0800, Wei Yang wrote:
>> On Tue, Feb 03, 2015 at 06:19:26PM -0600, Bjorn Helgaas wrote:
>> >On Tue, Feb 03, 2015 at 03:01:43PM +0800, Wei Yang wrote:
>> >> The actual IOV BAR range is determined by the start address and the actual
>> >> size for vf_num VFs BAR. After shifting the IOV BAR, there would be a
>> >> chance the actual end address exceed the limit and overlap with other
>> >> devices.
>> >> 
>> >> This patch adds a check to make sure after shifting, the range will not
>> >> overlap with other devices.
>> >
>> >I folded this into the previous patch (the one that adds
>> >pnv_pci_vf_resource_shift()).  And I think that needs to be folded together
>> >with the following one ("powerpc/powernv: Allocate VF PE") because this one
>> >references pdn->vf_pes, which is added by "Allocate VF PE".
>> >
>> 
>> Yes. Both need this.
>> 
>> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> ---
>> >>  arch/powerpc/platforms/powernv/pci-ioda.c |   53 ++++++++++++++++++++++++++---
>> >>  1 file changed, 48 insertions(+), 5 deletions(-)
>> >> 
>> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>> >> index 8456ae8..1a1e74b 100644
>> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> >> @@ -854,16 +854,18 @@ static unsigned int pnv_ioda_dma_weight(struct pci_dev *dev)
>> >>  }
>> >>  
>> >>  #ifdef CONFIG_PCI_IOV
>> >> -static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >> +static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >>  {
>> >>  	struct pci_dn *pdn = pci_get_pdn(dev);
>> >>  	int i;
>> >>  	struct resource *res;
>> >>  	resource_size_t size;
>> >> +	u16 vf_num;
>> >>  
>> >>  	if (!dev->is_physfn)
>> >> -		return;
>> >> +		return -EINVAL;
>> >>  
>> >> +	vf_num = pdn->vf_pes;
>> >
>> >I can't actually build this, but I don't think pdn->vf_pes is defined yet.
>> >
>> 
>> The pdn->vf_pes is defined in the next patch, it is not defined yet.
>> 
>> I thought the incremental patch means a patch on top of the current patch set,
>> so it is defined as the last patch.
>> 
>> >>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++) {
>> >>  		res = &dev->resource[i];
>> >>  		if (!res->flags || !res->parent)
>> >> @@ -875,11 +877,49 @@ static void pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>> >>  		dev_info(&dev->dev, " Shifting VF BAR %pR to\n", res);
>> >>  		size = pci_iov_resource_size(dev, i);
>> >>  		res->start += size*offset;
>> >> -
>> >>  		dev_info(&dev->dev, "                 %pR\n", res);
>> >> +
>> >> +		/*
>> >> +		 * The actual IOV BAR range is determined by the start address
>> >> +		 * and the actual size for vf_num VFs BAR. The check here is
>> >> +		 * to make sure after shifting, the range will not overlap
>> >> +		 * with other device.
>> >> +		 */
>> >> +		if ((res->start + (size * vf_num)) > res->end) {
>> >> +			dev_err(&dev->dev, "VF BAR%d: %pR will conflict with"
>> >> +					" other device after shift\n");
>> >
>> >sriov_init() sets up "res" with enough space to contain TotalVF copies
>> >of the VF BAR.  By the time we get here, that "res" is in the resource
>> >tree, and you should be able to see it in /proc/iomem.
>> >
>> >For example, if TotalVFs is 128 and VF BAR0 is 1MB in size, the
>> >resource size would be 128 * 1MB = 0x800_0000.  If the VF BAR0 in the
>> >SR-IOV Capability contains a base address of 0x8000_0000, the resource
>> >would be:
>> >
>> >  [mem 0x8000_0000-0x87ff_ffff]
>> >
>> >We have to assume there's another resource starting immediately after
>> >this one, i.e., at 0x8800_0000, and we have to make sure that when we
>> >change this resource and turn on SR-IOV, we don't overlap with it.
>> >
>> >The shifted resource will start at 0x8000_0000 + 1MB * "offset".  The
>> >hardware will respond to a range whose size is 1MB * NumVFs (NumVFs
>> >may be smaller than TotalVFs).
>> >
>> >If we enable 16 VFs and shift by 23, we set VF BAR0 to 0x8000_0000 +
>> >1MB * 23 = 0x8170_0000, and the size is 1MB * 16 = 0x100_0000, so the
>> >new resource will be:
>> >
>> >  [mem 0x8170_0000-0x826f_ffff]
>> >
>> >That's fine; it doesn't extend past the original end of 0x87ff_ffff.
>> >But if we enable those same 16 VFs with a shift of 120, we set VF BAR0
>> >to 0x8000_0000 + 1MB * 120 = 0x8780_0000, and the size stays the same,
>> >so the new resource will be:
>> >
>> >  [mem 0x8780_0000-0x887f_ffff]
>> >
>> >and that's a problem because we have two devices responding at
>> >0x8800_0000.
>> >
>> >Your test of "res->start + (size * vf_num)) > res->end" is not strict
>> >enough to catch this problem.
>> >
>> 
>> Yep, you are right.
>> 
>> >I think we need something like the patch below.  I restructured it so
>> >we don't have to back out any resource changes if we fail.
>> >
>> >This shifting strategy seems to imply that the closer NumVFs is to
>> >TotalVFs, the less flexibility you have to assign PEs, e.g., if NumVFs
>> >== TotalVFs, you wouldn't be able to shift at all.  In this example,
>> >you could shift by anything from 0 to 128 - 16 = 112, but if you
>> >wanted NumVFs = 64, you could only shift by 0 to 64.  Is that true?
>> >
>> >I think your M64 BAR gets split into 256 segments, regardless of what
>> >TotalVFs is, so if you expanded the resource to 256 * 1MB for this
>> >example, you would be able to shift by up to 256 - NumVFs.  Do you
>> >actually do this somewhere?
>> >
>> 
>> Yes, after expanding the resource to 256 * 1MB, it is able to shift up to 
>> 256 - NumVFs. 
>
>Oh, I see where the expansion happens.  We started in sriov_init() with:
>
>  res->end = res->start + resource_size(res) * total - 1;
>
>where "total" is TotalVFs, and you expand it to the maximum number of PEs
>in pnv_pci_ioda_fixup_iov_resources():
>
>  res->end = res->start + size * phb->ioda.total_pe - 1;
>
>in this path:
>
>  pcibios_scan_phb
>    pci_create_root_bus
>    pci_scan_child_bus
>      ...
>        sriov_init
>	  res->end = res->start + ...	# as above
>    ppc_md.pcibios_fixup_sriov		# pnv_pci_ioda_fixup_sriov
>    pnv_pci_ioda_fixup_sriov(bus)
>      list_for_each_entry(dev, &bus->devices, ...)
>        if (dev->subordinate)
>	  pnv_pci_ioda_fixup_sriov(dev->subordinate)	# recurse
>        pnv_pci_ioda_fixup_iov_resources(dev)
>	  res->end = res->start + ...	# fixup
>
>I think this will be cleaner if you add an arch interface for use by
>sriov_init(), e.g.,
>
>  resource_size_t __weak pcibios_iov_size(struct pci_dev *dev, int resno)
>  {
>    struct resource *res = &dev->resource[resno + PCI_IOV_RESOURCES];
>
>    return resource_size(res) * dev->iov->total_VFs;
>  }
>
>  static int sriov_int(...)
>  {
>    ...
>    res->end = res->start + pcibios_iov_size(dev, i) - 1;
>    ...
>  }
>
>and powerpc could override this.  That way we would set the size once and
>we wouldn't need a fixup pass, which will keep the pcibios_scan_phb() code
>similar to the common path in pci_scan_root_bus().
>

Bjorn,

The idea is a very good one, but when I try to implement it, there is some
issues. The rely on each other.

The "fixup" will go through all the IOV BARs and calculate the number to
expand. This calculation is based on pci_iov_resource_size(), which is just
under initialization at this stage. BTW, the pdev->sriov is not set neither.

And the "fixup" should be just invoked once, while the pcibios_iov_size() will
be called on every IOV BAR.

>> But currently, on my system, I don't see one case really do
>> this.
>> 
>> On my system, there is an Emulex card with 4 PFs.
>> 
>> 0006:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 0006:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 0006:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 0006:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10)
>> 
>> The max VFs for them are 80, 80, 20, 20, with total number of 200 VFs.
>> 
>> be2net 0006:01:00.0:  Shifting VF BAR [mem 0x3d40 1000 0000 - 0x3d40 10ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.0:                  [mem 0x3d40 1003 0000 - 0x3d40 10ff ffff 64bit pref]    253 segs offset 3
>> PE range [3 - 82]
>> be2net 0006:01:00.1:  Shifting VF BAR [mem 0x3d40 1100 0000 - 0x3d40 11ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.1:                  [mem 0x3d40 1153 0000 - 0x3d40 11ff ffff 64bit pref]    173 segs offset 83
>> PE range [83 - 162]
>> be2net 0006:01:00.2:  Shifting VF BAR [mem 0x3d40 1200 0000 - 0x3d40 12ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.2:                  [mem 0x3d40 12a3 0000 - 0x3d40 12ff ffff 64bit pref]    93  segs offset 163
>> PE range [163 - 182]
>> be2net 0006:01:00.3:  Shifting VF BAR [mem 0x3d40 1300 0000 - 0x3d40 13ff ffff 64bit pref] to 256 segs
>> be2net 0006:01:00.3:                  [mem 0x3d40 13b7 0000 - 0x3d40 13ff ffff 64bit pref]    73  segs offset 183
>> PE range [183 - 202]
>> 
>> After enable the max number of VFs, even the last VF still has 73 number VF
>> BAR size. So this not trigger the limit, but proves the shift offset could be
>> larger than (TotalVFs - NumVFs).
>
>You expanded the overall resource from "TotalVFs * size" to "256 * size".
>So the offset can be larger than "TotalVFs - NumVFs" but it still cannot be
>larger than "256 - NumVFs".  The point is that the range claimed by the
>hardware cannot extend past the range we told the resource tree about.
>That's what the "if (res2.end > res->end)" test is checking.
>
>Normally we compute res->end based on TotalVFs.  For PHB3, you compute
>res->end based on 256.  Either way, we need to make sure we don't program
>the BAR with an address that causes the hardware to respond to addresses
>past res->end.
>
>Bjorn
>
>> >+		/*
>> >+		 * The actual IOV BAR range is determined by the start address
>> >+		 * and the actual size for vf_num VFs BAR.  This check is to
>> >+		 * make sure that after shifting, the range will not overlap
>> >+		 * with another device.
>> >+		 */
>> >+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>> >+		res2.flags = res->flags;
>> >+		res2.start = res->start + (size * offset);
>> >+		res2.end = res2.start + (size * vf_num) - 1;
>> >+
>> >+		if (res2.end > res->end) {
>> >+			dev_err(&dev->dev, "VF BAR%d: %pR would extend past %pR (trying to enable %d VFs shifted by %d)\n",
>> >+				i, &res2, res, vf_num, offset);
>> >+			return -EBUSY;
>> >+		}

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH 0/3] Code adjustment on pci/virtualization
  2015-02-04 23:44         ` Bjorn Helgaas
@ 2015-02-05  6:34           ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

Bjorn,

I had a try on your pci/virtualization branch, it works fine after applying
following three patches. The reason is written in the change log. And I have
tried, they could apply cleanly with the original one.

For the pnv_pci_ioda_fixup_sriov(), which you suggest to merge in sriov_init()
I found it is not that simple. Maybe I misunderstand your meaning.

I will be online tonight to see what else we can improve. Thanks for your time
:-)

Wei Yang (3):
  fix on Store individual VF BAR size in struct pci_sriov
  fix Reserve additional space for IOV BAR, with m64_per_iov supported
  remove the unused end in pnv_pci_vf_resource_shift()

 arch/powerpc/platforms/powernv/pci-ioda.c |    6 +++---
 drivers/pci/iov.c                         |    6 ++++--
 2 files changed, 7 insertions(+), 5 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH 0/3] Code adjustment on pci/virtualization
@ 2015-02-05  6:34           ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

Bjorn,

I had a try on your pci/virtualization branch, it works fine after applying
following three patches. The reason is written in the change log. And I have
tried, they could apply cleanly with the original one.

For the pnv_pci_ioda_fixup_sriov(), which you suggest to merge in sriov_init()
I found it is not that simple. Maybe I misunderstand your meaning.

I will be online tonight to see what else we can improve. Thanks for your time
:-)

Wei Yang (3):
  fix on Store individual VF BAR size in struct pci_sriov
  fix Reserve additional space for IOV BAR, with m64_per_iov supported
  remove the unused end in pnv_pci_vf_resource_shift()

 arch/powerpc/platforms/powernv/pci-ioda.c |    6 +++---
 drivers/pci/iov.c                         |    6 ++++--
 2 files changed, 7 insertions(+), 5 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 168+ messages in thread

* [PATCH 1/3] fix on Store individual VF BAR size in struct pci_sriov
  2015-02-05  6:34           ` Wei Yang
@ 2015-02-05  6:34             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

__pci_read_base() will return 1 when it is a 64-bit BAR, which makes the
resource index not correct. So i could not be the index in this case.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 721987b..b348b72 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -444,10 +444,12 @@ found:
 			rc = -EIO;
 			goto failed;
 		}
-		iov->barsz[i] = resource_size(res);
+		iov->barsz[res - dev->resource - PCI_IOV_RESOURCES] =
+			resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
 		dev_info(&dev->dev, "VF BAR%d: %pR (contains BAR%d for %d VFs)\n",
-			 i, res, i, total);
+			 (int)(res - dev->resource - PCI_IOV_RESOURCES), res,
+			 (int)(res - dev->resource - PCI_IOV_RESOURCES), total);
 		nres++;
 	}
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH 1/3] fix on Store individual VF BAR size in struct pci_sriov
@ 2015-02-05  6:34             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

__pci_read_base() will return 1 when it is a 64-bit BAR, which makes the
resource index not correct. So i could not be the index in this case.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 721987b..b348b72 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -444,10 +444,12 @@ found:
 			rc = -EIO;
 			goto failed;
 		}
-		iov->barsz[i] = resource_size(res);
+		iov->barsz[res - dev->resource - PCI_IOV_RESOURCES] =
+			resource_size(res);
 		res->end = res->start + resource_size(res) * total - 1;
 		dev_info(&dev->dev, "VF BAR%d: %pR (contains BAR%d for %d VFs)\n",
-			 i, res, i, total);
+			 (int)(res - dev->resource - PCI_IOV_RESOURCES), res,
+			 (int)(res - dev->resource - PCI_IOV_RESOURCES), total);
 		nres++;
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH 2/3] fix Reserve additional space for IOV BAR, with m64_per_iov supported
  2015-02-05  6:34           ` Wei Yang
@ 2015-02-05  6:34             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

When IOV BAR is bigger than 64MB, we just reserve a power_2 value.

I guess this change is lost during the rebase.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70a0d24..1776b36 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2372,10 +2372,10 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
-				i, res, phb->ioda.total_pe);
+				i, res, mul);
 	}
 	pdn->max_vfs = mul;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH 2/3] fix Reserve additional space for IOV BAR, with m64_per_iov supported
@ 2015-02-05  6:34             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

When IOV BAR is bigger than 64MB, we just reserve a power_2 value.

I guess this change is lost during the rebase.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70a0d24..1776b36 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2372,10 +2372,10 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 
 		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
 		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
-		res->end = res->start + size * phb->ioda.total_pe - 1;
+		res->end = res->start + size * mul - 1;
 		dev_dbg(&pdev->dev, "                       %pR\n", res);
 		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
-				i, res, phb->ioda.total_pe);
+				i, res, mul);
 	}
 	pdn->max_vfs = mul;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH 3/3] remove the unused end in pnv_pci_vf_resource_shift()
  2015-02-05  6:34           ` Wei Yang
@ 2015-02-05  6:34             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, linuxppc-dev, Wei Yang

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1776b36..f1fc7cf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -854,7 +854,7 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 	struct pci_dn *pdn = pci_get_pdn(dev);
 	int i;
 	struct resource *res, res2;
-	resource_size_t size, end;
+	resource_size_t size;
 	u16 vf_num;
 
 	if (!dev->is_physfn)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 168+ messages in thread

* [PATCH 3/3] remove the unused end in pnv_pci_vf_resource_shift()
@ 2015-02-05  6:34             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-05  6:34 UTC (permalink / raw)
  To: bhelgaas, benh, gwshan; +Cc: linux-pci, Wei Yang, linuxppc-dev

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/pci-ioda.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1776b36..f1fc7cf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -854,7 +854,7 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
 	struct pci_dn *pdn = pci_get_pdn(dev);
 	int i;
 	struct resource *res, res2;
-	resource_size_t size, end;
+	resource_size_t size;
 	u16 vf_num;
 
 	if (!dev->is_physfn)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 00/17] Enable SRIOV on Power8
  2015-02-04 23:44         ` Bjorn Helgaas
@ 2015-02-10  0:25           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:25 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, gwshan, linux-pci, linuxppc-dev

On Wed, 2015-02-04 at 17:44 -0600, Bjorn Helgaas wrote:
> Ben, I know you chimed in earlier to help explain PEs.  Are you or
> another powerpc maintainer planning to ack all this?

I'll get through it in the next day or so.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 00/17] Enable SRIOV on Power8
@ 2015-02-10  0:25           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:25 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, linuxppc-dev, gwshan

On Wed, 2015-02-04 at 17:44 -0600, Bjorn Helgaas wrote:
> Ben, I know you chimed in earlier to help explain PEs.  Are you or
> another powerpc maintainer planning to ack all this?

I'll get through it in the next day or so.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
  2015-01-15  2:27         ` Wei Yang
@ 2015-02-10  0:26           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:26 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, gwshan, linux-pci, linuxppc-dev

On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> +       if ((retval = pcibios_sriov_enable(dev, initial))) {
> +               dev_err(&dev->dev, "Failure %d from
> pcibios_sriov_setup()\n",
> +                       retval);
> +               return retval;
> +       }
> +

Don't we want pcibios_sriov_enable() to be able to crop the number
of VFs or do we think any resource limits have been applied
already ?

Ben.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
@ 2015-02-10  0:26           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:26 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, linuxppc-dev, gwshan

On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> +       if ((retval = pcibios_sriov_enable(dev, initial))) {
> +               dev_err(&dev->dev, "Failure %d from
> pcibios_sriov_setup()\n",
> +                       retval);
> +               return retval;
> +       }
> +

Don't we want pcibios_sriov_enable() to be able to crop the number
of VFs or do we think any resource limits have been applied
already ?

Ben.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
  2015-01-15  2:27         ` Wei Yang
@ 2015-02-10  0:32           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:32 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, gwshan, linux-pci, linuxppc-dev

On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> The alignment of PF's IOV BAR is designed to be the individual size of a
> VF's BAR size. This works fine for many platforms, but on PowerNV platform
> it needs some change.
> 
> The original alignment works, since at sizing and assigning stage the
> requirement is from an individual VF's BAR size instead of the PF's IOV
> BAR.  This is the reason for the original code to just retrieve the
> individual VF BAR size as the alignment.
> 
> On PowerNV platform, it is required to align the whole PF IOV BAR to a
> hardware segment. Based on this fact, the alignment of PF's IOV BAR should
> be calculated seperately.
> 
> This patch introduces a weak pcibios_iov_resource_alignment() interface,
> which gives platform a chance to implement specific method to calculate
> the PF's IOV BAR alignment.

While the patch is probably fine, I find the above explanation quite
confusing :)

>From my memory (vague now) of the scheme we put in place, we need to
practically reserve a portion of address space that corresponds to
VF_size * Number_of_PEs. IE, it's not just the alignment that has
constraints but also the size that need to be allocated.

Now I suppose if we make the alignment to be the size of the M64
window and if the core also bounces the allocated size to the
alignment boundary, then we are fine, but that should be explained.

Cheers,
Ben.


> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   11 ++++++++++-
>  include/linux/pci.h |    3 +++
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 933d8cc..5f48201 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
>  		4 * (resno - PCI_IOV_RESOURCES);
>  }
>  
> +resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
> +		int resno, resource_size_t align)
> +{
> +	return align;
> +}
> +
>  /**
>   * pci_sriov_resource_alignment - get resource alignment for VF BAR
>   * @dev: the PCI device
> @@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>  {
>  	struct resource tmp;
>  	int reg = pci_iov_resource_bar(dev, resno);
> +	resource_size_t align;
>  
>  	if (!reg)
>  		return 0;
>  
>  	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
> -	return resource_alignment(&tmp);
> +	align = resource_alignment(&tmp);
> +
> +	return pcibios_iov_resource_alignment(dev, resno, align);
>  }
>  
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 74ef944..ae7a7ea 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
>  void pci_setup_bridge(struct pci_bus *bus);
>  resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>  					 unsigned long type);
> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
> +						 int resno,
> +						 resource_size_t align);
>  
>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
@ 2015-02-10  0:32           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:32 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, linuxppc-dev, gwshan

On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> The alignment of PF's IOV BAR is designed to be the individual size of a
> VF's BAR size. This works fine for many platforms, but on PowerNV platform
> it needs some change.
> 
> The original alignment works, since at sizing and assigning stage the
> requirement is from an individual VF's BAR size instead of the PF's IOV
> BAR.  This is the reason for the original code to just retrieve the
> individual VF BAR size as the alignment.
> 
> On PowerNV platform, it is required to align the whole PF IOV BAR to a
> hardware segment. Based on this fact, the alignment of PF's IOV BAR should
> be calculated seperately.
> 
> This patch introduces a weak pcibios_iov_resource_alignment() interface,
> which gives platform a chance to implement specific method to calculate
> the PF's IOV BAR alignment.

While the patch is probably fine, I find the above explanation quite
confusing :)

>From my memory (vague now) of the scheme we put in place, we need to
practically reserve a portion of address space that corresponds to
VF_size * Number_of_PEs. IE, it's not just the alignment that has
constraints but also the size that need to be allocated.

Now I suppose if we make the alignment to be the size of the M64
window and if the core also bounces the allocated size to the
alignment boundary, then we are fine, but that should be explained.

Cheers,
Ben.


> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   11 ++++++++++-
>  include/linux/pci.h |    3 +++
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 933d8cc..5f48201 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
>  		4 * (resno - PCI_IOV_RESOURCES);
>  }
>  
> +resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
> +		int resno, resource_size_t align)
> +{
> +	return align;
> +}
> +
>  /**
>   * pci_sriov_resource_alignment - get resource alignment for VF BAR
>   * @dev: the PCI device
> @@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>  {
>  	struct resource tmp;
>  	int reg = pci_iov_resource_bar(dev, resno);
> +	resource_size_t align;
>  
>  	if (!reg)
>  		return 0;
>  
>  	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
> -	return resource_alignment(&tmp);
> +	align = resource_alignment(&tmp);
> +
> +	return pcibios_iov_resource_alignment(dev, resno, align);
>  }
>  
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 74ef944..ae7a7ea 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
>  void pci_setup_bridge(struct pci_bus *bus);
>  resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>  					 unsigned long type);
> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
> +						 int resno,
> +						 resource_size_t align);
>  
>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-01-15  2:27         ` Wei Yang
@ 2015-02-10  0:36           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:36 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, gwshan, linux-pci, linuxppc-dev

On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
> resources will be cleaned out during device header fixup time and then get
> reassigned by PCI core. However, the VF resources won't be reassigned and
> thus, we shouldn't clean them out.
> 
> This patch adds a condition. If the pci_dev is a VF, skip the resource
> unset process.

I don't understand this, can you elaborate ? Why wouldn't we reassign
the IOV resource just like everything else ?

Ben.

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/pci-common.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 37d512d..889f743 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>  		       pci_name(dev));
>  		return;
>  	}
> +
> +	if (dev->is_virtfn)
> +		return;
> +
>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>  		struct resource *res = dev->resource + i;
>  		struct pci_bus_region reg;



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-02-10  0:36           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  0:36 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, linuxppc-dev, gwshan

On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
> resources will be cleaned out during device header fixup time and then get
> reassigned by PCI core. However, the VF resources won't be reassigned and
> thus, we shouldn't clean them out.
> 
> This patch adds a condition. If the pci_dev is a VF, skip the resource
> unset process.

I don't understand this, can you elaborate ? Why wouldn't we reassign
the IOV resource just like everything else ?

Ben.

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/pci-common.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 37d512d..889f743 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>  		       pci_name(dev));
>  		return;
>  	}
> +
> +	if (dev->is_virtfn)
> +		return;
> +
>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>  		struct resource *res = dev->resource + i;
>  		struct pci_bus_region reg;

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
  2015-02-04 23:44           ` Bjorn Helgaas
@ 2015-02-10  1:02             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  1:02 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, gwshan, linux-pci, linuxppc-dev

On Wed, 2015-02-04 at 17:44 -0600, Bjorn Helgaas wrote:
> > 
> > diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> > new file mode 100644
> > index 0000000..10d4ac2
> > --- /dev/null
> > +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> 
> I added the following two patches on top of this because I'm still confused
> about the difference between the M64 window and the M64 BARs.  Several
> parts of the writeup seem to imply that there are several M64 windows, but
> that seems to be incorrect.
> 
> And I tried to write something about M64 BARs, too.  But it could well be
> incorrect.
> 
> Please correct as necessary.  Ultimately I'll just fold everything into the
> original patch so there's only one.

The way the HW works is that 2 windows of the CPU address space are
routed to each PHB. One is used for 32-bit stuff and one is used for
64-bit stuff (it doesn't have to be and it's not fixed in HW which is
which, it's just two windows of the fabric being forwarded but that's
how we use them). The FW configures them, one is 4G and the other one is
today 64G but that might get increased at some point.

(Actually there's a 3rd window but it's exclusively used for the PHB
own registers so we can ignore it here).

Once an MMIO cycle hit one of the above window on the powerbus it gets
forwarded to the PHB.

Now the PHB itself contains a number of "BARs" which aren't the same
thing as device BARs so it's confusing and I tend to call them "windows"
for that reason. They are made of pairs of registers indicating an
address and a size (sort-of, the M64 ones are actually in some CAM in
the chip but that's a register access method detail that is not relevant
here).

 - One M32. It's limited to 4G in size, and has the specific attribute
that the top bits of the address from the powerbus are dropped (and
replaced with the content of a register) thus allowing this "window" to
target the 32-bit MMIO space from anywhere in the CPU 50-bit bus space.
This is setup at boot time, and we can probably ignore it here. It has
it's own segmenting for PEs which is a bit different from 64-bit stuff
as it goes through a remapping table allowing to configure which PE each
segment maps to.

 - 16 M64's. Each of these can be configured individually to pass a
portion of the above "window" space to the PCIe bus. There is no
remapping in that case (the powerbus addresses are passed 1:1). Each of
those M64's can be configured to have either a single PE (in which case
the PE number can be configured) or to be segmented (256 PE's but the PE
number cannot be configured and is equal to the segment number).

Additionally, the M64's can overlap, in which case we have a well
defined precedence order, which allows us to create a "backing" M64
that cover the entire 64G window going to the PCIe for "normal" 64-bit
BARs and overlap on top of that M64's appropriately sized and positioned
to cover IOV BARs (or in some case, single-PE M64's to cover very large
device BARs in order to avoid using too many PE's in the "backing" M64).

Cheers,
Ben.

> Bjorn
> 
> 
> commit 6f46b79d243c24fd02c662c43aec6c829013ff64
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Fri Jan 30 11:01:59 2015 -0600
> 
>     Try to fix references to M64 window vs M64 BARs.  If there really is only
>     one M64 window, I'm still a little confused about why there are so many
>     places that seem to mention multiple M64 windows.
> 
> diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> index 10d4ac2f25b5..140df9cb58bd 100644
> --- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> @@ -59,7 +59,7 @@ interrupt.
>   * Outbound. That's where the tricky part is.
>  
>  The PHB basically has a concept of "windows" from the CPU address space to the
> -PCI address space. There is one M32 window and 16 M64 windows. They have different
> +PCI address space. There is one M32 window and one M64 window. They have different
>  characteristics. First what they have in common: they are configured to forward a
>  configurable portion of the CPU address space to the PCIe bus and must be naturally
>  aligned power of two in size. The rest is different:
> @@ -89,29 +89,31 @@ Ideally we would like to be able to have individual functions in PE's but that
>  would mean using a completely different address allocation scheme where individual
>  function BARs can be "grouped" to fit in one or more segments....
>  
> - - The M64 windows.
> + - The M64 window:
>  
> -   * Their smallest size is 1M
> +   * Must be at least 256MB in size
>  
> -   * They do not translate addresses (the address on PCIe is the same as the
> +   * Does not translate addresses (the address on PCIe is the same as the
>  address on the PowerBus. There is a way to also set the top 14 bits which are
>  not conveyed by PowerBus but we don't use this).
>  
> -   * They can be configured to be segmented or not. When segmented, they have
> +   * Can be configured to be segmented or not. When segmented, it has
>  256 segments, however they are not remapped. The segment number *is* the PE
>  number. When no segmented, the PE number can be specified for the entire
>  window.
>  
> -   * They support overlaps in which case there is a well defined ordering of
> +   * Supports overlaps in which case there is a well defined ordering of
>  matching (I don't remember off hand which of the lower or higher numbered
>  window takes priority but basically it's well defined).
> +^^^^^^ This sounds like there are multiple M64 windows.   Or maybe this
> +paragraph is really about overlaps between M64 *BARs*, not M64 windows.
>  
>  We have code (fairly new compared to the M32 stuff) that exploits that for
>  large BARs in 64-bit space:
>  
> -We create a single big M64 that covers the entire region of address space that
> +We configure the M64 to cover the entire region of address space that
>  has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
> -it comes out of a different "reserve"). We configure that window as segmented.
> +it comes out of a different "reserve"). We configure it as segmented.
>  
>  Then we do the same thing as with M32, using the bridge aligment trick, to
>  match to those giant segments.
> @@ -133,15 +135,15 @@ the other ones for that "domain". We thus introduce the concept of "master PE"
>  which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
>  for the remaining M64 segments.
>  
> -We would like to investigate using additional M64's in "single PE" mode to
> +We would like to investigate using additional M64 BARs (?) in "single PE" mode to
>  overlay over specific BARs to work around some of that, for example for devices
>  with very large BARs (some GPUs), it would make sense, but we haven't done it
>  yet.
>  
> -Finally, the plan to use M64 for SR-IOV, which will be described more in next
> +Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
>  two sections. So for a given IOV BAR, we need to effectively reserve the
>  entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
> -the beginning of a free range of segments/PEs inside that M64.
> +the beginning of a free range of segments/PEs inside that M64 BAR.
>  
>  The goal is of course to be able to give a separate PE for each VF...
>  
> 
> commit 0f069e6a30e4c3de02f8c60aadd64fb64d434e7d
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Thu Jan 29 13:37:49 2015 -0600
> 
>     This adds description about M64 BARs.  Previously, these were mentioned,
>     but I don't think there was actually anything specific about how they
>     worked.
> 
> diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> index 140df9cb58bd..2e4811fae7fb 100644
> --- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> @@ -58,7 +58,7 @@ interrupt.
>  
>   * Outbound. That's where the tricky part is.
>  
> -The PHB basically has a concept of "windows" from the CPU address space to the
> +Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the
>  PCI address space. There is one M32 window and one M64 window. They have different
>  characteristics. First what they have in common: they are configured to forward a
>  configurable portion of the CPU address space to the PCIe bus and must be naturally
> @@ -140,6 +140,69 @@ overlay over specific BARs to work around some of that, for example for devices
>  with very large BARs (some GPUs), it would make sense, but we haven't done it
>  yet.
>  
> + - The M64 BARs.
> +
> +IODA2 has 16 M64 "BARs."  These are not traditional PCI BARs that assign
> +space for device registers or memory, and they're not normal window
> +registers that describe the base and size of a bridge aperture.
> +
> +Rather, these M64 BARs associate pieces of an existing M64 window with PEs.
> +The BAR describes a region of a window, and the region is divided into 256
> +segments, just like a segmented M64 window.  As with segmented M64 windows,
> +there's no lookup table: the segment number is the PE#.  The minimum size
> +of a segment is 1MB, so each M64 BAR covers at least 256MB of space in an
> +M64 window.
> +
> +The advantage of the M64 BARs is that they can be programmed to cover only
> +part of an M64 window, and you can use several of them at the same time.
> +That makes them useful for SR-IOV Virtual Functions, because each VF can be
> +assigned to a separate PE.
> +
> +SR-IOV BACKGROUND
> +
> +The PCIe SR-IOV feature allows a single Physical Function (PF) to support
> +several Virtual Functions (VFs).  Registers in the PF's SR-IOV Capability
> +control the number of VFs, whether the VFs are enabled, and the MMIO
> +resources assigned to the VFs.
> +
> +Each VF has its own VF BARs.  Software can write to a normal PCI BAR to
> +discover the BAR size and assign address for it.  VF BARs aren't like that;
> +the size discovery and address assignment is done via BARs in the *PF*
> +SR-IOV Capability, and the BARs in VF config space are read-only zeros.
> +
> +When a PF SR-IOV BAR is programmed, it sets the base address for all the
> +corresponding VF BARs.  For example, if the PF SR-IOV Capability is
> +programmed to enable eight VFs, and it describes a 1MB BAR 0 for those VFs,
> +the address in that PF BAR sets the base of an 8MB region that contains all
> +eight of the VF BARs.
> +
> +STRATEGIES FOR ISOLATING VFs IN PEs:
> +
> +- M32 window: There's one M32 window, and it is split into 256
> +  equally-sized segments.  The finest granularity possible is a 256MB
> +  window with 1MB segments.  VF BARs that are 1MB or larger could be mapped
> +  to separate PEs in this window.  Each segment can be individually mapped
> +  to a PE via the lookup table, so this is quite flexible, but it works
> +  best when all the VF BARs are the same size.  If they are different
> +  sizes, the entire window has to be small enough that the segment matches
> +  the smallest VF BAR, and larger VF BARs span several segments.
> +
> +- M64 window: A non-segmented M64 window is mapped entirely to a single PE,
> +  so it could only isolate one VF.  A segmented M64 window could be used
> +  just like the M32 window, but the segments can't be individually mapped
> +  to PEs (the segment number is the PE number), so there isn't as much
> +  flexibility.  A VF with multiple BARs would have to be be in a "domain"
> +  of multiple PEs, which is not as well isolated as a single PE.
> +
> +- M64 BAR: An M64 BAR effectively segments a region of an M64 window.  As
> +  usual, the region is split into 256 equally-sized pieces, and as in
> +  segmented M64 windows, the segment number is the PE number.  But there
> +  are several M64 BARs, and they can be set to different base addresses and
> +  different segment sizes.  So if we have VFs that each have a 1MB BAR and
> +  a 32MB BAR, we could use one M64 BAR to assign 1MB segments and another
> +  M64 BAR to assign 32MB segments.
> +
> +
>  Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
>  two sections. So for a given IOV BAR, we need to effectively reserve the
>  entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
@ 2015-02-10  1:02             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  1:02 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, linuxppc-dev, gwshan

On Wed, 2015-02-04 at 17:44 -0600, Bjorn Helgaas wrote:
> > 
> > diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> > new file mode 100644
> > index 0000000..10d4ac2
> > --- /dev/null
> > +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> 
> I added the following two patches on top of this because I'm still confused
> about the difference between the M64 window and the M64 BARs.  Several
> parts of the writeup seem to imply that there are several M64 windows, but
> that seems to be incorrect.
> 
> And I tried to write something about M64 BARs, too.  But it could well be
> incorrect.
> 
> Please correct as necessary.  Ultimately I'll just fold everything into the
> original patch so there's only one.

The way the HW works is that 2 windows of the CPU address space are
routed to each PHB. One is used for 32-bit stuff and one is used for
64-bit stuff (it doesn't have to be and it's not fixed in HW which is
which, it's just two windows of the fabric being forwarded but that's
how we use them). The FW configures them, one is 4G and the other one is
today 64G but that might get increased at some point.

(Actually there's a 3rd window but it's exclusively used for the PHB
own registers so we can ignore it here).

Once an MMIO cycle hit one of the above window on the powerbus it gets
forwarded to the PHB.

Now the PHB itself contains a number of "BARs" which aren't the same
thing as device BARs so it's confusing and I tend to call them "windows"
for that reason. They are made of pairs of registers indicating an
address and a size (sort-of, the M64 ones are actually in some CAM in
the chip but that's a register access method detail that is not relevant
here).

 - One M32. It's limited to 4G in size, and has the specific attribute
that the top bits of the address from the powerbus are dropped (and
replaced with the content of a register) thus allowing this "window" to
target the 32-bit MMIO space from anywhere in the CPU 50-bit bus space.
This is setup at boot time, and we can probably ignore it here. It has
it's own segmenting for PEs which is a bit different from 64-bit stuff
as it goes through a remapping table allowing to configure which PE each
segment maps to.

 - 16 M64's. Each of these can be configured individually to pass a
portion of the above "window" space to the PCIe bus. There is no
remapping in that case (the powerbus addresses are passed 1:1). Each of
those M64's can be configured to have either a single PE (in which case
the PE number can be configured) or to be segmented (256 PE's but the PE
number cannot be configured and is equal to the segment number).

Additionally, the M64's can overlap, in which case we have a well
defined precedence order, which allows us to create a "backing" M64
that cover the entire 64G window going to the PCIe for "normal" 64-bit
BARs and overlap on top of that M64's appropriately sized and positioned
to cover IOV BARs (or in some case, single-PE M64's to cover very large
device BARs in order to avoid using too many PE's in the "backing" M64).

Cheers,
Ben.

> Bjorn
> 
> 
> commit 6f46b79d243c24fd02c662c43aec6c829013ff64
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Fri Jan 30 11:01:59 2015 -0600
> 
>     Try to fix references to M64 window vs M64 BARs.  If there really is only
>     one M64 window, I'm still a little confused about why there are so many
>     places that seem to mention multiple M64 windows.
> 
> diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> index 10d4ac2f25b5..140df9cb58bd 100644
> --- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> @@ -59,7 +59,7 @@ interrupt.
>   * Outbound. That's where the tricky part is.
>  
>  The PHB basically has a concept of "windows" from the CPU address space to the
> -PCI address space. There is one M32 window and 16 M64 windows. They have different
> +PCI address space. There is one M32 window and one M64 window. They have different
>  characteristics. First what they have in common: they are configured to forward a
>  configurable portion of the CPU address space to the PCIe bus and must be naturally
>  aligned power of two in size. The rest is different:
> @@ -89,29 +89,31 @@ Ideally we would like to be able to have individual functions in PE's but that
>  would mean using a completely different address allocation scheme where individual
>  function BARs can be "grouped" to fit in one or more segments....
>  
> - - The M64 windows.
> + - The M64 window:
>  
> -   * Their smallest size is 1M
> +   * Must be at least 256MB in size
>  
> -   * They do not translate addresses (the address on PCIe is the same as the
> +   * Does not translate addresses (the address on PCIe is the same as the
>  address on the PowerBus. There is a way to also set the top 14 bits which are
>  not conveyed by PowerBus but we don't use this).
>  
> -   * They can be configured to be segmented or not. When segmented, they have
> +   * Can be configured to be segmented or not. When segmented, it has
>  256 segments, however they are not remapped. The segment number *is* the PE
>  number. When no segmented, the PE number can be specified for the entire
>  window.
>  
> -   * They support overlaps in which case there is a well defined ordering of
> +   * Supports overlaps in which case there is a well defined ordering of
>  matching (I don't remember off hand which of the lower or higher numbered
>  window takes priority but basically it's well defined).
> +^^^^^^ This sounds like there are multiple M64 windows.   Or maybe this
> +paragraph is really about overlaps between M64 *BARs*, not M64 windows.
>  
>  We have code (fairly new compared to the M32 stuff) that exploits that for
>  large BARs in 64-bit space:
>  
> -We create a single big M64 that covers the entire region of address space that
> +We configure the M64 to cover the entire region of address space that
>  has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
> -it comes out of a different "reserve"). We configure that window as segmented.
> +it comes out of a different "reserve"). We configure it as segmented.
>  
>  Then we do the same thing as with M32, using the bridge aligment trick, to
>  match to those giant segments.
> @@ -133,15 +135,15 @@ the other ones for that "domain". We thus introduce the concept of "master PE"
>  which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
>  for the remaining M64 segments.
>  
> -We would like to investigate using additional M64's in "single PE" mode to
> +We would like to investigate using additional M64 BARs (?) in "single PE" mode to
>  overlay over specific BARs to work around some of that, for example for devices
>  with very large BARs (some GPUs), it would make sense, but we haven't done it
>  yet.
>  
> -Finally, the plan to use M64 for SR-IOV, which will be described more in next
> +Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
>  two sections. So for a given IOV BAR, we need to effectively reserve the
>  entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
> -the beginning of a free range of segments/PEs inside that M64.
> +the beginning of a free range of segments/PEs inside that M64 BAR.
>  
>  The goal is of course to be able to give a separate PE for each VF...
>  
> 
> commit 0f069e6a30e4c3de02f8c60aadd64fb64d434e7d
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Thu Jan 29 13:37:49 2015 -0600
> 
>     This adds description about M64 BARs.  Previously, these were mentioned,
>     but I don't think there was actually anything specific about how they
>     worked.
> 
> diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> index 140df9cb58bd..2e4811fae7fb 100644
> --- a/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> @@ -58,7 +58,7 @@ interrupt.
>  
>   * Outbound. That's where the tricky part is.
>  
> -The PHB basically has a concept of "windows" from the CPU address space to the
> +Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the
>  PCI address space. There is one M32 window and one M64 window. They have different
>  characteristics. First what they have in common: they are configured to forward a
>  configurable portion of the CPU address space to the PCIe bus and must be naturally
> @@ -140,6 +140,69 @@ overlay over specific BARs to work around some of that, for example for devices
>  with very large BARs (some GPUs), it would make sense, but we haven't done it
>  yet.
>  
> + - The M64 BARs.
> +
> +IODA2 has 16 M64 "BARs."  These are not traditional PCI BARs that assign
> +space for device registers or memory, and they're not normal window
> +registers that describe the base and size of a bridge aperture.
> +
> +Rather, these M64 BARs associate pieces of an existing M64 window with PEs.
> +The BAR describes a region of a window, and the region is divided into 256
> +segments, just like a segmented M64 window.  As with segmented M64 windows,
> +there's no lookup table: the segment number is the PE#.  The minimum size
> +of a segment is 1MB, so each M64 BAR covers at least 256MB of space in an
> +M64 window.
> +
> +The advantage of the M64 BARs is that they can be programmed to cover only
> +part of an M64 window, and you can use several of them at the same time.
> +That makes them useful for SR-IOV Virtual Functions, because each VF can be
> +assigned to a separate PE.
> +
> +SR-IOV BACKGROUND
> +
> +The PCIe SR-IOV feature allows a single Physical Function (PF) to support
> +several Virtual Functions (VFs).  Registers in the PF's SR-IOV Capability
> +control the number of VFs, whether the VFs are enabled, and the MMIO
> +resources assigned to the VFs.
> +
> +Each VF has its own VF BARs.  Software can write to a normal PCI BAR to
> +discover the BAR size and assign address for it.  VF BARs aren't like that;
> +the size discovery and address assignment is done via BARs in the *PF*
> +SR-IOV Capability, and the BARs in VF config space are read-only zeros.
> +
> +When a PF SR-IOV BAR is programmed, it sets the base address for all the
> +corresponding VF BARs.  For example, if the PF SR-IOV Capability is
> +programmed to enable eight VFs, and it describes a 1MB BAR 0 for those VFs,
> +the address in that PF BAR sets the base of an 8MB region that contains all
> +eight of the VF BARs.
> +
> +STRATEGIES FOR ISOLATING VFs IN PEs:
> +
> +- M32 window: There's one M32 window, and it is split into 256
> +  equally-sized segments.  The finest granularity possible is a 256MB
> +  window with 1MB segments.  VF BARs that are 1MB or larger could be mapped
> +  to separate PEs in this window.  Each segment can be individually mapped
> +  to a PE via the lookup table, so this is quite flexible, but it works
> +  best when all the VF BARs are the same size.  If they are different
> +  sizes, the entire window has to be small enough that the segment matches
> +  the smallest VF BAR, and larger VF BARs span several segments.
> +
> +- M64 window: A non-segmented M64 window is mapped entirely to a single PE,
> +  so it could only isolate one VF.  A segmented M64 window could be used
> +  just like the M32 window, but the segments can't be individually mapped
> +  to PEs (the segment number is the PE number), so there isn't as much
> +  flexibility.  A VF with multiple BARs would have to be be in a "domain"
> +  of multiple PEs, which is not as well isolated as a single PE.
> +
> +- M64 BAR: An M64 BAR effectively segments a region of an M64 window.  As
> +  usual, the region is split into 256 equally-sized pieces, and as in
> +  segmented M64 windows, the segment number is the PE number.  But there
> +  are several M64 BARs, and they can be set to different base addresses and
> +  different segment sizes.  So if we have VFs that each have a 1MB BAR and
> +  a 32MB BAR, we could use one M64 BAR to assign 1MB segments and another
> +  M64 BAR to assign 32MB segments.
> +
> +
>  Finally, the plan to use M64 BARs for SR-IOV, which will be described more in next
>  two sections. So for a given IOV BAR, we need to effectively reserve the
>  entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
  2015-02-10  0:26           ` Benjamin Herrenschmidt
@ 2015-02-10  1:35             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  1:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 11:26:19AM +1100, Benjamin Herrenschmidt wrote:
>On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> +       if ((retval = pcibios_sriov_enable(dev, initial))) {
>> +               dev_err(&dev->dev, "Failure %d from
>> pcibios_sriov_setup()\n",
>> +                       retval);
>> +               return retval;
>> +       }
>> +
>
>Don't we want pcibios_sriov_enable() to be able to crop the number
>of VFs or do we think any resource limits have been applied
>already ?

The second parameter "initial" is the number of VFs will be enabled. Arch
dependent function will check the resources for these number of VFs.

Do I catch your question correctly?

>
>Ben.
>

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
@ 2015-02-10  1:35             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  1:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: bhelgaas, linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 11:26:19AM +1100, Benjamin Herrenschmidt wrote:
>On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> +       if ((retval = pcibios_sriov_enable(dev, initial))) {
>> +               dev_err(&dev->dev, "Failure %d from
>> pcibios_sriov_setup()\n",
>> +                       retval);
>> +               return retval;
>> +       }
>> +
>
>Don't we want pcibios_sriov_enable() to be able to crop the number
>of VFs or do we think any resource limits have been applied
>already ?

The second parameter "initial" is the number of VFs will be enabled. Arch
dependent function will check the resources for these number of VFs.

Do I catch your question correctly?

>
>Ben.
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
  2015-02-10  0:32           ` Benjamin Herrenschmidt
@ 2015-02-10  1:44             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  1:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 11:32:59AM +1100, Benjamin Herrenschmidt wrote:
>On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> The alignment of PF's IOV BAR is designed to be the individual size of a
>> VF's BAR size. This works fine for many platforms, but on PowerNV platform
>> it needs some change.
>> 
>> The original alignment works, since at sizing and assigning stage the
>> requirement is from an individual VF's BAR size instead of the PF's IOV
>> BAR.  This is the reason for the original code to just retrieve the
>> individual VF BAR size as the alignment.
>> 
>> On PowerNV platform, it is required to align the whole PF IOV BAR to a
>> hardware segment. Based on this fact, the alignment of PF's IOV BAR should
>> be calculated seperately.
>> 
>> This patch introduces a weak pcibios_iov_resource_alignment() interface,
>> which gives platform a chance to implement specific method to calculate
>> the PF's IOV BAR alignment.
>
>While the patch is probably fine, I find the above explanation quite
>confusing :)
>

I will try to make it more clear.

>>From my memory (vague now) of the scheme we put in place, we need to
>practically reserve a portion of address space that corresponds to
>VF_size * Number_of_PEs. IE, it's not just the alignment that has
>constraints but also the size that need to be allocated.
>
>Now I suppose if we make the alignment to be the size of the M64
>window and if the core also bounces the allocated size to the
>alignment boundary, then we are fine, but that should be explained.
>

The purpose of this patch is to give a chance to different archs to calculate
the alignment of PF's IOV BAR.

How about I move the detailed explanation on powernv platform in the following
patch? And focus on what this patch does in this log?

>Cheers,
>Ben.
>
>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  drivers/pci/iov.c   |   11 ++++++++++-
>>  include/linux/pci.h |    3 +++
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 933d8cc..5f48201 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
>>  		4 * (resno - PCI_IOV_RESOURCES);
>>  }
>>  
>> +resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
>> +		int resno, resource_size_t align)
>> +{
>> +	return align;
>> +}
>> +
>>  /**
>>   * pci_sriov_resource_alignment - get resource alignment for VF BAR
>>   * @dev: the PCI device
>> @@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>  {
>>  	struct resource tmp;
>>  	int reg = pci_iov_resource_bar(dev, resno);
>> +	resource_size_t align;
>>  
>>  	if (!reg)
>>  		return 0;
>>  
>>  	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>> -	return resource_alignment(&tmp);
>> +	align = resource_alignment(&tmp);
>> +
>> +	return pcibios_iov_resource_alignment(dev, resno, align);
>>  }
>>  
>>  /**
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 74ef944..ae7a7ea 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
>>  void pci_setup_bridge(struct pci_bus *bus);
>>  resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>>  					 unsigned long type);
>> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
>> +						 int resno,
>> +						 resource_size_t align);
>>  
>>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
>

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface
@ 2015-02-10  1:44             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  1:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: bhelgaas, linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 11:32:59AM +1100, Benjamin Herrenschmidt wrote:
>On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> The alignment of PF's IOV BAR is designed to be the individual size of a
>> VF's BAR size. This works fine for many platforms, but on PowerNV platform
>> it needs some change.
>> 
>> The original alignment works, since at sizing and assigning stage the
>> requirement is from an individual VF's BAR size instead of the PF's IOV
>> BAR.  This is the reason for the original code to just retrieve the
>> individual VF BAR size as the alignment.
>> 
>> On PowerNV platform, it is required to align the whole PF IOV BAR to a
>> hardware segment. Based on this fact, the alignment of PF's IOV BAR should
>> be calculated seperately.
>> 
>> This patch introduces a weak pcibios_iov_resource_alignment() interface,
>> which gives platform a chance to implement specific method to calculate
>> the PF's IOV BAR alignment.
>
>While the patch is probably fine, I find the above explanation quite
>confusing :)
>

I will try to make it more clear.

>>From my memory (vague now) of the scheme we put in place, we need to
>practically reserve a portion of address space that corresponds to
>VF_size * Number_of_PEs. IE, it's not just the alignment that has
>constraints but also the size that need to be allocated.
>
>Now I suppose if we make the alignment to be the size of the M64
>window and if the core also bounces the allocated size to the
>alignment boundary, then we are fine, but that should be explained.
>

The purpose of this patch is to give a chance to different archs to calculate
the alignment of PF's IOV BAR.

How about I move the detailed explanation on powernv platform in the following
patch? And focus on what this patch does in this log?

>Cheers,
>Ben.
>
>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  drivers/pci/iov.c   |   11 ++++++++++-
>>  include/linux/pci.h |    3 +++
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 933d8cc..5f48201 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -556,6 +556,12 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
>>  		4 * (resno - PCI_IOV_RESOURCES);
>>  }
>>  
>> +resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
>> +		int resno, resource_size_t align)
>> +{
>> +	return align;
>> +}
>> +
>>  /**
>>   * pci_sriov_resource_alignment - get resource alignment for VF BAR
>>   * @dev: the PCI device
>> @@ -570,12 +576,15 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>  {
>>  	struct resource tmp;
>>  	int reg = pci_iov_resource_bar(dev, resno);
>> +	resource_size_t align;
>>  
>>  	if (!reg)
>>  		return 0;
>>  
>>  	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>> -	return resource_alignment(&tmp);
>> +	align = resource_alignment(&tmp);
>> +
>> +	return pcibios_iov_resource_alignment(dev, resno, align);
>>  }
>>  
>>  /**
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 74ef944..ae7a7ea 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1163,6 +1163,9 @@ unsigned char pci_bus_max_busnr(struct pci_bus *bus);
>>  void pci_setup_bridge(struct pci_bus *bus);
>>  resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>>  					 unsigned long type);
>> +resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
>> +						 int resno,
>> +						 resource_size_t align);
>>  
>>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-02-10  0:36           ` Benjamin Herrenschmidt
@ 2015-02-10  1:51             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  1:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 11:36:24AM +1100, Benjamin Herrenschmidt wrote:
>On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
>> resources will be cleaned out during device header fixup time and then get
>> reassigned by PCI core. However, the VF resources won't be reassigned and
>> thus, we shouldn't clean them out.
>> 
>> This patch adds a condition. If the pci_dev is a VF, skip the resource
>> unset process.
>
>I don't understand this, can you elaborate ? Why wouldn't we reassign
>the IOV resource just like everything else ?

Sure.

VFs work a little bit different from normal devices. On powernv platform, we
have PCI_REASSIGN_ALL_RSRC set, which means all resource retrieved from
hardware will be cleaned and re-assigned by kernel. While VF's resources are
calculated from PF's IOV BAR, in virtfn_add(). And after this, there is not
re-assign process for VFs.

>
>Ben.
>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/pci-common.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 37d512d..889f743 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>>  		       pci_name(dev));
>>  		return;
>>  	}
>> +
>> +	if (dev->is_virtfn)
>> +		return;
>> +
>>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>>  		struct resource *res = dev->resource + i;
>>  		struct pci_bus_region reg;
>

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-02-10  1:51             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  1:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: bhelgaas, linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 11:36:24AM +1100, Benjamin Herrenschmidt wrote:
>On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
>> resources will be cleaned out during device header fixup time and then get
>> reassigned by PCI core. However, the VF resources won't be reassigned and
>> thus, we shouldn't clean them out.
>> 
>> This patch adds a condition. If the pci_dev is a VF, skip the resource
>> unset process.
>
>I don't understand this, can you elaborate ? Why wouldn't we reassign
>the IOV resource just like everything else ?

Sure.

VFs work a little bit different from normal devices. On powernv platform, we
have PCI_REASSIGN_ALL_RSRC set, which means all resource retrieved from
hardware will be cleaned and re-assigned by kernel. While VF's resources are
calculated from PF's IOV BAR, in virtfn_add(). And after this, there is not
re-assign process for VFs.

>
>Ben.
>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/pci-common.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> index 37d512d..889f743 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>>  		       pci_name(dev));
>>  		return;
>>  	}
>> +
>> +	if (dev->is_virtfn)
>> +		return;
>> +
>>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>>  		struct resource *res = dev->resource + i;
>>  		struct pci_bus_region reg;
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
  2015-02-10  1:35             ` Wei Yang
@ 2015-02-10  2:13               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  2:13 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, 2015-02-10 at 09:35 +0800, Wei Yang wrote:
> >Don't we want pcibios_sriov_enable() to be able to crop the number
> >of VFs or do we think any resource limits have been applied
> >already ?
> 
> The second parameter "initial" is the number of VFs will be enabled.
> Arch
> dependent function will check the resources for these number of VFs.
> 
> Do I catch your question correctly?

I was wondering if the number of resource that can be enabled is
smaller, should the arch function be able to return that smaller
number and we would still enable that number ?

Ie, have the arch function be able to "update" the value of
"initial" (by passing it by pointer for example).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
@ 2015-02-10  2:13               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  2:13 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, linuxppc-dev, gwshan

On Tue, 2015-02-10 at 09:35 +0800, Wei Yang wrote:
> >Don't we want pcibios_sriov_enable() to be able to crop the number
> >of VFs or do we think any resource limits have been applied
> >already ?
> 
> The second parameter "initial" is the number of VFs will be enabled.
> Arch
> dependent function will check the resources for these number of VFs.
> 
> Do I catch your question correctly?

I was wondering if the number of resource that can be enabled is
smaller, should the arch function be able to return that smaller
number and we would still enable that number ?

Ie, have the arch function be able to "update" the value of
"initial" (by passing it by pointer for example).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-02-10  1:51             ` Wei Yang
@ 2015-02-10  2:14               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  2:14 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, 2015-02-10 at 09:51 +0800, Wei Yang wrote:
> On Tue, Feb 10, 2015 at 11:36:24AM +1100, Benjamin Herrenschmidt wrote:
> >On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> >> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
> >> resources will be cleaned out during device header fixup time and then get
> >> reassigned by PCI core. However, the VF resources won't be reassigned and
> >> thus, we shouldn't clean them out.
> >> 
> >> This patch adds a condition. If the pci_dev is a VF, skip the resource
> >> unset process.
> >
> >I don't understand this, can you elaborate ? Why wouldn't we reassign
> >the IOV resource just like everything else ?
> 
> Sure.
> 
> VFs work a little bit different from normal devices. On powernv platform, we
> have PCI_REASSIGN_ALL_RSRC set, which means all resource retrieved from
> hardware will be cleaned and re-assigned by kernel. While VF's resources are
> calculated from PF's IOV BAR, in virtfn_add(). And after this, there is not
> re-assign process for VFs.

I still don't undertand, you mean SR-IOV is assigned before we assign
everybody else ? That doesn't make sense to me...

Ben.

> >
> >Ben.
> >
> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/kernel/pci-common.c |    4 ++++
> >>  1 file changed, 4 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> >> index 37d512d..889f743 100644
> >> --- a/arch/powerpc/kernel/pci-common.c
> >> +++ b/arch/powerpc/kernel/pci-common.c
> >> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
> >>  		       pci_name(dev));
> >>  		return;
> >>  	}
> >> +
> >> +	if (dev->is_virtfn)
> >> +		return;
> >> +
> >>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
> >>  		struct resource *res = dev->resource + i;
> >>  		struct pci_bus_region reg;
> >
> 



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-02-10  2:14               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  2:14 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, linuxppc-dev, gwshan

On Tue, 2015-02-10 at 09:51 +0800, Wei Yang wrote:
> On Tue, Feb 10, 2015 at 11:36:24AM +1100, Benjamin Herrenschmidt wrote:
> >On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
> >> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
> >> resources will be cleaned out during device header fixup time and then get
> >> reassigned by PCI core. However, the VF resources won't be reassigned and
> >> thus, we shouldn't clean them out.
> >> 
> >> This patch adds a condition. If the pci_dev is a VF, skip the resource
> >> unset process.
> >
> >I don't understand this, can you elaborate ? Why wouldn't we reassign
> >the IOV resource just like everything else ?
> 
> Sure.
> 
> VFs work a little bit different from normal devices. On powernv platform, we
> have PCI_REASSIGN_ALL_RSRC set, which means all resource retrieved from
> hardware will be cleaned and re-assigned by kernel. While VF's resources are
> calculated from PF's IOV BAR, in virtfn_add(). And after this, there is not
> re-assign process for VFs.

I still don't undertand, you mean SR-IOV is assigned before we assign
everybody else ? That doesn't make sense to me...

Ben.

> >
> >Ben.
> >
> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/kernel/pci-common.c |    4 ++++
> >>  1 file changed, 4 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> >> index 37d512d..889f743 100644
> >> --- a/arch/powerpc/kernel/pci-common.c
> >> +++ b/arch/powerpc/kernel/pci-common.c
> >> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
> >>  		       pci_name(dev));
> >>  		return;
> >>  	}
> >> +
> >> +	if (dev->is_virtfn)
> >> +		return;
> >> +
> >>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
> >>  		struct resource *res = dev->resource + i;
> >>  		struct pci_bus_region reg;
> >
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
  2015-02-10  2:13               ` Benjamin Herrenschmidt
@ 2015-02-10  6:18                 ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  6:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 01:13:14PM +1100, Benjamin Herrenschmidt wrote:
>On Tue, 2015-02-10 at 09:35 +0800, Wei Yang wrote:
>> >Don't we want pcibios_sriov_enable() to be able to crop the number
>> >of VFs or do we think any resource limits have been applied
>> >already ?
>> 
>> The second parameter "initial" is the number of VFs will be enabled.
>> Arch
>> dependent function will check the resources for these number of VFs.
>> 
>> Do I catch your question correctly?
>
>I was wondering if the number of resource that can be enabled is
>smaller, should the arch function be able to return that smaller
>number and we would still enable that number ?
>
>Ie, have the arch function be able to "update" the value of
>"initial" (by passing it by pointer for example).

This would increase the time to enable sriov and block others driver to enable
sriov.

On powernv platform, those resources needed are M64 BAR and PE numbers.
Currently they are acquired separately. We have a lock to protect those
resources respectively. If we want to apply the logic you mentioned, we need
to have a "bigger" lock to protect both of them and try different values,
since at the same time, other driver may want to enable their sriov too. We
have to protect this contention.

Another example is PF has two IOV BAR but just one M64 BAR left in system.
In this case, no matter how many VFs want to enable, it will fail.

>
>Cheers,
>Ben.
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook
@ 2015-02-10  6:18                 ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  6:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: bhelgaas, linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 01:13:14PM +1100, Benjamin Herrenschmidt wrote:
>On Tue, 2015-02-10 at 09:35 +0800, Wei Yang wrote:
>> >Don't we want pcibios_sriov_enable() to be able to crop the number
>> >of VFs or do we think any resource limits have been applied
>> >already ?
>> 
>> The second parameter "initial" is the number of VFs will be enabled.
>> Arch
>> dependent function will check the resources for these number of VFs.
>> 
>> Do I catch your question correctly?
>
>I was wondering if the number of resource that can be enabled is
>smaller, should the arch function be able to return that smaller
>number and we would still enable that number ?
>
>Ie, have the arch function be able to "update" the value of
>"initial" (by passing it by pointer for example).

This would increase the time to enable sriov and block others driver to enable
sriov.

On powernv platform, those resources needed are M64 BAR and PE numbers.
Currently they are acquired separately. We have a lock to protect those
resources respectively. If we want to apply the logic you mentioned, we need
to have a "bigger" lock to protect both of them and try different values,
since at the same time, other driver may want to enable their sriov too. We
have to protect this contention.

Another example is PF has two IOV BAR but just one M64 BAR left in system.
In this case, no matter how many VFs want to enable, it will fail.

>
>Cheers,
>Ben.
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-02-10  2:14               ` Benjamin Herrenschmidt
@ 2015-02-10  6:25                 ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  6:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wei Yang, bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 01:14:11PM +1100, Benjamin Herrenschmidt wrote:
>On Tue, 2015-02-10 at 09:51 +0800, Wei Yang wrote:
>> On Tue, Feb 10, 2015 at 11:36:24AM +1100, Benjamin Herrenschmidt wrote:
>> >On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> >> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
>> >> resources will be cleaned out during device header fixup time and then get
>> >> reassigned by PCI core. However, the VF resources won't be reassigned and
>> >> thus, we shouldn't clean them out.
>> >> 
>> >> This patch adds a condition. If the pci_dev is a VF, skip the resource
>> >> unset process.
>> >
>> >I don't understand this, can you elaborate ? Why wouldn't we reassign
>> >the IOV resource just like everything else ?
>> 
>> Sure.
>> 
>> VFs work a little bit different from normal devices. On powernv platform, we
>> have PCI_REASSIGN_ALL_RSRC set, which means all resource retrieved from
>> hardware will be cleaned and re-assigned by kernel. While VF's resources are
>> calculated from PF's IOV BAR, in virtfn_add(). And after this, there is not
>> re-assign process for VFs.
>
>I still don't undertand, you mean SR-IOV is assigned before we assign
>everybody else ? That doesn't make sense to me...
>

PF's resource will be assigned first, including normal BARs and IOV BARs.

Then PF's driver will create VFs, in virtfn_add(). In this function, VF's
resources is calculated from its PF's IOV BAR.

If you reset VF's resource as PFs, no one will try to assign it again.

>Ben.
>
>> >
>> >Ben.
>> >
>> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> ---
>> >>  arch/powerpc/kernel/pci-common.c |    4 ++++
>> >>  1 file changed, 4 insertions(+)
>> >> 
>> >> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> >> index 37d512d..889f743 100644
>> >> --- a/arch/powerpc/kernel/pci-common.c
>> >> +++ b/arch/powerpc/kernel/pci-common.c
>> >> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>> >>  		       pci_name(dev));
>> >>  		return;
>> >>  	}
>> >> +
>> >> +	if (dev->is_virtfn)
>> >> +		return;
>> >> +
>> >>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>> >>  		struct resource *res = dev->resource + i;
>> >>  		struct pci_bus_region reg;
>> >
>> 
>

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-02-10  6:25                 ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-02-10  6:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: bhelgaas, linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 01:14:11PM +1100, Benjamin Herrenschmidt wrote:
>On Tue, 2015-02-10 at 09:51 +0800, Wei Yang wrote:
>> On Tue, Feb 10, 2015 at 11:36:24AM +1100, Benjamin Herrenschmidt wrote:
>> >On Thu, 2015-01-15 at 10:27 +0800, Wei Yang wrote:
>> >> If we're going to reassign resources with flag PCI_REASSIGN_ALL_RSRC, all
>> >> resources will be cleaned out during device header fixup time and then get
>> >> reassigned by PCI core. However, the VF resources won't be reassigned and
>> >> thus, we shouldn't clean them out.
>> >> 
>> >> This patch adds a condition. If the pci_dev is a VF, skip the resource
>> >> unset process.
>> >
>> >I don't understand this, can you elaborate ? Why wouldn't we reassign
>> >the IOV resource just like everything else ?
>> 
>> Sure.
>> 
>> VFs work a little bit different from normal devices. On powernv platform, we
>> have PCI_REASSIGN_ALL_RSRC set, which means all resource retrieved from
>> hardware will be cleaned and re-assigned by kernel. While VF's resources are
>> calculated from PF's IOV BAR, in virtfn_add(). And after this, there is not
>> re-assign process for VFs.
>
>I still don't undertand, you mean SR-IOV is assigned before we assign
>everybody else ? That doesn't make sense to me...
>

PF's resource will be assigned first, including normal BARs and IOV BARs.

Then PF's driver will create VFs, in virtfn_add(). In this function, VF's
resources is calculated from its PF's IOV BAR.

If you reset VF's resource as PFs, no one will try to assign it again.

>Ben.
>
>> >
>> >Ben.
>> >
>> >> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> >> ---
>> >>  arch/powerpc/kernel/pci-common.c |    4 ++++
>> >>  1 file changed, 4 insertions(+)
>> >> 
>> >> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
>> >> index 37d512d..889f743 100644
>> >> --- a/arch/powerpc/kernel/pci-common.c
>> >> +++ b/arch/powerpc/kernel/pci-common.c
>> >> @@ -788,6 +788,10 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
>> >>  		       pci_name(dev));
>> >>  		return;
>> >>  	}
>> >> +
>> >> +	if (dev->is_virtfn)
>> >> +		return;
>> >> +
>> >>  	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
>> >>  		struct resource *res = dev->resource + i;
>> >>  		struct pci_bus_region reg;
>> >
>> 
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-02-10  6:25                 ` Wei Yang
@ 2015-02-10  8:14                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  8:14 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, gwshan, linux-pci, linuxppc-dev

On Tue, 2015-02-10 at 14:25 +0800, Wei Yang wrote:
> PF's resource will be assigned first, including normal BARs and IOV
> BARs.
> 
> Then PF's driver will create VFs, in virtfn_add(). In this function,
> VF's
> resources is calculated from its PF's IOV BAR.
> 
> If you reset VF's resource as PFs, no one will try to assign it again.

So the problem is that the flag indicating VF is lost ? IE. We should
still mark them unset, but preserve that flag ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-02-10  8:14                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-10  8:14 UTC (permalink / raw)
  To: Wei Yang; +Cc: bhelgaas, linux-pci, linuxppc-dev, gwshan

On Tue, 2015-02-10 at 14:25 +0800, Wei Yang wrote:
> PF's resource will be assigned first, including normal BARs and IOV
> BARs.
> 
> Then PF's driver will create VFs, in virtfn_add(). In this function,
> VF's
> resources is calculated from its PF's IOV BAR.
> 
> If you reset VF's resource as PFs, no one will try to assign it again.

So the problem is that the flag indicating VF is lost ? IE. We should
still mark them unset, but preserve that flag ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
  2015-02-10  1:02             ` Benjamin Herrenschmidt
@ 2015-02-20  0:56               ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20  0:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Wei Yang, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 12:02:31PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2015-02-04 at 17:44 -0600, Bjorn Helgaas wrote:
> > > 
> > > diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> > > new file mode 100644
> > > index 0000000..10d4ac2
> > > --- /dev/null
> > > +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> > 
> > I added the following two patches on top of this because I'm still confused
> > about the difference between the M64 window and the M64 BARs.  Several
> > parts of the writeup seem to imply that there are several M64 windows, but
> > that seems to be incorrect.
> > 
> > And I tried to write something about M64 BARs, too.  But it could well be
> > incorrect.
> > 
> > Please correct as necessary.  Ultimately I'll just fold everything into the
> > original patch so there's only one.
> 
> The way the HW works is that 2 windows of the CPU address space are
> routed to each PHB. One is used for 32-bit stuff and one is used for
> 64-bit stuff (it doesn't have to be and it's not fixed in HW which is
> which, it's just two windows of the fabric being forwarded but that's
> how we use them). The FW configures them, one is 4G and the other one is
> today 64G but that might get increased at some point.
> 
> (Actually there's a 3rd window but it's exclusively used for the PHB
> own registers so we can ignore it here).
> 
> Once an MMIO cycle hit one of the above window on the powerbus it gets
> forwarded to the PHB.
> 
> Now the PHB itself contains a number of "BARs" which aren't the same
> thing as device BARs so it's confusing and I tend to call them "windows"
> for that reason. They are made of pairs of registers indicating an
> address and a size (sort-of, the M64 ones are actually in some CAM in
> the chip but that's a register access method detail that is not relevant
> here).
> 
>  - One M32. It's limited to 4G in size, and has the specific attribute
> that the top bits of the address from the powerbus are dropped (and
> replaced with the content of a register) thus allowing this "window" to
> target the 32-bit MMIO space from anywhere in the CPU 50-bit bus space.
> This is setup at boot time, and we can probably ignore it here. It has
> it's own segmenting for PEs which is a bit different from 64-bit stuff
> as it goes through a remapping table allowing to configure which PE each
> segment maps to.
> 
>  - 16 M64's. Each of these can be configured individually to pass a
> portion of the above "window" space to the PCIe bus. There is no
> remapping in that case (the powerbus addresses are passed 1:1). Each of
> those M64's can be configured to have either a single PE (in which case
> the PE number can be configured) or to be segmented (256 PE's but the PE
> number cannot be configured and is equal to the segment number).
> 
> Additionally, the M64's can overlap, in which case we have a well
> defined precedence order, which allows us to create a "backing" M64
> that cover the entire 64G window going to the PCIe for "normal" 64-bit
> BARs and overlap on top of that M64's appropriately sized and positioned
> to cover IOV BARs (or in some case, single-PE M64's to cover very large
> device BARs in order to avoid using too many PE's in the "backing" M64).

So there are the two windows of CPU address space that are routed to the
PHB.  And the PHB contains one M32 window and sixteen M64 windows.  What
happens if the PHB receives an access to something that was in one of the
two CPU address space windows, but is not contained in M32 or one of the
M64 windows?

If that is an error or is non-sensical, then the only windows relevant to
PCI would be the M32 and M64 windows, and we could just ignore the
top-level two windows.

I squashed all my doc updates into the original and pushed it here:

https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=5449d1a812d561bafe0d458132ef356765505507

If I made it say something wrong, a patch would be the best way to fix it.

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
@ 2015-02-20  0:56               ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20  0:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 12:02:31PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2015-02-04 at 17:44 -0600, Bjorn Helgaas wrote:
> > > 
> > > diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> > > new file mode 100644
> > > index 0000000..10d4ac2
> > > --- /dev/null
> > > +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
> > 
> > I added the following two patches on top of this because I'm still confused
> > about the difference between the M64 window and the M64 BARs.  Several
> > parts of the writeup seem to imply that there are several M64 windows, but
> > that seems to be incorrect.
> > 
> > And I tried to write something about M64 BARs, too.  But it could well be
> > incorrect.
> > 
> > Please correct as necessary.  Ultimately I'll just fold everything into the
> > original patch so there's only one.
> 
> The way the HW works is that 2 windows of the CPU address space are
> routed to each PHB. One is used for 32-bit stuff and one is used for
> 64-bit stuff (it doesn't have to be and it's not fixed in HW which is
> which, it's just two windows of the fabric being forwarded but that's
> how we use them). The FW configures them, one is 4G and the other one is
> today 64G but that might get increased at some point.
> 
> (Actually there's a 3rd window but it's exclusively used for the PHB
> own registers so we can ignore it here).
> 
> Once an MMIO cycle hit one of the above window on the powerbus it gets
> forwarded to the PHB.
> 
> Now the PHB itself contains a number of "BARs" which aren't the same
> thing as device BARs so it's confusing and I tend to call them "windows"
> for that reason. They are made of pairs of registers indicating an
> address and a size (sort-of, the M64 ones are actually in some CAM in
> the chip but that's a register access method detail that is not relevant
> here).
> 
>  - One M32. It's limited to 4G in size, and has the specific attribute
> that the top bits of the address from the powerbus are dropped (and
> replaced with the content of a register) thus allowing this "window" to
> target the 32-bit MMIO space from anywhere in the CPU 50-bit bus space.
> This is setup at boot time, and we can probably ignore it here. It has
> it's own segmenting for PEs which is a bit different from 64-bit stuff
> as it goes through a remapping table allowing to configure which PE each
> segment maps to.
> 
>  - 16 M64's. Each of these can be configured individually to pass a
> portion of the above "window" space to the PCIe bus. There is no
> remapping in that case (the powerbus addresses are passed 1:1). Each of
> those M64's can be configured to have either a single PE (in which case
> the PE number can be configured) or to be segmented (256 PE's but the PE
> number cannot be configured and is equal to the segment number).
> 
> Additionally, the M64's can overlap, in which case we have a well
> defined precedence order, which allows us to create a "backing" M64
> that cover the entire 64G window going to the PCIe for "normal" 64-bit
> BARs and overlap on top of that M64's appropriately sized and positioned
> to cover IOV BARs (or in some case, single-PE M64's to cover very large
> device BARs in order to avoid using too many PE's in the "backing" M64).

So there are the two windows of CPU address space that are routed to the
PHB.  And the PHB contains one M32 window and sixteen M64 windows.  What
happens if the PHB receives an access to something that was in one of the
two CPU address space windows, but is not contained in M32 or one of the
M64 windows?

If that is an error or is non-sensical, then the only windows relevant to
PCI would be the M32 and M64 windows, and we could just ignore the
top-level two windows.

I squashed all my doc updates into the original and pushed it here:

https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=5449d1a812d561bafe0d458132ef356765505507

If I made it say something wrong, a patch would be the best way to fix it.

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
  2015-02-20  0:56               ` Bjorn Helgaas
@ 2015-02-20  2:41                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-20  2:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, gwshan, linux-pci, linuxppc-dev

On Thu, 2015-02-19 at 18:56 -0600, Bjorn Helgaas wrote:

> So there are the two windows of CPU address space that are routed to the
> PHB.  And the PHB contains one M32 window and sixteen M64 windows.  What
> happens if the PHB receives an access to something that was in one of the
> two CPU address space windows, but is not contained in M32 or one of the
> M64 windows?

Some kind of error, I don't know which one at this point, possibly fatal
(checkstop or similar) or maybe a fence of the PHB. Don't do it :-)

> If that is an error or is non-sensical, then the only windows relevant to
> PCI would be the M32 and M64 windows, and we could just ignore the
> top-level two windows.

Right. In fact we pretty much hard wire that one of the top level is
small and used for M32 with a fixed layout and the other is big and used
for all M64's. We use one of the M64 to cover it entirely, which is our
"base" set of segments for allocating busses/BARs, and then we use the
remaining M64's overlaid on top of that first one for things like
SR-IOV.

> I squashed all my doc updates into the original and pushed it here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=5449d1a812d561bafe0d458132ef356765505507
> 
> If I made it say something wrong, a patch would be the best way to fix it.

Thanks, I'll have a look

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation
@ 2015-02-20  2:41                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-20  2:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, linuxppc-dev, gwshan

On Thu, 2015-02-19 at 18:56 -0600, Bjorn Helgaas wrote:

> So there are the two windows of CPU address space that are routed to the
> PHB.  And the PHB contains one M32 window and sixteen M64 windows.  What
> happens if the PHB receives an access to something that was in one of the
> two CPU address space windows, but is not contained in M32 or one of the
> M64 windows?

Some kind of error, I don't know which one at this point, possibly fatal
(checkstop or similar) or maybe a fence of the PHB. Don't do it :-)

> If that is an error or is non-sensical, then the only windows relevant to
> PCI would be the M32 and M64 windows, and we could just ignore the
> top-level two windows.

Right. In fact we pretty much hard wire that one of the top level is
small and used for M32 with a fixed layout and the other is big and used
for all M64's. We use one of the M64 to cover it entirely, which is our
"base" set of segments for allocating busses/BARs, and then we use the
remaining M64's overlaid on top of that first one for things like
SR-IOV.

> I squashed all my doc updates into the original and pushed it here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization&id=5449d1a812d561bafe0d458132ef356765505507
> 
> If I made it say something wrong, a patch would be the best way to fix it.

Thanks, I'll have a look

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF
  2015-01-15  2:27         ` Wei Yang
@ 2015-02-20 23:09           ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20 23:09 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:27:51AM +0800, Wei Yang wrote:
> When implementing the SR-IOV on PowerNV platform, some resource reservation
> is needed for VFs which don't exist at the bootup stage. To do the match
> between resources and VFs, the code need to get the VF's BDF in advance.
> 
> In this patch, it exports the interface to retrieve VF's BDF:
>    * Make the virtfn_bus as an interface
>    * Make the virtfn_devfn as an interface
>    * Rename them with more specific name
>    * Code cleanup in pci_sriov_resource_alignment()

You use these in this path:

    pci_enable_sriov
      sriov_enable
	pcibios_sriov_enable
	  add_dev_pci_info
	    for (i = 0; i < pci_sriov_get_totalvfs(pdev))
	      add_one_dev_pci_info(..., pci_iov_virtfn_bus(), ...)  <---
		pdn = kzalloc
		pdn->busno = busno
		pdn->devfn = devfn
		list_add_tail(&pdn->list, &parent->child_list)

It looks like this sets up a struct pci_dn for each VF.

Could the struct pci_dn setup be done in pcibios_add_device() instead?
Then each VF we enumerate would set up its own struct pci_dn.  That would
be a lot nicer than using this hook to iterate over all possible VFs.

You also use them in some PE setup in a similar path:

    pci_enable_sriov
      sriov_enable
        pcibios_sriov_enable
          pnv_pci_sriov_enable
            pnv_pci_vf_assign_m64
            pnv_pci_vf_resource_shift
            pnv_ioda_setup_vf_PE
              for (i = 0; i < vf_num; i++)
                pe->rid = pci_iov_virtfn_bus(...)  <---
                pnv_ioda_configure_pe(phb, pe)
                pe->tce32_table = kzalloc
                pnv_pci_ioda2_setup_dma_pe

Could this PE setup also be done in pcibios_device() when the VF device
itself is enumerated?  I think that would be a nicer design if it's
possible.

I'd prefer to avoid exporting pci_iov_virtfn_bus() and
pci_iov_virtfn_devfn() if possible, because they're only safe to call when
VFs are enabled (because offset & stride depend on numVFs).

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   22 +++++++++++++---------
>  include/linux/pci.h |   11 +++++++++++
>  2 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index ea3a82c..e76d1a0 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -19,14 +19,18 @@
>  
>  #define VIRTFN_ID_LEN	16
>  
> -static inline u8 virtfn_bus(struct pci_dev *dev, int id)
> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>  {
> +	if (!dev->is_physfn)
> +		return -EINVAL;
>  	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
>  				    dev->sriov->stride * id) >> 8);
>  }
>  
> -static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>  {
> +	if (!dev->is_physfn)
> +		return -EINVAL;
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * id) & 0xff;
>  }
> @@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
>  
>  	for ( ; total >= 0; total--) {
>  		pci_iov_set_numvfs(dev, total);
> -		busnr = virtfn_bus(dev, iov->total_VFs - 1);
> +		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
>  		if (busnr > max)
>  			max = busnr;
>  	}
> @@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  	struct pci_bus *bus;
>  
>  	mutex_lock(&iov->dev->sriov->lock);
> -	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
> +	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
>  	if (!bus)
>  		goto failed;
>  
> @@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  	if (!virtfn)
>  		goto failed0;
>  
> -	virtfn->devfn = virtfn_devfn(dev, id);
> +	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>  	virtfn->vendor = dev->vendor;
>  	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
>  	pci_setup_device(virtfn);
> @@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>  	struct pci_sriov *iov = dev->sriov;
>  
>  	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
> -					     virtfn_bus(dev, id),
> -					     virtfn_devfn(dev, id));
> +					     pci_iov_virtfn_bus(dev, id),
> +					     pci_iov_virtfn_devfn(dev, id));
>  	if (!virtfn)
>  		return;
>  
> @@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  	iov->offset = offset;
>  	iov->stride = stride;
>  
> -	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
> +	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
>  		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
>  		return -ENOMEM;
>  	}
> @@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>  	if (!reg)
>  		return 0;
>  
> -	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
> +	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>  	return resource_alignment(&tmp);
>  }
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 360a966..74ef944 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
>  void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>  
>  #ifdef CONFIG_PCI_IOV
> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> +
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
>  int pci_num_vf(struct pci_dev *dev);
> @@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  #else
> +static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
> +{
> +	return -ENOSYS;
> +}
> +static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
> +{
> +	return -ENOSYS;
> +}
>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>  { return -ENODEV; }
>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF
@ 2015-02-20 23:09           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20 23:09 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:27:51AM +0800, Wei Yang wrote:
> When implementing the SR-IOV on PowerNV platform, some resource reservation
> is needed for VFs which don't exist at the bootup stage. To do the match
> between resources and VFs, the code need to get the VF's BDF in advance.
> 
> In this patch, it exports the interface to retrieve VF's BDF:
>    * Make the virtfn_bus as an interface
>    * Make the virtfn_devfn as an interface
>    * Rename them with more specific name
>    * Code cleanup in pci_sriov_resource_alignment()

You use these in this path:

    pci_enable_sriov
      sriov_enable
	pcibios_sriov_enable
	  add_dev_pci_info
	    for (i = 0; i < pci_sriov_get_totalvfs(pdev))
	      add_one_dev_pci_info(..., pci_iov_virtfn_bus(), ...)  <---
		pdn = kzalloc
		pdn->busno = busno
		pdn->devfn = devfn
		list_add_tail(&pdn->list, &parent->child_list)

It looks like this sets up a struct pci_dn for each VF.

Could the struct pci_dn setup be done in pcibios_add_device() instead?
Then each VF we enumerate would set up its own struct pci_dn.  That would
be a lot nicer than using this hook to iterate over all possible VFs.

You also use them in some PE setup in a similar path:

    pci_enable_sriov
      sriov_enable
        pcibios_sriov_enable
          pnv_pci_sriov_enable
            pnv_pci_vf_assign_m64
            pnv_pci_vf_resource_shift
            pnv_ioda_setup_vf_PE
              for (i = 0; i < vf_num; i++)
                pe->rid = pci_iov_virtfn_bus(...)  <---
                pnv_ioda_configure_pe(phb, pe)
                pe->tce32_table = kzalloc
                pnv_pci_ioda2_setup_dma_pe

Could this PE setup also be done in pcibios_device() when the VF device
itself is enumerated?  I think that would be a nicer design if it's
possible.

I'd prefer to avoid exporting pci_iov_virtfn_bus() and
pci_iov_virtfn_devfn() if possible, because they're only safe to call when
VFs are enabled (because offset & stride depend on numVFs).

> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   22 +++++++++++++---------
>  include/linux/pci.h |   11 +++++++++++
>  2 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index ea3a82c..e76d1a0 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -19,14 +19,18 @@
>  
>  #define VIRTFN_ID_LEN	16
>  
> -static inline u8 virtfn_bus(struct pci_dev *dev, int id)
> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>  {
> +	if (!dev->is_physfn)
> +		return -EINVAL;
>  	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
>  				    dev->sriov->stride * id) >> 8);
>  }
>  
> -static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>  {
> +	if (!dev->is_physfn)
> +		return -EINVAL;
>  	return (dev->devfn + dev->sriov->offset +
>  		dev->sriov->stride * id) & 0xff;
>  }
> @@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
>  
>  	for ( ; total >= 0; total--) {
>  		pci_iov_set_numvfs(dev, total);
> -		busnr = virtfn_bus(dev, iov->total_VFs - 1);
> +		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
>  		if (busnr > max)
>  			max = busnr;
>  	}
> @@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  	struct pci_bus *bus;
>  
>  	mutex_lock(&iov->dev->sriov->lock);
> -	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
> +	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
>  	if (!bus)
>  		goto failed;
>  
> @@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  	if (!virtfn)
>  		goto failed0;
>  
> -	virtfn->devfn = virtfn_devfn(dev, id);
> +	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>  	virtfn->vendor = dev->vendor;
>  	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
>  	pci_setup_device(virtfn);
> @@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>  	struct pci_sriov *iov = dev->sriov;
>  
>  	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
> -					     virtfn_bus(dev, id),
> -					     virtfn_devfn(dev, id));
> +					     pci_iov_virtfn_bus(dev, id),
> +					     pci_iov_virtfn_devfn(dev, id));
>  	if (!virtfn)
>  		return;
>  
> @@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  	iov->offset = offset;
>  	iov->stride = stride;
>  
> -	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
> +	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
>  		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
>  		return -ENOMEM;
>  	}
> @@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>  	if (!reg)
>  		return 0;
>  
> -	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
> +	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>  	return resource_alignment(&tmp);
>  }
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 360a966..74ef944 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
>  void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>  
>  #ifdef CONFIG_PCI_IOV
> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
> +
>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>  void pci_disable_sriov(struct pci_dev *dev);
>  int pci_num_vf(struct pci_dev *dev);
> @@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>  #else
> +static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
> +{
> +	return -ENOSYS;
> +}
> +static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
> +{
> +	return -ENOSYS;
> +}
>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>  { return -ENODEV; }
>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
  2015-01-15  2:27       ` [PATCH V11 08/17] powrepc/pci: Refactor pci_dn Wei Yang
@ 2015-02-20 23:19           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20 23:19 UTC (permalink / raw)
  To: Wei Yang; +Cc: benh, gwshan, linux-pci, linuxppc-dev

On Thu, Jan 15, 2015 at 10:27:58AM +0800, Wei Yang wrote:
> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> 
> pci_dn is the extension of PCI device node and it's created from
> device node. Unfortunately, VFs that are enabled dynamically by
> PF's driver and they don't have corresponding device nodes, and
> pci_dn. The patch refactors pci_dn to support VFs:
> 
>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>      to the child list of pci_dn of PF's bridge. pci_dn of other
>      device put to the child list of pci_dn of its upstream bridge.
> 
>    * VF's pci_dn is expected to be created dynamically when PF
>      enabling VFs. VF's pci_dn will be destroyed when PF disabling
>      VFs. pci_dn of other device is still created from device node
>      as before.
> 
>    * For one particular PCI device (VF or not), its pci_dn can be
>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>      or parent's list. The fast path (fetching pci_dn through PCI
>      device instance) is populated during early fixup time.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/device.h         |    3 +
>  arch/powerpc/include/asm/pci-bridge.h     |   14 +-
>  arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
>  arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
>  4 files changed, 270 insertions(+), 5 deletions(-)
> ...

> +#ifdef CONFIG_PCI_IOV
> +static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
> +					   struct pci_dev *pdev,
> +					   int busno, int devfn)
> +{
> +	struct pci_dn *pdn;
> +
> +	/* Except PHB, we always have parent firmware data */
> +	if (!parent)
> +		return NULL;
> +
> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
> +	if (!pdn) {
> +		pr_warn("%s: Out of memory !\n", __func__);
> +		return NULL;
> +	}
> +
> +	pdn->phb = parent->phb;
> +	pdn->parent = parent;
> +	pdn->busno = busno;
> +	pdn->devfn = devfn;
> +#ifdef CONFIG_PPC_POWERNV
> +	pdn->pe_number = IODA_INVALID_PE;
> +#endif
> +	INIT_LIST_HEAD(&pdn->child_list);
> +	INIT_LIST_HEAD(&pdn->list);
> +	list_add_tail(&pdn->list, &parent->child_list);
> +
> +	/*
> +	 * If we already have PCI device instance, lets
> +	 * bind them.
> +	 */
> +	if (pdev)
> +		pdev->dev.archdata.firmware_data = pdn;
> +
> +	return pdn;

I'd like to see this done in pcibios_add_device(), as I mentioned in
response to "[PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's
BDF".  Maybe that's not feasible for some reason, but it would be a nicer
design if it's possible.

The remove_dev_pci_info() work would be done in pcibios_release_device()
then, of course.

> +}
> +#endif // CONFIG_PCI_IOV
> +
> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> +{
> +#ifdef CONFIG_PCI_IOV
> +	struct pci_dn *parent, *pdn;
> +	int i;
> +
> +	/* Only support IOV for now */
> +	if (!pdev->is_physfn)
> +		return pci_get_pdn(pdev);
> +
> +	/* Check if VFs have been populated */
> +	pdn = pci_get_pdn(pdev);
> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> +		return NULL;
> +
> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> +	parent = pci_bus_to_pdn(pdev->bus);
> +	if (!parent)
>  		return NULL;
> -	return PCI_DN(dn);
> +
> +	for (i = 0; i < vf_num; i++) {
> +		pdn = add_one_dev_pci_info(parent, NULL,
> +					   pci_iov_virtfn_bus(pdev, i),
> +					   pci_iov_virtfn_devfn(pdev, i));
> +		if (!pdn) {
> +			pr_warn("%s: Cannot create firmware data "
> +				"for VF#%d of %s\n",
> +				__func__, i, pci_name(pdev));
> +			return NULL;
> +		}
> +	}
> +#endif
> +
> +	return pci_get_pdn(pdev);
> +}
> +
> +void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> +{
> +#ifdef CONFIG_PCI_IOV
> +	struct pci_dn *parent;
> +	struct pci_dn *pdn, *tmp;
> +	int i;
> +
> +	/* Only support IOV PF for now */
> +	if (!pdev->is_physfn)
> +		return;
> +
> +	/* Check if VFs have been populated */
> +	pdn = pci_get_pdn(pdev);
> +	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
> +		return;
> +
> +	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
> +	parent = pci_bus_to_pdn(pdev->bus);
> +	if (!parent)
> +		return;
> +
> +	/*
> +	 * We might introduce flag to pci_dn in future
> +	 * so that we can release VF's firmware data in
> +	 * a batch mode.
> +	 */
> +	for (i = 0; i < vf_num; i++) {
> +		list_for_each_entry_safe(pdn, tmp,
> +			&parent->child_list, list) {
> +			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
> +			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
> +				continue;
> +
> +			if (!list_empty(&pdn->list))
> +				list_del(&pdn->list);
> +			kfree(pdn);
> +		}
> +	}
> +#endif
>  }

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
@ 2015-02-20 23:19           ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20 23:19 UTC (permalink / raw)
  To: Wei Yang; +Cc: linux-pci, benh, linuxppc-dev, gwshan

On Thu, Jan 15, 2015 at 10:27:58AM +0800, Wei Yang wrote:
> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> 
> pci_dn is the extension of PCI device node and it's created from
> device node. Unfortunately, VFs that are enabled dynamically by
> PF's driver and they don't have corresponding device nodes, and
> pci_dn. The patch refactors pci_dn to support VFs:
> 
>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>      to the child list of pci_dn of PF's bridge. pci_dn of other
>      device put to the child list of pci_dn of its upstream bridge.
> 
>    * VF's pci_dn is expected to be created dynamically when PF
>      enabling VFs. VF's pci_dn will be destroyed when PF disabling
>      VFs. pci_dn of other device is still created from device node
>      as before.
> 
>    * For one particular PCI device (VF or not), its pci_dn can be
>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>      or parent's list. The fast path (fetching pci_dn through PCI
>      device instance) is populated during early fixup time.
> 
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/device.h         |    3 +
>  arch/powerpc/include/asm/pci-bridge.h     |   14 +-
>  arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
>  arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
>  4 files changed, 270 insertions(+), 5 deletions(-)
> ...

> +#ifdef CONFIG_PCI_IOV
> +static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
> +					   struct pci_dev *pdev,
> +					   int busno, int devfn)
> +{
> +	struct pci_dn *pdn;
> +
> +	/* Except PHB, we always have parent firmware data */
> +	if (!parent)
> +		return NULL;
> +
> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
> +	if (!pdn) {
> +		pr_warn("%s: Out of memory !\n", __func__);
> +		return NULL;
> +	}
> +
> +	pdn->phb = parent->phb;
> +	pdn->parent = parent;
> +	pdn->busno = busno;
> +	pdn->devfn = devfn;
> +#ifdef CONFIG_PPC_POWERNV
> +	pdn->pe_number = IODA_INVALID_PE;
> +#endif
> +	INIT_LIST_HEAD(&pdn->child_list);
> +	INIT_LIST_HEAD(&pdn->list);
> +	list_add_tail(&pdn->list, &parent->child_list);
> +
> +	/*
> +	 * If we already have PCI device instance, lets
> +	 * bind them.
> +	 */
> +	if (pdev)
> +		pdev->dev.archdata.firmware_data = pdn;
> +
> +	return pdn;

I'd like to see this done in pcibios_add_device(), as I mentioned in
response to "[PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's
BDF".  Maybe that's not feasible for some reason, but it would be a nicer
design if it's possible.

The remove_dev_pci_info() work would be done in pcibios_release_device()
then, of course.

> +}
> +#endif // CONFIG_PCI_IOV
> +
> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> +{
> +#ifdef CONFIG_PCI_IOV
> +	struct pci_dn *parent, *pdn;
> +	int i;
> +
> +	/* Only support IOV for now */
> +	if (!pdev->is_physfn)
> +		return pci_get_pdn(pdev);
> +
> +	/* Check if VFs have been populated */
> +	pdn = pci_get_pdn(pdev);
> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> +		return NULL;
> +
> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> +	parent = pci_bus_to_pdn(pdev->bus);
> +	if (!parent)
>  		return NULL;
> -	return PCI_DN(dn);
> +
> +	for (i = 0; i < vf_num; i++) {
> +		pdn = add_one_dev_pci_info(parent, NULL,
> +					   pci_iov_virtfn_bus(pdev, i),
> +					   pci_iov_virtfn_devfn(pdev, i));
> +		if (!pdn) {
> +			pr_warn("%s: Cannot create firmware data "
> +				"for VF#%d of %s\n",
> +				__func__, i, pci_name(pdev));
> +			return NULL;
> +		}
> +	}
> +#endif
> +
> +	return pci_get_pdn(pdev);
> +}
> +
> +void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> +{
> +#ifdef CONFIG_PCI_IOV
> +	struct pci_dn *parent;
> +	struct pci_dn *pdn, *tmp;
> +	int i;
> +
> +	/* Only support IOV PF for now */
> +	if (!pdev->is_physfn)
> +		return;
> +
> +	/* Check if VFs have been populated */
> +	pdn = pci_get_pdn(pdev);
> +	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
> +		return;
> +
> +	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
> +	parent = pci_bus_to_pdn(pdev->bus);
> +	if (!parent)
> +		return;
> +
> +	/*
> +	 * We might introduce flag to pci_dn in future
> +	 * so that we can release VF's firmware data in
> +	 * a batch mode.
> +	 */
> +	for (i = 0; i < vf_num; i++) {
> +		list_for_each_entry_safe(pdn, tmp,
> +			&parent->child_list, list) {
> +			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
> +			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
> +				continue;
> +
> +			if (!list_empty(&pdn->list))
> +				list_del(&pdn->list);
> +			kfree(pdn);
> +		}
> +	}
> +#endif
>  }

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-02-10  8:14                   ` Benjamin Herrenschmidt
@ 2015-02-20 23:47                     ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20 23:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Wei Yang, gwshan, linux-pci, linuxppc-dev

On Tue, Feb 10, 2015 at 07:14:45PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2015-02-10 at 14:25 +0800, Wei Yang wrote:
> > PF's resource will be assigned first, including normal BARs and IOV
> > BARs.
> > 
> > Then PF's driver will create VFs, in virtfn_add(). In this function,
> > VF's
> > resources is calculated from its PF's IOV BAR.
> > 
> > If you reset VF's resource as PFs, no one will try to assign it again.
> 
> So the problem is that the flag indicating VF is lost ? IE. We should
> still mark them unset, but preserve that flag ?

I think the problem is that the normal path for PCI_REASSIGN_ALL_RSRC is
at boot-time, where we do this:

    pcibios_init
      pcibios_scan_phb
        pci_scan_child_bus
          ...
            pci_device_add
              pci_fixup_device(pci_fixup_header)
		pcibios_fixup_resources                       # header fixup
		  for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
		    dev->resource[i].start = 0
      pcibios_resource_survey
        pcibios_allocate_resources

and we assign dev->resource[] for everything in
pcibios_allocate_resources().

But VFs are enumerated later, when they are enabled by the PF driver after
boot, so we have this path:

    pci_enable_sriov
      sriov_enable
        virtfn_add(vf_id)
          for (i = 0; i < 6; i++)
            vf->resource[i].start = pf->resource[IOV + i].start + (size * vf_id)
          pci_device_add
            pci_fixup_device(pci_fixup_header)
              pcibios_fixup_resources                   # header fixup
                for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
                  vf->resource[i].start = 0

Here, we clear out vf->resource[0..5] in the header fixup, but we're not
going to call pcibios_allocate_resources() again to reassign them.

So I think the *intent* of PCI_REASSIGN_ALL_RSRC is preserved if
pcibios_fixup_resources() leaves the VF resources alone, because the VF
resources are completely determined by the PF resources, and the PF
resources have already been reassigned.

If my understanding is correct, I think the patch is reasonable, and I
would try to put some of this explanation into the changelog.

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-02-20 23:47                     ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-20 23:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-pci, Wei Yang, linuxppc-dev, gwshan

On Tue, Feb 10, 2015 at 07:14:45PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2015-02-10 at 14:25 +0800, Wei Yang wrote:
> > PF's resource will be assigned first, including normal BARs and IOV
> > BARs.
> > 
> > Then PF's driver will create VFs, in virtfn_add(). In this function,
> > VF's
> > resources is calculated from its PF's IOV BAR.
> > 
> > If you reset VF's resource as PFs, no one will try to assign it again.
> 
> So the problem is that the flag indicating VF is lost ? IE. We should
> still mark them unset, but preserve that flag ?

I think the problem is that the normal path for PCI_REASSIGN_ALL_RSRC is
at boot-time, where we do this:

    pcibios_init
      pcibios_scan_phb
        pci_scan_child_bus
          ...
            pci_device_add
              pci_fixup_device(pci_fixup_header)
		pcibios_fixup_resources                       # header fixup
		  for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
		    dev->resource[i].start = 0
      pcibios_resource_survey
        pcibios_allocate_resources

and we assign dev->resource[] for everything in
pcibios_allocate_resources().

But VFs are enumerated later, when they are enabled by the PF driver after
boot, so we have this path:

    pci_enable_sriov
      sriov_enable
        virtfn_add(vf_id)
          for (i = 0; i < 6; i++)
            vf->resource[i].start = pf->resource[IOV + i].start + (size * vf_id)
          pci_device_add
            pci_fixup_device(pci_fixup_header)
              pcibios_fixup_resources                   # header fixup
                for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
                  vf->resource[i].start = 0

Here, we clear out vf->resource[0..5] in the header fixup, but we're not
going to call pcibios_allocate_resources() again to reassign them.

So I think the *intent* of PCI_REASSIGN_ALL_RSRC is preserved if
pcibios_fixup_resources() leaves the VF resources alone, because the VF
resources are completely determined by the PF resources, and the PF
resources have already been reassigned.

If my understanding is correct, I think the patch is reasonable, and I
would try to put some of this explanation into the changelog.

Bjorn

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
  2015-02-20 23:19           ` Bjorn Helgaas
@ 2015-02-23  0:13             ` Gavin Shan
  -1 siblings, 0 replies; 168+ messages in thread
From: Gavin Shan @ 2015-02-23  0:13 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Fri, Feb 20, 2015 at 05:19:17PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:27:58AM +0800, Wei Yang wrote:
>> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> 
>> pci_dn is the extension of PCI device node and it's created from
>> device node. Unfortunately, VFs that are enabled dynamically by
>> PF's driver and they don't have corresponding device nodes, and
>> pci_dn. The patch refactors pci_dn to support VFs:
>> 
>>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>>      to the child list of pci_dn of PF's bridge. pci_dn of other
>>      device put to the child list of pci_dn of its upstream bridge.
>> 
>>    * VF's pci_dn is expected to be created dynamically when PF
>>      enabling VFs. VF's pci_dn will be destroyed when PF disabling
>>      VFs. pci_dn of other device is still created from device node
>>      as before.
>> 
>>    * For one particular PCI device (VF or not), its pci_dn can be
>>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>>      or parent's list. The fast path (fetching pci_dn through PCI
>>      device instance) is populated during early fixup time.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/device.h         |    3 +
>>  arch/powerpc/include/asm/pci-bridge.h     |   14 +-
>>  arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
>>  4 files changed, 270 insertions(+), 5 deletions(-)
>> ...
>
>> +#ifdef CONFIG_PCI_IOV
>> +static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
>> +					   struct pci_dev *pdev,
>> +					   int busno, int devfn)
>> +{
>> +	struct pci_dn *pdn;
>> +
>> +	/* Except PHB, we always have parent firmware data */
>> +	if (!parent)
>> +		return NULL;
>> +
>> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>> +	if (!pdn) {
>> +		pr_warn("%s: Out of memory !\n", __func__);
>> +		return NULL;
>> +	}
>> +
>> +	pdn->phb = parent->phb;
>> +	pdn->parent = parent;
>> +	pdn->busno = busno;
>> +	pdn->devfn = devfn;
>> +#ifdef CONFIG_PPC_POWERNV
>> +	pdn->pe_number = IODA_INVALID_PE;
>> +#endif
>> +	INIT_LIST_HEAD(&pdn->child_list);
>> +	INIT_LIST_HEAD(&pdn->list);
>> +	list_add_tail(&pdn->list, &parent->child_list);
>> +
>> +	/*
>> +	 * If we already have PCI device instance, lets
>> +	 * bind them.
>> +	 */
>> +	if (pdev)
>> +		pdev->dev.archdata.firmware_data = pdn;
>> +
>> +	return pdn;
>
>I'd like to see this done in pcibios_add_device(), as I mentioned in
>response to "[PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's
>BDF".  Maybe that's not feasible for some reason, but it would be a nicer
>design if it's possible.
>
>The remove_dev_pci_info() work would be done in pcibios_release_device()
>then, of course.
>

Yes, it's not feasible. PCI config accessors rely on VF's pci_dn. Before
calling pcibios_add_device(), we need access VF's config space. That means
we need VF's pci_dn before pci_setup_device() as follows:

    sriov_enable()
        pcibios_sriov_enable();     /* Currently, VF's pci_dn is created at this point */
        virtfn_add();
            virtfn_add_bus();       /* Create virtual bus if necessary */
                                    /* ---> A */
            pci_alloc_dev();        /* ---> B */
            pci_setup_device(vf);   /* Access VF's config space */
                pci_read_config_byte(vf, PCI_HEADER_TYPE);
                pci_read_config_dword(vf, PCI_CLASS_REVISION);
                pci_fixup_device(pci_fixup_early, vf);
                pci_read_irq();
                pci_read_bases();
            pci_device_add(vf);
                device_initialize(&vf->dev);
                pci_fixup_device(pci_fixup_header, vf);
                pci_init_capabilities(vf);
                pcibios_add_device(vf);

We have couple of options here:

1) Keep current code. VF's pci_dn is going to be destroyed in
   pcibios_sriov_disable() as we're doing currently.
2) Introduce pcibios_iov_virtfn_add() (at A) for platform to override.
   VF's pci_dn is going to be destroyed in pcibios_release_device().
3) Introduce pcibios_alloc_dev() (at B) for platform to override. The
   VF's pci_dn is going to be destroyed in pcibios_release_device().

Thanks,
Gavin

>> +}
>> +#endif // CONFIG_PCI_IOV
>> +
>> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
>> +{
>> +#ifdef CONFIG_PCI_IOV
>> +	struct pci_dn *parent, *pdn;
>> +	int i;
>> +
>> +	/* Only support IOV for now */
>> +	if (!pdev->is_physfn)
>> +		return pci_get_pdn(pdev);
>> +
>> +	/* Check if VFs have been populated */
>> +	pdn = pci_get_pdn(pdev);
>> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>> +		return NULL;
>> +
>> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>> +	parent = pci_bus_to_pdn(pdev->bus);
>> +	if (!parent)
>>  		return NULL;
>> -	return PCI_DN(dn);
>> +
>> +	for (i = 0; i < vf_num; i++) {
>> +		pdn = add_one_dev_pci_info(parent, NULL,
>> +					   pci_iov_virtfn_bus(pdev, i),
>> +					   pci_iov_virtfn_devfn(pdev, i));
>> +		if (!pdn) {
>> +			pr_warn("%s: Cannot create firmware data "
>> +				"for VF#%d of %s\n",
>> +				__func__, i, pci_name(pdev));
>> +			return NULL;
>> +		}
>> +	}
>> +#endif
>> +
>> +	return pci_get_pdn(pdev);
>> +}
>> +
>> +void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
>> +{
>> +#ifdef CONFIG_PCI_IOV
>> +	struct pci_dn *parent;
>> +	struct pci_dn *pdn, *tmp;
>> +	int i;
>> +
>> +	/* Only support IOV PF for now */
>> +	if (!pdev->is_physfn)
>> +		return;
>> +
>> +	/* Check if VFs have been populated */
>> +	pdn = pci_get_pdn(pdev);
>> +	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
>> +		return;
>> +
>> +	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
>> +	parent = pci_bus_to_pdn(pdev->bus);
>> +	if (!parent)
>> +		return;
>> +
>> +	/*
>> +	 * We might introduce flag to pci_dn in future
>> +	 * so that we can release VF's firmware data in
>> +	 * a batch mode.
>> +	 */
>> +	for (i = 0; i < vf_num; i++) {
>> +		list_for_each_entry_safe(pdn, tmp,
>> +			&parent->child_list, list) {
>> +			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
>> +			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
>> +				continue;
>> +
>> +			if (!list_empty(&pdn->list))
>> +				list_del(&pdn->list);
>> +			kfree(pdn);
>> +		}
>> +	}
>> +#endif
>>  }
>


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
@ 2015-02-23  0:13             ` Gavin Shan
  0 siblings, 0 replies; 168+ messages in thread
From: Gavin Shan @ 2015-02-23  0:13 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Fri, Feb 20, 2015 at 05:19:17PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:27:58AM +0800, Wei Yang wrote:
>> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> 
>> pci_dn is the extension of PCI device node and it's created from
>> device node. Unfortunately, VFs that are enabled dynamically by
>> PF's driver and they don't have corresponding device nodes, and
>> pci_dn. The patch refactors pci_dn to support VFs:
>> 
>>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
>>      to the child list of pci_dn of PF's bridge. pci_dn of other
>>      device put to the child list of pci_dn of its upstream bridge.
>> 
>>    * VF's pci_dn is expected to be created dynamically when PF
>>      enabling VFs. VF's pci_dn will be destroyed when PF disabling
>>      VFs. pci_dn of other device is still created from device node
>>      as before.
>> 
>>    * For one particular PCI device (VF or not), its pci_dn can be
>>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
>>      or parent's list. The fast path (fetching pci_dn through PCI
>>      device instance) is populated during early fixup time.
>> 
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/device.h         |    3 +
>>  arch/powerpc/include/asm/pci-bridge.h     |   14 +-
>>  arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
>>  arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
>>  4 files changed, 270 insertions(+), 5 deletions(-)
>> ...
>
>> +#ifdef CONFIG_PCI_IOV
>> +static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
>> +					   struct pci_dev *pdev,
>> +					   int busno, int devfn)
>> +{
>> +	struct pci_dn *pdn;
>> +
>> +	/* Except PHB, we always have parent firmware data */
>> +	if (!parent)
>> +		return NULL;
>> +
>> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
>> +	if (!pdn) {
>> +		pr_warn("%s: Out of memory !\n", __func__);
>> +		return NULL;
>> +	}
>> +
>> +	pdn->phb = parent->phb;
>> +	pdn->parent = parent;
>> +	pdn->busno = busno;
>> +	pdn->devfn = devfn;
>> +#ifdef CONFIG_PPC_POWERNV
>> +	pdn->pe_number = IODA_INVALID_PE;
>> +#endif
>> +	INIT_LIST_HEAD(&pdn->child_list);
>> +	INIT_LIST_HEAD(&pdn->list);
>> +	list_add_tail(&pdn->list, &parent->child_list);
>> +
>> +	/*
>> +	 * If we already have PCI device instance, lets
>> +	 * bind them.
>> +	 */
>> +	if (pdev)
>> +		pdev->dev.archdata.firmware_data = pdn;
>> +
>> +	return pdn;
>
>I'd like to see this done in pcibios_add_device(), as I mentioned in
>response to "[PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's
>BDF".  Maybe that's not feasible for some reason, but it would be a nicer
>design if it's possible.
>
>The remove_dev_pci_info() work would be done in pcibios_release_device()
>then, of course.
>

Yes, it's not feasible. PCI config accessors rely on VF's pci_dn. Before
calling pcibios_add_device(), we need access VF's config space. That means
we need VF's pci_dn before pci_setup_device() as follows:

    sriov_enable()
        pcibios_sriov_enable();     /* Currently, VF's pci_dn is created at this point */
        virtfn_add();
            virtfn_add_bus();       /* Create virtual bus if necessary */
                                    /* ---> A */
            pci_alloc_dev();        /* ---> B */
            pci_setup_device(vf);   /* Access VF's config space */
                pci_read_config_byte(vf, PCI_HEADER_TYPE);
                pci_read_config_dword(vf, PCI_CLASS_REVISION);
                pci_fixup_device(pci_fixup_early, vf);
                pci_read_irq();
                pci_read_bases();
            pci_device_add(vf);
                device_initialize(&vf->dev);
                pci_fixup_device(pci_fixup_header, vf);
                pci_init_capabilities(vf);
                pcibios_add_device(vf);

We have couple of options here:

1) Keep current code. VF's pci_dn is going to be destroyed in
   pcibios_sriov_disable() as we're doing currently.
2) Introduce pcibios_iov_virtfn_add() (at A) for platform to override.
   VF's pci_dn is going to be destroyed in pcibios_release_device().
3) Introduce pcibios_alloc_dev() (at B) for platform to override. The
   VF's pci_dn is going to be destroyed in pcibios_release_device().

Thanks,
Gavin

>> +}
>> +#endif // CONFIG_PCI_IOV
>> +
>> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
>> +{
>> +#ifdef CONFIG_PCI_IOV
>> +	struct pci_dn *parent, *pdn;
>> +	int i;
>> +
>> +	/* Only support IOV for now */
>> +	if (!pdev->is_physfn)
>> +		return pci_get_pdn(pdev);
>> +
>> +	/* Check if VFs have been populated */
>> +	pdn = pci_get_pdn(pdev);
>> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
>> +		return NULL;
>> +
>> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
>> +	parent = pci_bus_to_pdn(pdev->bus);
>> +	if (!parent)
>>  		return NULL;
>> -	return PCI_DN(dn);
>> +
>> +	for (i = 0; i < vf_num; i++) {
>> +		pdn = add_one_dev_pci_info(parent, NULL,
>> +					   pci_iov_virtfn_bus(pdev, i),
>> +					   pci_iov_virtfn_devfn(pdev, i));
>> +		if (!pdn) {
>> +			pr_warn("%s: Cannot create firmware data "
>> +				"for VF#%d of %s\n",
>> +				__func__, i, pci_name(pdev));
>> +			return NULL;
>> +		}
>> +	}
>> +#endif
>> +
>> +	return pci_get_pdn(pdev);
>> +}
>> +
>> +void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
>> +{
>> +#ifdef CONFIG_PCI_IOV
>> +	struct pci_dn *parent;
>> +	struct pci_dn *pdn, *tmp;
>> +	int i;
>> +
>> +	/* Only support IOV PF for now */
>> +	if (!pdev->is_physfn)
>> +		return;
>> +
>> +	/* Check if VFs have been populated */
>> +	pdn = pci_get_pdn(pdev);
>> +	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
>> +		return;
>> +
>> +	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
>> +	parent = pci_bus_to_pdn(pdev->bus);
>> +	if (!parent)
>> +		return;
>> +
>> +	/*
>> +	 * We might introduce flag to pci_dn in future
>> +	 * so that we can release VF's firmware data in
>> +	 * a batch mode.
>> +	 */
>> +	for (i = 0; i < vf_num; i++) {
>> +		list_for_each_entry_safe(pdn, tmp,
>> +			&parent->child_list, list) {
>> +			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
>> +			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
>> +				continue;
>> +
>> +			if (!list_empty(&pdn->list))
>> +				list_del(&pdn->list);
>> +			kfree(pdn);
>> +		}
>> +	}
>> +#endif
>>  }
>

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
  2015-02-23  0:13             ` Gavin Shan
@ 2015-02-24  8:13               ` Bjorn Helgaas
  -1 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:13 UTC (permalink / raw)
  To: Gavin Shan; +Cc: Wei Yang, benh, linux-pci, linuxppc-dev

On Mon, Feb 23, 2015 at 11:13:49AM +1100, Gavin Shan wrote:
> On Fri, Feb 20, 2015 at 05:19:17PM -0600, Bjorn Helgaas wrote:
> >On Thu, Jan 15, 2015 at 10:27:58AM +0800, Wei Yang wrote:
> >> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> 
> >> pci_dn is the extension of PCI device node and it's created from
> >> device node. Unfortunately, VFs that are enabled dynamically by
> >> PF's driver and they don't have corresponding device nodes, and
> >> pci_dn. The patch refactors pci_dn to support VFs:
> >> 
> >>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
> >>      to the child list of pci_dn of PF's bridge. pci_dn of other
> >>      device put to the child list of pci_dn of its upstream bridge.
> >> 
> >>    * VF's pci_dn is expected to be created dynamically when PF
> >>      enabling VFs. VF's pci_dn will be destroyed when PF disabling
> >>      VFs. pci_dn of other device is still created from device node
> >>      as before.
> >> 
> >>    * For one particular PCI device (VF or not), its pci_dn can be
> >>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
> >>      or parent's list. The fast path (fetching pci_dn through PCI
> >>      device instance) is populated during early fixup time.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/include/asm/device.h         |    3 +
> >>  arch/powerpc/include/asm/pci-bridge.h     |   14 +-
> >>  arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
> >>  4 files changed, 270 insertions(+), 5 deletions(-)
> >> ...
> >
> >> +#ifdef CONFIG_PCI_IOV
> >> +static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
> >> +					   struct pci_dev *pdev,
> >> +					   int busno, int devfn)
> >> +{
> >> +	struct pci_dn *pdn;
> >> +
> >> +	/* Except PHB, we always have parent firmware data */
> >> +	if (!parent)
> >> +		return NULL;
> >> +
> >> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
> >> +	if (!pdn) {
> >> +		pr_warn("%s: Out of memory !\n", __func__);
> >> +		return NULL;
> >> +	}
> >> +
> >> +	pdn->phb = parent->phb;
> >> +	pdn->parent = parent;
> >> +	pdn->busno = busno;
> >> +	pdn->devfn = devfn;
> >> +#ifdef CONFIG_PPC_POWERNV
> >> +	pdn->pe_number = IODA_INVALID_PE;
> >> +#endif
> >> +	INIT_LIST_HEAD(&pdn->child_list);
> >> +	INIT_LIST_HEAD(&pdn->list);
> >> +	list_add_tail(&pdn->list, &parent->child_list);
> >> +
> >> +	/*
> >> +	 * If we already have PCI device instance, lets
> >> +	 * bind them.
> >> +	 */
> >> +	if (pdev)
> >> +		pdev->dev.archdata.firmware_data = pdn;
> >> +
> >> +	return pdn;
> >
> >I'd like to see this done in pcibios_add_device(), as I mentioned in
> >response to "[PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's
> >BDF".  Maybe that's not feasible for some reason, but it would be a nicer
> >design if it's possible.
> >
> >The remove_dev_pci_info() work would be done in pcibios_release_device()
> >then, of course.
> >
> 
> Yes, it's not feasible. PCI config accessors rely on VF's pci_dn. Before
> calling pcibios_add_device(), we need access VF's config space.  That means
> we need VF's pci_dn before pci_setup_device() as follows:
> 
>     sriov_enable()
>         pcibios_sriov_enable();     /* Currently, VF's pci_dn is created at this point */
>         virtfn_add();
>             virtfn_add_bus();       /* Create virtual bus if necessary */
>                                     /* ---> A */
>             pci_alloc_dev();        /* ---> B */
>             pci_setup_device(vf);   /* Access VF's config space */
>                 pci_read_config_byte(vf, PCI_HEADER_TYPE);
>                 pci_read_config_dword(vf, PCI_CLASS_REVISION);
>                 pci_fixup_device(pci_fixup_early, vf);
>                 pci_read_irq();
>                 pci_read_bases();
>             pci_device_add(vf);
>                 device_initialize(&vf->dev);
>                 pci_fixup_device(pci_fixup_header, vf);
>                 pci_init_capabilities(vf);
>                 pcibios_add_device(vf);
> 
> We have couple of options here:
> 
> 1) Keep current code. VF's pci_dn is going to be destroyed in
>    pcibios_sriov_disable() as we're doing currently.
> 2) Introduce pcibios_iov_virtfn_add() (at A) for platform to override.
>    VF's pci_dn is going to be destroyed in pcibios_release_device().
> 3) Introduce pcibios_alloc_dev() (at B) for platform to override. The
>    VF's pci_dn is going to be destroyed in pcibios_release_device().

Ah, yes, now I see the problem.  I don't really like having to export
pci_iov_virtfn_bus() and pci_iov_virtfn_devfn(), but it's probably not
worth the hassle of changing it, and I think adding more pcibios interfaces
would be even worse.

So let's leave it as-is for now.

> >> +}
> >> +#endif // CONFIG_PCI_IOV
> >> +
> >> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> >> +{
> >> +#ifdef CONFIG_PCI_IOV
> >> +	struct pci_dn *parent, *pdn;
> >> +	int i;
> >> +
> >> +	/* Only support IOV for now */
> >> +	if (!pdev->is_physfn)
> >> +		return pci_get_pdn(pdev);
> >> +
> >> +	/* Check if VFs have been populated */
> >> +	pdn = pci_get_pdn(pdev);
> >> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> >> +		return NULL;
> >> +
> >> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> >> +	parent = pci_bus_to_pdn(pdev->bus);
> >> +	if (!parent)
> >>  		return NULL;
> >> -	return PCI_DN(dn);
> >> +
> >> +	for (i = 0; i < vf_num; i++) {
> >> +		pdn = add_one_dev_pci_info(parent, NULL,
> >> +					   pci_iov_virtfn_bus(pdev, i),
> >> +					   pci_iov_virtfn_devfn(pdev, i));
> >> +		if (!pdn) {
> >> +			pr_warn("%s: Cannot create firmware data "
> >> +				"for VF#%d of %s\n",
> >> +				__func__, i, pci_name(pdev));
> >> +			return NULL;
> >> +		}
> >> +	}
> >> +#endif
> >> +
> >> +	return pci_get_pdn(pdev);
> >> +}
> >> +
> >> +void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> >> +{
> >> +#ifdef CONFIG_PCI_IOV
> >> +	struct pci_dn *parent;
> >> +	struct pci_dn *pdn, *tmp;
> >> +	int i;
> >> +
> >> +	/* Only support IOV PF for now */
> >> +	if (!pdev->is_physfn)
> >> +		return;
> >> +
> >> +	/* Check if VFs have been populated */
> >> +	pdn = pci_get_pdn(pdev);
> >> +	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
> >> +		return;
> >> +
> >> +	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
> >> +	parent = pci_bus_to_pdn(pdev->bus);
> >> +	if (!parent)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * We might introduce flag to pci_dn in future
> >> +	 * so that we can release VF's firmware data in
> >> +	 * a batch mode.
> >> +	 */
> >> +	for (i = 0; i < vf_num; i++) {
> >> +		list_for_each_entry_safe(pdn, tmp,
> >> +			&parent->child_list, list) {
> >> +			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
> >> +			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
> >> +				continue;
> >> +
> >> +			if (!list_empty(&pdn->list))
> >> +				list_del(&pdn->list);
> >> +			kfree(pdn);
> >> +		}
> >> +	}
> >> +#endif
> >>  }
> >
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
@ 2015-02-24  8:13               ` Bjorn Helgaas
  0 siblings, 0 replies; 168+ messages in thread
From: Bjorn Helgaas @ 2015-02-24  8:13 UTC (permalink / raw)
  To: Gavin Shan; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev

On Mon, Feb 23, 2015 at 11:13:49AM +1100, Gavin Shan wrote:
> On Fri, Feb 20, 2015 at 05:19:17PM -0600, Bjorn Helgaas wrote:
> >On Thu, Jan 15, 2015 at 10:27:58AM +0800, Wei Yang wrote:
> >> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> 
> >> pci_dn is the extension of PCI device node and it's created from
> >> device node. Unfortunately, VFs that are enabled dynamically by
> >> PF's driver and they don't have corresponding device nodes, and
> >> pci_dn. The patch refactors pci_dn to support VFs:
> >> 
> >>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
> >>      to the child list of pci_dn of PF's bridge. pci_dn of other
> >>      device put to the child list of pci_dn of its upstream bridge.
> >> 
> >>    * VF's pci_dn is expected to be created dynamically when PF
> >>      enabling VFs. VF's pci_dn will be destroyed when PF disabling
> >>      VFs. pci_dn of other device is still created from device node
> >>      as before.
> >> 
> >>    * For one particular PCI device (VF or not), its pci_dn can be
> >>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
> >>      or parent's list. The fast path (fetching pci_dn through PCI
> >>      device instance) is populated during early fixup time.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/include/asm/device.h         |    3 +
> >>  arch/powerpc/include/asm/pci-bridge.h     |   14 +-
> >>  arch/powerpc/kernel/pci_dn.c              |  242 ++++++++++++++++++++++++++++-
> >>  arch/powerpc/platforms/powernv/pci-ioda.c |   16 ++
> >>  4 files changed, 270 insertions(+), 5 deletions(-)
> >> ...
> >
> >> +#ifdef CONFIG_PCI_IOV
> >> +static struct pci_dn *add_one_dev_pci_info(struct pci_dn *parent,
> >> +					   struct pci_dev *pdev,
> >> +					   int busno, int devfn)
> >> +{
> >> +	struct pci_dn *pdn;
> >> +
> >> +	/* Except PHB, we always have parent firmware data */
> >> +	if (!parent)
> >> +		return NULL;
> >> +
> >> +	pdn = kzalloc(sizeof(*pdn), GFP_KERNEL);
> >> +	if (!pdn) {
> >> +		pr_warn("%s: Out of memory !\n", __func__);
> >> +		return NULL;
> >> +	}
> >> +
> >> +	pdn->phb = parent->phb;
> >> +	pdn->parent = parent;
> >> +	pdn->busno = busno;
> >> +	pdn->devfn = devfn;
> >> +#ifdef CONFIG_PPC_POWERNV
> >> +	pdn->pe_number = IODA_INVALID_PE;
> >> +#endif
> >> +	INIT_LIST_HEAD(&pdn->child_list);
> >> +	INIT_LIST_HEAD(&pdn->list);
> >> +	list_add_tail(&pdn->list, &parent->child_list);
> >> +
> >> +	/*
> >> +	 * If we already have PCI device instance, lets
> >> +	 * bind them.
> >> +	 */
> >> +	if (pdev)
> >> +		pdev->dev.archdata.firmware_data = pdn;
> >> +
> >> +	return pdn;
> >
> >I'd like to see this done in pcibios_add_device(), as I mentioned in
> >response to "[PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's
> >BDF".  Maybe that's not feasible for some reason, but it would be a nicer
> >design if it's possible.
> >
> >The remove_dev_pci_info() work would be done in pcibios_release_device()
> >then, of course.
> >
> 
> Yes, it's not feasible. PCI config accessors rely on VF's pci_dn. Before
> calling pcibios_add_device(), we need access VF's config space.  That means
> we need VF's pci_dn before pci_setup_device() as follows:
> 
>     sriov_enable()
>         pcibios_sriov_enable();     /* Currently, VF's pci_dn is created at this point */
>         virtfn_add();
>             virtfn_add_bus();       /* Create virtual bus if necessary */
>                                     /* ---> A */
>             pci_alloc_dev();        /* ---> B */
>             pci_setup_device(vf);   /* Access VF's config space */
>                 pci_read_config_byte(vf, PCI_HEADER_TYPE);
>                 pci_read_config_dword(vf, PCI_CLASS_REVISION);
>                 pci_fixup_device(pci_fixup_early, vf);
>                 pci_read_irq();
>                 pci_read_bases();
>             pci_device_add(vf);
>                 device_initialize(&vf->dev);
>                 pci_fixup_device(pci_fixup_header, vf);
>                 pci_init_capabilities(vf);
>                 pcibios_add_device(vf);
> 
> We have couple of options here:
> 
> 1) Keep current code. VF's pci_dn is going to be destroyed in
>    pcibios_sriov_disable() as we're doing currently.
> 2) Introduce pcibios_iov_virtfn_add() (at A) for platform to override.
>    VF's pci_dn is going to be destroyed in pcibios_release_device().
> 3) Introduce pcibios_alloc_dev() (at B) for platform to override. The
>    VF's pci_dn is going to be destroyed in pcibios_release_device().

Ah, yes, now I see the problem.  I don't really like having to export
pci_iov_virtfn_bus() and pci_iov_virtfn_devfn(), but it's probably not
worth the hassle of changing it, and I think adding more pcibios interfaces
would be even worse.

So let's leave it as-is for now.

> >> +}
> >> +#endif // CONFIG_PCI_IOV
> >> +
> >> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> >> +{
> >> +#ifdef CONFIG_PCI_IOV
> >> +	struct pci_dn *parent, *pdn;
> >> +	int i;
> >> +
> >> +	/* Only support IOV for now */
> >> +	if (!pdev->is_physfn)
> >> +		return pci_get_pdn(pdev);
> >> +
> >> +	/* Check if VFs have been populated */
> >> +	pdn = pci_get_pdn(pdev);
> >> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> >> +		return NULL;
> >> +
> >> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> >> +	parent = pci_bus_to_pdn(pdev->bus);
> >> +	if (!parent)
> >>  		return NULL;
> >> -	return PCI_DN(dn);
> >> +
> >> +	for (i = 0; i < vf_num; i++) {
> >> +		pdn = add_one_dev_pci_info(parent, NULL,
> >> +					   pci_iov_virtfn_bus(pdev, i),
> >> +					   pci_iov_virtfn_devfn(pdev, i));
> >> +		if (!pdn) {
> >> +			pr_warn("%s: Cannot create firmware data "
> >> +				"for VF#%d of %s\n",
> >> +				__func__, i, pci_name(pdev));
> >> +			return NULL;
> >> +		}
> >> +	}
> >> +#endif
> >> +
> >> +	return pci_get_pdn(pdev);
> >> +}
> >> +
> >> +void remove_dev_pci_info(struct pci_dev *pdev, u16 vf_num)
> >> +{
> >> +#ifdef CONFIG_PCI_IOV
> >> +	struct pci_dn *parent;
> >> +	struct pci_dn *pdn, *tmp;
> >> +	int i;
> >> +
> >> +	/* Only support IOV PF for now */
> >> +	if (!pdev->is_physfn)
> >> +		return;
> >> +
> >> +	/* Check if VFs have been populated */
> >> +	pdn = pci_get_pdn(pdev);
> >> +	if (!pdn || !(pdn->flags & PCI_DN_FLAG_IOV_VF))
> >> +		return;
> >> +
> >> +	pdn->flags &= ~PCI_DN_FLAG_IOV_VF;
> >> +	parent = pci_bus_to_pdn(pdev->bus);
> >> +	if (!parent)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * We might introduce flag to pci_dn in future
> >> +	 * so that we can release VF's firmware data in
> >> +	 * a batch mode.
> >> +	 */
> >> +	for (i = 0; i < vf_num; i++) {
> >> +		list_for_each_entry_safe(pdn, tmp,
> >> +			&parent->child_list, list) {
> >> +			if (pdn->busno != pci_iov_virtfn_bus(pdev, i) ||
> >> +			    pdn->devfn != pci_iov_virtfn_devfn(pdev, i))
> >> +				continue;
> >> +
> >> +			if (!list_empty(&pdn->list))
> >> +				list_del(&pdn->list);
> >> +			kfree(pdn);
> >> +		}
> >> +	}
> >> +#endif
> >>  }
> >
> 

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
  2015-02-24  8:13               ` Bjorn Helgaas
@ 2015-02-24  8:25                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-24  8:25 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Gavin Shan, Wei Yang, linux-pci, linuxppc-dev

On Tue, 2015-02-24 at 02:13 -0600, Bjorn Helgaas wrote:
> 
> Ah, yes, now I see the problem.  I don't really like having to export
> pci_iov_virtfn_bus() and pci_iov_virtfn_devfn(), but it's probably not
> worth the hassle of changing it, and I think adding more pcibios
> interfaces
> would be even worse.

Aren't we going to eventually turn them all into host bridge ops ? :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 08/17] powrepc/pci: Refactor pci_dn
@ 2015-02-24  8:25                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 168+ messages in thread
From: Benjamin Herrenschmidt @ 2015-02-24  8:25 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, linuxppc-dev, Gavin Shan

On Tue, 2015-02-24 at 02:13 -0600, Bjorn Helgaas wrote:
> 
> Ah, yes, now I see the problem.  I don't really like having to export
> pci_iov_virtfn_bus() and pci_iov_virtfn_devfn(), but it's probably not
> worth the hassle of changing it, and I think adding more pcibios
> interfaces
> would be even worse.

Aren't we going to eventually turn them all into host bridge ops ? :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF
  2015-02-20 23:09           ` Bjorn Helgaas
@ 2015-03-02  6:05             ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-03-02  6:05 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Wei Yang, benh, gwshan, linux-pci, linuxppc-dev

On Fri, Feb 20, 2015 at 05:09:04PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:27:51AM +0800, Wei Yang wrote:
>> When implementing the SR-IOV on PowerNV platform, some resource reservation
>> is needed for VFs which don't exist at the bootup stage. To do the match
>> between resources and VFs, the code need to get the VF's BDF in advance.
>> 
>> In this patch, it exports the interface to retrieve VF's BDF:
>>    * Make the virtfn_bus as an interface
>>    * Make the virtfn_devfn as an interface
>>    * Rename them with more specific name
>>    * Code cleanup in pci_sriov_resource_alignment()
>
>You use these in this path:
>
>    pci_enable_sriov
>      sriov_enable
>	pcibios_sriov_enable
>	  add_dev_pci_info
>	    for (i = 0; i < pci_sriov_get_totalvfs(pdev))
>	      add_one_dev_pci_info(..., pci_iov_virtfn_bus(), ...)  <---
>		pdn = kzalloc
>		pdn->busno = busno
>		pdn->devfn = devfn
>		list_add_tail(&pdn->list, &parent->child_list)
>
>It looks like this sets up a struct pci_dn for each VF.
>
>Could the struct pci_dn setup be done in pcibios_add_device() instead?
>Then each VF we enumerate would set up its own struct pci_dn.  That would
>be a lot nicer than using this hook to iterate over all possible VFs.
>

Looks no.

Current powernv platform use pdn->busno/devfn to access the configuration
space of a device.

    virtfn_add
        pci_setup_device(virtfn)     ---  (1)
	pci_device_add()
	    pcibios_add_device()     ---  (2)

So you suggest to fixup the pdn in (2), right? 

By doing so, it needs to access the configuration space of virtfn in (1),
which will fail since not enough information in pdn at this moment.

This means pdn->busno/devfn must be initialized before kernel could access the
device on powernv platform. Originally this step is down in
update_dn_pci_infor() for PFs. Those information is retrieved from device
node. But currently there is no device node for VF, that is why we need to
create it.

>You also use them in some PE setup in a similar path:
>
>    pci_enable_sriov
>      sriov_enable
>        pcibios_sriov_enable
>          pnv_pci_sriov_enable
>            pnv_pci_vf_assign_m64
>            pnv_pci_vf_resource_shift
>            pnv_ioda_setup_vf_PE
>              for (i = 0; i < vf_num; i++)
>                pe->rid = pci_iov_virtfn_bus(...)  <---
>                pnv_ioda_configure_pe(phb, pe)
>                pe->tce32_table = kzalloc
>                pnv_pci_ioda2_setup_dma_pe
>
>Could this PE setup also be done in pcibios_device() when the VF device
>itself is enumerated?  I think that would be a nicer design if it's
>possible.

Hmm... looks possible, while I need to do some investigation.

>
>I'd prefer to avoid exporting pci_iov_virtfn_bus() and
>pci_iov_virtfn_devfn() if possible, because they're only safe to call when
>VFs are enabled (because offset & stride depend on numVFs).

Yep, I know your concern.

pci_enable_sriov
    sriov_enable
        pci_iov_set_numvfs(dev, nr_virtfn);
	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
	pcibios_sriov_enable(dev, initial)

So in current implementation, these two functions will be called after VFs are
enabled. But, yes, we can't guarantee other people won't use them improperly.

Could we return an error code when (iov->num_VFs == 0) or !(iov->ctrl |
PCI_SRIOV_CTRL_VFE) in these two functions? These two fields could reflect
whether VFs are enabled or not.

>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  drivers/pci/iov.c   |   22 +++++++++++++---------
>>  include/linux/pci.h |   11 +++++++++++
>>  2 files changed, 24 insertions(+), 9 deletions(-)
>> 
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index ea3a82c..e76d1a0 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -19,14 +19,18 @@
>>  
>>  #define VIRTFN_ID_LEN	16
>>  
>> -static inline u8 virtfn_bus(struct pci_dev *dev, int id)
>> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>>  {
>> +	if (!dev->is_physfn)
>> +		return -EINVAL;
>>  	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
>>  				    dev->sriov->stride * id) >> 8);
>>  }
>>  
>> -static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>>  {
>> +	if (!dev->is_physfn)
>> +		return -EINVAL;
>>  	return (dev->devfn + dev->sriov->offset +
>>  		dev->sriov->stride * id) & 0xff;
>>  }
>> @@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
>>  
>>  	for ( ; total >= 0; total--) {
>>  		pci_iov_set_numvfs(dev, total);
>> -		busnr = virtfn_bus(dev, iov->total_VFs - 1);
>> +		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
>>  		if (busnr > max)
>>  			max = busnr;
>>  	}
>> @@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>>  	struct pci_bus *bus;
>>  
>>  	mutex_lock(&iov->dev->sriov->lock);
>> -	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
>> +	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
>>  	if (!bus)
>>  		goto failed;
>>  
>> @@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>>  	if (!virtfn)
>>  		goto failed0;
>>  
>> -	virtfn->devfn = virtfn_devfn(dev, id);
>> +	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>>  	virtfn->vendor = dev->vendor;
>>  	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
>>  	pci_setup_device(virtfn);
>> @@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>>  	struct pci_sriov *iov = dev->sriov;
>>  
>>  	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
>> -					     virtfn_bus(dev, id),
>> -					     virtfn_devfn(dev, id));
>> +					     pci_iov_virtfn_bus(dev, id),
>> +					     pci_iov_virtfn_devfn(dev, id));
>>  	if (!virtfn)
>>  		return;
>>  
>> @@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  	iov->offset = offset;
>>  	iov->stride = stride;
>>  
>> -	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
>> +	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
>>  		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
>>  		return -ENOMEM;
>>  	}
>> @@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>  	if (!reg)
>>  		return 0;
>>  
>> -	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>> +	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>>  	return resource_alignment(&tmp);
>>  }
>>  
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 360a966..74ef944 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
>>  void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>>  
>>  #ifdef CONFIG_PCI_IOV
>> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>> +
>>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>>  void pci_disable_sriov(struct pci_dev *dev);
>>  int pci_num_vf(struct pci_dev *dev);
>> @@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
>>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>>  #else
>> +static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>> +{
>> +	return -ENOSYS;
>> +}
>> +static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>> +{
>> +	return -ENOSYS;
>> +}
>>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>>  { return -ENODEV; }
>>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>> -- 
>> 1.7.9.5
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF
@ 2015-03-02  6:05             ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-03-02  6:05 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Wei Yang, benh, linuxppc-dev, gwshan

On Fri, Feb 20, 2015 at 05:09:04PM -0600, Bjorn Helgaas wrote:
>On Thu, Jan 15, 2015 at 10:27:51AM +0800, Wei Yang wrote:
>> When implementing the SR-IOV on PowerNV platform, some resource reservation
>> is needed for VFs which don't exist at the bootup stage. To do the match
>> between resources and VFs, the code need to get the VF's BDF in advance.
>> 
>> In this patch, it exports the interface to retrieve VF's BDF:
>>    * Make the virtfn_bus as an interface
>>    * Make the virtfn_devfn as an interface
>>    * Rename them with more specific name
>>    * Code cleanup in pci_sriov_resource_alignment()
>
>You use these in this path:
>
>    pci_enable_sriov
>      sriov_enable
>	pcibios_sriov_enable
>	  add_dev_pci_info
>	    for (i = 0; i < pci_sriov_get_totalvfs(pdev))
>	      add_one_dev_pci_info(..., pci_iov_virtfn_bus(), ...)  <---
>		pdn = kzalloc
>		pdn->busno = busno
>		pdn->devfn = devfn
>		list_add_tail(&pdn->list, &parent->child_list)
>
>It looks like this sets up a struct pci_dn for each VF.
>
>Could the struct pci_dn setup be done in pcibios_add_device() instead?
>Then each VF we enumerate would set up its own struct pci_dn.  That would
>be a lot nicer than using this hook to iterate over all possible VFs.
>

Looks no.

Current powernv platform use pdn->busno/devfn to access the configuration
space of a device.

    virtfn_add
        pci_setup_device(virtfn)     ---  (1)
	pci_device_add()
	    pcibios_add_device()     ---  (2)

So you suggest to fixup the pdn in (2), right? 

By doing so, it needs to access the configuration space of virtfn in (1),
which will fail since not enough information in pdn at this moment.

This means pdn->busno/devfn must be initialized before kernel could access the
device on powernv platform. Originally this step is down in
update_dn_pci_infor() for PFs. Those information is retrieved from device
node. But currently there is no device node for VF, that is why we need to
create it.

>You also use them in some PE setup in a similar path:
>
>    pci_enable_sriov
>      sriov_enable
>        pcibios_sriov_enable
>          pnv_pci_sriov_enable
>            pnv_pci_vf_assign_m64
>            pnv_pci_vf_resource_shift
>            pnv_ioda_setup_vf_PE
>              for (i = 0; i < vf_num; i++)
>                pe->rid = pci_iov_virtfn_bus(...)  <---
>                pnv_ioda_configure_pe(phb, pe)
>                pe->tce32_table = kzalloc
>                pnv_pci_ioda2_setup_dma_pe
>
>Could this PE setup also be done in pcibios_device() when the VF device
>itself is enumerated?  I think that would be a nicer design if it's
>possible.

Hmm... looks possible, while I need to do some investigation.

>
>I'd prefer to avoid exporting pci_iov_virtfn_bus() and
>pci_iov_virtfn_devfn() if possible, because they're only safe to call when
>VFs are enabled (because offset & stride depend on numVFs).

Yep, I know your concern.

pci_enable_sriov
    sriov_enable
        pci_iov_set_numvfs(dev, nr_virtfn);
	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
	pcibios_sriov_enable(dev, initial)

So in current implementation, these two functions will be called after VFs are
enabled. But, yes, we can't guarantee other people won't use them improperly.

Could we return an error code when (iov->num_VFs == 0) or !(iov->ctrl |
PCI_SRIOV_CTRL_VFE) in these two functions? These two fields could reflect
whether VFs are enabled or not.

>
>> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>> ---
>>  drivers/pci/iov.c   |   22 +++++++++++++---------
>>  include/linux/pci.h |   11 +++++++++++
>>  2 files changed, 24 insertions(+), 9 deletions(-)
>> 
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index ea3a82c..e76d1a0 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -19,14 +19,18 @@
>>  
>>  #define VIRTFN_ID_LEN	16
>>  
>> -static inline u8 virtfn_bus(struct pci_dev *dev, int id)
>> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>>  {
>> +	if (!dev->is_physfn)
>> +		return -EINVAL;
>>  	return dev->bus->number + ((dev->devfn + dev->sriov->offset +
>>  				    dev->sriov->stride * id) >> 8);
>>  }
>>  
>> -static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>>  {
>> +	if (!dev->is_physfn)
>> +		return -EINVAL;
>>  	return (dev->devfn + dev->sriov->offset +
>>  		dev->sriov->stride * id) & 0xff;
>>  }
>> @@ -62,7 +66,7 @@ static inline void pci_iov_max_bus_range(struct pci_dev *dev)
>>  
>>  	for ( ; total >= 0; total--) {
>>  		pci_iov_set_numvfs(dev, total);
>> -		busnr = virtfn_bus(dev, iov->total_VFs - 1);
>> +		busnr = pci_iov_virtfn_bus(dev, iov->total_VFs - 1);
>>  		if (busnr > max)
>>  			max = busnr;
>>  	}
>> @@ -108,7 +112,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>>  	struct pci_bus *bus;
>>  
>>  	mutex_lock(&iov->dev->sriov->lock);
>> -	bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
>> +	bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
>>  	if (!bus)
>>  		goto failed;
>>  
>> @@ -116,7 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>>  	if (!virtfn)
>>  		goto failed0;
>>  
>> -	virtfn->devfn = virtfn_devfn(dev, id);
>> +	virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
>>  	virtfn->vendor = dev->vendor;
>>  	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
>>  	pci_setup_device(virtfn);
>> @@ -179,8 +183,8 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
>>  	struct pci_sriov *iov = dev->sriov;
>>  
>>  	virtfn = pci_get_domain_bus_and_slot(pci_domain_nr(dev->bus),
>> -					     virtfn_bus(dev, id),
>> -					     virtfn_devfn(dev, id));
>> +					     pci_iov_virtfn_bus(dev, id),
>> +					     pci_iov_virtfn_devfn(dev, id));
>>  	if (!virtfn)
>>  		return;
>>  
>> @@ -255,7 +259,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>  	iov->offset = offset;
>>  	iov->stride = stride;
>>  
>> -	if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
>> +	if (pci_iov_virtfn_bus(dev, nr_virtfn - 1) > dev->bus->busn_res.end) {
>>  		dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
>>  		return -ENOMEM;
>>  	}
>> @@ -551,7 +555,7 @@ resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno)
>>  	if (!reg)
>>  		return 0;
>>  
>> -	 __pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>> +	__pci_read_base(dev, pci_bar_unknown, &tmp, reg);
>>  	return resource_alignment(&tmp);
>>  }
>>  
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 360a966..74ef944 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1658,6 +1658,9 @@ int pci_ext_cfg_avail(void);
>>  void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
>>  
>>  #ifdef CONFIG_PCI_IOV
>> +int pci_iov_virtfn_bus(struct pci_dev *dev, int id);
>> +int pci_iov_virtfn_devfn(struct pci_dev *dev, int id);
>> +
>>  int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
>>  void pci_disable_sriov(struct pci_dev *dev);
>>  int pci_num_vf(struct pci_dev *dev);
>> @@ -1665,6 +1668,14 @@ int pci_vfs_assigned(struct pci_dev *dev);
>>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>>  #else
>> +static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>> +{
>> +	return -ENOSYS;
>> +}
>> +static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id)
>> +{
>> +	return -ENOSYS;
>> +}
>>  static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
>>  { return -ENODEV; }
>>  static inline void pci_disable_sriov(struct pci_dev *dev) { }
>> -- 
>> 1.7.9.5
>> 
>--
>To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
  2015-02-20 23:47                     ` Bjorn Helgaas
@ 2015-03-02  6:09                       ` Wei Yang
  -1 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-03-02  6:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Wei Yang, gwshan, linux-pci, linuxppc-dev

On Fri, Feb 20, 2015 at 05:47:09PM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 10, 2015 at 07:14:45PM +1100, Benjamin Herrenschmidt wrote:
>> On Tue, 2015-02-10 at 14:25 +0800, Wei Yang wrote:
>> > PF's resource will be assigned first, including normal BARs and IOV
>> > BARs.
>> > 
>> > Then PF's driver will create VFs, in virtfn_add(). In this function,
>> > VF's
>> > resources is calculated from its PF's IOV BAR.
>> > 
>> > If you reset VF's resource as PFs, no one will try to assign it again.
>> 
>> So the problem is that the flag indicating VF is lost ? IE. We should
>> still mark them unset, but preserve that flag ?
>
>I think the problem is that the normal path for PCI_REASSIGN_ALL_RSRC is
>at boot-time, where we do this:
>
>    pcibios_init
>      pcibios_scan_phb
>        pci_scan_child_bus
>          ...
>            pci_device_add
>              pci_fixup_device(pci_fixup_header)
>		pcibios_fixup_resources                       # header fixup
>		  for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
>		    dev->resource[i].start = 0
>      pcibios_resource_survey
>        pcibios_allocate_resources
>
>and we assign dev->resource[] for everything in
>pcibios_allocate_resources().
>
>But VFs are enumerated later, when they are enabled by the PF driver after
>boot, so we have this path:
>
>    pci_enable_sriov
>      sriov_enable
>        virtfn_add(vf_id)
>          for (i = 0; i < 6; i++)
>            vf->resource[i].start = pf->resource[IOV + i].start + (size * vf_id)
>          pci_device_add
>            pci_fixup_device(pci_fixup_header)
>              pcibios_fixup_resources                   # header fixup
>                for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
>                  vf->resource[i].start = 0
>
>Here, we clear out vf->resource[0..5] in the header fixup, but we're not
>going to call pcibios_allocate_resources() again to reassign them.
>
>So I think the *intent* of PCI_REASSIGN_ALL_RSRC is preserved if
>pcibios_fixup_resources() leaves the VF resources alone, because the VF
>resources are completely determined by the PF resources, and the PF
>resources have already been reassigned.
>
>If my understanding is correct, I think the patch is reasonable, and I
>would try to put some of this explanation into the changelog.

Yep, it is correct, thanks for your explanation.

I did a chat on IRC with Ben, I guess he has got the idea :-)

>
>Bjorn

-- 
Richard Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 168+ messages in thread

* Re: [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs
@ 2015-03-02  6:09                       ` Wei Yang
  0 siblings, 0 replies; 168+ messages in thread
From: Wei Yang @ 2015-03-02  6:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Wei Yang, Benjamin Herrenschmidt, linuxppc-dev, gwshan

On Fri, Feb 20, 2015 at 05:47:09PM -0600, Bjorn Helgaas wrote:
>On Tue, Feb 10, 2015 at 07:14:45PM +1100, Benjamin Herrenschmidt wrote:
>> On Tue, 2015-02-10 at 14:25 +0800, Wei Yang wrote:
>> > PF's resource will be assigned first, including normal BARs and IOV
>> > BARs.
>> > 
>> > Then PF's driver will create VFs, in virtfn_add(). In this function,
>> > VF's
>> > resources is calculated from its PF's IOV BAR.
>> > 
>> > If you reset VF's resource as PFs, no one will try to assign it again.
>> 
>> So the problem is that the flag indicating VF is lost ? IE. We should
>> still mark them unset, but preserve that flag ?
>
>I think the problem is that the normal path for PCI_REASSIGN_ALL_RSRC is
>at boot-time, where we do this:
>
>    pcibios_init
>      pcibios_scan_phb
>        pci_scan_child_bus
>          ...
>            pci_device_add
>              pci_fixup_device(pci_fixup_header)
>		pcibios_fixup_resources                       # header fixup
>		  for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
>		    dev->resource[i].start = 0
>      pcibios_resource_survey
>        pcibios_allocate_resources
>
>and we assign dev->resource[] for everything in
>pcibios_allocate_resources().
>
>But VFs are enumerated later, when they are enabled by the PF driver after
>boot, so we have this path:
>
>    pci_enable_sriov
>      sriov_enable
>        virtfn_add(vf_id)
>          for (i = 0; i < 6; i++)
>            vf->resource[i].start = pf->resource[IOV + i].start + (size * vf_id)
>          pci_device_add
>            pci_fixup_device(pci_fixup_header)
>              pcibios_fixup_resources                   # header fixup
>                for (i = 0; i < DEVICE_COUNT_RESOURCE; i++)
>                  vf->resource[i].start = 0
>
>Here, we clear out vf->resource[0..5] in the header fixup, but we're not
>going to call pcibios_allocate_resources() again to reassign them.
>
>So I think the *intent* of PCI_REASSIGN_ALL_RSRC is preserved if
>pcibios_fixup_resources() leaves the VF resources alone, because the VF
>resources are completely determined by the PF resources, and the PF
>resources have already been reassigned.
>
>If my understanding is correct, I think the patch is reasonable, and I
>would try to put some of this explanation into the changelog.

Yep, it is correct, thanks for your explanation.

I did a chat on IRC with Ben, I guess he has got the idea :-)

>
>Bjorn

-- 
Richard Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 168+ messages in thread

end of thread, other threads:[~2015-03-02  6:10 UTC | newest]

Thread overview: 168+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-22  5:54 [PATCH V10 00/17] Enable SRIOV on Power8 Wei Yang
2014-12-22  5:54 ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 01/17] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 02/17] PCI/IOV: add VF enable/disable hook Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 04/17] PCI: Store VF BAR size in pci_sriov Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 05/17] PCI: Take additional PF's IOV BAR alignment in sizing and assigning Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 06/17] powerpc/pci: Add PCI resource alignment documentation Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 07/17] powerpc/pci: Don't unset pci resources for VFs Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 08/17] powrepc/pci: Refactor pci_dn Wei Yang
2014-12-22  5:54 ` [PATCH V10 09/17] powerpc/pci: remove pci_dn->pcidev field Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 10/17] powerpc/powernv: Use pci_dn in PCI config accessor Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 11/17] powerpc/powernv: Allocate pe->iommu_table dynamically Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 14/17] powerpc/powernv: Shift VF resource with an offset Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 15/17] powerpc/powernv: Allocate VF PE Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  5:54 ` [PATCH V10 17/17] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Wei Yang
2014-12-22  5:54   ` Wei Yang
2014-12-22  6:05 ` [PATCH V10 00/17] Enable SRIOV on Power8 Wei Yang
2014-12-22  6:05   ` Wei Yang
2015-01-13 18:05   ` Bjorn Helgaas
2015-01-13 18:05     ` Bjorn Helgaas
2015-01-15  2:27     ` [PATCH V11 " Wei Yang
2015-01-15  2:27       ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 01/17] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-02-20 23:09         ` Bjorn Helgaas
2015-02-20 23:09           ` Bjorn Helgaas
2015-03-02  6:05           ` Wei Yang
2015-03-02  6:05             ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 02/17] PCI/IOV: add VF enable/disable hook Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-02-10  0:26         ` Benjamin Herrenschmidt
2015-02-10  0:26           ` Benjamin Herrenschmidt
2015-02-10  1:35           ` Wei Yang
2015-02-10  1:35             ` Wei Yang
2015-02-10  2:13             ` Benjamin Herrenschmidt
2015-02-10  2:13               ` Benjamin Herrenschmidt
2015-02-10  6:18               ` Wei Yang
2015-02-10  6:18                 ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 03/17] PCI: Add weak pcibios_iov_resource_alignment() interface Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-02-10  0:32         ` Benjamin Herrenschmidt
2015-02-10  0:32           ` Benjamin Herrenschmidt
2015-02-10  1:44           ` Wei Yang
2015-02-10  1:44             ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 04/17] PCI: Store VF BAR size in pci_sriov Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 05/17] PCI: Take additional PF's IOV BAR alignment in sizing and assigning Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 06/17] powerpc/pci: Add PCI resource alignment documentation Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-02-04 23:44         ` Bjorn Helgaas
2015-02-04 23:44           ` Bjorn Helgaas
2015-02-10  1:02           ` Benjamin Herrenschmidt
2015-02-10  1:02             ` Benjamin Herrenschmidt
2015-02-20  0:56             ` Bjorn Helgaas
2015-02-20  0:56               ` Bjorn Helgaas
2015-02-20  2:41               ` Benjamin Herrenschmidt
2015-02-20  2:41                 ` Benjamin Herrenschmidt
2015-01-15  2:27       ` [PATCH V11 07/17] powerpc/pci: Don't unset pci resources for VFs Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-02-10  0:36         ` Benjamin Herrenschmidt
2015-02-10  0:36           ` Benjamin Herrenschmidt
2015-02-10  1:51           ` Wei Yang
2015-02-10  1:51             ` Wei Yang
2015-02-10  2:14             ` Benjamin Herrenschmidt
2015-02-10  2:14               ` Benjamin Herrenschmidt
2015-02-10  6:25               ` Wei Yang
2015-02-10  6:25                 ` Wei Yang
2015-02-10  8:14                 ` Benjamin Herrenschmidt
2015-02-10  8:14                   ` Benjamin Herrenschmidt
2015-02-20 23:47                   ` Bjorn Helgaas
2015-02-20 23:47                     ` Bjorn Helgaas
2015-03-02  6:09                     ` Wei Yang
2015-03-02  6:09                       ` Wei Yang
2015-01-15  2:27       ` [PATCH V11 08/17] powrepc/pci: Refactor pci_dn Wei Yang
2015-02-20 23:19         ` Bjorn Helgaas
2015-02-20 23:19           ` Bjorn Helgaas
2015-02-23  0:13           ` Gavin Shan
2015-02-23  0:13             ` Gavin Shan
2015-02-24  8:13             ` Bjorn Helgaas
2015-02-24  8:13               ` Bjorn Helgaas
2015-02-24  8:25               ` Benjamin Herrenschmidt
2015-02-24  8:25                 ` Benjamin Herrenschmidt
2015-01-15  2:27       ` [PATCH V11 09/17] powerpc/pci: remove pci_dn->pcidev field Wei Yang
2015-01-15  2:27         ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 10/17] powerpc/powernv: Use pci_dn in PCI config accessor Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 11/17] powerpc/powernv: Allocate pe->iommu_table dynamically Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 12/17] powerpc/powernv: Reserve additional space for IOV BAR according to the number of total_pe Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-02-04 21:26         ` Bjorn Helgaas
2015-02-04 21:26           ` Bjorn Helgaas
2015-02-04 23:08           ` Wei Yang
2015-02-04 23:08             ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 13/17] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-02-04 21:26         ` Bjorn Helgaas
2015-02-04 21:26           ` Bjorn Helgaas
2015-02-04 22:45           ` Wei Yang
2015-02-04 22:45             ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 14/17] powerpc/powernv: Shift VF resource with an offset Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-01-30 23:08         ` Bjorn Helgaas
2015-01-30 23:08           ` Bjorn Helgaas
2015-02-03  1:30           ` Wei Yang
2015-02-03  1:30             ` Wei Yang
2015-02-03  7:01           ` [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting Wei Yang
2015-02-03  7:01             ` Wei Yang
2015-02-04  0:19             ` Bjorn Helgaas
2015-02-04  0:19               ` Bjorn Helgaas
2015-02-04  3:34               ` Wei Yang
2015-02-04  3:34                 ` Wei Yang
2015-02-04 14:19                 ` Bjorn Helgaas
2015-02-04 14:19                   ` Bjorn Helgaas
2015-02-04 15:20                   ` Wei Yang
2015-02-04 15:20                     ` Wei Yang
2015-02-04 16:08                   ` [PATCH] pci/iov: fix memory leak introduced in "PCI: Store individual VF BAR size in struct pci_sriov" Wei Yang
2015-02-04 16:08                     ` Wei Yang
2015-02-04 16:28                     ` Bjorn Helgaas
2015-02-04 16:28                       ` Bjorn Helgaas
2015-02-04 20:53                 ` [PATCH] powerpc/powernv: make sure the IOV BAR will not exceed limit after shifting Bjorn Helgaas
2015-02-04 20:53                   ` Bjorn Helgaas
2015-02-05  3:01                   ` Wei Yang
2015-02-05  3:01                     ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 15/17] powerpc/powernv: Allocate VF PE Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 16/17] powerpc/powernv: Reserve additional space for IOV BAR, with m64_per_iov supported Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-02-04 22:05         ` Bjorn Helgaas
2015-02-04 22:05           ` Bjorn Helgaas
2015-02-05  0:07           ` Wei Yang
2015-02-05  0:07             ` Wei Yang
2015-01-15  2:28       ` [PATCH V11 17/17] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Wei Yang
2015-01-15  2:28         ` Wei Yang
2015-02-04 23:44       ` [PATCH V11 00/17] Enable SRIOV on Power8 Bjorn Helgaas
2015-02-04 23:44         ` Bjorn Helgaas
2015-02-05  0:13         ` Wei Yang
2015-02-05  0:13           ` Wei Yang
2015-02-05  6:34         ` [PATCH 0/3] Code adjustment on pci/virtualization Wei Yang
2015-02-05  6:34           ` Wei Yang
2015-02-05  6:34           ` [PATCH 1/3] fix on Store individual VF BAR size in struct pci_sriov Wei Yang
2015-02-05  6:34             ` Wei Yang
2015-02-05  6:34           ` [PATCH 2/3] fix Reserve additional space for IOV BAR, with m64_per_iov supported Wei Yang
2015-02-05  6:34             ` Wei Yang
2015-02-05  6:34           ` [PATCH 3/3] remove the unused end in pnv_pci_vf_resource_shift() Wei Yang
2015-02-05  6:34             ` Wei Yang
2015-02-10  0:25         ` [PATCH V11 00/17] Enable SRIOV on Power8 Benjamin Herrenschmidt
2015-02-10  0:25           ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.