linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <bhelgaas@google.com>
To: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: linux-pci@vger.kernel.org, benh@au1.ibm.com,
	linuxppc-dev@lists.ozlabs.org, gwshan@linux.vnet.ibm.com
Subject: Re: [PATCH V9 08/18] powrepc/pci: Refactor pci_dn
Date: Thu, 20 Nov 2014 12:05:41 -0700	[thread overview]
Message-ID: <20141120190541.GA5110@google.com> (raw)
In-Reply-To: <20141120072057.GC8562@richard>

On Thu, Nov 20, 2014 at 03:20:57PM +0800, Wei Yang wrote:
> On Wed, Nov 19, 2014 at 04:30:24PM -0700, Bjorn Helgaas wrote:
> >On Sun, Nov 02, 2014 at 11:41:24PM +0800, Wei Yang wrote:
> >> From: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> 
> >> pci_dn is the extension of PCI device node and it's created from
> >> device node. Unfortunately, VFs that are enabled dynamically by
> >> PF's driver and they don't have corresponding device nodes, and
> >> pci_dn. The patch refactors pci_dn to support VFs:
> >> 
> >>    * pci_dn is organized as a hierarchy tree. VF's pci_dn is put
> >>      to the child list of pci_dn of PF's bridge. pci_dn of other
> >>      device put to the child list of pci_dn of its upstream bridge.
> >> 
> >>    * VF's pci_dn is expected to be created dynamically when applying
> >>      final fixup to PF. VF's pci_dn will be destroyed when releasing
> >>      PF's pci_dev instance. pci_dn of other device is still created
> >>      from device node as before.
> >> 
> >>    * For one particular PCI device (VF or not), its pci_dn can be
> >>      found from pdev->dev.archdata.firmware_data, PCI_DN(devnode),
> >>      or parent's list. The fast path (fetching pci_dn through PCI
> >>      device instance) is populated during early fixup time.
> >> 
> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> >> ---
> >> ...
> >
> >> +struct pci_dn *add_dev_pci_info(struct pci_dev *pdev)
> >> +{
> >> +#ifdef CONFIG_PCI_IOV
> >> +	struct pci_dn *parent, *pdn;
> >> +	int i;
> >> +
> >> +	/* Only support IOV for now */
> >> +	if (!pdev->is_physfn)
> >> +		return pci_get_pdn(pdev);
> >> +
> >> +	/* Check if VFs have been populated */
> >> +	pdn = pci_get_pdn(pdev);
> >> +	if (!pdn || (pdn->flags & PCI_DN_FLAG_IOV_VF))
> >> +		return NULL;
> >> +
> >> +	pdn->flags |= PCI_DN_FLAG_IOV_VF;
> >> +	parent = pci_bus_to_pdn(pdev->bus);
> >> +	if (!parent)
> >> +		return NULL;
> >> +
> >> +	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
> >> +		pdn = add_one_dev_pci_info(parent, NULL,
> >> +					   pci_iov_virtfn_bus(pdev, i),
> >> +					   pci_iov_virtfn_devfn(pdev, i));
> >
> >I'm not sure this makes sense, but I certainly don't know this code, so
> >maybe I'm missing something.
> >
> >pci_iov_virtfn_bus() and pci_iov_virtfn_devfn() depend on
> >pdev->sriov->stride and pdev->sriov->offset.  These are read from VF Stride
> >and First VF Offset in the SR-IOV capability by sriov_init(), which is
> >called before add_dev_pci_info():
> >
> >  pci_scan_child_bus
> >    pci_scan_slot
> >      pci_scan_single_device
> >	pci_device_add
> >	  pci_init_capabilities
> >	    pci_iov_init(PF)
> >	      sriov_init(PF, pos)
> >		pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, 0)
> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset)
> >		pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride)
> >		iov->offset = offset
> >		iov->stride = stride
> >
> >  pci_bus_add_devices
> >    pci_bus_add_device
> >      pci_fixup_device(pci_fixup_final)
> >	add_dev_pci_info
> >	  pci_iov_virtfn_bus
> >	    return ... + sriov->offset + (sriov->stride * id) ...
> >
> >But both First VF Offset and VF Stride change when ARI Capable Hierarchy or
> >NumVFs changes (SR-IOV spec sec 3.3.9, 3.3.10).  We set NumVFs to zero in
> >sriov_init() above.  We will change NumVFs to something different when a
> >driver calls pci_enable_sriov():
> >
> >  pci_enable_sriov
> >    sriov_enable
> >      pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn)
> >
> >Now First VF Offset and VF Stride have changed from what they were when we
> >called pci_iov_virtfn_bus() above.
> >
> 
> Oops, I see the ARI would affect those value, while missed the NumVFs also
> would.
> 
> Let's look at the problem one by one.
> 
> 1. The ARI capability.
> ===============================================================================
> The kernel initialize the capability like this:
> 
> pci_init_capabilities()
> 	pci_configure_ari()
> 	pci_iov_init()
> 		iov->offset = offset
> 		iov->stride = stride
> 
> When offset/stride is retrieved at this point, the ARI capability is taken
> into consideration.

PCI_SRIOV_CTRL_ARI is currently only changed at the time we enumerate the
PF, so I don't think this is really a problem.

> 2. The PF's NumVFs field
> ===============================================================================
> 2.1 Potential problem in current code
> ===============================================================================
> First, is current pci code has some potential problem?
> 
> sriov_enable()
> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
> 	pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
> 	iov->offset = offset;
> 	iov->stride = stride;
> 	pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
> 	virtfn_add()
> 		...
> 		virtfn->devfn = pci_iov_virtfn_devfn(dev, id);
> 
> The sriov_enable() retrieve the offset/stride then write the NumVFs. According
> to the SPEC, at this moment the offset/stride may change. While I don't see
> some code to retrieve and store those value again. And these fields will be
> used in virtfn_add().
> 
> If my understanding is correct, I suggest to move the retrieve and store
> operation after NumVFs is written.

Yep, it looks like the existing code has similar problems.  We might want
to add a simple function that writes PCI_SRIOV_NUM_VF, then reads
PCI_SRIOV_VF_OFFSET and PCI_SRIOV_VF_STRIDE and refreshes the cached values
in dev->sriov.

Then we'd at least know that virtfn_bus() and virtfn_devfn() return values
that are correct until the next NumVFs change.

> 2.2 The IOV bus range may not be correct in pci_scan_child_bus()?
> ===============================================================================
> In current pci core, when enumerating the pci tree, we do like this:
> 
> pci_scan_child_bus()
> 	pci_scan_slot()
> 		pci_scan_single_device()
> 			pci_device_add()
> 				pci_init_capabilities()
> 					pci_iov_init()
> 	max += pci_iov_bus_range(bus);
> 		busnr = pci_iov_virtfn_bus(dev, dev->sriov->total_VFs - 1);
> 	max = pci_scan_bridge(bus, dev, max, pass);
> 
> From this point, we see pci core reserve some bus range for VFs. This
> calculation is based on the offset/stride at this moment. And do the
> enumeration with the new bus number.
> 
> sriov_enable() could be called several times from driver to enable SRIOV, and
> with different nr_virtfn. If each time the NumVFs written, the offset/stride
> will change. This means we may try to address an extra bus we didn't reserve?
> Or this means it is out of control?

This looks like a problem, too.  I don't have a good suggestion for fixing
it.

> 2.3 How can I reserve bus range in FW?
> ===============================================================================
> This question comes from the previous one.
> 
> Based on my understanding, current pci core will rely on the bus number in HW
> if pcibios_assign_all_busses() is not set. If we want to support those VFs
> sits on different bus with PF, we need to reserve bus range and write the
> correct secondary/subordinate in bridge. Otherwise, those VFs on different bus
> may not be addressed.
> 
> Currently I am writing the code in FW to reserve the range with the same
> mechanism in pci core. While as you mentioned the offset/stride may change
> after sriov_enable(), I am confused whether this is the correct way.

If your firmware knows something about the device and can compute the
number of buses it will likely need, it can set up bridges with appropriate
bus number ranges, and Linux will generally leave those alone.

I don't know the best way to figure out the number of buses, though.  It
seems like you almost need to experimentally set NumVFs and read the
resulting offset/stride, because I think it's really up to the device to
decide how to number the VFs.  Maybe pci_iov_bus_range() needs to do
something similar.

> 2.4 The potential problem for [Patch 08/18]
> ===============================================================================
> According to the SPEC, the offset/stride will change after each
> sriov_enable(). This means the bus/devfn will change after each
> sriov_enable().
> 
> My current thought is to fix it up in virtfn_add(). If the total VF number
> will not change, we could create those pci_dn at the beginning and fix the
> bus/devfn at each time the VF is truely created.

By "fix it up," I assume you mean call an arch function that does the
pci_pdn setup you need.

It sounds reasonable to do this either in virtfn_add()/virtfn_remove() or
at the points where we write PCI_SRIOV_CTRL_VFE, i.e., in sriov_init(),
sriov_enable(), sriov_disable(), and sriov_restore_state().  From a
hardware point of view, the VFs exist whenever PCI_SRIOV_CTRL_VFE is set,
so it might be nice to have this setup connected to that.

Bjorn

  reply	other threads:[~2014-11-20 19:05 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-02 15:41 [PATCH V9 00/18] Enable SRIOV on PowerNV Wei Yang
2014-11-02 15:41 ` [PATCH V9 01/18] PCI/IOV: Export interface for retrieve VF's BDF Wei Yang
2014-11-19 23:35   ` Bjorn Helgaas
2014-11-02 15:41 ` [PATCH V9 02/18] PCI: Add weak pcibios_iov_resource_alignment() interface Wei Yang
2014-11-02 15:41 ` [PATCH V9 03/18] PCI: Add weak pcibios_iov_resource_size() interface Wei Yang
2014-11-19  1:12   ` Bjorn Helgaas
2014-11-19  2:15     ` Benjamin Herrenschmidt
2014-11-19  3:21       ` Wei Yang
2014-11-19  4:26         ` Bjorn Helgaas
2014-11-19  9:27           ` Wei Yang
2014-11-19 17:23             ` Bjorn Helgaas
2014-11-19 20:51               ` Benjamin Herrenschmidt
2014-11-20  5:40                 ` Wei Yang
2014-11-20  5:39               ` Wei Yang
2014-11-02 15:41 ` [PATCH V9 04/18] PCI: Take additional PF's IOV BAR alignment in sizing and assigning Wei Yang
2014-11-02 15:41 ` [PATCH V9 05/18] powerpc/pci: Add PCI resource alignment documentation Wei Yang
2014-11-02 15:41 ` [PATCH V9 06/18] powerpc/pci: Don't unset pci resources for VFs Wei Yang
2014-11-02 15:41 ` [PATCH V9 07/18] powerpc/pci: Define pcibios_disable_device() on powerpc Wei Yang
2014-11-02 15:41 ` [PATCH V9 08/18] powrepc/pci: Refactor pci_dn Wei Yang
2014-11-19 23:30   ` Bjorn Helgaas
2014-11-20  1:02     ` Gavin Shan
2014-11-20  7:25       ` Wei Yang
2014-11-20  7:20     ` Wei Yang
2014-11-20 19:05       ` Bjorn Helgaas [this message]
2014-11-21  0:04         ` Gavin Shan
2014-11-25  9:28           ` Wei Yang
2014-11-21  1:46         ` Wei Yang
2014-11-02 15:41 ` [PATCH V9 09/18] powerpc/pci: remove pci_dn->pcidev field Wei Yang
2014-11-02 15:41 ` [PATCH V9 10/18] powerpc/powernv: Use pci_dn in PCI config accessor Wei Yang
2014-11-02 15:41 ` [PATCH V9 11/18] powerpc/powernv: Allocate pe->iommu_table dynamically Wei Yang
2014-11-02 15:41 ` [PATCH V9 12/18] powerpc/powernv: Expand VF resources according to the number of total_pe Wei Yang
2014-11-02 15:41 ` [PATCH V9 13/18] powerpc/powernv: Implement pcibios_iov_resource_alignment() on powernv Wei Yang
2014-11-02 15:41 ` [PATCH V9 14/18] powerpc/powernv: Implement pcibios_iov_resource_size() " Wei Yang
2014-11-02 15:41 ` [PATCH V9 15/18] powerpc/powernv: Shift VF resource with an offset Wei Yang
2014-11-02 15:41 ` [PATCH V9 16/18] powerpc/powernv: Allocate VF PE Wei Yang
2014-11-02 15:41 ` [PATCH V9 17/18] powerpc/powernv: Expanding IOV BAR, with m64_per_iov supported Wei Yang
2014-11-02 15:41 ` [PATCH V9 18/18] powerpc/powernv: Group VF PE when IOV BAR is big on PHB3 Wei Yang
2014-11-18 23:11 ` [PATCH V9 00/18] Enable SRIOV on PowerNV Gavin Shan
2014-11-18 23:40   ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141120190541.GA5110@google.com \
    --to=bhelgaas@google.com \
    --cc=benh@au1.ibm.com \
    --cc=gwshan@linux.vnet.ibm.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=weiyang@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).