On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:
> On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote:
> > On 11/12/2019 7:51 PM, Bjorn Helgaas wrote:
> > > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
> > > > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
> > > > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> > > > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> > > > > 
> > > > > > > Based on Vidya's backtrace, I think the resume path with problems
> > > > > > > is this:
> > > > > > > 
> > > > > > >     pci_pm_resume_noirq
> > > > > > >       pci_pm_default_resume_early
> > > > > > >         pci_power_up
> > > > > > >           if (platform_pci_power_manageable(dev))
> > > > > > >             platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> > > > > > >           pci_raw_set_power_state
> > > > > > >           pci_update_current_state
> > > > > > >             pci_device_is_present        # <-- config read returns CRS
> > > > > > > 
> > > > > > > So I think your suggestion is that Vidya's firmware should be
> > > > > > > doing the delay inside platform_pci_set_power_state()?
> > > > > > > 
> > > > > > > Vidya, you typically work on Tegra, so I assume this is on an
> > > > > > > arm64 system?  Does it have ACPI?  Do you have access to the
> > > > > > > firmware developers to ask about who they expect to do the delays?
> > > > > > 
> > > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> > > > > > other firmware for that matter. PCIe is brought up directly in the
> > > > > > kernel.
> > > > > 
> > > > > I assume that your device is coming out of D3cold because apparently
> > > > > you're seeing a CRS status from the config read when
> > > > > pci_update_current_state() calls pci_device_is_present().  CRS status
> > > > > should only happen after reset or power-on from D3cold, and you're not
> > > > > doing a reset.
> > > > > 
> > > > > I'm pretty sure platform_pci_power_manageable() returns false on
> > > > > your system (can you confirm that?) because the only scenarios with
> > > > > platform power management are MID (Intel platform) and ACPI (which you
> > > > > don't have).
> > > > > 
> > > > > Maybe you have some other platform-specific mechanism that controls
> > > > > power to PCI devices, and it's not integrated into the
> > > > > platform_pci_*() framework?
> > > > 
> > > > My understanding after reading the PCIe specification is that CRS is a
> > > > mechanism that allows an endpoint to signal that it isn't ready yet for
> > > > operation after reset or power-on from D3cold. There's nothing in there
> > > > that's platform specific. This is really only for specific endpoints.
> > > > 
> > > > I don't see how adding platform specific PM code would help in this
> > > > case. At a platform level we don't know if users are going to plug in a
> > > > PCI endpoint that needs a long delay before it's ready after reset and/
> > > > or exit from D3cold.
> > > 
> > > Right, see below.
> > > 
> > > > I do understand that perhaps pci_device_is_present() is perhaps not the
> > > > best place to do complex CRS handling, but if a mechanism is clearly
> > > > described in the specification, isn't it something that should be dealt
> > > > with in the core? That way we don't have to quirk this for every device
> > > > and platform.
> > > 
> > > Definitely; we don't want quirks for endpoints (unless they're
> > > actually broken) or for platforms (unless there's a platform hardware
> > > or firmware defect).
> > > 
> > > There's no question that we need to delay and handle CRS after
> > > power-on from D3cold.  I'm trying to get at the point that PCI itself
> > > doesn't tell us how to do that power-on.  The mechanisms defined by
> > > PCI rely on config space, which is only accessible in D0-D3hot, not
> > > D3cold.  Power-on from D3cold can only happen via ACPI methods or
> > > other platform-specific mechanisms, and the current design abstracts
> > > those via platform_pci_set_power_state().  This is partly based on
> > > Rafael's response in [1].
> > > 
> > > > The PCIe specification says that:
> > > > 
> > > > 	Software that intends to take advantage of this mechanism must
> > > > 	ensure that the first access made to a device following a valid
> > > > 	reset condition is a Configuration Read Request accessing both
> > > > 	bytes of the Vendor ID field in the device's Configuration Space
> > > > 	header.
> > > > 
> > > > So doesn't that mean that pci_device_is_present() is already much too
> > > > late because we've potentially made other configuration read requests in
> > > > the meantime?
> > > > 
> > > > Wouldn't it make more sense to push the CRS handling up a bit? The
> > > > existing pci_power_up() function seems like it would be a good place.
> > > > For example, adding code to deal with CRS right after the platform PCI
> > > > PM calls but before pci_raw_set_power_state() seems like it would fit
> > > > the restrictions given in the above quote from the specification.
> > > 
> > > Yep, I think that's the right point.  I'm trying to figure out how to
> > > integrate it.  Rafael suggests that delays may be platform-specific
> > > and should be in platform_pci_set_power_state(), but the CRS handling
> > > itself isn't platform-specific and maybe could be higher up.
> > > 
> > > I'm fishing to see if Tegra has some kind of power control for
> > > endpoints that is not related to the platform_pci_*() framework.  How
> > > did the endpoint get put in D3cold in the first place?  I assume
> > > something in the suspend path did that?  Maybe this happens when we
> > > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?
> > > 
> > >    tegra_pcie_pm_suspend
> > >      tegra_pcie_phy_power_off
> > >      tegra_pcie_power_off
> > > 
> > >    tegra_pcie_pm_resume
> > >      tegra_pcie_power_on
> > >      tegra_pcie_phy_power_on
> > > 
> > > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
> > > transition for the endpoint, I don't think we want to do CRS handling
> > > there because that path shouldn't be touching the endpoint.  But maybe
> > > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
> > > any config accesses are issued to devices.
> >
> > Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of
> > concern is Intel 750 NVMe drive.
> > PCIe host controller driver already has 100ms of delay as per the spec,
> 
> To make this more concrete, where specifically is this delay?
> 
> > but this particular device is taking 1023ms to get ready to respond to
> > configuration space requests (till that time, it responds with
> > configuration request retry statuses)
> > I've put a dump_stack () to see the path resume sequence takes and here it is
> > 
> >  Call trace:
> >   dump_backtrace+0x0/0x158
> >   show_stack+0x14/0x20
> >   dump_stack+0xb0/0xf4
> >   pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0
> >   pci_bus_read_dev_vendor_id+0x48/0x70
> >   pci_update_current_state+0x68/0xd8
> >   pci_power_up+0x40/0x50
> >   pci_pm_resume_noirq+0x7c/0x138
> >   dpm_run_callback.isra.16+0x20/0x70
> >   device_resume_noirq+0x120/0x238
> >   async_resume_noirq+0x24/0x58
> >   async_run_entry_fn+0x40/0x148
> >   process_one_work+0x1e8/0x360
> >   worker_thread+0x40/0x488
> >   kthread+0x118/0x120
> >   ret_from_fork+0x10/0x1c
> >  pci 0005:01:00.0: ready after 1023ms
> > 
> > Spec also mentions the following
> >     Unless Readiness Notifications mechanisms are used (see Section
> >     6.23), the Root Complex and/or system software must allow at
> >     least 1.0 s after a Conventional Reset of a device, before it
> >     may determine that a device which fails to return a Successful
> >     Completion status for a valid Configuration Request is a broken
> >     device. This period is independent of how quickly Link training
> >     completes.
> > 
> > My understanding is that this 1sec waiting is supposed to be done by
> > the PCIe sub-system and not the host controller driver.
> 
> That doesn't say we must wait 1s; it only says we can't decide the
> device is broken before 1s.  But regardless, I agree that the CRS
> handling doesn't belong in the driver for either the endpoint or the
> PCIe host controller.
> 
> My question is whether this wait should be connected somehow with
> platform_pci_set_power_state().  It sounds like the tegra host
> controller driver already does the platform-specific delays, and I'm
> not sure it's reasonable for platform_pci_set_power_state() to do the
> CRS polling.  Maybe something like this?  I'd really like to get
> Rafael's thinking here.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e7982af9a5d8..052fa316c917 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
>   */
>  void pci_power_up(struct pci_dev *dev)
>  {
> +	pci_power_state_t prev_state = dev->current_state;
> +
>  	if (platform_pci_power_manageable(dev))
>  		platform_pci_set_power_state(dev, PCI_D0);
>  
> +	if (prev_state == PCI_D3cold)
> +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);

Is there any reason in particular why you chose to call pci_dev_wait()?
It seems to me like that's a little broader than pci_bus_wait_crs(). The
latter is static in drivers/pci/probe.c so we'd need to change that in
order to use it from drivers/pci/pci.c, but it sounds like the more
explicit thing to do.

Thierry

> +
>  	pci_raw_set_power_state(dev, PCI_D0);
>  	pci_update_current_state(dev, PCI_D0);
>  }
> 
> > FWIW, this particular device is taking a little more time than 1 sec
> > (i.e. 1023 ms) I'm now wondering why is it that the CRS has a
> > timeout of 60 secs than just 1 sec?
> 
> pci_bus_wait_crs() does an exponential backoff, i.e., it does reads
> after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know
> *exactly* when the device became ready.  All we know is that it
> became ready somewhere between 512ms and 1024ms.
> 
> But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become
> ready after FLR") does suggest that the Intel 750 NVMe may require
> more than 1s.  I think the 60s timeout is just a convenient large
> number and I'm not sure it's derived from anything in the spec.
> 
> > > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
>