All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thierry Reding <treding@nvidia.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Vidya Sagar <vidyas@nvidia.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Sinan Kaya <okaya@kernel.org>,
	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>,
	<jonathanh@nvidia.com>, <linux-tegra@vger.kernel.org>,
	<linux-pci@vger.kernel.org>, <kthota@nvidia.com>,
	<mmaddireddy@nvidia.com>, <sagar.tv@gmail.com>,
	Andrew Murray <andrew.murray@arm.com>,
	Lukas Wunner <lukas@wunner.de>
Subject: Re: [PATCH] PCI: Add CRS timeout for pci_device_is_present()
Date: Wed, 13 Nov 2019 12:20:43 +0100	[thread overview]
Message-ID: <20191113112043.GA329424@ulmo> (raw)
In-Reply-To: <20191112185844.GA191944@google.com>

[-- Attachment #1: Type: text/plain, Size: 10635 bytes --]

On Tue, Nov 12, 2019 at 12:58:44PM -0600, Bjorn Helgaas wrote:
> On Tue, Nov 12, 2019 at 11:29:55PM +0530, Vidya Sagar wrote:
> > On 11/12/2019 7:51 PM, Bjorn Helgaas wrote:
> > > On Tue, Nov 12, 2019 at 01:59:23PM +0100, Thierry Reding wrote:
> > > > On Mon, Nov 11, 2019 at 04:32:35PM -0600, Bjorn Helgaas wrote:
> > > > > On Mon, Nov 11, 2019 at 11:31:18AM +0530, Vidya Sagar wrote:
> > > > > > On 11/6/2019 10:11 PM, Bjorn Helgaas wrote:
> > > > > 
> > > > > > > Based on Vidya's backtrace, I think the resume path with problems
> > > > > > > is this:
> > > > > > > 
> > > > > > >     pci_pm_resume_noirq
> > > > > > >       pci_pm_default_resume_early
> > > > > > >         pci_power_up
> > > > > > >           if (platform_pci_power_manageable(dev))
> > > > > > >             platform_pci_set_power_state(dev, PCI_D0)  # <-- FW delay here?
> > > > > > >           pci_raw_set_power_state
> > > > > > >           pci_update_current_state
> > > > > > >             pci_device_is_present        # <-- config read returns CRS
> > > > > > > 
> > > > > > > So I think your suggestion is that Vidya's firmware should be
> > > > > > > doing the delay inside platform_pci_set_power_state()?
> > > > > > > 
> > > > > > > Vidya, you typically work on Tegra, so I assume this is on an
> > > > > > > arm64 system?  Does it have ACPI?  Do you have access to the
> > > > > > > firmware developers to ask about who they expect to do the delays?
> > > > > > 
> > > > > > Yes. This is on arm64 (Tegra) and we don't have any ACPI or any
> > > > > > other firmware for that matter. PCIe is brought up directly in the
> > > > > > kernel.
> > > > > 
> > > > > I assume that your device is coming out of D3cold because apparently
> > > > > you're seeing a CRS status from the config read when
> > > > > pci_update_current_state() calls pci_device_is_present().  CRS status
> > > > > should only happen after reset or power-on from D3cold, and you're not
> > > > > doing a reset.
> > > > > 
> > > > > I'm pretty sure platform_pci_power_manageable() returns false on
> > > > > your system (can you confirm that?) because the only scenarios with
> > > > > platform power management are MID (Intel platform) and ACPI (which you
> > > > > don't have).
> > > > > 
> > > > > Maybe you have some other platform-specific mechanism that controls
> > > > > power to PCI devices, and it's not integrated into the
> > > > > platform_pci_*() framework?
> > > > 
> > > > My understanding after reading the PCIe specification is that CRS is a
> > > > mechanism that allows an endpoint to signal that it isn't ready yet for
> > > > operation after reset or power-on from D3cold. There's nothing in there
> > > > that's platform specific. This is really only for specific endpoints.
> > > > 
> > > > I don't see how adding platform specific PM code would help in this
> > > > case. At a platform level we don't know if users are going to plug in a
> > > > PCI endpoint that needs a long delay before it's ready after reset and/
> > > > or exit from D3cold.
> > > 
> > > Right, see below.
> > > 
> > > > I do understand that perhaps pci_device_is_present() is perhaps not the
> > > > best place to do complex CRS handling, but if a mechanism is clearly
> > > > described in the specification, isn't it something that should be dealt
> > > > with in the core? That way we don't have to quirk this for every device
> > > > and platform.
> > > 
> > > Definitely; we don't want quirks for endpoints (unless they're
> > > actually broken) or for platforms (unless there's a platform hardware
> > > or firmware defect).
> > > 
> > > There's no question that we need to delay and handle CRS after
> > > power-on from D3cold.  I'm trying to get at the point that PCI itself
> > > doesn't tell us how to do that power-on.  The mechanisms defined by
> > > PCI rely on config space, which is only accessible in D0-D3hot, not
> > > D3cold.  Power-on from D3cold can only happen via ACPI methods or
> > > other platform-specific mechanisms, and the current design abstracts
> > > those via platform_pci_set_power_state().  This is partly based on
> > > Rafael's response in [1].
> > > 
> > > > The PCIe specification says that:
> > > > 
> > > > 	Software that intends to take advantage of this mechanism must
> > > > 	ensure that the first access made to a device following a valid
> > > > 	reset condition is a Configuration Read Request accessing both
> > > > 	bytes of the Vendor ID field in the device's Configuration Space
> > > > 	header.
> > > > 
> > > > So doesn't that mean that pci_device_is_present() is already much too
> > > > late because we've potentially made other configuration read requests in
> > > > the meantime?
> > > > 
> > > > Wouldn't it make more sense to push the CRS handling up a bit? The
> > > > existing pci_power_up() function seems like it would be a good place.
> > > > For example, adding code to deal with CRS right after the platform PCI
> > > > PM calls but before pci_raw_set_power_state() seems like it would fit
> > > > the restrictions given in the above quote from the specification.
> > > 
> > > Yep, I think that's the right point.  I'm trying to figure out how to
> > > integrate it.  Rafael suggests that delays may be platform-specific
> > > and should be in platform_pci_set_power_state(), but the CRS handling
> > > itself isn't platform-specific and maybe could be higher up.
> > > 
> > > I'm fishing to see if Tegra has some kind of power control for
> > > endpoints that is not related to the platform_pci_*() framework.  How
> > > did the endpoint get put in D3cold in the first place?  I assume
> > > something in the suspend path did that?  Maybe this happens when we
> > > suspend the Tegra RC itself, e.g., tegra_pcie_pm_suspend()?
> > > 
> > >    tegra_pcie_pm_suspend
> > >      tegra_pcie_phy_power_off
> > >      tegra_pcie_power_off
> > > 
> > >    tegra_pcie_pm_resume
> > >      tegra_pcie_power_on
> > >      tegra_pcie_phy_power_on
> > > 
> > > If a path like tegra_pcie_pm_resume() is causing the D3cold -> D0
> > > transition for the endpoint, I don't think we want to do CRS handling
> > > there because that path shouldn't be touching the endpoint.  But maybe
> > > it should be doing the delays required by PCIe r5.0, sec 6.6.1, before
> > > any config accesses are issued to devices.
> >
> > Here, I'm exercising suspend-to-RAM sequence and the PCIe endpoint of
> > concern is Intel 750 NVMe drive.
> > PCIe host controller driver already has 100ms of delay as per the spec,
> 
> To make this more concrete, where specifically is this delay?
> 
> > but this particular device is taking 1023ms to get ready to respond to
> > configuration space requests (till that time, it responds with
> > configuration request retry statuses)
> > I've put a dump_stack () to see the path resume sequence takes and here it is
> > 
> >  Call trace:
> >   dump_backtrace+0x0/0x158
> >   show_stack+0x14/0x20
> >   dump_stack+0xb0/0xf4
> >   pci_bus_generic_read_dev_vendor_id+0x19c/0x1a0
> >   pci_bus_read_dev_vendor_id+0x48/0x70
> >   pci_update_current_state+0x68/0xd8
> >   pci_power_up+0x40/0x50
> >   pci_pm_resume_noirq+0x7c/0x138
> >   dpm_run_callback.isra.16+0x20/0x70
> >   device_resume_noirq+0x120/0x238
> >   async_resume_noirq+0x24/0x58
> >   async_run_entry_fn+0x40/0x148
> >   process_one_work+0x1e8/0x360
> >   worker_thread+0x40/0x488
> >   kthread+0x118/0x120
> >   ret_from_fork+0x10/0x1c
> >  pci 0005:01:00.0: ready after 1023ms
> > 
> > Spec also mentions the following
> >     Unless Readiness Notifications mechanisms are used (see Section
> >     6.23), the Root Complex and/or system software must allow at
> >     least 1.0 s after a Conventional Reset of a device, before it
> >     may determine that a device which fails to return a Successful
> >     Completion status for a valid Configuration Request is a broken
> >     device. This period is independent of how quickly Link training
> >     completes.
> > 
> > My understanding is that this 1sec waiting is supposed to be done by
> > the PCIe sub-system and not the host controller driver.
> 
> That doesn't say we must wait 1s; it only says we can't decide the
> device is broken before 1s.  But regardless, I agree that the CRS
> handling doesn't belong in the driver for either the endpoint or the
> PCIe host controller.
> 
> My question is whether this wait should be connected somehow with
> platform_pci_set_power_state().  It sounds like the tegra host
> controller driver already does the platform-specific delays, and I'm
> not sure it's reasonable for platform_pci_set_power_state() to do the
> CRS polling.  Maybe something like this?  I'd really like to get
> Rafael's thinking here.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e7982af9a5d8..052fa316c917 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -964,9 +964,14 @@ void pci_refresh_power_state(struct pci_dev *dev)
>   */
>  void pci_power_up(struct pci_dev *dev)
>  {
> +	pci_power_state_t prev_state = dev->current_state;
> +
>  	if (platform_pci_power_manageable(dev))
>  		platform_pci_set_power_state(dev, PCI_D0);
>  
> +	if (prev_state == PCI_D3cold)
> +		pci_dev_wait(dev, "D3cold->D0", PCIE_RESET_READY_POLL_MS);

Is there any reason in particular why you chose to call pci_dev_wait()?
It seems to me like that's a little broader than pci_bus_wait_crs(). The
latter is static in drivers/pci/probe.c so we'd need to change that in
order to use it from drivers/pci/pci.c, but it sounds like the more
explicit thing to do.

Thierry

> +
>  	pci_raw_set_power_state(dev, PCI_D0);
>  	pci_update_current_state(dev, PCI_D0);
>  }
> 
> > FWIW, this particular device is taking a little more time than 1 sec
> > (i.e. 1023 ms) I'm now wondering why is it that the CRS has a
> > timeout of 60 secs than just 1 sec?
> 
> pci_bus_wait_crs() does an exponential backoff, i.e., it does reads
> after 0ms, 2ms, 4ms, 8ms, ..., 512ms, 1024ms, so we don't know
> *exactly* when the device became ready.  All we know is that it
> became ready somewhere between 512ms and 1024ms.
> 
> But 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become
> ready after FLR") does suggest that the Intel 750 NVMe may require
> more than 1s.  I think the 60s timeout is just a convenient large
> number and I'm not sure it's derived from anything in the spec.
> 
> > > [1] https://lore.kernel.org/r/11429373.7ySiFsEkgL@kreacher
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2019-11-13 11:20 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-05 18:21 [PATCH] PCI: Add CRS timeout for pci_device_is_present() Vidya Sagar
2019-10-14  8:20 ` Thierry Reding
2019-10-14 20:21   ` Sinan Kaya
2019-10-15  9:30     ` Thierry Reding
2019-10-15 11:10       ` Sinan Kaya
2019-10-15 12:14         ` Vidya Sagar
     [not found]           ` <afa16546-e63d-6eba-8be0-8e52339cd100@nvidia.com>
2019-10-25 11:58             ` Vidya Sagar
2019-10-26 13:59               ` Sinan Kaya
2019-11-04 11:43                 ` Vidya Sagar
2019-11-04 16:52                   ` Lorenzo Pieralisi
2019-11-04 17:39           ` Bjorn Helgaas
2019-11-05 10:55             ` Rafael J. Wysocki
2019-11-06 16:41               ` Bjorn Helgaas
2019-11-11  6:01                 ` Vidya Sagar
2019-11-11 22:32                   ` Bjorn Helgaas
2019-11-12 12:59                     ` Thierry Reding
2019-11-12 14:21                       ` Bjorn Helgaas
2019-11-12 17:59                         ` Vidya Sagar
2019-11-12 18:58                           ` Bjorn Helgaas
2019-11-13  5:39                             ` Vidya Sagar
2019-11-13 11:20                             ` Thierry Reding [this message]
2019-11-14 18:36                               ` Bjorn Helgaas
2019-11-15 10:04                                 ` Vidya Sagar
2019-11-15 22:36                                   ` Bjorn Helgaas
2019-11-18 15:18                                     ` Vidya Sagar
2019-11-12 17:59                     ` Vidya Sagar
2019-10-15 12:03       ` Vidya Sagar
2019-10-15 11:34     ` Vidya Sagar
2019-10-14 10:45 ` Andrew Murray

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191113112043.GA329424@ulmo \
    --to=treding@nvidia.com \
    --cc=andrew.murray@arm.com \
    --cc=helgaas@kernel.org \
    --cc=jonathanh@nvidia.com \
    --cc=kthota@nvidia.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=lorenzo.pieralisi@arm.com \
    --cc=lukas@wunner.de \
    --cc=mmaddireddy@nvidia.com \
    --cc=okaya@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=sagar.tv@gmail.com \
    --cc=vidyas@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.