From: Mika Westerberg <mika.westerberg@linux.intel.com>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
"Rafael J. Wysocki" <rjw@rjwysocki.net>,
Len Brown <lenb@kernel.org>, Lukas Wunner <lukas@wunner.de>,
Keith Busch <keith.busch@intel.com>,
Alex Williamson <alex.williamson@redhat.com>,
Alexandru Gagniuc <mr.nuke.me@gmail.com>,
ACPI Devel Maling List <linux-acpi@vger.kernel.org>,
Linux PCI <linux-pci@vger.kernel.org>
Subject: Re: [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec
Date: Thu, 6 Jun 2019 14:24:30 +0300 [thread overview]
Message-ID: <20190606112430.GZ2781@lahna.fi.intel.com> (raw)
In-Reply-To: <CAJZ5v0hnb9LBoo+63Qvpa+-QQfPsFBoN7CDCLrWhcKzUppbw4A@mail.gmail.com>
On Thu, Jun 06, 2019 at 10:40:48AM +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Currently Linux does not follow PCIe spec regarding the required delays
> > after reset. A concrete example is a Thunderbolt add-in-card that
> > consists of a PCIe switch and two PCIe endpoints:
> >
> > +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
> > +-01.0-[04-36]-- DS hotplug port
> > +-02.0-[37]----00.0 xHCI controller
> > \-04.0-[38-6b]-- DS hotplug port
> >
> > The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> > gen3 so they support 8GT/s link speeds.
> >
> > We wait for the PCIe hierarchy to enter D3cold (runtime):
> >
> > pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
> >
> > When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> > PCIe switch is put to reset and its power is re-applied. This means that
> > we must follow the rules in PCIe 4.0 section 6.6.1.
> >
> > For the PCIe gen3 ports we are dealing with here, the following applies:
> >
> > With a Downstream Port that supports Link speeds greater than 5.0
> > GT/s, software must wait a minimum of 100 ms after Link training
> > completes before sending a Configuration Request to the device
> > immediately below that Port. Software can determine when Link training
> > completes by polling the Data Link Layer Link Active bit or by setting
> > up an associated interrupt (see Section 6.7.3.3).
> >
> > Translating this into the above topology we would need to do this (DLLLA
> > stands for Data Link Layer Link Active):
> >
> > pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
> > pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
> > pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0
> >
> > I've instrumented the kernel with additional logging so we can see the
> > actual delays the kernel performs:
> >
> > pcieport 0000:00:1b.0: power state changed by ACPI to D0
> > pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
> > pcieport 0000:00:1b.0: waking up bus
> > pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
> > pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
> > ...
> > pcieport 0000:00:1b.0: PME# disabled
> > pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > ...
> > pcieport 0000:01:00.0: PME# disabled
> > pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > ...
> > pcieport 0000:02:00.0: PME# disabled
> > pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > ...
> > pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> > pcieport 0000:02:01.0: PME# disabled
> > pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > ...
> > pcieport 0000:02:02.0: PME# disabled
> > pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > ...
> > pcieport 0000:02:04.0: PME# disabled
> > pcieport 0000:02:01.0: PME# enabled
> > pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
> > pcieport 0000:02:04.0: PME# enabled
> > pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
> > thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
> > ...
> > thunderbolt 0000:03:00.0: PME# disabled
> > xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
> > ...
> > xhci_hcd 0000:37:00.0: PME# disabled
> >
> > For the switch upstream port (01:00.0) we wait for 100ms but not taking
> > into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
> > transition of the root port and the two downstream hotplug ports. This
> > means that we deviate from what the spec requires.
> >
> > Performing the same check for system sleep (s2idle) transitions we can
> > see following when resuming from s2idle:
> >
> > pcieport 0000:00:1b.0: power state changed by ACPI to D0
> > pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
> > ...
> > pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > ...
> > pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
> > pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
> > pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
> > pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
> > pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
> > pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
> > pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
> > pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
> > pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
> > pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
> > pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
> > pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
> > pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
> > pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
> > pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
> > pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
> > pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
> > pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
> > pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
> > pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
> > pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
> > pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
> > pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
> > pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
> > pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> > pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
> > pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
> > pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> > pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> > pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
> > xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
> > ...
> > thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
> >
> > This is even worse. None of the mandatory delays are performed. If this
> > would be S3 instead of s2idle then according to PCI FW spec 3.2 section
> > 4.6.8. there is a specific _DSM that allows the OS to skip the delays
> > but this platform does not provide the _DSM and does not go to S3 anyway
> > so no firmware is involved that could already handle these delays.
> >
> > In this particular Intel Coffee Lake platform these delays are not
> > actually needed because there is an additional delay as part of the ACPI
> > power resource that is used to turn on power to the hierarchy but since
> > that additional delay is not required by any of standards (PCIe, ACPI)
> > it is not present in the Intel Ice Lake, for example where missing the
> > mandatory delays causes pciehp to start tearing down the stack too early
> > (links are not yet trained).
> >
> > For this reason, change the PCIe portdrv PM resume hooks so that they
> > perform the mandatory delays before the downstream component gets
> > resumed. We perform the delays before port services are resumed because
> > otherwise pciehp might find that the link is not up (even if it is just
> > training) and tears-down the hierarchy.
> >
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
>
> Generally
>
> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Thanks!
> with a couple of nits below.
>
> > ---
> > drivers/pci/pci.c | 29 +++++++++------
> > drivers/pci/pci.h | 1 +
> > drivers/pci/pcie/portdrv_core.c | 62 +++++++++++++++++++++++++++++++++
> > 3 files changed, 82 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 8abc843b1615..87a1f902fa8e 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1004,15 +1004,10 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
> > if (state == PCI_D0) {
> > pci_platform_power_transition(dev, PCI_D0);
> > /*
> > - * Mandatory power management transition delays, see
> > - * PCI Express Base Specification Revision 2.0 Section
> > - * 6.6.1: Conventional Reset. Do not delay for
> > - * devices powered on/off by corresponding bridge,
> > - * because have already delayed for the bridge.
> > + * Mandatory power management transition delays are
> > + * handled in the PCIe portdrv resume hooks.
> > */
> > if (dev->runtime_d3cold) {
> > - if (dev->d3cold_delay && !dev->imm_ready)
> > - msleep(dev->d3cold_delay);
> > /*
> > * When powering on a bridge from D3cold, the
> > * whole hierarchy may be powered on into
> > @@ -4568,14 +4563,16 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
> >
> > return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
> > }
> > +
> > /**
> > - * pcie_wait_for_link - Wait until link is active or inactive
> > + * pcie_wait_for_link_delay - Wait until link is active or inactive
> > * @pdev: Bridge device
> > * @active: waiting for active or inactive?
> > + * @delay: Delay to wait after link has become active (in ms)
> > *
> > * Use this to wait till link becomes active or inactive.
> > */
> > -bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> > +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay)
> > {
> > int timeout = 1000;
> > bool ret;
> > @@ -4612,13 +4609,25 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> > timeout -= 10;
> > }
> > if (active && ret)
> > - msleep(100);
> > + msleep(delay);
> > else if (ret != active)
> > pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
> > active ? "set" : "cleared");
> > return ret == active;
> > }
> >
> > +/**
> > + * pcie_wait_for_link - Wait until link is active or inactive
> > + * @pdev: Bridge device
> > + * @active: waiting for active or inactive?
> > + *
> > + * Use this to wait till link becomes active or inactive.
> > + */
> > +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> > +{
> > + return pcie_wait_for_link_delay(pdev, active, 100);
> > +}
> > +
> > void pci_reset_secondary_bus(struct pci_dev *dev)
> > {
> > u16 ctrl;
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 9cb99380c61e..59802b3def4b 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -493,6 +493,7 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
> > void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> > u32 service);
> >
> > +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay);
> > bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
> > #ifdef CONFIG_PCIEASPM
> > void pcie_aspm_init_link_state(struct pci_dev *pdev);
> > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> > index 1b330129089f..88d151a54be6 100644
> > --- a/drivers/pci/pcie/portdrv_core.c
> > +++ b/drivers/pci/pcie/portdrv_core.c
> > @@ -9,6 +9,7 @@
> > #include <linux/module.h>
> > #include <linux/pci.h>
> > #include <linux/kernel.h>
> > +#include <linux/delay.h>
> > #include <linux/errno.h>
> > #include <linux/pm.h>
> > #include <linux/pm_runtime.h>
> > @@ -378,6 +379,58 @@ static int pm_iter(struct device *dev, void *data)
> > return 0;
> > }
> >
> > +static int get_downstream_delay(struct pci_bus *bus)
> > +{
> > + struct pci_dev *pdev;
> > + int min_delay = 100;
> > + int max_delay = 0;
> > +
> > + list_for_each_entry(pdev, &bus->devices, bus_list) {
> > + if (!pdev->imm_ready)
> > + min_delay = 0;
> > + else if (pdev->d3cold_delay < min_delay)
> > + min_delay = pdev->d3cold_delay;
> > + if (pdev->d3cold_delay > max_delay)
> > + max_delay = pdev->d3cold_delay;
> > + }
> > +
> > + return max(min_delay, max_delay);
> > +}
> > +
> > +static void wait_for_downstream_link(struct pci_dev *pdev)
> > +{
> > + /*
> > + * Handle delays according to PCIe 4.0 section 6.6.1 before
> > + * configuration access to the downstream component is permitted.
> > + *
> > + * This blocks PCI core resume of the hierarchy below this port
> > + * until the link is trained.
> > + */
>
> Shouldn't the above go to a kerneldoc comment?
Yup I'll move it there.
> > + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> > + pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM) &&
> > + pdev->subordinate && !list_empty(&pdev->subordinate->devices) &&
> > + pdev->bridge_d3 && !pci_dev_is_disconnected(pdev)) {
>
> There's nothing except for this if () in the function, so maybe just
> return here if the condition is not satisfied?
OK
> Also maybe split it into a port type check (you can return right away
> if it fails), disconnected check and the rest?
OK
> > + int delay;
> > +
> > + delay = get_downstream_delay(pdev->subordinate);
> > + if (!delay)
> > + return;
> > +
> > + dev_dbg(&pdev->dev, "waiting downstream link for %d ms\n", delay);
> > +
> > + /*
> > + * If downstream port does not support speeds greater than
> > + * 5 GT/s need to wait 100ms. For higher speeds (gen3) we
> > + * need to wait first for the data link layer to become
> > + * active.
> > + */
> > + if (pcie_get_speed_cap(pdev) <= PCIE_SPEED_5_0GT)
> > + msleep(delay);
> > + else
> > + pcie_wait_for_link_delay(pdev, true, delay);
> > + }
> > +}
> > +
> > /**
> > * pcie_port_device_suspend - suspend port services associated with a PCIe port
> > * @dev: PCI Express port to handle
> > @@ -391,6 +444,13 @@ int pcie_port_device_suspend(struct device *dev)
> > int pcie_port_device_resume_noirq(struct device *dev)
> > {
> > size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
> > +
> > + /*
> > + * Wait for the link to be fully up before resuming port services.
> > + * This prevents pciehp from starting to tear-down the hierarchy
> > + * too soon.
> > + */
> > + wait_for_downstream_link(to_pci_dev(dev));
> > return device_for_each_child(dev, &off, pm_iter);
> > }
> >
> > @@ -421,6 +481,8 @@ int pcie_port_device_runtime_suspend(struct device *dev)
> > int pcie_port_device_runtime_resume(struct device *dev)
> > {
> > size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
> > +
>
> The comment from pcie_port_device_resume_noirq() is applicable here
> too, so maybe move it to a common place (kerneldoc?).
Sure.
>
> > + wait_for_downstream_link(to_pci_dev(dev));
> > return device_for_each_child(dev, &off, pm_iter);
> > }
> > #endif /* PM */
> > --
> > 2.20.1
> >
next prev parent reply other threads:[~2019-06-06 11:24 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-05 14:58 [PATCH 0/3] PCI: Power management improvements Mika Westerberg
2019-06-05 14:58 ` [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec Mika Westerberg
2019-06-06 8:40 ` Rafael J. Wysocki
2019-06-06 11:24 ` Mika Westerberg [this message]
2019-06-05 14:58 ` [PATCH 2/3] PCI: Do not poll for PME if the device is in D3cold Mika Westerberg
2019-06-05 19:05 ` Lukas Wunner
2019-06-06 11:21 ` Mika Westerberg
2019-06-09 18:38 ` Lukas Wunner
2019-06-10 11:35 ` Rafael J. Wysocki
2019-06-05 14:58 ` [PATCH 3/3] PCI / ACPI: Handle sibling devices sharing power resources Mika Westerberg
2019-06-06 8:54 ` Rafael J. Wysocki
2019-06-06 11:26 ` Mika Westerberg
2019-06-06 13:44 ` Mika Westerberg
2019-06-06 14:08 ` Rafael J. Wysocki
2019-06-06 14:17 ` Mika Westerberg
2019-06-06 14:27 ` Rafael J. Wysocki
2019-06-06 14:36 ` Mika Westerberg
2019-06-12 22:38 ` Rafael J. Wysocki
2019-06-13 12:52 ` Mika Westerberg
2019-06-13 13:51 ` Rafael J. Wysocki
2019-06-13 14:27 ` Mika Westerberg
2019-06-09 18:58 ` Lukas Wunner
2019-06-10 10:57 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190606112430.GZ2781@lahna.fi.intel.com \
--to=mika.westerberg@linux.intel.com \
--cc=alex.williamson@redhat.com \
--cc=bhelgaas@google.com \
--cc=keith.busch@intel.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mr.nuke.me@gmail.com \
--cc=rafael@kernel.org \
--cc=rjw@rjwysocki.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).