Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
@ 2020-03-11  5:39 Kai-Heng Feng
  2020-03-11 10:38 ` Mika Westerberg
  0 siblings, 1 reply; 8+ messages in thread
From: Kai-Heng Feng @ 2020-03-11  5:39 UTC (permalink / raw)
  To: Rafael Wysocki, Mika Westerberg
  Cc: Shih-Yuan Lee (FourDollars), Tiffany, Linux PCI, Linux PM, open list

Hi,

I am currently investigating long suspend and resume time of suspend-to-idle.
It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.

I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:

	/* Avoid direct_complete to let wakeup_path propagate. */
		if (device_may_wakeup(dev) || dev->power.wakeup_path)
			dev->power.direct_complete = false;

Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
- Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
- Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
- Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pci.c#n4621
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/power/main.c#n1748

Kai-Heng

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-11  5:39 Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle Kai-Heng Feng
@ 2020-03-11 10:38 ` Mika Westerberg
  2020-03-12  4:41   ` Kai-Heng Feng
  0 siblings, 1 reply; 8+ messages in thread
From: Mika Westerberg @ 2020-03-11 10:38 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list

On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
> Hi,
> 
> I am currently investigating long suspend and resume time of suspend-to-idle.
> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
> 
> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
> 
> 	/* Avoid direct_complete to let wakeup_path propagate. */
> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
> 			dev->power.direct_complete = false;

You need to be careful here because otherwise you end up situation where
the link is not properly trained and we tear down the whole tree of
devices which is worse than waiting bit more for resume.

> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...

Which system this is? ICL? I think it is the TBT root ports only that do
not support active link reporting. The PCIe spec is not entirely clear
about root ports since it explictly mentions only downstream ports so
one option would be to check for root port and that it supports gen 3
speeds and based on that wait for max say 2 * 100ms or something like
that.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-11 10:38 ` Mika Westerberg
@ 2020-03-12  4:41   ` Kai-Heng Feng
  2020-03-12  8:15     ` Mika Westerberg
  0 siblings, 1 reply; 8+ messages in thread
From: Kai-Heng Feng @ 2020-03-12  4:41 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list



> On Mar 11, 2020, at 18:38, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> 
> On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
>> Hi,
>> 
>> I am currently investigating long suspend and resume time of suspend-to-idle.
>> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
>> 
>> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
>> 
>> 	/* Avoid direct_complete to let wakeup_path propagate. */
>> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
>> 			dev->power.direct_complete = false;
> 
> You need to be careful here because otherwise you end up situation where
> the link is not properly trained and we tear down the whole tree of
> devices which is worse than waiting bit more for resume.

My idea is to direct-complete when there's no PCI or USB device plugged into the TBT, and use pm_reuqest_resume() in complete() so it won't block resume() or resume_noirq().

> 
>> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
>> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
>> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
>> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
>> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...
> 
> Which system this is? ICL?

CML-H + Titan Ridge.

> I think it is the TBT root ports only that do
> not support active link reporting. The PCIe spec is not entirely clear
> about root ports since it explictly mentions only downstream ports so
> one option would be to check for root port and that it supports gen 3
> speeds and based on that wait for max say 2 * 100ms or something like
> that.

So 200ms for rootport, but still 1100ms for downstream ports?

Kai-Heng

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-12  4:41   ` Kai-Heng Feng
@ 2020-03-12  8:15     ` Mika Westerberg
  2020-03-12 10:10       ` Kai-Heng Feng
  0 siblings, 1 reply; 8+ messages in thread
From: Mika Westerberg @ 2020-03-12  8:15 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list

On Thu, Mar 12, 2020 at 12:41:08PM +0800, Kai-Heng Feng wrote:
> 
> 
> > On Mar 11, 2020, at 18:38, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> > 
> > On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
> >> Hi,
> >> 
> >> I am currently investigating long suspend and resume time of suspend-to-idle.
> >> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
> >> 
> >> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
> >> 
> >> 	/* Avoid direct_complete to let wakeup_path propagate. */
> >> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
> >> 			dev->power.direct_complete = false;
> > 
> > You need to be careful here because otherwise you end up situation where
> > the link is not properly trained and we tear down the whole tree of
> > devices which is worse than waiting bit more for resume.
> 
> My idea is to direct-complete when there's no PCI or USB device
> plugged into the TBT, and use pm_reuqest_resume() in complete() so it
> won't block resume() or resume_noirq().

Before doing that..

> >> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
> >> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
> >> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
> >> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
> >> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...
> > 
> > Which system this is? ICL?
> 
> CML-H + Titan Ridge.

.. we should really understand this better because CML-H PCH root ports
and Titan/Alpine Ridge downstream ports all support active link
reporting so instead of the 1000+100ms you should see something like
this:

  1. Wait for the link + 100ms for the root port
  2. Wait for the link + 100ms for the Titan Ridge downstream ports
    (these are run paraller wrt all Titan Ridge downstream ports that have
     something connected)

If there is a TBT device connected then 2. is repeated for it and so on.

So the 1000ms+ is really unexpected. Are you running mainline kernel and
if so, can you share dmesg with CONFIG_PCI_DEBUG=y so we can see the
delays there? Maybe also add some debugging to
pcie_wait_for_link_delay() where it checks for the
!pdev->link_active_reporting and waits for 1100ms.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-12  8:15     ` Mika Westerberg
@ 2020-03-12 10:10       ` Kai-Heng Feng
  2020-03-12 10:41         ` Mika Westerberg
  0 siblings, 1 reply; 8+ messages in thread
From: Kai-Heng Feng @ 2020-03-12 10:10 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list



> On Mar 12, 2020, at 16:15, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> 
> On Thu, Mar 12, 2020 at 12:41:08PM +0800, Kai-Heng Feng wrote:
>> 
>> 
>>> On Mar 11, 2020, at 18:38, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
>>> 
>>> On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
>>>> Hi,
>>>> 
>>>> I am currently investigating long suspend and resume time of suspend-to-idle.
>>>> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
>>>> 
>>>> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
>>>> 
>>>> 	/* Avoid direct_complete to let wakeup_path propagate. */
>>>> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
>>>> 			dev->power.direct_complete = false;
>>> 
>>> You need to be careful here because otherwise you end up situation where
>>> the link is not properly trained and we tear down the whole tree of
>>> devices which is worse than waiting bit more for resume.
>> 
>> My idea is to direct-complete when there's no PCI or USB device
>> plugged into the TBT, and use pm_reuqest_resume() in complete() so it
>> won't block resume() or resume_noirq().
> 
> Before doing that..
> 
>>>> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
>>>> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
>>>> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
>>>> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
>>>> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...
>>> 
>>> Which system this is? ICL?
>> 
>> CML-H + Titan Ridge.
> 
> .. we should really understand this better because CML-H PCH root ports
> and Titan/Alpine Ridge downstream ports all support active link
> reporting so instead of the 1000+100ms you should see something like
> this:

Root port for discrete graphics:
# lspci -vvnn -s 00:01.0                    
00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 02) (prog-if 00 [Normal decode])
        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
                LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
   
Thunderbolt ports:
# lspci -vvvv -s 04:00
04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
                LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

# lspci -vvnn -s 04:01
04:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
                LnkCap: Port #1, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

# lspci -vvnn -s 04:02 
04:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
                LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

So both CML-H PCH and TBT ports report "LLActRep-".

> 
>  1. Wait for the link + 100ms for the root port
>  2. Wait for the link + 100ms for the Titan Ridge downstream ports
>    (these are run paraller wrt all Titan Ridge downstream ports that have
>     something connected)
> 
> If there is a TBT device connected then 2. is repeated for it and so on.
> 
> So the 1000ms+ is really unexpected. Are you running mainline kernel and
> if so, can you share dmesg with CONFIG_PCI_DEBUG=y so we can see the
> delays there? Maybe also add some debugging to
> pcie_wait_for_link_delay() where it checks for the
> !pdev->link_active_reporting and waits for 1100ms.

I added the debug log in another thread and it does reach !pdev->link_active_reporting.

Let me see if patch link active reporting for the ports in PCI quirks can help.

Kai-Heng


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-12 10:10       ` Kai-Heng Feng
@ 2020-03-12 10:41         ` Mika Westerberg
  2020-03-13  5:07           ` Kai-Heng Feng
  0 siblings, 1 reply; 8+ messages in thread
From: Mika Westerberg @ 2020-03-12 10:41 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list

On Thu, Mar 12, 2020 at 06:10:45PM +0800, Kai-Heng Feng wrote:
> 
> 
> > On Mar 12, 2020, at 16:15, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> > 
> > On Thu, Mar 12, 2020 at 12:41:08PM +0800, Kai-Heng Feng wrote:
> >> 
> >> 
> >>> On Mar 11, 2020, at 18:38, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> >>> 
> >>> On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
> >>>> Hi,
> >>>> 
> >>>> I am currently investigating long suspend and resume time of suspend-to-idle.
> >>>> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
> >>>> 
> >>>> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
> >>>> 
> >>>> 	/* Avoid direct_complete to let wakeup_path propagate. */
> >>>> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
> >>>> 			dev->power.direct_complete = false;
> >>> 
> >>> You need to be careful here because otherwise you end up situation where
> >>> the link is not properly trained and we tear down the whole tree of
> >>> devices which is worse than waiting bit more for resume.
> >> 
> >> My idea is to direct-complete when there's no PCI or USB device
> >> plugged into the TBT, and use pm_reuqest_resume() in complete() so it
> >> won't block resume() or resume_noirq().
> > 
> > Before doing that..
> > 
> >>>> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
> >>>> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
> >>>> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
> >>>> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
> >>>> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...
> >>> 
> >>> Which system this is? ICL?
> >> 
> >> CML-H + Titan Ridge.
> > 
> > .. we should really understand this better because CML-H PCH root ports
> > and Titan/Alpine Ridge downstream ports all support active link
> > reporting so instead of the 1000+100ms you should see something like
> > this:
> 
> Root port for discrete graphics:
> # lspci -vvnn -s 00:01.0                    
> 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 02) (prog-if 00 [Normal decode])
>         Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
>                 LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
>                         ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>                 LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

Interesting, Titan Ridge is connected to the graphics slot, no? What
system this is?

> Thunderbolt ports:
> # lspci -vvvv -s 04:00
> 04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
>         Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
>                         ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>                 LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

This one leads to the TBT NHI.

> # lspci -vvnn -s 04:01
> 04:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
>         Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
>                 LnkCap: Port #1, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
>                         ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>                 LnkCtl: ASPM L1 Enabled; Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

This one is one of the extension downstream ports and it supports active
link reporting.

> # lspci -vvnn -s 04:02 
> 04:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
>         Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
>                 LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
>                         ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>                 LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

This one leads to the xHCI.

> So both CML-H PCH and TBT ports report "LLActRep-".

So in pci_bridge_wait_for_secondary_bus() we only call
pcie_wait_for_link_delay() if the port supports speeds higher than 5
GT/s (gen2). Now if I read the above correct all the ports except the
root port support 2.5 GT/s (gen1) speeds so we should go to the
msleep(delay) branch and not call pcie_wait_for_link_delay() at all:

        if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
                pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
                msleep(delay);
        } else {
                pci_dbg(dev, "waiting %d ms for downstream link, after activation\n",
                        delay);
                if (!pcie_wait_for_link_delay(dev, true, delay)) {
                        /* Did not train, no need to wait any further */
                        return;
                }
        }

Only explanation I have is that delay itself is set to 1000ms for some
reason. Can you check if that's the case and then maybe check where that
delay is coming from?

> >  1. Wait for the link + 100ms for the root port
> >  2. Wait for the link + 100ms for the Titan Ridge downstream ports
> >    (these are run paraller wrt all Titan Ridge downstream ports that have
> >     something connected)
> > 
> > If there is a TBT device connected then 2. is repeated for it and so on.
> > 
> > So the 1000ms+ is really unexpected. Are you running mainline kernel and
> > if so, can you share dmesg with CONFIG_PCI_DEBUG=y so we can see the
> > delays there? Maybe also add some debugging to
> > pcie_wait_for_link_delay() where it checks for the
> > !pdev->link_active_reporting and waits for 1100ms.
> 
> I added the debug log in another thread and it does reach !pdev->link_active_reporting.

Hmm, based on the above that should not happen :-(

> Let me see if patch link active reporting for the ports in PCI quirks can help.

Let's first investigate bit more to understand what is going on.

I suggest to create kernel.org bugzilla about this. Please include full
dmesg and 'sudo lspci -vv' output at least and of course the steps you
use to reproduce this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-12 10:41         ` Mika Westerberg
@ 2020-03-13  5:07           ` Kai-Heng Feng
  2020-03-13 11:31             ` Mika Westerberg
  0 siblings, 1 reply; 8+ messages in thread
From: Kai-Heng Feng @ 2020-03-13  5:07 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list



> On Mar 12, 2020, at 18:41, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> 
> On Thu, Mar 12, 2020 at 06:10:45PM +0800, Kai-Heng Feng wrote:
>> 
>> 
>>> On Mar 12, 2020, at 16:15, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
>>> 
>>> On Thu, Mar 12, 2020 at 12:41:08PM +0800, Kai-Heng Feng wrote:
>>>> 
>>>> 
>>>>> On Mar 11, 2020, at 18:38, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
>>>>> 
>>>>> On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I am currently investigating long suspend and resume time of suspend-to-idle.
>>>>>> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
>>>>>> 
>>>>>> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
>>>>>> 
>>>>>> 	/* Avoid direct_complete to let wakeup_path propagate. */
>>>>>> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
>>>>>> 			dev->power.direct_complete = false;
>>>>> 
>>>>> You need to be careful here because otherwise you end up situation where
>>>>> the link is not properly trained and we tear down the whole tree of
>>>>> devices which is worse than waiting bit more for resume.
>>>> 
>>>> My idea is to direct-complete when there's no PCI or USB device
>>>> plugged into the TBT, and use pm_reuqest_resume() in complete() so it
>>>> won't block resume() or resume_noirq().
>>> 
>>> Before doing that..
>>> 
>>>>>> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
>>>>>> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
>>>>>> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
>>>>>> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
>>>>>> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...
>>>>> 
>>>>> Which system this is? ICL?
>>>> 
>>>> CML-H + Titan Ridge.
>>> 
>>> .. we should really understand this better because CML-H PCH root ports
>>> and Titan/Alpine Ridge downstream ports all support active link
>>> reporting so instead of the 1000+100ms you should see something like
>>> this:
>> 
>> Root port for discrete graphics:
>> # lspci -vvnn -s 00:01.0                    
>> 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 02) (prog-if 00 [Normal decode])
>>        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
>>                LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
>>                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>>                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
>>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> 
> Interesting, Titan Ridge is connected to the graphics slot, no? What
> system this is?

No, TBT connects to another port, which supports link active reporting.
This is just to show not all CML-H ports support that.

> 
>> Thunderbolt ports:
>> # lspci -vvvv -s 04:00
>> 04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
>>        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
>>                LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
>>                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>>                LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
>>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> 
> This one leads to the TBT NHI.
> 
>> # lspci -vvnn -s 04:01
>> 04:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
>>        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
>>                LnkCap: Port #1, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
>>                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>>                LnkCtl: ASPM L1 Enabled; Disabled- CommClk-
>>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> 
> This one is one of the extension downstream ports and it supports active
> link reporting.
> 
>> # lspci -vvnn -s 04:02 
>> 04:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
>>        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
>>                LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
>>                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>>                LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
>>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> 
> This one leads to the xHCI.
> 
>> So both CML-H PCH and TBT ports report "LLActRep-".
> 
> So in pci_bridge_wait_for_secondary_bus() we only call
> pcie_wait_for_link_delay() if the port supports speeds higher than 5
> GT/s (gen2). Now if I read the above correct all the ports except the
> root port support 2.5 GT/s (gen1) speeds so we should go to the
> msleep(delay) branch and not call pcie_wait_for_link_delay() at all:
> 
>        if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
>                pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
>                msleep(delay);
>        } else {
>                pci_dbg(dev, "waiting %d ms for downstream link, after activation\n",
>                        delay);
>                if (!pcie_wait_for_link_delay(dev, true, delay)) {
>                        /* Did not train, no need to wait any further */
>                        return;
>                }
>        }
> 
> Only explanation I have is that delay itself is set to 1000ms for some
> reason. Can you check if that's the case and then maybe check where that
> delay is coming from?
> 
>>> 1. Wait for the link + 100ms for the root port
>>> 2. Wait for the link + 100ms for the Titan Ridge downstream ports
>>>   (these are run paraller wrt all Titan Ridge downstream ports that have
>>>    something connected)
>>> 
>>> If there is a TBT device connected then 2. is repeated for it and so on.
>>> 
>>> So the 1000ms+ is really unexpected. Are you running mainline kernel and
>>> if so, can you share dmesg with CONFIG_PCI_DEBUG=y so we can see the
>>> delays there? Maybe also add some debugging to
>>> pcie_wait_for_link_delay() where it checks for the
>>> !pdev->link_active_reporting and waits for 1100ms.
>> 
>> I added the debug log in another thread and it does reach !pdev->link_active_reporting.
> 
> Hmm, based on the above that should not happen :-(
> 
>> Let me see if patch link active reporting for the ports in PCI quirks can help.
> 
> Let's first investigate bit more to understand what is going on.
> 
> I suggest to create kernel.org bugzilla about this. Please include full
> dmesg and 'sudo lspci -vv' output at least and of course the steps you
> use to reproduce this.

https://bugzilla.kernel.org/show_bug.cgi?id=206837

Kai-Heng


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle
  2020-03-13  5:07           ` Kai-Heng Feng
@ 2020-03-13 11:31             ` Mika Westerberg
  0 siblings, 0 replies; 8+ messages in thread
From: Mika Westerberg @ 2020-03-13 11:31 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Rafael Wysocki, Shih-Yuan Lee (FourDollars),
	Tiffany, Linux PCI, Linux PM, open list

On Fri, Mar 13, 2020 at 01:07:35PM +0800, Kai-Heng Feng wrote:
> 
> 
> > On Mar 12, 2020, at 18:41, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> > 
> > On Thu, Mar 12, 2020 at 06:10:45PM +0800, Kai-Heng Feng wrote:
> >> 
> >> 
> >>> On Mar 12, 2020, at 16:15, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> >>> 
> >>> On Thu, Mar 12, 2020 at 12:41:08PM +0800, Kai-Heng Feng wrote:
> >>>> 
> >>>> 
> >>>>> On Mar 11, 2020, at 18:38, Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> >>>>> 
> >>>>> On Wed, Mar 11, 2020 at 01:39:51PM +0800, Kai-Heng Feng wrote:
> >>>>>> Hi,
> >>>>>> 
> >>>>>> I am currently investigating long suspend and resume time of suspend-to-idle.
> >>>>>> It's because Thunderbolt bridges need to wait for 1100ms [1] for runtime-resume on system suspend, and also for system resume.
> >>>>>> 
> >>>>>> I made a quick hack to the USB driver and xHCI driver to support direct-complete, but I failed to do so for the parent PCIe bridge as it always disables the direct-complete [2], since device_may_wakeup() returns true for the device:
> >>>>>> 
> >>>>>> 	/* Avoid direct_complete to let wakeup_path propagate. */
> >>>>>> 		if (device_may_wakeup(dev) || dev->power.wakeup_path)
> >>>>>> 			dev->power.direct_complete = false;
> >>>>> 
> >>>>> You need to be careful here because otherwise you end up situation where
> >>>>> the link is not properly trained and we tear down the whole tree of
> >>>>> devices which is worse than waiting bit more for resume.
> >>>> 
> >>>> My idea is to direct-complete when there's no PCI or USB device
> >>>> plugged into the TBT, and use pm_reuqest_resume() in complete() so it
> >>>> won't block resume() or resume_noirq().
> >>> 
> >>> Before doing that..
> >>> 
> >>>>>> Once the direct-complete is disabled, system suspend/resume is used hence the delay in [1] is making the resume really slow. 
> >>>>>> So how do we make suspend-to-idle faster? I have some ideas but I am not sure if they are feasible:
> >>>>>> - Make PM core know the runtime_suspend() already use the same wakeup as suspend(), so it doesn't need to use device_may_wakeup() check to determine direct-complete.
> >>>>>> - Remove the DPM_FLAG_NEVER_SKIP flag in pcieport driver, and use pm_request_resume() in its complete() callback to prevent blocking the resume process.
> >>>>>> - Reduce the 1100ms delay. Maybe someone knows the values used in macOS and Windows...
> >>>>> 
> >>>>> Which system this is? ICL?
> >>>> 
> >>>> CML-H + Titan Ridge.
> >>> 
> >>> .. we should really understand this better because CML-H PCH root ports
> >>> and Titan/Alpine Ridge downstream ports all support active link
> >>> reporting so instead of the 1000+100ms you should see something like
> >>> this:
> >> 
> >> Root port for discrete graphics:
> >> # lspci -vvnn -s 00:01.0                    
> >> 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 02) (prog-if 00 [Normal decode])
> >>        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
> >>                LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
> >>                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
> >>                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> >>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > 
> > Interesting, Titan Ridge is connected to the graphics slot, no? What
> > system this is?
> 
> No, TBT connects to another port, which supports link active reporting.
> This is just to show not all CML-H ports support that.

Right.

> >> Thunderbolt ports:
> >> # lspci -vvvv -s 04:00
> >> 04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
> >>        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
> >>                LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
> >>                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
> >>                LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> >>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > 
> > This one leads to the TBT NHI.
> > 
> >> # lspci -vvnn -s 04:01
> >> 04:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
> >>        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
> >>                LnkCap: Port #1, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
> >>                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
> >>                LnkCtl: ASPM L1 Enabled; Disabled- CommClk-
> >>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > 
> > This one is one of the extension downstream ports and it supports active
> > link reporting.
> > 
> >> # lspci -vvnn -s 04:02 
> >> 04:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 2C 2018] [8086:15e7] (rev 06) (prog-if 00 [Normal decode])
> >>        Capabilities: [c0] Express (v2) Downstream Port (Slot+), MSI 00
> >>                LnkCap: Port #2, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <64ns, L1 <1us
> >>                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
> >>                LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> >>                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > 
> > This one leads to the xHCI.
> > 
> >> So both CML-H PCH and TBT ports report "LLActRep-".
> > 
> > So in pci_bridge_wait_for_secondary_bus() we only call
> > pcie_wait_for_link_delay() if the port supports speeds higher than 5
> > GT/s (gen2). Now if I read the above correct all the ports except the
> > root port support 2.5 GT/s (gen1) speeds so we should go to the
> > msleep(delay) branch and not call pcie_wait_for_link_delay() at all:
> > 
> >        if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
> >                pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
> >                msleep(delay);
> >        } else {
> >                pci_dbg(dev, "waiting %d ms for downstream link, after activation\n",
> >                        delay);
> >                if (!pcie_wait_for_link_delay(dev, true, delay)) {
> >                        /* Did not train, no need to wait any further */
> >                        return;
> >                }
> >        }
> > 
> > Only explanation I have is that delay itself is set to 1000ms for some
> > reason. Can you check if that's the case and then maybe check where that
> > delay is coming from?
> > 
> >>> 1. Wait for the link + 100ms for the root port
> >>> 2. Wait for the link + 100ms for the Titan Ridge downstream ports
> >>>   (these are run paraller wrt all Titan Ridge downstream ports that have
> >>>    something connected)
> >>> 
> >>> If there is a TBT device connected then 2. is repeated for it and so on.
> >>> 
> >>> So the 1000ms+ is really unexpected. Are you running mainline kernel and
> >>> if so, can you share dmesg with CONFIG_PCI_DEBUG=y so we can see the
> >>> delays there? Maybe also add some debugging to
> >>> pcie_wait_for_link_delay() where it checks for the
> >>> !pdev->link_active_reporting and waits for 1100ms.
> >> 
> >> I added the debug log in another thread and it does reach !pdev->link_active_reporting.
> > 
> > Hmm, based on the above that should not happen :-(
> > 
> >> Let me see if patch link active reporting for the ports in PCI quirks can help.
> > 
> > Let's first investigate bit more to understand what is going on.
> > 
> > I suggest to create kernel.org bugzilla about this. Please include full
> > dmesg and 'sudo lspci -vv' output at least and of course the steps you
> > use to reproduce this.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=206837

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-03-13 11:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-11  5:39 Thunderbolt, direct-complete and long suspend/resume time of Suspend-to-idle Kai-Heng Feng
2020-03-11 10:38 ` Mika Westerberg
2020-03-12  4:41   ` Kai-Heng Feng
2020-03-12  8:15     ` Mika Westerberg
2020-03-12 10:10       ` Kai-Heng Feng
2020-03-12 10:41         ` Mika Westerberg
2020-03-13  5:07           ` Kai-Heng Feng
2020-03-13 11:31             ` Mika Westerberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).