All of lore.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
@ 2023-08-21 10:39 Kamil Paral
  2023-08-21 13:12 ` Mika Westerberg
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Kamil Paral @ 2023-08-21 10:39 UTC (permalink / raw)
  To: linux-pci; +Cc: regressions, mika.westerberg, bhelgaas, chris.chiu

= Summary =
A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected
can no longer resume from suspend. The problem was introduced in
e8b908146d44 "PCI/PM: Increase wait time after resume".

= Detailed description =
When running a kernel containing the identified commit and trying to
resume the laptop from sleep, the laptop's power light changes from
blinking (sleep state) to shining (running state), but the display
stays black and it doesn't respond to any keyboard input, nor to
ping/ssh, and no logs are written to the disk (which means I don't
know how to gather error logs). It needs to be force-rebooted. I
bisected the kernel and identified the commit which causes this
behavior. I used the vanilla kernel with a Fedora kernel config.

The reproducer is:
1. Connect the dock to the laptop.
2. Boot the laptop (in my case, to the gdm).
3. Suspend the laptop.
4. Resume the laptop. This is successful before the identified commit
(the last tested good commit was cc8a983d0fce), and unsuccessful
(black screen, frozen system) after the identified commit
(e8b908146d44).

The reproducibility is 100%, I tested it many many times in a row.

When the dock is unplugged, suspend and resume works as expected. When
I connect a different laptop to the dock (Thinkpad P1 gen3), I don't
see any resume failure. So this is somehow related to the particular
combination of Thinkpad T480s and Thinkpad Thunderbolt 3 Dock. The
dock is running the latest firmware. I also tested "pcie_aspm=off",
and that allows the laptop to resume properly, but then there's a race
condition whether devices on the dock are visible to the OS or not
after resume, so this is not useful even just as a workaround.


I already created a downstream Fedora bug report in Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=2230357

lspci of the laptop:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1982541
git bisect log:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1983351


The commit which broke resume is the following:

e8b908146d44310473e43b3382eca126e12d279c is the first bad commit
commit e8b908146d44310473e43b3382eca126e12d279c
Author: Mika Westerberg <mika.westerberg.com>
Date:   Tue Apr 4 08:27:13 2023 +0300

    PCI/PM: Increase wait time after resume

    PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to
    config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional
    reset and this same delay is prescribed when coming out of D3cold (as that
    involves reset too).

    A device that requires more than 1 second to initialize after reset may
    respond to config requests with Request Retry Status completions (sec
    2.3.1), and we accommodate that in Linux with a 60 second cap
    (PCIE_RESET_READY_POLL_MS).

    Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code
    path, not in the resume path.  However, a device has surfaced, namely Intel
    Titan Ridge xHCI, which requires a longer delay also in the resume code
    path.

    Make the resume code path to use this same extended delay as the reset
    path.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728
    Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com
    Reported-by: Chris Chiu <chris.chiu>
    Signed-off-by: Mika Westerberg <mika.westerberg.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas>
    Cc: Lukas Wunner <lukas>

 drivers/pci/pci-driver.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


I'm happy to add further details, perform additional debugging, or
test some experimental patches in order to resolve this regression.
Please CC me in your replies, I'm not subscribed to this list. Thank
you!
Kamil Páral


#regzbot introduced: e8b908146d44


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-21 10:39 [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume" Kamil Paral
@ 2023-08-21 13:12 ` Mika Westerberg
  2023-08-22 16:43   ` Kamil Paral
  2023-08-21 19:10 ` Bjorn Helgaas
  2023-11-01 10:59 ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-21 13:12 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

Hi,

On Mon, Aug 21, 2023 at 12:39:02PM +0200, Kamil Paral wrote:
> = Summary =
> A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected
> can no longer resume from suspend. The problem was introduced in
> e8b908146d44 "PCI/PM: Increase wait time after resume".

Thanks for the detailed description! There was follow up patch that made
the timeout shorter for slow links, I wonder if you could try that first
and see if that changes anything? It is this commit in the mainline:

7b3ba09febf4 PCI/PM: Shorten pci_bridge_wait_for_secondary_bus() wait time for slow links

That should at least make the resume faster too but let's see.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-21 10:39 [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume" Kamil Paral
  2023-08-21 13:12 ` Mika Westerberg
@ 2023-08-21 19:10 ` Bjorn Helgaas
  2023-08-22 16:36   ` Kamil Paral
  2023-11-01 10:59 ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-08-21 19:10 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, mika.westerberg, bhelgaas, chris.chiu

On Mon, Aug 21, 2023 at 12:39:02PM +0200, Kamil Paral wrote:
> = Summary =
> A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected
> can no longer resume from suspend. The problem was introduced in
> e8b908146d44 "PCI/PM: Increase wait time after resume".
> 
> = Detailed description =
> When running a kernel containing the identified commit and trying to
> resume the laptop from sleep, the laptop's power light changes from
> blinking (sleep state) to shining (running state), but the display
> stays black and it doesn't respond to any keyboard input, nor to
> ping/ssh, and no logs are written to the disk (which means I don't
> know how to gather error logs). It needs to be force-rebooted. I
> bisected the kernel and identified the commit which causes this
> behavior. I used the vanilla kernel with a Fedora kernel config.
> 
> The reproducer is:
> 1. Connect the dock to the laptop.
> 2. Boot the laptop (in my case, to the gdm).
> 3. Suspend the laptop.
> 4. Resume the laptop. This is successful before the identified commit
> (the last tested good commit was cc8a983d0fce), and unsuccessful
> (black screen, frozen system) after the identified commit
> (e8b908146d44).
> 
> The reproducibility is 100%, I tested it many many times in a row.
> 
> When the dock is unplugged, suspend and resume works as expected. When
> I connect a different laptop to the dock (Thinkpad P1 gen3), I don't
> see any resume failure. So this is somehow related to the particular
> combination of Thinkpad T480s and Thinkpad Thunderbolt 3 Dock. The
> dock is running the latest firmware.

Thanks very much for the report.

Wow, this is super interesting.  e8b908146d44 literally just increases
a timeout; the complete patch is:

   static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
   {
  -       pci_bridge_wait_for_secondary_bus(pci_dev, "resume", PCI_RESET_WAIT);
  +       pci_bridge_wait_for_secondary_bus(pci_dev, "resume",
  +                                         PCIE_RESET_READY_POLL_MS);

Increasing a timeout should never cause a failure like this, so
there must be something really unexpected going on.

> I also tested "pcie_aspm=off",
> and that allows the laptop to resume properly, but then there's a race
> condition whether devices on the dock are visible to the OS or not
> after resume, so this is not useful even just as a workaround.

Wow, also shocking.  I see you have lspci output attached to the RH
bugzilla, but it doesn't include the details.  Would you mind
collecting the output of "sudo lspci -vv" both with and without
"pcie_aspm=off"?  No need to try suspend/resume to collect these.

Also, what does this race condition look like?  Dock devices are
visible before suspend, but sometimes none of them are visible *after*
resume?  We don't re-enumerate on resume, so does this mean they still
appear in lspci output but they just don't work?

If you can collect the complete dmesg log and "sudo lspci -vv" output
after a resume when the dock devices aren't visible, maybe there would
be a clue there.

> I already created a downstream Fedora bug report in Red Hat Bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=2230357
> 
> lspci of the laptop:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1982541
> git bisect log:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1983351
> 
> 
> The commit which broke resume is the following:
> 
> e8b908146d44310473e43b3382eca126e12d279c is the first bad commit
> commit e8b908146d44310473e43b3382eca126e12d279c
> Author: Mika Westerberg <mika.westerberg.com>
> Date:   Tue Apr 4 08:27:13 2023 +0300
> 
>     PCI/PM: Increase wait time after resume
> 
>     PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to
>     config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional
>     reset and this same delay is prescribed when coming out of D3cold (as that
>     involves reset too).
> 
>     A device that requires more than 1 second to initialize after reset may
>     respond to config requests with Request Retry Status completions (sec
>     2.3.1), and we accommodate that in Linux with a 60 second cap
>     (PCIE_RESET_READY_POLL_MS).
> 
>     Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code
>     path, not in the resume path.  However, a device has surfaced, namely Intel
>     Titan Ridge xHCI, which requires a longer delay also in the resume code
>     path.
> 
>     Make the resume code path to use this same extended delay as the reset
>     path.
> 
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728
>     Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com
>     Reported-by: Chris Chiu <chris.chiu>
>     Signed-off-by: Mika Westerberg <mika.westerberg.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas>
>     Cc: Lukas Wunner <lukas>
> 
>  drivers/pci/pci-driver.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> 
> I'm happy to add further details, perform additional debugging, or
> test some experimental patches in order to resolve this regression.
> Please CC me in your replies, I'm not subscribed to this list. Thank
> you!
> Kamil Páral
> 
> 
> #regzbot introduced: e8b908146d44
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-21 19:10 ` Bjorn Helgaas
@ 2023-08-22 16:36   ` Kamil Paral
  0 siblings, 0 replies; 44+ messages in thread
From: Kamil Paral @ 2023-08-22 16:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, regressions, mika.westerberg, bhelgaas, chris.chiu

On Mon, Aug 21, 2023 at 9:20 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> Wow, this is super interesting.  e8b908146d44 literally just increases
> a timeout; the complete patch is:
>
>    static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
>    {
>   -       pci_bridge_wait_for_secondary_bus(pci_dev, "resume", PCI_RESET_WAIT);
>   +       pci_bridge_wait_for_secondary_bus(pci_dev, "resume",
>   +                                         PCIE_RESET_READY_POLL_MS);
>
> Increasing a timeout should never cause a failure like this, so
> there must be something really unexpected going on.

Hello Bjorn, thanks for a quick response.

Your reply helped me discover that the laptop doesn't really *fail* to
resume, it just makes the resume much *longer*. I just never waited
that long. PCI_RESET_WAIT is 1 second, PCIE_RESET_READY_POLL_MS is 60
seconds. If I wait long enough, the laptop finally resumes correctly
after roughly 70 seconds (before the patch the resume took roughly 5
seconds). Sorry for not spotting that earlier!

I also tested this with the current git master tip (commit
f7757129e3de). Without any adjustments, the resume delay is roughly 70
seconds. But if I change PCIE_RESET_READY_POLL_MS from 60 seconds to 2
seconds and recompile it, the resume delay is roughly 6 seconds.

With the latest kernel f7757129e3de, here are some debugging logs:
* dmesg collected after delayed resume (extra 60 seconds):
  https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984636
* system journal after delayed resume:
  https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984637
* lspci -vv before suspend:
  https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984638
* lspci -vv after delayed resume:
  https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984639


> Would you mind
> collecting the output of "sudo lspci -vv" both with and without
> "pcie_aspm=off"?  No need to try suspend/resume to collect these.
>
> Also, what does this race condition look like?  Dock devices are
> visible before suspend, but sometimes none of them are visible *after*
> resume?  We don't re-enumerate on resume, so does this mean they still
> appear in lspci output but they just don't work?

I didn't manage to debug this today. Given the newly discovered
circumstances described above, I wonder whether your request still
applies. If it does, I can provide it tomorrow.

Thanks for looking into this,
Kamil Páral


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-21 13:12 ` Mika Westerberg
@ 2023-08-22 16:43   ` Kamil Paral
  2023-08-23  5:07     ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-08-22 16:43 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Mon, Aug 21, 2023 at 3:13 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> Thanks for the detailed description! There was follow up patch that made
> the timeout shorter for slow links, I wonder if you could try that first
> and see if that changes anything? It is this commit in the mainline:
>
> 7b3ba09febf4 PCI/PM: Shorten pci_bridge_wait_for_secondary_bus() wait time for slow links
>

Hello Mika, thanks for a quick response. Please see my reply to Bjorn,
the resume is just 60+ seconds delayed, not completely failing, I was
wrong about that. I attached my logs there.

To respond to your question, I applied 7b3ba09febf4 on top of
fe15c26ee26e (which is the parent of e8b908146d44) and tested that.
Hopefully I understood the instructions correctly. That kernel resumes
just fine, with the expected speed (~5 seconds), no extra delay. (If
there's a speed-up, I can't really tell).

Thank you for looking into this.
Kamil Páral


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-22 16:43   ` Kamil Paral
@ 2023-08-23  5:07     ` Mika Westerberg
  2023-08-23  7:00       ` Kamil Paral
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-23  5:07 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

Hi,

On Tue, Aug 22, 2023 at 06:43:54PM +0200, Kamil Paral wrote:
> On Mon, Aug 21, 2023 at 3:13 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > Thanks for the detailed description! There was follow up patch that made
> > the timeout shorter for slow links, I wonder if you could try that first
> > and see if that changes anything? It is this commit in the mainline:
> >
> > 7b3ba09febf4 PCI/PM: Shorten pci_bridge_wait_for_secondary_bus() wait time for slow links
> >
> 
> Hello Mika, thanks for a quick response. Please see my reply to Bjorn,
> the resume is just 60+ seconds delayed, not completely failing, I was
> wrong about that. I attached my logs there.
> 
> To respond to your question, I applied 7b3ba09febf4 on top of
> fe15c26ee26e (which is the parent of e8b908146d44) and tested that.
> Hopefully I understood the instructions correctly. That kernel resumes
> just fine, with the expected speed (~5 seconds), no extra delay. (If
> there's a speed-up, I can't really tell).

Okay if I understand correctly with commit 7b3ba09febf4 things works as
before and you don't see any delays in resume?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23  5:07     ` Mika Westerberg
@ 2023-08-23  7:00       ` Kamil Paral
  2023-08-23  7:44         ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-08-23  7:00 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Aug 23, 2023 at 7:07 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> Okay if I understand correctly with commit 7b3ba09febf4 things works as
> before and you don't see any delays in resume?

If commit 7b3ba09febf4 is included but commit e8b908146d44 is
excluded, the resume works as usual. Once e8b908146d44 is included (it
doesn't matter whether 7b3ba09febf4 is included or not), the resume
gets delayed by ~60 seconds.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23  7:00       ` Kamil Paral
@ 2023-08-23  7:44         ` Mika Westerberg
  2023-08-23  7:56           ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-23  7:44 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Aug 23, 2023 at 09:00:39AM +0200, Kamil Paral wrote:
> On Wed, Aug 23, 2023 at 7:07 AM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > Okay if I understand correctly with commit 7b3ba09febf4 things works as
> > before and you don't see any delays in resume?
> 
> If commit 7b3ba09febf4 is included but commit e8b908146d44 is
> excluded, the resume works as usual. Once e8b908146d44 is included (it
> doesn't matter whether 7b3ba09febf4 is included or not), the resume
> gets delayed by ~60 seconds.

Okay thanks for clarifying. The ~60s means that the PCIe link does not
come up and this is unexpected so most likely there is something else
going on during resume. I will check the logs you shared if there is
something but what is expected is that regardless of the timeout the
PCIe tunnel gets established to the dock pretty quickly and the OS does
not notice that the link is even down.

I guess even without the 60s delay you see in the logs that the PCIe
link is down and Linux starts tearing the device stack towards the dock
on resume?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23  7:44         ` Mika Westerberg
@ 2023-08-23  7:56           ` Mika Westerberg
  2023-08-23  8:20             ` Kamil Paral
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-23  7:56 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Aug 23, 2023 at 10:44:47AM +0300, Mika Westerberg wrote:
> I guess even without the 60s delay you see in the logs that the PCIe
> link is down and Linux starts tearing the device stack towards the dock
> on resume?

Okay this is what happens (from your log):

srp 22 18:19:50 fenix kernel: pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
srp 22 18:19:50 fenix kernel: pcieport 0000:07:00.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: pcieport 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: pcieport 0000:08:04.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: pcieport 0000:08:01.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: pcieport 0000:08:03.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: pcieport 0000:08:02.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 1023ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 1023ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 2047ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 2047ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 4095ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 4095ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 8191ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 8191ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 16383ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 16383ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 32767ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 32767ms after resume; waiting
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: not ready 65535ms after resume; giving up
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: not ready 65535ms after resume; giving up
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: ACPI: EC: event unblocked
srp 22 18:19:50 fenix kernel: i915 0000:00:02.0: [drm] [ENCODER:94:DDI A/PHY A] is disabled/in DSI mode with an ungated DDI clock, gate it
srp 22 18:19:50 fenix kernel: i915 0000:00:02.0: [drm] [ENCODER:102:DDI B/PHY B] is disabled/in DSI mode with an ungated DDI clock, gate it
srp 22 18:19:50 fenix kernel: i915 0000:00:02.0: [drm] [ENCODER:118:DDI C/PHY C] is disabled/in DSI mode with an ungated DDI clock, gate it
srp 22 18:19:50 fenix kernel: pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
srp 22 18:19:50 fenix kernel: pcieport 0000:08:04.0: pciehp: pcie_do_write_cmd: no response from device
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: Controller not ready at resume -19
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: PCI post-resume error -19!
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: HC died; cleaning up
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: Controller not ready at resume -19
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: PCI post-resume error -19!
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: HC died; cleaning up
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xf0 returns -19
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xf0 returns -19
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:09:00.0: PM: failed to resume async: error -19
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:0b:00.0: PM: failed to resume async: error -19
srp 22 18:19:50 fenix kernel: xhci_hcd 0000:3c:00.0: xHC error in resume, USBSTS 0x401, Reinit

So directly after resume the PCIe link (tunnel) from the Thunderbolt host router
PCIe downstream port does not get re-established and this brings down
the whole device hierarchy behind that port. The delay just made it take
longer but the actual problem is not the delay but why the tunnel is not
re-established properly.

Next question is that what's the Thunderbolt firmware version? You can
check it throughs sysfs: /sys/bus/thunderbolt/devices/0-0/nvm_version. I
see the BIOS you have seems to be quite recent (06/12/2023) so that
probably should be good enough.

@Chris Chiu, have you guys run Linux on that particular Lenovo system?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23  7:56           ` Mika Westerberg
@ 2023-08-23  8:20             ` Kamil Paral
  2023-08-23  9:05               ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-08-23  8:20 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Aug 23, 2023 at 9:56 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> So directly after resume the PCIe link (tunnel) from the Thunderbolt host router
> PCIe downstream port does not get re-established and this brings down
> the whole device hierarchy behind that port. The delay just made it take
> longer but the actual problem is not the delay but why the tunnel is not
> re-established properly.

If you want to compare it to a "fast" resume (~5 sec, before commit
e8b908146d44), here's dmesg:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984726

Even when the resume is fast, it takes a few extra seconds before the
devices on the dock are usable in the OS. For example, my USB mouse
connected to the dock doesn't work immediately, I have to wait a few
more seconds. The ethernet on the dock also reconnects only after a
few extra seconds.

> Next question is that what's the Thunderbolt firmware version? You can
> check it throughs sysfs: /sys/bus/thunderbolt/devices/0-0/nvm_version

$ sudo cat /sys/bus/thunderbolt/devices/0-0/nvm_version
20.0

Here's whole `fwupdmgr get-devices` output, if that helps:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984728

Before reporting this bug, I updated the firmware on the Dock itself
to the latest version (had to use Windows for that). The dock should
have now this firmware:
https://pcsupport.lenovo.com/us/en/downloads/DS506176
Which is:
    Tool package V1.0.25
    TBT FW: C44
    PD FW: 1.38.07
    DP hub: 3.13.005
    Audio: 04-0E-87_Rev_0087
according to the Readme file. That seems to match the "44" version in
the fwupdmgr output.

> I
> see the BIOS you have seems to be quite recent (06/12/2023) so that
> probably should be good enough.

Lenovo seems to support it through LVFS, so that's what I use for
updating the BIOS. Version 0.1.54 was updated quite recently and it
seems to be also the latest version they have on their website.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23  8:20             ` Kamil Paral
@ 2023-08-23  9:05               ` Mika Westerberg
  2023-08-23 14:02                 ` Kamil Paral
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-23  9:05 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

Hi,

On Wed, Aug 23, 2023 at 10:20:52AM +0200, Kamil Paral wrote:
> On Wed, Aug 23, 2023 at 9:56 AM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > So directly after resume the PCIe link (tunnel) from the Thunderbolt host router
> > PCIe downstream port does not get re-established and this brings down
> > the whole device hierarchy behind that port. The delay just made it take
> > longer but the actual problem is not the delay but why the tunnel is not
> > re-established properly.
> 
> If you want to compare it to a "fast" resume (~5 sec, before commit
> e8b908146d44), here's dmesg:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984726
> 
> Even when the resume is fast, it takes a few extra seconds before the
> devices on the dock are usable in the OS. For example, my USB mouse
> connected to the dock doesn't work immediately, I have to wait a few
> more seconds. The ethernet on the dock also reconnects only after a
> few extra seconds.

Yes, this is exactly the issue. The PCIe tunnel is not up and that makes
the OS to remove the devices during resume. The delay just makes it more
"visible".

> > Next question is that what's the Thunderbolt firmware version? You can
> > check it throughs sysfs: /sys/bus/thunderbolt/devices/0-0/nvm_version
> 
> $ sudo cat /sys/bus/thunderbolt/devices/0-0/nvm_version
> 20.0

OK, this is "Alpine Ridge Low Power" and for that the firmware seems to
be quite recent.

> Here's whole `fwupdmgr get-devices` output, if that helps:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984728
> 
> Before reporting this bug, I updated the firmware on the Dock itself
> to the latest version (had to use Windows for that). The dock should
> have now this firmware:
> https://pcsupport.lenovo.com/us/en/downloads/DS506176
> Which is:
>     Tool package V1.0.25
>     TBT FW: C44
>     PD FW: 1.38.07
>     DP hub: 3.13.005
>     Audio: 04-0E-87_Rev_0087
> according to the Readme file. That seems to match the "44" version in
> the fwupdmgr output.

[You could do that in Linux too but Lenovo does not seem to ship the
firmware through LVFS.]

> > I
> > see the BIOS you have seems to be quite recent (06/12/2023) so that
> > probably should be good enough.
> 
> Lenovo seems to support it through LVFS, so that's what I use for
> updating the BIOS. Version 0.1.54 was updated quite recently and it
> seems to be also the latest version they have on their website.

OK. Did you change any BIOS settings from the defaults that might have
affect on this? Sometimes these are exposed to through the BIOS menu and
the user can change those (Lenovo typically does not, expose them
though).

Can you also attach output of acpidump to and dmesg with
"thunderbolt.dyndbg=+p" in the command line?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23  9:05               ` Mika Westerberg
@ 2023-08-23 14:02                 ` Kamil Paral
  2023-08-24 11:43                   ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-08-23 14:02 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Aug 23, 2023 at 11:05 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> OK. Did you change any BIOS settings from the defaults that might have
> affect on this? Sometimes these are exposed to through the BIOS menu and
> the user can change those (Lenovo typically does not, expose them
> though).

These BIOS pages seem they could be relevant:
https://imgur.com/a/vEltxpj

And heureka, "Thunderbolt BIOS Assist Mode" affects this! It was
disabled (not sure if that's the default or I changed it before).
After I enable it, the resume 60+ second delay is gone, it resumes in
standard ~5 seconds. The dock devices (mouse, LAN) still take a few
extra seconds to activate. The dmesg for a resume with TB assist mode
is here:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984785

Unfortunately, it seems to have its own quirks. There seems to be
something like ~20% chance that the dock devices are no longer
available after resume. I see the dock as connected in boltctl, but I
no longer see the devices in lsusb. Reconnecting the dock doesn't
help, more suspend&resume cycles don't help, only system reboot helps.
I captured this situation in this dmesg (there are multiple resume
events, the devices disappear after the last one):
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984786

This might be the same problem that I described regarding
pcie_aspm=off to Bjorn, but I'd need to check. Either way, this wasn't
happening when the TB assist mode was disabled.

> Can you also attach output of acpidump to and dmesg with
> "thunderbolt.dyndbg=+p" in the command line?

acpidump:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984802

dmesg with "thunderbolt.dyndbg=+p" and one suspend&resume cycle:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984803

I'm currently running kernel 6.4.8 packaged in Fedora 38 for these
experiments, I hope that's OK. If needed, I can switch to the latest
kernel.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-23 14:02                 ` Kamil Paral
@ 2023-08-24 11:43                   ` Mika Westerberg
  2023-08-25  8:42                     ` Kamil Paral
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-24 11:43 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Aug 23, 2023 at 04:02:06PM +0200, Kamil Paral wrote:
> On Wed, Aug 23, 2023 at 11:05 AM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > OK. Did you change any BIOS settings from the defaults that might have
> > affect on this? Sometimes these are exposed to through the BIOS menu and
> > the user can change those (Lenovo typically does not, expose them
> > though).
> 
> These BIOS pages seem they could be relevant:
> https://imgur.com/a/vEltxpj
> 
> And heureka, "Thunderbolt BIOS Assist Mode" affects this! It was
> disabled (not sure if that's the default or I changed it before).
> After I enable it, the resume 60+ second delay is gone, it resumes in
> standard ~5 seconds. The dock devices (mouse, LAN) still take a few
> extra seconds to activate. The dmesg for a resume with TB assist mode
> is here:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984785

I think the correct setting for this is to have it disabled. This means
it is native PCIe with the runtime D3 support and looking at the ACPI
dump you shared this seems to be the case.

> Unfortunately, it seems to have its own quirks. There seems to be
> something like ~20% chance that the dock devices are no longer
> available after resume. I see the dock as connected in boltctl, but I
> no longer see the devices in lsusb. Reconnecting the dock doesn't
> help, more suspend&resume cycles don't help, only system reboot helps.
> I captured this situation in this dmesg (there are multiple resume
> events, the devices disappear after the last one):
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984786
> 
> This might be the same problem that I described regarding
> pcie_aspm=off to Bjorn, but I'd need to check. Either way, this wasn't
> happening when the TB assist mode was disabled.
> 
> > Can you also attach output of acpidump to and dmesg with
> > "thunderbolt.dyndbg=+p" in the command line?
> 
> acpidump:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984802
> 
> dmesg with "thunderbolt.dyndbg=+p" and one suspend&resume cycle:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984803
> 
> I'm currently running kernel 6.4.8 packaged in Fedora 38 for these
> experiments, I hope that's OK. If needed, I can switch to the latest
> kernel.

I did not find anything "unusual" in the ACPI dump but I keep looking.

One thing I noticed, probably has nothing to do with this, but you have
the "security level" set to "secure". Now this is fine and actually
recommended but I wonder if anything changes if you switch that
temporarily to "user"? What is happening here is that once the system
enters S3 the Thunderbolt driver tells the firmware to save the
connected device list, and then once it exits S3 it is expected to
re-connect the PCIe tunnels of the devices on that list but this is not
happening and that's why the dock "dissappears" during resume.

In any case we can conclude that the commit in question has nothing to
do with the issue. This is completely Thunderbolt related problem.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-24 11:43                   ` Mika Westerberg
@ 2023-08-25  8:42                     ` Kamil Paral
  2023-08-25  9:46                       ` Mika Westerberg
  2023-09-23 22:46                       ` Bjorn Helgaas
  0 siblings, 2 replies; 44+ messages in thread
From: Kamil Paral @ 2023-08-25  8:42 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> One thing I noticed, probably has nothing to do with this, but you have
> the "security level" set to "secure". Now this is fine and actually
> recommended but I wonder if anything changes if you switch that
> temporarily to "user"? What is happening here is that once the system
> enters S3 the Thunderbolt driver tells the firmware to save the
> connected device list, and then once it exits S3 it is expected to
> re-connect the PCIe tunnels of the devices on that list but this is not
> happening and that's why the dock "dissappears" during resume.

That was a great suggestion. After switching to the user security
level, the resume delay is gone, and my dock devices seem to be
working almost immediately after resume! The dmesg for that is here:
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262

I've done tens of cycles and haven't found any race conditions, unlike
with the TB assist mode. (Only once, my USB mouse wasn't working at
all, but that's something that occasionally happens on most docks I've
worked with and seems to be some different issue).

I'm sorry I haven't found this earlier myself. I did try switching
these options, but I bundled it together with enabling the TB assist
mode, which has quirks, so I didn't realize switching just this one
option might have an impact.

> In any case we can conclude that the commit in question has nothing to
> do with the issue. This is completely Thunderbolt related problem.

Considering the information above, does this appear to be a solely
dock-related issue (bugged firmware), or does it make sense to follow
up on this in some different kernel list? I have to say I'm completely
OK with running the laptop using the "user" TB security level, but if
you think I should follow up somewhere to get the "secure" level fixed
(or some workaround applied, etc), I can.

Thanks a lot for all your help debugging this, Mika and Bjorn.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-25  8:42                     ` Kamil Paral
@ 2023-08-25  9:46                       ` Mika Westerberg
  2023-08-25 11:42                         ` Kamil Paral
  2023-09-23 22:46                       ` Bjorn Helgaas
  1 sibling, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-08-25  9:46 UTC (permalink / raw)
  To: Kamil Paral; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

Hi,

On Fri, Aug 25, 2023 at 10:42:55AM +0200, Kamil Paral wrote:
> On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > One thing I noticed, probably has nothing to do with this, but you have
> > the "security level" set to "secure". Now this is fine and actually
> > recommended but I wonder if anything changes if you switch that
> > temporarily to "user"? What is happening here is that once the system
> > enters S3 the Thunderbolt driver tells the firmware to save the
> > connected device list, and then once it exits S3 it is expected to
> > re-connect the PCIe tunnels of the devices on that list but this is not
> > happening and that's why the dock "dissappears" during resume.
> 
> That was a great suggestion. After switching to the user security
> level, the resume delay is gone, and my dock devices seem to be
> working almost immediately after resume! The dmesg for that is here:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262

Yeah looks good now.

> I've done tens of cycles and haven't found any race conditions, unlike
> with the TB assist mode. (Only once, my USB mouse wasn't working at
> all, but that's something that occasionally happens on most docks I've
> worked with and seems to be some different issue).
> 
> I'm sorry I haven't found this earlier myself. I did try switching
> these options, but I bundled it together with enabling the TB assist
> mode, which has quirks, so I didn't realize switching just this one
> option might have an impact.
> 
> > In any case we can conclude that the commit in question has nothing to
> > do with the issue. This is completely Thunderbolt related problem.
> 
> Considering the information above, does this appear to be a solely
> dock-related issue (bugged firmware), or does it make sense to follow
> up on this in some different kernel list? I have to say I'm completely
> OK with running the laptop using the "user" TB security level, but if
> you think I should follow up somewhere to get the "secure" level fixed
> (or some workaround applied, etc), I can.

I would start by contacting Lenovo because I suspect this is Thunderbolt
firmware (host side not dock) issue so they may pull a newer one for
this one from Intel.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-25  9:46                       ` Mika Westerberg
@ 2023-08-25 11:42                         ` Kamil Paral
  0 siblings, 0 replies; 44+ messages in thread
From: Kamil Paral @ 2023-08-25 11:42 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, regressions, bhelgaas, chris.chiu

On Fri, Aug 25, 2023 at 11:46 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> I would start by contacting Lenovo because I suspect this is Thunderbolt
> firmware (host side not dock) issue so they may pull a newer one for
> this one from Intel.

Thanks. For anyone who wants to follow this, I did that in my
downstream bug report here:
https://bugzilla.redhat.com/show_bug.cgi?id=2230357#c23

Thanks again for all your help.
Kamil Páral


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-25  8:42                     ` Kamil Paral
  2023-08-25  9:46                       ` Mika Westerberg
@ 2023-09-23 22:46                       ` Bjorn Helgaas
  2023-09-24 13:27                         ` Mika Westerberg
  1 sibling, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-23 22:46 UTC (permalink / raw)
  To: Kamil Paral; +Cc: Mika Westerberg, linux-pci, regressions, bhelgaas, chris.chiu

On Fri, Aug 25, 2023 at 10:42:55AM +0200, Kamil Paral wrote:
> On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > One thing I noticed, probably has nothing to do with this, but you have
> > the "security level" set to "secure". Now this is fine and actually
> > recommended but I wonder if anything changes if you switch that
> > temporarily to "user"? What is happening here is that once the system
> > enters S3 the Thunderbolt driver tells the firmware to save the
> > connected device list, and then once it exits S3 it is expected to
> > re-connect the PCIe tunnels of the devices on that list but this is not
> > happening and that's why the dock "dissappears" during resume.
> 
> That was a great suggestion. After switching to the user security
> level, the resume delay is gone, and my dock devices seem to be
> working almost immediately after resume! The dmesg for that is here:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262
> 
> I've done tens of cycles and haven't found any race conditions, unlike
> with the TB assist mode. (Only once, my USB mouse wasn't working at
> all, but that's something that occasionally happens on most docks I've
> worked with and seems to be some different issue).
> 
> I'm sorry I haven't found this earlier myself. I did try switching
> these options, but I bundled it together with enabling the TB assist
> mode, which has quirks, so I didn't realize switching just this one
> option might have an impact.
> 
> > In any case we can conclude that the commit in question has nothing to
> > do with the issue. This is completely Thunderbolt related problem.
> 
> Considering the information above, does this appear to be a solely
> dock-related issue (bugged firmware), or does it make sense to follow
> up on this in some different kernel list? I have to say I'm completely
> OK with running the laptop using the "user" TB security level, but if
> you think I should follow up somewhere to get the "secure" level fixed
> (or some workaround applied, etc), I can.

I'm confused about this issue.  Correct me if I go wrong:

The hierarchy is:

  00:1c.4 Root Port to [bus 04-3c]
  04:00.0 Upstream Port (Thunderbolt) to [bus 05-3c]
  05:01.0 Downstream Port (Thunderbolt) to [bus 07-3b]
  07:00.0 Upstream Port (Thunderbolt) to [bus 08-3b]

With security level=secure, before e8b908146d44 ("PCI/PM: Increase
wait time after resume"), resume takes ~5 seconds, but the hierarchy
below 05:01.0 gets removed and re-enumerated (dmesg [1]).  After
e8b908146d44, the same thing happens except the resume takes 60+
seconds (dmesg [2]).  In both cases, the devices (USB mouse, LAN, etc)
below 05:01.0 work after resume.

With security level=user, resume takes << 5 seconds regardless of
e8b908146d44, and the hierarchy below 05:01.0 does not get removed and
re-enumerated (dmesg [3]).

So if that's all accurate, it sounds like we've always had some
problem with security level=secure that causes the hierarchy to get
removed and re-enumerated, and e8b908146d44 just makes this problem
much more visible?

I don't know anything at all about how Thunderbolt security levels
work.  If "secure" means the hierarchy must be re-enumerated after
resume, we can detect that case immediately and get on with it without
having to wait for a timeout?

Bjorn

[1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984726
[2] https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984803
[3] https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-23 22:46                       ` Bjorn Helgaas
@ 2023-09-24 13:27                         ` Mika Westerberg
  2023-09-24 20:18                           ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-24 13:27 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Sat, Sep 23, 2023 at 05:46:10PM -0500, Bjorn Helgaas wrote:
> On Fri, Aug 25, 2023 at 10:42:55AM +0200, Kamil Paral wrote:
> > On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > > One thing I noticed, probably has nothing to do with this, but you have
> > > the "security level" set to "secure". Now this is fine and actually
> > > recommended but I wonder if anything changes if you switch that
> > > temporarily to "user"? What is happening here is that once the system
> > > enters S3 the Thunderbolt driver tells the firmware to save the
> > > connected device list, and then once it exits S3 it is expected to
> > > re-connect the PCIe tunnels of the devices on that list but this is not
> > > happening and that's why the dock "dissappears" during resume.
> > 
> > That was a great suggestion. After switching to the user security
> > level, the resume delay is gone, and my dock devices seem to be
> > working almost immediately after resume! The dmesg for that is here:
> > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262
> > 
> > I've done tens of cycles and haven't found any race conditions, unlike
> > with the TB assist mode. (Only once, my USB mouse wasn't working at
> > all, but that's something that occasionally happens on most docks I've
> > worked with and seems to be some different issue).
> > 
> > I'm sorry I haven't found this earlier myself. I did try switching
> > these options, but I bundled it together with enabling the TB assist
> > mode, which has quirks, so I didn't realize switching just this one
> > option might have an impact.
> > 
> > > In any case we can conclude that the commit in question has nothing to
> > > do with the issue. This is completely Thunderbolt related problem.
> > 
> > Considering the information above, does this appear to be a solely
> > dock-related issue (bugged firmware), or does it make sense to follow
> > up on this in some different kernel list? I have to say I'm completely
> > OK with running the laptop using the "user" TB security level, but if
> > you think I should follow up somewhere to get the "secure" level fixed
> > (or some workaround applied, etc), I can.
> 
> I'm confused about this issue.  Correct me if I go wrong:
> 
> The hierarchy is:
> 
>   00:1c.4 Root Port to [bus 04-3c]
>   04:00.0 Upstream Port (Thunderbolt) to [bus 05-3c]
>   05:01.0 Downstream Port (Thunderbolt) to [bus 07-3b]
>   07:00.0 Upstream Port (Thunderbolt) to [bus 08-3b]
> 
> With security level=secure, before e8b908146d44 ("PCI/PM: Increase
> wait time after resume"), resume takes ~5 seconds, but the hierarchy
> below 05:01.0 gets removed and re-enumerated (dmesg [1]).  After
> e8b908146d44, the same thing happens except the resume takes 60+
> seconds (dmesg [2]).  In both cases, the devices (USB mouse, LAN, etc)
> below 05:01.0 work after resume.
> 
> With security level=user, resume takes << 5 seconds regardless of
> e8b908146d44, and the hierarchy below 05:01.0 does not get removed and
> re-enumerated (dmesg [3]).
> 
> So if that's all accurate, it sounds like we've always had some
> problem with security level=secure that causes the hierarchy to get
> removed and re-enumerated, and e8b908146d44 just makes this problem
> much more visible?

Yes.

> I don't know anything at all about how Thunderbolt security levels
> work.  If "secure" means the hierarchy must be re-enumerated after
> resume, we can detect that case immediately and get on with it without
> having to wait for a timeout?

"secure" means that the Thunderbolt device that is connected can "prove"
it is the device we "authorized". It basically has a random number we
generated flashed on the NVM. This is the security "measure" used before
PC world aligned to use IOMMU instead.

(there is an explanation of all these here:
https://docs.kernel.org/admin-guide/thunderbolt.html#security-levels-and-how-to-use-them).

Now, in case of resume the Thunderbolt firmware is expected to connect
the PCIe tunnel before the OS gets to resume its PCIe stack but this is
not happening in this particular system when the security level is set
to "secure". It could be firmware issue, and also if the BIOS settings
get changed from the defaults it is entirely possible that the system
enters paths that are not fully validated. Yes, changing security level
should definitely work and the PCIe tunnel should be properly
established but in any case this is Thunderbolt issue not PCIe issue.

I don't recall if I suggested this already but if not, try to see there
is a firmware update for your system. Lenovo supports LVFS so if there
is newer one fwupd should allow you to upgrade it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-24 13:27                         ` Mika Westerberg
@ 2023-09-24 20:18                           ` Bjorn Helgaas
  2023-09-25  4:59                             ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-24 20:18 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Sun, Sep 24, 2023 at 04:27:22PM +0300, Mika Westerberg wrote:
> On Sat, Sep 23, 2023 at 05:46:10PM -0500, Bjorn Helgaas wrote:
> > On Fri, Aug 25, 2023 at 10:42:55AM +0200, Kamil Paral wrote:
> > > On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
> > > <mika.westerberg@linux.intel.com> wrote:
> > > > One thing I noticed, probably has nothing to do with this, but you have
> > > > the "security level" set to "secure". Now this is fine and actually
> > > > recommended but I wonder if anything changes if you switch that
> > > > temporarily to "user"? What is happening here is that once the system
> > > > enters S3 the Thunderbolt driver tells the firmware to save the
> > > > connected device list, and then once it exits S3 it is expected to
> > > > re-connect the PCIe tunnels of the devices on that list but this is not
> > > > happening and that's why the dock "dissappears" during resume.
> > > 
> > > That was a great suggestion. After switching to the user security
> > > level, the resume delay is gone, and my dock devices seem to be
> > > working almost immediately after resume! The dmesg for that is here:
> > > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262
> > > 
> > > I've done tens of cycles and haven't found any race conditions, unlike
> > > with the TB assist mode. (Only once, my USB mouse wasn't working at
> > > all, but that's something that occasionally happens on most docks I've
> > > worked with and seems to be some different issue).
> > > 
> > > I'm sorry I haven't found this earlier myself. I did try switching
> > > these options, but I bundled it together with enabling the TB assist
> > > mode, which has quirks, so I didn't realize switching just this one
> > > option might have an impact.
> > > 
> > > > In any case we can conclude that the commit in question has nothing to
> > > > do with the issue. This is completely Thunderbolt related problem.
> > > 
> > > Considering the information above, does this appear to be a solely
> > > dock-related issue (bugged firmware), or does it make sense to follow
> > > up on this in some different kernel list? I have to say I'm completely
> > > OK with running the laptop using the "user" TB security level, but if
> > > you think I should follow up somewhere to get the "secure" level fixed
> > > (or some workaround applied, etc), I can.
> > 
> > I'm confused about this issue.  Correct me if I go wrong:
> > 
> > The hierarchy is:
> > 
> >   00:1c.4 Root Port to [bus 04-3c]
> >   04:00.0 Upstream Port (Thunderbolt) to [bus 05-3c]
> >   05:01.0 Downstream Port (Thunderbolt) to [bus 07-3b]
> >   07:00.0 Upstream Port (Thunderbolt) to [bus 08-3b]
> > 
> > With security level=secure, before e8b908146d44 ("PCI/PM: Increase
> > wait time after resume"), resume takes ~5 seconds, but the hierarchy
> > below 05:01.0 gets removed and re-enumerated (dmesg [1]).  After
> > e8b908146d44, the same thing happens except the resume takes 60+
> > seconds (dmesg [2]).  In both cases, the devices (USB mouse, LAN, etc)
> > below 05:01.0 work after resume.
> > 
> > With security level=user, resume takes << 5 seconds regardless of
> > e8b908146d44, and the hierarchy below 05:01.0 does not get removed and
> > re-enumerated (dmesg [3]).
> > 
> > So if that's all accurate, it sounds like we've always had some
> > problem with security level=secure that causes the hierarchy to get
> > removed and re-enumerated, and e8b908146d44 just makes this problem
> > much more visible?
> 
> Yes.
> 
> > I don't know anything at all about how Thunderbolt security levels
> > work.  If "secure" means the hierarchy must be re-enumerated after
> > resume, we can detect that case immediately and get on with it without
> > having to wait for a timeout?
> 
> "secure" means that the Thunderbolt device that is connected can "prove"
> it is the device we "authorized". It basically has a random number we
> generated flashed on the NVM. This is the security "measure" used before
> PC world aligned to use IOMMU instead.
> 
> (there is an explanation of all these here:
> https://docs.kernel.org/admin-guide/thunderbolt.html#security-levels-and-how-to-use-them).

So is there some user-level software that runs between the removal and
re-enumeration?  Something that authorizes the 07:00.0 Upstream Port?

> Now, in case of resume the Thunderbolt firmware is expected to connect
> the PCIe tunnel before the OS gets to resume its PCIe stack but this is
> not happening in this particular system when the security level is set
> to "secure". It could be firmware issue, and also if the BIOS settings
> get changed from the defaults it is entirely possible that the system
> enters paths that are not fully validated. Yes, changing security level
> should definitely work and the PCIe tunnel should be properly
> established but in any case this is Thunderbolt issue not PCIe issue.
> 
> I don't recall if I suggested this already but if not, try to see there
> is a firmware update for your system. Lenovo supports LVFS so if there
> is newer one fwupd should allow you to upgrade it.

I think you have suggested a firmware update, but I don't think that's
a great solution for most users.  An ordinary user who has the
security level set to "secure" and updates to a v6.4 kernel is just
going to think resume is broken.  That user will not be willing or
able to diagnose it as a security setting that could be changed or
firmware that could be updated.

It would be ideal if we could make "secure" resume as fast as "user"
resume, but at the very least, I think we need to make it no worse
than it was in v6.3.

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-24 20:18                           ` Bjorn Helgaas
@ 2023-09-25  4:59                             ` Mika Westerberg
  2023-09-25 13:48                               ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-25  4:59 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Sun, Sep 24, 2023 at 03:18:00PM -0500, Bjorn Helgaas wrote:
> On Sun, Sep 24, 2023 at 04:27:22PM +0300, Mika Westerberg wrote:
> > On Sat, Sep 23, 2023 at 05:46:10PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Aug 25, 2023 at 10:42:55AM +0200, Kamil Paral wrote:
> > > > On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
> > > > <mika.westerberg@linux.intel.com> wrote:
> > > > > One thing I noticed, probably has nothing to do with this, but you have
> > > > > the "security level" set to "secure". Now this is fine and actually
> > > > > recommended but I wonder if anything changes if you switch that
> > > > > temporarily to "user"? What is happening here is that once the system
> > > > > enters S3 the Thunderbolt driver tells the firmware to save the
> > > > > connected device list, and then once it exits S3 it is expected to
> > > > > re-connect the PCIe tunnels of the devices on that list but this is not
> > > > > happening and that's why the dock "dissappears" during resume.
> > > > 
> > > > That was a great suggestion. After switching to the user security
> > > > level, the resume delay is gone, and my dock devices seem to be
> > > > working almost immediately after resume! The dmesg for that is here:
> > > > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262
> > > > 
> > > > I've done tens of cycles and haven't found any race conditions, unlike
> > > > with the TB assist mode. (Only once, my USB mouse wasn't working at
> > > > all, but that's something that occasionally happens on most docks I've
> > > > worked with and seems to be some different issue).
> > > > 
> > > > I'm sorry I haven't found this earlier myself. I did try switching
> > > > these options, but I bundled it together with enabling the TB assist
> > > > mode, which has quirks, so I didn't realize switching just this one
> > > > option might have an impact.
> > > > 
> > > > > In any case we can conclude that the commit in question has nothing to
> > > > > do with the issue. This is completely Thunderbolt related problem.
> > > > 
> > > > Considering the information above, does this appear to be a solely
> > > > dock-related issue (bugged firmware), or does it make sense to follow
> > > > up on this in some different kernel list? I have to say I'm completely
> > > > OK with running the laptop using the "user" TB security level, but if
> > > > you think I should follow up somewhere to get the "secure" level fixed
> > > > (or some workaround applied, etc), I can.
> > > 
> > > I'm confused about this issue.  Correct me if I go wrong:
> > > 
> > > The hierarchy is:
> > > 
> > >   00:1c.4 Root Port to [bus 04-3c]
> > >   04:00.0 Upstream Port (Thunderbolt) to [bus 05-3c]
> > >   05:01.0 Downstream Port (Thunderbolt) to [bus 07-3b]
> > >   07:00.0 Upstream Port (Thunderbolt) to [bus 08-3b]
> > > 
> > > With security level=secure, before e8b908146d44 ("PCI/PM: Increase
> > > wait time after resume"), resume takes ~5 seconds, but the hierarchy
> > > below 05:01.0 gets removed and re-enumerated (dmesg [1]).  After
> > > e8b908146d44, the same thing happens except the resume takes 60+
> > > seconds (dmesg [2]).  In both cases, the devices (USB mouse, LAN, etc)
> > > below 05:01.0 work after resume.
> > > 
> > > With security level=user, resume takes << 5 seconds regardless of
> > > e8b908146d44, and the hierarchy below 05:01.0 does not get removed and
> > > re-enumerated (dmesg [3]).
> > > 
> > > So if that's all accurate, it sounds like we've always had some
> > > problem with security level=secure that causes the hierarchy to get
> > > removed and re-enumerated, and e8b908146d44 just makes this problem
> > > much more visible?
> > 
> > Yes.
> > 
> > > I don't know anything at all about how Thunderbolt security levels
> > > work.  If "secure" means the hierarchy must be re-enumerated after
> > > resume, we can detect that case immediately and get on with it without
> > > having to wait for a timeout?
> > 
> > "secure" means that the Thunderbolt device that is connected can "prove"
> > it is the device we "authorized". It basically has a random number we
> > generated flashed on the NVM. This is the security "measure" used before
> > PC world aligned to use IOMMU instead.
> > 
> > (there is an explanation of all these here:
> > https://docs.kernel.org/admin-guide/thunderbolt.html#security-levels-and-how-to-use-them).
> 
> So is there some user-level software that runs between the removal and
> re-enumeration?  Something that authorizes the 07:00.0 Upstream Port?

No.

They get "authorized" upon plug by boltd based on user decision and
after that the firmware should keep them authorized as long as the
device is connected, including also resume.

> > Now, in case of resume the Thunderbolt firmware is expected to connect
> > the PCIe tunnel before the OS gets to resume its PCIe stack but this is
> > not happening in this particular system when the security level is set
> > to "secure". It could be firmware issue, and also if the BIOS settings
> > get changed from the defaults it is entirely possible that the system
> > enters paths that are not fully validated. Yes, changing security level
> > should definitely work and the PCIe tunnel should be properly
> > established but in any case this is Thunderbolt issue not PCIe issue.
> > 
> > I don't recall if I suggested this already but if not, try to see there
> > is a firmware update for your system. Lenovo supports LVFS so if there
> > is newer one fwupd should allow you to upgrade it.
> 
> I think you have suggested a firmware update, but I don't think that's
> a great solution for most users.  An ordinary user who has the
> security level set to "secure" and updates to a v6.4 kernel is just
> going to think resume is broken.  That user will not be willing or
> able to diagnose it as a security setting that could be changed or
> firmware that could be updated.

It does not affect every single system there with security level set to
"secure". It is just this one AFAICT. Like I said the firmware is
expected to connect the PCIe tunnel (and for some unknown reason in this
particular system it does not).

> It would be ideal if we could make "secure" resume as fast as "user"
> resume, but at the very least, I think we need to make it no worse
> than it was in v6.3.

Well what else can we do here? The link goes down regardless what we do
in the PCI stack. If you want to revert the patch that caused the delay,
fine but that does not cure the problem tha thet device stack get torn
down upon resume. For instance if there is USB stick connected to the
dock and the filesystem is mounted, all that is gone upon resume
regardless of the delay. If you want to detect the "secure" vs. "user"
fine feel free to do so but keep in mind that there are other systems
out there where this works just fine so avoid breaking them.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-25  4:59                             ` Mika Westerberg
@ 2023-09-25 13:48                               ` Bjorn Helgaas
  2023-09-25 14:19                                 ` Lukas Wunner
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-25 13:48 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Mon, Sep 25, 2023 at 07:59:28AM +0300, Mika Westerberg wrote:
> On Sun, Sep 24, 2023 at 03:18:00PM -0500, Bjorn Helgaas wrote:
> ...

> > So is there some user-level software that runs between the removal and
> > re-enumeration?  Something that authorizes the 07:00.0 Upstream Port?
> 
> No.
> 
> They get "authorized" upon plug by boltd based on user decision and
> after that the firmware should keep them authorized as long as the
> device is connected, including also resume.

I'm trying to understand the events involved in bringing that link up.
When we resume, evidently the link is not up, and it doesn't come up
by itself no matter how long we wait.  From the log at [1],

  [    0.494521] pci 0000:05:01.0: PCI bridge to [bus 07-3b]

The hierarchy that's a problem is bus 07-3b.

  [  117.485355] ACPI: PM: Preparing to enter system sleep state S3

Suspended.

  [  117.528216] ACPI: PM: Waking up from system sleep state S3
  [  117.606664] ACPI: EC: interrupt unblocked
  [  118.915870] thunderbolt 0000:06:00.0: control channel starting...

Resuming.

  [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
  [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
  [  190.376024] pci_bus 0000:09: busn_res: [bus 09] is released
  [  190.376089] pci_bus 0000:0a: busn_res: [bus 0a] is released
  [  190.376168] pci_bus 0000:0b: busn_res: [bus 0b] is released
  [  190.376850] pci_bus 0000:0c: busn_res: [bus 0c] is released
  [  190.376883] pci_bus 0000:0d: busn_res: [bus 0d-3b] is released
  [  190.376912] pci_bus 0000:08: busn_res: [bus 08-3b] is released

Link from 05:01.0 didn't come up, so pciehp thinks the "slot" is empty
and we removed the problem hierarchy (bus 07-3b).

  [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
  [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
  [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
  [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
  [  191.943518] pcieport 0000:05:01.0: pciehp: Slot(1): Link Up
  [  192.074593] pci 0000:07:00.0: [8086:15d3] type 01 class 0x060400

Now pciehp thinks the slot is occupied and the link is up, so we
re-enumerate the hierarchy.  Is this because thunderbolt did something
to 06:00.0 that made the link from 05:01.0 come up?

Does Windows suffer from the same 60+ second resume time on this
platform?

Bjorn

[1] https://bugzilla-attachments.redhat.com/attachment.cgi?id=1984803

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-25 13:48                               ` Bjorn Helgaas
@ 2023-09-25 14:19                                 ` Lukas Wunner
  2023-09-26 17:55                                   ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Lukas Wunner @ 2023-09-25 14:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Mika Westerberg, Kamil Paral, linux-pci, regressions, bhelgaas,
	chris.chiu

On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> Now pciehp thinks the slot is occupied and the link is up, so we
> re-enumerate the hierarchy.  Is this because thunderbolt did something
> to 06:00.0 that made the link from 05:01.0 come up?

PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
alongside DisplayPort and other data over the same physical link.

For this to work, PCIe tunnels need to be set up between the Thunderbolt
host controller and attached devices.  Once a tunnel is established,
the PCIe link magically goes up and TLPs can be transmitted.

There are two ways to establish those tunnels:

1/ By a firmware in the Thunderbolt host controller.
   (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)

2/ Natively by the kernel.
   (software connection manager)

I'm assuming that the laptop in question exclusively uses the firmware
connection manager, hence the kernel is reliant on that firmware to
establish tunnels and can't really do anything if it fails to do so.

IMO it goes to show that the native approach is more robust.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-25 14:19                                 ` Lukas Wunner
@ 2023-09-26 17:55                                   ` Bjorn Helgaas
  2023-09-27  5:16                                     ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-26 17:55 UTC (permalink / raw)
  To: Lukas Wunner, Mika Westerberg
  Cc: Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > Now pciehp thinks the slot is occupied and the link is up, so we
> > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > to 06:00.0 that made the link from 05:01.0 come up?
> 
> PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> alongside DisplayPort and other data over the same physical link.
> 
> For this to work, PCIe tunnels need to be set up between the Thunderbolt
> host controller and attached devices.  Once a tunnel is established,
> the PCIe link magically goes up and TLPs can be transmitted.
> 
> There are two ways to establish those tunnels:
> 
> 1/ By a firmware in the Thunderbolt host controller.
>    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> 
> 2/ Natively by the kernel.
>    (software connection manager)
> 
> I'm assuming that the laptop in question exclusively uses the firmware
> connection manager, hence the kernel is reliant on that firmware to
> establish tunnels and can't really do anything if it fails to do so.

Thanks for the background; that improves my meager understanding a
lot.

Since this seems to be a firmware issue, it does sound like this
laptop uses a firmware connection manager.  But there still seems to
be some kernel connection because pre-e8b908146d44, the link came up
in <5 seconds, and after the minor e8b908146d44 change, it takes >60
seconds.

I'm kind of at a loss here because I don't have a clear path forward.
What I'm hearing is that the real fix is a firmware update or a BIOS
setting change (Thunderbolt "user" instead of "secure" mode).

That is problematic for users, who will think resume got broken and
they don't know how to fix it.  It's problematic for me, because it
*looks* like a PCI issue and a PCI change exposed it, so I'll have to
deal with the reports.

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-26 17:55                                   ` Bjorn Helgaas
@ 2023-09-27  5:16                                     ` Mika Westerberg
  2023-09-27 11:57                                       ` Bjorn Helgaas
       [not found]                                       ` <CA+cBOTds9k1Q2haC_gTpsUvjP02dHOv9vSconFEAu-Fsxwf36A@mail.gmail.com>
  0 siblings, 2 replies; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27  5:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > to 06:00.0 that made the link from 05:01.0 come up?
> > 
> > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > alongside DisplayPort and other data over the same physical link.
> > 
> > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > host controller and attached devices.  Once a tunnel is established,
> > the PCIe link magically goes up and TLPs can be transmitted.
> > 
> > There are two ways to establish those tunnels:
> > 
> > 1/ By a firmware in the Thunderbolt host controller.
> >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > 
> > 2/ Natively by the kernel.
> >    (software connection manager)
> > 
> > I'm assuming that the laptop in question exclusively uses the firmware
> > connection manager, hence the kernel is reliant on that firmware to
> > establish tunnels and can't really do anything if it fails to do so.
> 
> Thanks for the background; that improves my meager understanding a
> lot.
> 
> Since this seems to be a firmware issue, it does sound like this
> laptop uses a firmware connection manager.  But there still seems to
> be some kernel connection because pre-e8b908146d44, the link came up
> in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> seconds.

In both cases (with or without) the commit what happens is that after
resume is finished the firmware connection manager notices the
connection, announces it to the Thunderbolt driver that exposes it to
the userspace where boltd re-authorizes the device. This brings up the
PCIe tunnel again and things get working.

(What is expected to happen is that during the resume the firmware
 connection manager re-connects the PCIe tunnel.)

This took previously the ~5s before resume is complete so that the above
steps can happen where as after the commit it got delayed more up to the
arbitrary ~60s because we started to use that with the commit
e8b908146d44 (PCIE_RESET_READY_POLL_MS).

> I'm kind of at a loss here because I don't have a clear path forward.
> What I'm hearing is that the real fix is a firmware update or a BIOS
> setting change (Thunderbolt "user" instead of "secure" mode).

There are lots of firmares involved so, say if any of them are turned
from the default value the system may enter code paths that are not
fully validated unfortunately.

I would also try to change all the BIOS settings back to defaults, see
that it works (it is probably in "user" security level then), then
switch back to "secure" (only change this one option) and try if it now
works. It could be that some setting just did not get commited properly.

> That is problematic for users, who will think resume got broken and
> they don't know how to fix it.  It's problematic for me, because it
> *looks* like a PCI issue and a PCI change exposed it, so I'll have to
> deal with the reports.

I'm sorry about that. Trying best I can to remedy this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27  5:16                                     ` Mika Westerberg
@ 2023-09-27 11:57                                       ` Bjorn Helgaas
  2023-09-27 12:47                                         ` Mika Westerberg
       [not found]                                       ` <CA+cBOTds9k1Q2haC_gTpsUvjP02dHOv9vSconFEAu-Fsxwf36A@mail.gmail.com>
  1 sibling, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-27 11:57 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > 
> > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > alongside DisplayPort and other data over the same physical link.
> > > 
> > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > host controller and attached devices.  Once a tunnel is established,
> > > the PCIe link magically goes up and TLPs can be transmitted.
> > > 
> > > There are two ways to establish those tunnels:
> > > 
> > > 1/ By a firmware in the Thunderbolt host controller.
> > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > 
> > > 2/ Natively by the kernel.
> > >    (software connection manager)
> > > 
> > > I'm assuming that the laptop in question exclusively uses the firmware
> > > connection manager, hence the kernel is reliant on that firmware to
> > > establish tunnels and can't really do anything if it fails to do so.
> > 
> > Thanks for the background; that improves my meager understanding a
> > lot.
> > 
> > Since this seems to be a firmware issue, it does sound like this
> > laptop uses a firmware connection manager.  But there still seems to
> > be some kernel connection because pre-e8b908146d44, the link came up
> > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > seconds.
> 
> In both cases (with or without) the commit what happens is that after
> resume is finished the firmware connection manager notices the
> connection, announces it to the Thunderbolt driver that exposes it to
> the userspace where boltd re-authorizes the device. This brings up the
> PCIe tunnel again and things get working.
> 
> (What is expected to happen is that during the resume the firmware
>  connection manager re-connects the PCIe tunnel.)
> 
> This took previously the ~5s before resume is complete so that the above
> steps can happen where as after the commit it got delayed more up to the
> arbitrary ~60s because we started to use that with the commit
> e8b908146d44 (PCIE_RESET_READY_POLL_MS).

Why does the kernel delay affect the timing of when the firmware
connection manager notices the connection?  It seems like Linux waits
for the timeout, then Linux does something that kicks the firmware
connection manager.  That's why I asked about this sequence:

  [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
  [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
  [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
  [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
  [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
  [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present

where we wait for the timeout, decide the device is gone, remove
everything, and then the thunderbolt driver does something, and we
notice the device is magically back.

> I would also try to change all the BIOS settings back to defaults, see
> that it works (it is probably in "user" security level then), then
> switch back to "secure" (only change this one option) and try if it now
> works. It could be that some setting just did not get commited properly.

If this might lead to fixing a Linux defect, I'm all for this kind of
experimentation.  But if it only leads to understanding a firmware
defect better or figuring out better advice to users, I'm not, because
I don't want to address this with a release note.

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 11:57                                       ` Bjorn Helgaas
@ 2023-09-27 12:47                                         ` Mika Westerberg
  2023-09-27 14:31                                           ` Lukas Wunner
  2023-09-27 16:50                                           ` Bjorn Helgaas
  0 siblings, 2 replies; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27 12:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 06:57:03AM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> > On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > > 
> > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > > alongside DisplayPort and other data over the same physical link.
> > > > 
> > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > > host controller and attached devices.  Once a tunnel is established,
> > > > the PCIe link magically goes up and TLPs can be transmitted.
> > > > 
> > > > There are two ways to establish those tunnels:
> > > > 
> > > > 1/ By a firmware in the Thunderbolt host controller.
> > > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > > 
> > > > 2/ Natively by the kernel.
> > > >    (software connection manager)
> > > > 
> > > > I'm assuming that the laptop in question exclusively uses the firmware
> > > > connection manager, hence the kernel is reliant on that firmware to
> > > > establish tunnels and can't really do anything if it fails to do so.
> > > 
> > > Thanks for the background; that improves my meager understanding a
> > > lot.
> > > 
> > > Since this seems to be a firmware issue, it does sound like this
> > > laptop uses a firmware connection manager.  But there still seems to
> > > be some kernel connection because pre-e8b908146d44, the link came up
> > > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > > seconds.
> > 
> > In both cases (with or without) the commit what happens is that after
> > resume is finished the firmware connection manager notices the
> > connection, announces it to the Thunderbolt driver that exposes it to
> > the userspace where boltd re-authorizes the device. This brings up the
> > PCIe tunnel again and things get working.
> > 
> > (What is expected to happen is that during the resume the firmware
> >  connection manager re-connects the PCIe tunnel.)
> > 
> > This took previously the ~5s before resume is complete so that the above
> > steps can happen where as after the commit it got delayed more up to the
> > arbitrary ~60s because we started to use that with the commit
> > e8b908146d44 (PCIE_RESET_READY_POLL_MS).
> 
> Why does the kernel delay affect the timing of when the firmware
> connection manager notices the connection?  It seems like Linux waits
> for the timeout, then Linux does something that kicks the firmware
> connection manager.  That's why I asked about this sequence:
> 
>   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
>   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
>   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
>   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
>   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
>   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> 
> where we wait for the timeout, decide the device is gone, remove
> everything, and then the thunderbolt driver does something, and we
> notice the device is magically back.

Well the delay delays the whole resume and this includes Thunderbolt
driver resume too, and userspace (where the bolt daemon authorizes the
device again).

> > I would also try to change all the BIOS settings back to defaults, see
> > that it works (it is probably in "user" security level then), then
> > switch back to "secure" (only change this one option) and try if it now
> > works. It could be that some setting just did not get commited properly.
> 
> If this might lead to fixing a Linux defect, I'm all for this kind of
> experimentation.  But if it only leads to understanding a firmware
> defect better or figuring out better advice to users, I'm not, because
> I don't want to address this with a release note.

This is not a Linux defect. The firmware is expected to create that
tunnel so regardless of the "delay" the devices are already back. This
is not happening.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
       [not found]                                       ` <CA+cBOTds9k1Q2haC_gTpsUvjP02dHOv9vSconFEAu-Fsxwf36A@mail.gmail.com>
@ 2023-09-27 13:53                                         ` Mika Westerberg
  2023-09-27 14:12                                           ` Kamil Paral
  2023-09-27 14:08                                         ` Kamil Paral
  1 sibling, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27 13:53 UTC (permalink / raw)
  To: Kamil Paral
  Cc: Bjorn Helgaas, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

Hi,

On Wed, Sep 27, 2023 at 03:43:29PM +0200, Kamil Paral wrote:
>    On Wed, Sep 27, 2023 at 7:16 AM Mika Westerberg
>    <[1]mika.westerberg@linux.intel.com> wrote:
> 
>    > I would also try to change all the BIOS settings back to defaults, see
>    > that it works (it is probably in "user" security level then), then
>    > switch back to "secure" (only change this one option) and try if it now
>    > works. It could be that some setting just did not get commited properly.
> 
>    Hello, I have some interesting findings.
>    I reset the BIOS ("load BIOS defaults"), however that doesn't reset the TB
>    security level option - whatever is set is kept. So I can't really find
>    out what the default value is and whether I changed it. However, to my
>    surprise, the resume was suddenly fast even with the Secure Connect
>    security level. So I went through options one by one and finally
>    discovered that the "Wake by Thunderbolt 3" option makes the difference.
>    It has the following description:
>    "Enable/Disable Wake Feature with Thunderbolt 3 ports. If Enabled, the
>    battery life during low power state may become shorter."
>    It is enabled by default, I had it disabled before. When I enable it, I
>    have a 5 sec resume even with Secure Connect TB security level. The dmesg
>    for such a resume is here:
>    [2]https://bugzilla-attachments.redhat.com/attachment.cgi?id=1990815

Okay that now looks like what is expected. The firmware re-connects the
tunnels and the resume can continue.

I actually don't know what that option does either but I would expect
that it leaves the PME and small power enabled for the port so that
probably allows the firmware to do this even in "secure" mode or so.

>    I'm sorry I haven't spotted and tested such an obviously named option
>    earlier. My understanding was that this option either wakes up my laptop
>    when I connect the cable (tested now, it doesn't appear to do so), or that
>    it wakes it up when I move/click my dock-connected mouse (tested, doesn't
>    appear to do so). So originally I wanted this disabled (and I had no
>    kernel issues with it disabled, prior to e8b908146d44), but it seems I
>    misunderstood what it does. And I'm still unsure what exactly it's
>    supposed to do and why it should kill my battery life in the process,
>    according to its description.
>    Hopefully this helps to bring more light into this?
>    However, I also discovered that even though Wake by TB3 + Secure Connect
>    gives me a fast resume in a normal case, I still see a 60+ sec resume
>    delay when I disconnect and then reconnect the TB cable while the laptop
>    is suspended (which is a frequent use case for me). That doesn't happen
>    with the User Authorization level (regardless of Wake by TB value). So I'm
>    staying with User Authorization for the moment.

If you apply the patch from here:

https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=pm&id=6786c2941fe1788035f99c98c932672138b3fbc5

Does the problem go away with the disconnect case too (and so that you
have "secure" enabled)?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
       [not found]                                       ` <CA+cBOTds9k1Q2haC_gTpsUvjP02dHOv9vSconFEAu-Fsxwf36A@mail.gmail.com>
  2023-09-27 13:53                                         ` Mika Westerberg
@ 2023-09-27 14:08                                         ` Kamil Paral
  1 sibling, 0 replies; 44+ messages in thread
From: Kamil Paral @ 2023-09-27 14:08 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

Resending my previous mail again, because I haven't sent it as
text/plain and it was rejected by the kernel mailing list. See below.

On Wed, Sep 27, 2023 at 3:43 PM Kamil Paral <kparal@redhat.com> wrote:
>
> On Wed, Sep 27, 2023 at 7:16 AM Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
>>
>> I would also try to change all the BIOS settings back to defaults, see
>> that it works (it is probably in "user" security level then), then
>> switch back to "secure" (only change this one option) and try if it now
>> works. It could be that some setting just did not get commited properly.
>
>
> Hello, I have some interesting findings.
>
> I reset the BIOS ("load BIOS defaults"), however that doesn't reset the TB security level option - whatever is set is kept. So I can't really find out what the default value is and whether I changed it. However, to my surprise, the resume was suddenly fast even with the Secure Connect security level. So I went through options one by one and finally discovered that the "Wake by Thunderbolt 3" option makes the difference. It has the following description:
> "Enable/Disable Wake Feature with Thunderbolt 3 ports. If Enabled, the battery life during low power state may become shorter."
>
> It is enabled by default, I had it disabled before. When I enable it, I have a 5 sec resume even with Secure Connect TB security level. The dmesg for such a resume is here:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1990815
>
> I'm sorry I haven't spotted and tested such an obviously named option earlier. My understanding was that this option either wakes up my laptop when I connect the cable (tested now, it doesn't appear to do so), or that it wakes it up when I move/click my dock-connected mouse (tested, doesn't appear to do so). So originally I wanted this disabled (and I had no kernel issues with it disabled, prior to e8b908146d44), but it seems I misunderstood what it does. And I'm still unsure what exactly it's supposed to do and why it should kill my battery life in the process, according to its description.
>
> Hopefully this helps to bring more light into this?
>
> However, I also discovered that even though Wake by TB3 + Secure Connect gives me a fast resume in a normal case, I still see a 60+ sec resume delay when I disconnect and then reconnect the TB cable while the laptop is suspended (which is a frequent use case for me). That doesn't happen with the User Authorization level (regardless of Wake by TB value). So I'm staying with User Authorization for the moment.
>
> Cheers,
> Kamil Páral


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 13:53                                         ` Mika Westerberg
@ 2023-09-27 14:12                                           ` Kamil Paral
  2023-10-05 12:54                                             ` Kamil Paral
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-09-27 14:12 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

On Wed, Sep 27, 2023 at 3:53 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
> If you apply the patch from here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=pm&id=6786c2941fe1788035f99c98c932672138b3fbc5
>
> Does the problem go away with the disconnect case too (and so that you
> have "secure" enabled)?

Thanks, I'll test it, but I can't do it this week. I'll reply next week.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 12:47                                         ` Mika Westerberg
@ 2023-09-27 14:31                                           ` Lukas Wunner
  2023-09-27 14:42                                             ` Mika Westerberg
  2023-09-27 16:50                                           ` Bjorn Helgaas
  1 sibling, 1 reply; 44+ messages in thread
From: Lukas Wunner @ 2023-09-27 14:31 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> This is not a Linux defect. The firmware is expected to create that
> tunnel so regardless of the "delay" the devices are already back. This
> is not happening.

I recall that newer chips can be switched over to software connection
manager at runtime.

Can we determine that the ICM firmware failed to do what it should,
kick it out and let the software connection manager take over?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 14:31                                           ` Lukas Wunner
@ 2023-09-27 14:42                                             ` Mika Westerberg
  2023-09-27 15:36                                               ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27 14:42 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 04:31:43PM +0200, Lukas Wunner wrote:
> On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> > This is not a Linux defect. The firmware is expected to create that
> > tunnel so regardless of the "delay" the devices are already back. This
> > is not happening.
> 
> I recall that newer chips can be switched over to software connection
> manager at runtime.
> 
> Can we determine that the ICM firmware failed to do what it should,
> kick it out and let the software connection manager take over?

No that's not possible. In Macs it was "partially" possible but even
there you lose all the PM and the like. I don't even want to speculate
what happens if you run the same on PCs.

There is an option to "force" this but I do not recommend this (pass
thunderbolt.start_icm=1 in the command line).

This is of course completely different with USB4 where software CM is
the only option.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 14:42                                             ` Mika Westerberg
@ 2023-09-27 15:36                                               ` Mika Westerberg
  0 siblings, 0 replies; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27 15:36 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 05:42:48PM +0300, Mika Westerberg wrote:
> On Wed, Sep 27, 2023 at 04:31:43PM +0200, Lukas Wunner wrote:
> > On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> > > This is not a Linux defect. The firmware is expected to create that
> > > tunnel so regardless of the "delay" the devices are already back. This
> > > is not happening.
> > 
> > I recall that newer chips can be switched over to software connection
> > manager at runtime.
> > 
> > Can we determine that the ICM firmware failed to do what it should,
> > kick it out and let the software connection manager take over?
> 
> No that's not possible. In Macs it was "partially" possible but even
> there you lose all the PM and the like. I don't even want to speculate
> what happens if you run the same on PCs.
> 
> There is an option to "force" this but I do not recommend this (pass
> thunderbolt.start_icm=1 in the command line).

Sorry, this is the complete opposite. This starts the firmware not stops
it but anyways everything else above is true ;-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 12:47                                         ` Mika Westerberg
  2023-09-27 14:31                                           ` Lukas Wunner
@ 2023-09-27 16:50                                           ` Bjorn Helgaas
  2023-09-27 17:01                                             ` Mika Westerberg
  1 sibling, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-27 16:50 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> On Wed, Sep 27, 2023 at 06:57:03AM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> > > On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > > > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > > > 
> > > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > > > alongside DisplayPort and other data over the same physical link.
> > > > > 
> > > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > > > host controller and attached devices.  Once a tunnel is established,
> > > > > the PCIe link magically goes up and TLPs can be transmitted.
> > > > > 
> > > > > There are two ways to establish those tunnels:
> > > > > 
> > > > > 1/ By a firmware in the Thunderbolt host controller.
> > > > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > > > 
> > > > > 2/ Natively by the kernel.
> > > > >    (software connection manager)
> > > > > 
> > > > > I'm assuming that the laptop in question exclusively uses the firmware
> > > > > connection manager, hence the kernel is reliant on that firmware to
> > > > > establish tunnels and can't really do anything if it fails to do so.
> > > > 
> > > > Thanks for the background; that improves my meager understanding a
> > > > lot.
> > > > 
> > > > Since this seems to be a firmware issue, it does sound like this
> > > > laptop uses a firmware connection manager.  But there still seems to
> > > > be some kernel connection because pre-e8b908146d44, the link came up
> > > > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > > > seconds.
> > > 
> > > In both cases (with or without) the commit what happens is that after
> > > resume is finished the firmware connection manager notices the
> > > connection, announces it to the Thunderbolt driver that exposes it to
> > > the userspace where boltd re-authorizes the device. This brings up the
> > > PCIe tunnel again and things get working.
> > > 
> > > (What is expected to happen is that during the resume the firmware
> > >  connection manager re-connects the PCIe tunnel.)
> > > 
> > > This took previously the ~5s before resume is complete so that the above
> > > steps can happen where as after the commit it got delayed more up to the
> > > arbitrary ~60s because we started to use that with the commit
> > > e8b908146d44 (PCIE_RESET_READY_POLL_MS).
> > 
> > Why does the kernel delay affect the timing of when the firmware
> > connection manager notices the connection?  It seems like Linux waits
> > for the timeout, then Linux does something that kicks the firmware
> > connection manager.  That's why I asked about this sequence:
> > 
> >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > 
> > where we wait for the timeout, decide the device is gone, remove
> > everything, and then the thunderbolt driver does something, and we
> > notice the device is magically back.
> 
> Well the delay delays the whole resume and this includes Thunderbolt
> driver resume too, and userspace (where the bolt daemon authorizes the
> device again).

I don't know how the Thunderbolt driver works.  I assume this refers
to "thunderbolt 0000:06:00.0"?  Is the 06:00.0 resume related to the
firmware connection manager somehow?

The removal affects the sub-hierarchy below 05:01.0 (bus 07-3b).
06:00.0 is below 05:00.0, so it's in a different sub-hierarchy.  I
don't think there's a PCIe requirement that 05:01.0 be resumed before
05:00.0, or even that they be serialized at all.

The hierarchy:

  pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
  pci 0000:04:00.0: PCI bridge to [bus 05-3c]
  pci 0000:05:00.0: PCI bridge to [bus 06]
  pci 0000:05:01.0: PCI bridge to [bus 07-3b]

It looks like we start the 06:00.0 resume first (118.9), but it
doesn't complete until after the timeout (191.7):

  [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
  [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
  [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
  [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
  [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
  [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
  [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present

Did the Thunderbolt driver do something to 06:00.0 that caused the
05:01.0 link to come up, or is the timing just coincidental?

> > > I would also try to change all the BIOS settings back to defaults, see
> > > that it works (it is probably in "user" security level then), then
> > > switch back to "secure" (only change this one option) and try if it now
> > > works. It could be that some setting just did not get commited properly.
> > 
> > If this might lead to fixing a Linux defect, I'm all for this kind of
> > experimentation.  But if it only leads to understanding a firmware
> > defect better or figuring out better advice to users, I'm not, because
> > I don't want to address this with a release note.
> 
> This is not a Linux defect. The firmware is expected to create that
> tunnel so regardless of the "delay" the devices are already back. This
> is not happening.

Sorry, bad phrasing on my part.  I understand it's not a Linux defect;
I meant to suggest that if we can learn something that leads to a
Linux change that avoids the delay, whether it's a quirk, design
change, whatever, that's great.

If it only leads to a documentation change or recommendation to users,
that's not great because it's only a reaction to something the user
has already tripped over.

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 16:50                                           ` Bjorn Helgaas
@ 2023-09-27 17:01                                             ` Mika Westerberg
  2023-09-27 17:24                                               ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27 17:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> > On Wed, Sep 27, 2023 at 06:57:03AM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> > > > On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > > > > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > > > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > > > > 
> > > > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > > > > alongside DisplayPort and other data over the same physical link.
> > > > > > 
> > > > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > > > > host controller and attached devices.  Once a tunnel is established,
> > > > > > the PCIe link magically goes up and TLPs can be transmitted.
> > > > > > 
> > > > > > There are two ways to establish those tunnels:
> > > > > > 
> > > > > > 1/ By a firmware in the Thunderbolt host controller.
> > > > > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > > > > 
> > > > > > 2/ Natively by the kernel.
> > > > > >    (software connection manager)
> > > > > > 
> > > > > > I'm assuming that the laptop in question exclusively uses the firmware
> > > > > > connection manager, hence the kernel is reliant on that firmware to
> > > > > > establish tunnels and can't really do anything if it fails to do so.
> > > > > 
> > > > > Thanks for the background; that improves my meager understanding a
> > > > > lot.
> > > > > 
> > > > > Since this seems to be a firmware issue, it does sound like this
> > > > > laptop uses a firmware connection manager.  But there still seems to
> > > > > be some kernel connection because pre-e8b908146d44, the link came up
> > > > > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > > > > seconds.
> > > > 
> > > > In both cases (with or without) the commit what happens is that after
> > > > resume is finished the firmware connection manager notices the
> > > > connection, announces it to the Thunderbolt driver that exposes it to
> > > > the userspace where boltd re-authorizes the device. This brings up the
> > > > PCIe tunnel again and things get working.
> > > > 
> > > > (What is expected to happen is that during the resume the firmware
> > > >  connection manager re-connects the PCIe tunnel.)
> > > > 
> > > > This took previously the ~5s before resume is complete so that the above
> > > > steps can happen where as after the commit it got delayed more up to the
> > > > arbitrary ~60s because we started to use that with the commit
> > > > e8b908146d44 (PCIE_RESET_READY_POLL_MS).
> > > 
> > > Why does the kernel delay affect the timing of when the firmware
> > > connection manager notices the connection?  It seems like Linux waits
> > > for the timeout, then Linux does something that kicks the firmware
> > > connection manager.  That's why I asked about this sequence:
> > > 
> > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > 
> > > where we wait for the timeout, decide the device is gone, remove
> > > everything, and then the thunderbolt driver does something, and we
> > > notice the device is magically back.
> > 
> > Well the delay delays the whole resume and this includes Thunderbolt
> > driver resume too, and userspace (where the bolt daemon authorizes the
> > device again).
> 
> I don't know how the Thunderbolt driver works.  I assume this refers
> to "thunderbolt 0000:06:00.0"?  Is the 06:00.0 resume related to the
> firmware connection manager somehow?
> 
> The removal affects the sub-hierarchy below 05:01.0 (bus 07-3b).
> 06:00.0 is below 05:00.0, so it's in a different sub-hierarchy.  I
> don't think there's a PCIe requirement that 05:01.0 be resumed before
> 05:00.0, or even that they be serialized at all.
> 
> The hierarchy:
> 
>   pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
>   pci 0000:04:00.0: PCI bridge to [bus 05-3c]
>   pci 0000:05:00.0: PCI bridge to [bus 06]
>   pci 0000:05:01.0: PCI bridge to [bus 07-3b]
> 
> It looks like we start the 06:00.0 resume first (118.9), but it
> doesn't complete until after the timeout (191.7):
> 
>   [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
>   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
>   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
>   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
>   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
>   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
>   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> 
> Did the Thunderbolt driver do something to 06:00.0 that caused the
> 05:01.0 link to come up, or is the timing just coincidental?

Yes it sent the firmware a command telling that the driver is ready again, then
the firmware sends back notification that there is a new device:

[  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
[  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
[  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock

this then is send to the userspace via uevent where bolt goes and
authorizes it and this results the tunnel to be created which show in
the log as:

[  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 17:01                                             ` Mika Westerberg
@ 2023-09-27 17:24                                               ` Bjorn Helgaas
  2023-09-27 18:02                                                 ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-27 17:24 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 08:01:14PM +0300, Mika Westerberg wrote:
> On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> > > On Wed, Sep 27, 2023 at 06:57:03AM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> > > > > On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > > > > > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > > > > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > > > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > > > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > > > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > > > > > 
> > > > > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > > > > > alongside DisplayPort and other data over the same physical link.
> > > > > > > 
> > > > > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > > > > > host controller and attached devices.  Once a tunnel is established,
> > > > > > > the PCIe link magically goes up and TLPs can be transmitted.
> > > > > > > 
> > > > > > > There are two ways to establish those tunnels:
> > > > > > > 
> > > > > > > 1/ By a firmware in the Thunderbolt host controller.
> > > > > > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > > > > > 
> > > > > > > 2/ Natively by the kernel.
> > > > > > >    (software connection manager)
> > > > > > > 
> > > > > > > I'm assuming that the laptop in question exclusively uses the firmware
> > > > > > > connection manager, hence the kernel is reliant on that firmware to
> > > > > > > establish tunnels and can't really do anything if it fails to do so.
> > > > > > 
> > > > > > Thanks for the background; that improves my meager understanding a
> > > > > > lot.
> > > > > > 
> > > > > > Since this seems to be a firmware issue, it does sound like this
> > > > > > laptop uses a firmware connection manager.  But there still seems to
> > > > > > be some kernel connection because pre-e8b908146d44, the link came up
> > > > > > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > > > > > seconds.
> > > > > 
> > > > > In both cases (with or without) the commit what happens is that after
> > > > > resume is finished the firmware connection manager notices the
> > > > > connection, announces it to the Thunderbolt driver that exposes it to
> > > > > the userspace where boltd re-authorizes the device. This brings up the
> > > > > PCIe tunnel again and things get working.
> > > > > 
> > > > > (What is expected to happen is that during the resume the firmware
> > > > >  connection manager re-connects the PCIe tunnel.)
> > > > > 
> > > > > This took previously the ~5s before resume is complete so that the above
> > > > > steps can happen where as after the commit it got delayed more up to the
> > > > > arbitrary ~60s because we started to use that with the commit
> > > > > e8b908146d44 (PCIE_RESET_READY_POLL_MS).
> > > > 
> > > > Why does the kernel delay affect the timing of when the firmware
> > > > connection manager notices the connection?  It seems like Linux waits
> > > > for the timeout, then Linux does something that kicks the firmware
> > > > connection manager.  That's why I asked about this sequence:
> > > > 
> > > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > > 
> > > > where we wait for the timeout, decide the device is gone, remove
> > > > everything, and then the thunderbolt driver does something, and we
> > > > notice the device is magically back.
> > > 
> > > Well the delay delays the whole resume and this includes Thunderbolt
> > > driver resume too, and userspace (where the bolt daemon authorizes the
> > > device again).
> > 
> > I don't know how the Thunderbolt driver works.  I assume this refers
> > to "thunderbolt 0000:06:00.0"?  Is the 06:00.0 resume related to the
> > firmware connection manager somehow?
> > 
> > The removal affects the sub-hierarchy below 05:01.0 (bus 07-3b).
> > 06:00.0 is below 05:00.0, so it's in a different sub-hierarchy.  I
> > don't think there's a PCIe requirement that 05:01.0 be resumed before
> > 05:00.0, or even that they be serialized at all.
> > 
> > The hierarchy:
> > 
> >   pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
> >   pci 0000:04:00.0: PCI bridge to [bus 05-3c]
> >   pci 0000:05:00.0: PCI bridge to [bus 06]
> >   pci 0000:05:01.0: PCI bridge to [bus 07-3b]
> > 
> > It looks like we start the 06:00.0 resume first (118.9), but it
> > doesn't complete until after the timeout (191.7):
> > 
> >   [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
> >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > 
> > Did the Thunderbolt driver do something to 06:00.0 that caused the
> > 05:01.0 link to come up, or is the timing just coincidental?
> 
> Yes it sent the firmware a command telling that the driver is ready
> again, then the firmware sends back notification that there is a new
> device:
> 
> [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> 
> this then is send to the userspace via uevent where bolt goes and
> authorizes it and this results the tunnel to be created which show in
> the log as:
> 
> [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present

So the obvious next question is why we have to wait for the 05:01.0
link timeout before sending the command to the 06:00.0 firmware, since
there's no PCI connection between them.

But there must be *some* connection between the 05:01.0 link coming up
and the 06:00.0 behavior.  Maybe this is related to the
nhi_resume_noirq() comment about "The tunneled pci bridges are
siblings of us. Use resume_noirq to reenable the tunnels asap. A
corresponding pci quirk blocks the downstream bridges resume_noirq
until we are done." Unfortunately the comment doesn't mention the NAME
of the quirk, so I lost the trail there.

Maybe there's an opportunity for a quirk that says "this Thunderbolt
device should never need a whole second for the link to come up", for
example.

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 17:24                                               ` Bjorn Helgaas
@ 2023-09-27 18:02                                                 ` Mika Westerberg
  2023-09-27 19:41                                                   ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-27 18:02 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 12:24:55PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 27, 2023 at 08:01:14PM +0300, Mika Westerberg wrote:
> > On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> > > > On Wed, Sep 27, 2023 at 06:57:03AM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> > > > > > On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > > > > > > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > > > > > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > > > > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > > > > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > > > > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > > > > > > 
> > > > > > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > > > > > > alongside DisplayPort and other data over the same physical link.
> > > > > > > > 
> > > > > > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > > > > > > host controller and attached devices.  Once a tunnel is established,
> > > > > > > > the PCIe link magically goes up and TLPs can be transmitted.
> > > > > > > > 
> > > > > > > > There are two ways to establish those tunnels:
> > > > > > > > 
> > > > > > > > 1/ By a firmware in the Thunderbolt host controller.
> > > > > > > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > > > > > > 
> > > > > > > > 2/ Natively by the kernel.
> > > > > > > >    (software connection manager)
> > > > > > > > 
> > > > > > > > I'm assuming that the laptop in question exclusively uses the firmware
> > > > > > > > connection manager, hence the kernel is reliant on that firmware to
> > > > > > > > establish tunnels and can't really do anything if it fails to do so.
> > > > > > > 
> > > > > > > Thanks for the background; that improves my meager understanding a
> > > > > > > lot.
> > > > > > > 
> > > > > > > Since this seems to be a firmware issue, it does sound like this
> > > > > > > laptop uses a firmware connection manager.  But there still seems to
> > > > > > > be some kernel connection because pre-e8b908146d44, the link came up
> > > > > > > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > > > > > > seconds.
> > > > > > 
> > > > > > In both cases (with or without) the commit what happens is that after
> > > > > > resume is finished the firmware connection manager notices the
> > > > > > connection, announces it to the Thunderbolt driver that exposes it to
> > > > > > the userspace where boltd re-authorizes the device. This brings up the
> > > > > > PCIe tunnel again and things get working.
> > > > > > 
> > > > > > (What is expected to happen is that during the resume the firmware
> > > > > >  connection manager re-connects the PCIe tunnel.)
> > > > > > 
> > > > > > This took previously the ~5s before resume is complete so that the above
> > > > > > steps can happen where as after the commit it got delayed more up to the
> > > > > > arbitrary ~60s because we started to use that with the commit
> > > > > > e8b908146d44 (PCIE_RESET_READY_POLL_MS).
> > > > > 
> > > > > Why does the kernel delay affect the timing of when the firmware
> > > > > connection manager notices the connection?  It seems like Linux waits
> > > > > for the timeout, then Linux does something that kicks the firmware
> > > > > connection manager.  That's why I asked about this sequence:
> > > > > 
> > > > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > > > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > > > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > > > 
> > > > > where we wait for the timeout, decide the device is gone, remove
> > > > > everything, and then the thunderbolt driver does something, and we
> > > > > notice the device is magically back.
> > > > 
> > > > Well the delay delays the whole resume and this includes Thunderbolt
> > > > driver resume too, and userspace (where the bolt daemon authorizes the
> > > > device again).
> > > 
> > > I don't know how the Thunderbolt driver works.  I assume this refers
> > > to "thunderbolt 0000:06:00.0"?  Is the 06:00.0 resume related to the
> > > firmware connection manager somehow?
> > > 
> > > The removal affects the sub-hierarchy below 05:01.0 (bus 07-3b).
> > > 06:00.0 is below 05:00.0, so it's in a different sub-hierarchy.  I
> > > don't think there's a PCIe requirement that 05:01.0 be resumed before
> > > 05:00.0, or even that they be serialized at all.
> > > 
> > > The hierarchy:
> > > 
> > >   pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
> > >   pci 0000:04:00.0: PCI bridge to [bus 05-3c]
> > >   pci 0000:05:00.0: PCI bridge to [bus 06]
> > >   pci 0000:05:01.0: PCI bridge to [bus 07-3b]
> > > 
> > > It looks like we start the 06:00.0 resume first (118.9), but it
> > > doesn't complete until after the timeout (191.7):
> > > 
> > >   [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
> > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > 
> > > Did the Thunderbolt driver do something to 06:00.0 that caused the
> > > 05:01.0 link to come up, or is the timing just coincidental?
> > 
> > Yes it sent the firmware a command telling that the driver is ready
> > again, then the firmware sends back notification that there is a new
> > device:
> > 
> > [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > 
> > this then is send to the userspace via uevent where bolt goes and
> > authorizes it and this results the tunnel to be created which show in
> > the log as:
> > 
> > [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> 
> So the obvious next question is why we have to wait for the 05:01.0
> link timeout before sending the command to the 06:00.0 firmware, since
> there's no PCI connection between them.

No there is not - it is a PCIe tunnel.

> But there must be *some* connection between the 05:01.0 link coming up
> and the 06:00.0 behavior.  Maybe this is related to the
> nhi_resume_noirq() comment about "The tunneled pci bridges are
> siblings of us. Use resume_noirq to reenable the tunnels asap. A
> corresponding pci quirk blocks the downstream bridges resume_noirq
> until we are done." Unfortunately the comment doesn't mention the NAME
> of the quirk, so I lost the trail there.
> 
> Maybe there's an opportunity for a quirk that says "this Thunderbolt
> device should never need a whole second for the link to come up", for
> example.

The PCIe stack needs to follow PCIe spec (regardless of what is the
actual medium the packets go through). And regadless what we do here the
PCIe link goes down and that breaks the usage of any devices behind that
link such as mounted disks so the delay is definitely not the worst
thing that can happen.

Now, If you read anything that Kamil said, this actually works now with
the "secure" mode when he turned the "Wake from Thunderbolt" option back
to the default. So now it works as expected but upon unplug there is the
issue that we fixed with the commit that marks the devices disconnected.
He will test that next week so we can be sure.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 18:02                                                 ` Mika Westerberg
@ 2023-09-27 19:41                                                   ` Bjorn Helgaas
  2023-09-28  4:42                                                     ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-27 19:41 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 09:02:00PM +0300, Mika Westerberg wrote:
> On Wed, Sep 27, 2023 at 12:24:55PM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 27, 2023 at 08:01:14PM +0300, Mika Westerberg wrote:
> > > On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote:
> > > ...

> > > > The hierarchy:
> > > > 
> > > >   pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
> > > >   pci 0000:04:00.0: PCI bridge to [bus 05-3c]
> > > >   pci 0000:05:00.0: PCI bridge to [bus 06]
> > > >   pci 0000:05:01.0: PCI bridge to [bus 07-3b]
> > > > 
> > > > It looks like we start the 06:00.0 resume first (118.9), but it
> > > > doesn't complete until after the timeout (191.7):
> > > > 
> > > >   [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
> > > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > > 
> > > > Did the Thunderbolt driver do something to 06:00.0 that caused the
> > > > 05:01.0 link to come up, or is the timing just coincidental?
> > > 
> > > Yes it sent the firmware a command telling that the driver is ready
> > > again, then the firmware sends back notification that there is a new
> > > device:
> > > 
> > > [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > 
> > > this then is send to the userspace via uevent where bolt goes and
> > > authorizes it and this results the tunnel to be created which show in
> > > the log as:
> > > 
> > > [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > 
> > So the obvious next question is why we have to wait for the 05:01.0
> > link timeout before sending the command to the 06:00.0 firmware, since
> > there's no PCI connection between them.
> 
> No there is not - it is a PCIe tunnel.

You confirmed my assertion that in terms of the PCI topology, there is
no connection between 05:01.0 and 06:00.0.

So the question remains: do we need to wait for the 05:01.0 link
timeout before the 06:00.0 driver does something that leads to
firmware starting the process of bringing up the tunnel and the link?

> > But there must be *some* connection between the 05:01.0 link coming up
> > and the 06:00.0 behavior.  Maybe this is related to the
> > nhi_resume_noirq() comment about "The tunneled pci bridges are
> > siblings of us. Use resume_noirq to reenable the tunnels asap. A
> > corresponding pci quirk blocks the downstream bridges resume_noirq
> > until we are done." Unfortunately the comment doesn't mention the NAME
> > of the quirk, so I lost the trail there.
> > 
> > Maybe there's an opportunity for a quirk that says "this Thunderbolt
> > device should never need a whole second for the link to come up", for
> > example.
> 
> The PCIe stack needs to follow PCIe spec (regardless of what is the
> actual medium the packets go through). And regadless what we do here the
> PCIe link goes down and that breaks the usage of any devices behind that
> link such as mounted disks so the delay is definitely not the worst
> thing that can happen.

Sure.  But if we resume in a reasonable time, we at least have the
opportunity to notice what happened and warn about why the mounted
disk broke.  That's way better than getting bug reports that say
"resume is broken."

> Now, If you read anything that Kamil said, this actually works now with
> the "secure" mode when he turned the "Wake from Thunderbolt" option back
> to the default. So now it works as expected but upon unplug there is the
> issue that we fixed with the commit that marks the devices disconnected.
> He will test that next week so we can be sure.

I did see that.  You may have also seen my sentiment that it's painful
for users to diagnose an apparent resume failure and find a BIOS
option to fix it.

If we can't make resume work in a reasonable time, can we at least
detect the case where it's going to take too long?  If so, maybe we
can warn about it, avoid suspend, or change the way we do it.

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 19:41                                                   ` Bjorn Helgaas
@ 2023-09-28  4:42                                                     ` Mika Westerberg
  2023-09-28 15:49                                                       ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Mika Westerberg @ 2023-09-28  4:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Wed, Sep 27, 2023 at 02:41:45PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 27, 2023 at 09:02:00PM +0300, Mika Westerberg wrote:
> > On Wed, Sep 27, 2023 at 12:24:55PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 27, 2023 at 08:01:14PM +0300, Mika Westerberg wrote:
> > > > On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote:
> > > > ...
> 
> > > > > The hierarchy:
> > > > > 
> > > > >   pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
> > > > >   pci 0000:04:00.0: PCI bridge to [bus 05-3c]
> > > > >   pci 0000:05:00.0: PCI bridge to [bus 06]
> > > > >   pci 0000:05:01.0: PCI bridge to [bus 07-3b]
> > > > > 
> > > > > It looks like we start the 06:00.0 resume first (118.9), but it
> > > > > doesn't complete until after the timeout (191.7):
> > > > > 
> > > > >   [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
> > > > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > > > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > > > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > > > 
> > > > > Did the Thunderbolt driver do something to 06:00.0 that caused the
> > > > > 05:01.0 link to come up, or is the timing just coincidental?
> > > > 
> > > > Yes it sent the firmware a command telling that the driver is ready
> > > > again, then the firmware sends back notification that there is a new
> > > > device:
> > > > 
> > > > [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > > [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > > [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > > 
> > > > this then is send to the userspace via uevent where bolt goes and
> > > > authorizes it and this results the tunnel to be created which show in
> > > > the log as:
> > > > 
> > > > [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > 
> > > So the obvious next question is why we have to wait for the 05:01.0
> > > link timeout before sending the command to the 06:00.0 firmware, since
> > > there's no PCI connection between them.
> > 
> > No there is not - it is a PCIe tunnel.
> 
> You confirmed my assertion that in terms of the PCI topology, there is
> no connection between 05:01.0 and 06:00.0.
> 
> So the question remains: do we need to wait for the 05:01.0 link
> timeout before the 06:00.0 driver does something that leads to
> firmware starting the process of bringing up the tunnel and the link?

In the expected case this should not matter. It is the ACPI _PR0
resource that when turned on starts bringing up all the links.

It is enough for the firmware that we follow the delays in the PCIe spec.

In this case I think we don't need to wait for the link timeout.

> > > But there must be *some* connection between the 05:01.0 link coming up
> > > and the 06:00.0 behavior.  Maybe this is related to the
> > > nhi_resume_noirq() comment about "The tunneled pci bridges are
> > > siblings of us. Use resume_noirq to reenable the tunnels asap. A
> > > corresponding pci quirk blocks the downstream bridges resume_noirq
> > > until we are done." Unfortunately the comment doesn't mention the NAME
> > > of the quirk, so I lost the trail there.
> > > 
> > > Maybe there's an opportunity for a quirk that says "this Thunderbolt
> > > device should never need a whole second for the link to come up", for
> > > example.
> > 
> > The PCIe stack needs to follow PCIe spec (regardless of what is the
> > actual medium the packets go through). And regadless what we do here the
> > PCIe link goes down and that breaks the usage of any devices behind that
> > link such as mounted disks so the delay is definitely not the worst
> > thing that can happen.
> 
> Sure.  But if we resume in a reasonable time, we at least have the
> opportunity to notice what happened and warn about why the mounted
> disk broke.  That's way better than getting bug reports that say
> "resume is broken."

Well "resume is broken" is much better than "your precious eata just got
lost, have fun."

> > Now, If you read anything that Kamil said, this actually works now with
> > the "secure" mode when he turned the "Wake from Thunderbolt" option back
> > to the default. So now it works as expected but upon unplug there is the
> > issue that we fixed with the commit that marks the devices disconnected.
> > He will test that next week so we can be sure.
> 
> I did see that.  You may have also seen my sentiment that it's painful
> for users to diagnose an apparent resume failure and find a BIOS
> option to fix it.

Unfortunately todays systems are such that if you go and change BIOS
options from the defaults you may accidentally enter untested
territories. Changing things back to defaults is a reasonable ask IMHO.

> If we can't make resume work in a reasonable time, can we at least
> detect the case where it's going to take too long?  If so, maybe we
> can warn about it, avoid suspend, or change the way we do it.

Only thing we are missing now is the result of Kamil's testing to see if
the real uplug case is working. I would expect so, this is the issue we
fixed with:

  https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=pm&id=6786c2941fe1788035f99c98c932672138b3fbc5

If that's true then there is nothing more to be fixed, no hacks need to
be put anywhere, and this discussion can be ended too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-28  4:42                                                     ` Mika Westerberg
@ 2023-09-28 15:49                                                       ` Bjorn Helgaas
  2023-10-05 13:01                                                         ` Kamil Paral
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2023-09-28 15:49 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Lukas Wunner, Kamil Paral, linux-pci, regressions, bhelgaas, chris.chiu

On Thu, Sep 28, 2023 at 07:42:33AM +0300, Mika Westerberg wrote:
> ...

> Only thing we are missing now is the result of Kamil's testing to see if
> the real uplug case is working. I would expect so, this is the issue we
> fixed with:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=pm&id=6786c2941fe1788035f99c98c932672138b3fbc5
> 
> If that's true then there is nothing more to be fixed, no hacks need to
> be put anywhere, and this discussion can be ended too.

Do we know how suspend/resume works with Windows in this
configuration?

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-27 14:12                                           ` Kamil Paral
@ 2023-10-05 12:54                                             ` Kamil Paral
  2023-10-05 13:09                                               ` Mika Westerberg
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-10-05 12:54 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Bjorn Helgaas, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

On Wed, Sep 27, 2023 at 4:12 PM Kamil Paral <kparal@redhat.com> wrote:
>
> On Wed, Sep 27, 2023 at 3:53 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> > If you apply the patch from here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=pm&id=6786c2941fe1788035f99c98c932672138b3fbc5
> >
> > Does the problem go away with the disconnect case too (and so that you
> > have "secure" enabled)?
>
> Thanks, I'll test it, but I can't do it this week. I'll reply next week.

Hello Mika, sorry for taking me so long. The URL above gives me a "Bad
commit reference" error, but I found a patch mentioned in a different
thread, so I used that one, and I hope it's the correct one :-) It's
quoted below.

With the patch applied, I can confirm that disconnecting and
reconnecting the cable during suspend is no longer a problem. I tested
both "user" and "secure" Thunderbolt security levels. The resume is
fast in all cases, and I've found no issues. Thanks for working on it!

The patch I used:

> ---
>  drivers/pci/pci-driver.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index a79c110c7e51..51ec9e7e784f 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -572,7 +572,19 @@ static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
>
>  static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
>  {
> -     pci_bridge_wait_for_secondary_bus(pci_dev, "resume");
> +     int ret;
> +
> +     ret = pci_bridge_wait_for_secondary_bus(pci_dev, "resume");
> +     if (ret) {
> +             /*
> +              * The downstream link failed to come up, so mark the
> +              * devices below as disconnected to make sure we don't
> +              * attempt to resume them.
> +              */
> +             pci_walk_bus(pci_dev->subordinate, pci_dev_set_disconnected,
> +                          NULL);
> +             return;
> +     }
>
>       /*
>        * When powering on a bridge from D3cold, the whole hierarchy may be
> --


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-09-28 15:49                                                       ` Bjorn Helgaas
@ 2023-10-05 13:01                                                         ` Kamil Paral
  2023-10-05 19:00                                                           ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Kamil Paral @ 2023-10-05 13:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Mika Westerberg, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

On Thu, Sep 28, 2023 at 5:49 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> Do we know how suspend/resume works with Windows in this
> configuration?

Hello Bjorn. I'm not sure if you're asking in general or about my
laptop in particular. I don't have Windows installed, so I don't have
an easy way to test this. There used to be some "evaluation" version
of Windows that I think I could use to test it, but it would be a
considerable time investment for me (including backing up my data,
etc), and I'm currently low on spare time. So at this moment I'm
unfortunately unable to help with this question.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-10-05 12:54                                             ` Kamil Paral
@ 2023-10-05 13:09                                               ` Mika Westerberg
  0 siblings, 0 replies; 44+ messages in thread
From: Mika Westerberg @ 2023-10-05 13:09 UTC (permalink / raw)
  To: Kamil Paral
  Cc: Bjorn Helgaas, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

Hi Kamil,

On Thu, Oct 05, 2023 at 02:54:23PM +0200, Kamil Paral wrote:
> On Wed, Sep 27, 2023 at 4:12 PM Kamil Paral <kparal@redhat.com> wrote:
> >
> > On Wed, Sep 27, 2023 at 3:53 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > > If you apply the patch from here:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=pm&id=6786c2941fe1788035f99c98c932672138b3fbc5
> > >
> > > Does the problem go away with the disconnect case too (and so that you
> > > have "secure" enabled)?
> >
> > Thanks, I'll test it, but I can't do it this week. I'll reply next week.
> 
> Hello Mika, sorry for taking me so long. The URL above gives me a "Bad
> commit reference" error, but I found a patch mentioned in a different
> thread, so I used that one, and I hope it's the correct one :-) It's
> quoted below.
> 
> With the patch applied, I can confirm that disconnecting and
> reconnecting the cable during suspend is no longer a problem. I tested
> both "user" and "secure" Thunderbolt security levels. The resume is
> fast in all cases, and I've found no issues. Thanks for working on it!
> 
> The patch I used:

Yes, that patch is the correct one. Bjorn moved it to his "for-linus"
branch so it got different Commit ID:

https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/commit/?h=for-linus&id=c82458101d5490230d735caecce14c9c27b1010c

Thanks a lot for testing!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-10-05 13:01                                                         ` Kamil Paral
@ 2023-10-05 19:00                                                           ` Bjorn Helgaas
  0 siblings, 0 replies; 44+ messages in thread
From: Bjorn Helgaas @ 2023-10-05 19:00 UTC (permalink / raw)
  To: Kamil Paral
  Cc: Mika Westerberg, Lukas Wunner, linux-pci, regressions, bhelgaas,
	chris.chiu

On Thu, Oct 05, 2023 at 03:01:56PM +0200, Kamil Paral wrote:
> On Thu, Sep 28, 2023 at 5:49 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > Do we know how suspend/resume works with Windows in this
> > configuration?
> 
> Hello Bjorn. I'm not sure if you're asking in general or about my
> laptop in particular. I don't have Windows installed, so I don't have
> an easy way to test this. There used to be some "evaluation" version
> of Windows that I think I could use to test it, but it would be a
> considerable time investment for me (including backing up my data,
> etc), and I'm currently low on spare time. So at this moment I'm
> unfortunately unable to help with this question.

That's fair, thanks!

Bjorn

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
  2023-08-21 10:39 [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume" Kamil Paral
  2023-08-21 13:12 ` Mika Westerberg
  2023-08-21 19:10 ` Bjorn Helgaas
@ 2023-11-01 10:59 ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 0 replies; 44+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-11-01 10:59 UTC (permalink / raw)
  To: Kamil Paral, linux-pci; +Cc: regressions

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking.]

On 21.08.23 12:39, Kamil Paral wrote:
> = Summary =
> A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected
> can no longer resume from suspend. The problem was introduced in
> e8b908146d44 "PCI/PM: Increase wait time after resume".
> [...]
> #regzbot introduced: e8b908146d44

Kamil, It looks like you found a workaround and lack time to further
look into this. Nobody else seems to have run into this and it seems
this is a odd corner case. Hence I guess there is not much value in
tracking this regression further for now. Please speak up if you disagree.

#regzbot inconclusive: workaround found, debugging stuck
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-11-01 10:59 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-21 10:39 [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume" Kamil Paral
2023-08-21 13:12 ` Mika Westerberg
2023-08-22 16:43   ` Kamil Paral
2023-08-23  5:07     ` Mika Westerberg
2023-08-23  7:00       ` Kamil Paral
2023-08-23  7:44         ` Mika Westerberg
2023-08-23  7:56           ` Mika Westerberg
2023-08-23  8:20             ` Kamil Paral
2023-08-23  9:05               ` Mika Westerberg
2023-08-23 14:02                 ` Kamil Paral
2023-08-24 11:43                   ` Mika Westerberg
2023-08-25  8:42                     ` Kamil Paral
2023-08-25  9:46                       ` Mika Westerberg
2023-08-25 11:42                         ` Kamil Paral
2023-09-23 22:46                       ` Bjorn Helgaas
2023-09-24 13:27                         ` Mika Westerberg
2023-09-24 20:18                           ` Bjorn Helgaas
2023-09-25  4:59                             ` Mika Westerberg
2023-09-25 13:48                               ` Bjorn Helgaas
2023-09-25 14:19                                 ` Lukas Wunner
2023-09-26 17:55                                   ` Bjorn Helgaas
2023-09-27  5:16                                     ` Mika Westerberg
2023-09-27 11:57                                       ` Bjorn Helgaas
2023-09-27 12:47                                         ` Mika Westerberg
2023-09-27 14:31                                           ` Lukas Wunner
2023-09-27 14:42                                             ` Mika Westerberg
2023-09-27 15:36                                               ` Mika Westerberg
2023-09-27 16:50                                           ` Bjorn Helgaas
2023-09-27 17:01                                             ` Mika Westerberg
2023-09-27 17:24                                               ` Bjorn Helgaas
2023-09-27 18:02                                                 ` Mika Westerberg
2023-09-27 19:41                                                   ` Bjorn Helgaas
2023-09-28  4:42                                                     ` Mika Westerberg
2023-09-28 15:49                                                       ` Bjorn Helgaas
2023-10-05 13:01                                                         ` Kamil Paral
2023-10-05 19:00                                                           ` Bjorn Helgaas
     [not found]                                       ` <CA+cBOTds9k1Q2haC_gTpsUvjP02dHOv9vSconFEAu-Fsxwf36A@mail.gmail.com>
2023-09-27 13:53                                         ` Mika Westerberg
2023-09-27 14:12                                           ` Kamil Paral
2023-10-05 12:54                                             ` Kamil Paral
2023-10-05 13:09                                               ` Mika Westerberg
2023-09-27 14:08                                         ` Kamil Paral
2023-08-21 19:10 ` Bjorn Helgaas
2023-08-22 16:36   ` Kamil Paral
2023-11-01 10:59 ` Linux regression tracking (Thorsten Leemhuis)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.