linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
@ 2022-12-15  9:13 Ron Lee
  2022-12-15 15:16 ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Ron Lee @ 2022-12-15  9:13 UTC (permalink / raw)
  To: bhelgaas; +Cc: linux-kernel, linux-pci, lmajczak, rajatja, Ron Lee

On Google Coral and Reef family chromebooks, the PCIe bridge lost its
L1 PM Substates capability after resumed from D3cold, and identify that
the pointer to the this capability and capapability header are missing
from the capability list.

....
Capabilities: [150 v0] Null
Capabilities: [200 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ ...
                  PortCommonModeRestoreTime=40us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
                   T_CommonMode=40us LTR1.2_Threshold=98304ns
        L1SubCtl2: T_PwrOn=60us
...

This patch fix up the header and the pointer to the L1SS capability
after resuming from D3Cold.

Signed-off-by: Ron Lee <ron.lee@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 drivers/pci/quirks.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 285acc4aaccc..fc959be17a9d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5992,3 +5992,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2d, dpc_log_size);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2f, dpc_log_size);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31, dpc_log_size);
 #endif
+
+#ifdef CONFIG_PCIEASPM
+static void chromeos_fixup_apl_bridge_l1ss_capability(struct pci_dev *pdev)
+{
+	if (!dmi_match(DMI_SYS_VENDOR, "Google") ||
+		(!dmi_match(DMI_PRODUCT_FAMILY, "Google_Coral") &&
+		 !dmi_match(DMI_PRODUCT_FAMILY, "Google_Reef")))
+		return;
+
+	pci_info(pdev, "Fix up L1SS Capability\n");
+	/* Fix up the L1SS Capability Header*/
+	pci_write_config_dword(pdev, pdev->l1ss, (0x220 << 20) | (1 << 16) | (PCI_EXT_CAP_ID_L1SS));
+	/* Fix up the pointer to L1SS Capability*/
+	pci_write_config_dword(pdev, 0x150, pdev->l1ss << 20);
+}
+DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL, 0x5ad6, chromeos_fixup_apl_bridge_l1ss_capability);
+#endif

base-commit: e2ca6ba6ba0152361aa4fcbf6067db71b2c7a770
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
  2022-12-15  9:13 [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge Ron Lee
@ 2022-12-15 15:16 ` Bjorn Helgaas
  2022-12-16 16:29   ` Lee, Ron
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2022-12-15 15:16 UTC (permalink / raw)
  To: Ron Lee; +Cc: bhelgaas, linux-kernel, linux-pci, lmajczak, rajatja, Ron Lee

On Thu, Dec 15, 2022 at 05:13:57PM +0800, Ron Lee wrote:
> On Google Coral and Reef family chromebooks, the PCIe bridge lost its
> L1 PM Substates capability after resumed from D3cold, and identify that
> the pointer to the this capability and capapability header are missing
> from the capability list.

s/chromebooks/Chromebooks/
s/to the this/this/
s/capapability/capability/

This should say what problem we're solving.  I assume some devices
used L1 PM Substates before suspend, but after resume they do not, so
the user-visible effect is that battery life is worse after resume.

> Capabilities: [150 v0] Null
> Capabilities: [200 v1] L1 PM Substates
>         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ ...
>                   PortCommonModeRestoreTime=40us PortTPowerOnTime=10us
>         L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
>                    T_CommonMode=40us LTR1.2_Threshold=98304ns
>         L1SubCtl2: T_PwrOn=60us

I'm not sure what this snippet is telling me.  Based on the patch, I
guess before suspend, lspci would show:

  Capabilities: [150 v0] Null
  Capabilities: [200 v1] L1 PM Substates
  Capabilities: [220] <some other valid capability?>

but after resume, you see only:

  Capabilities: [150 v0] Null

Right?

> This patch fix up the header and the pointer to the L1SS capability
> after resuming from D3Cold.

The main problem here is that this patch covers up an issue without
saying what the root cause is.  Presumably this is a firmware issue.
Has that been identified?  Has it been fixed for future firmware
releases?

s/D3Cold/D3cold/ to match above.

Is there a bug report for this issue?  Include the URL here.

Is there a bug report for the firmware?

> Signed-off-by: Ron Lee <ron.lee@intel.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---

Nits:

  - Use "Apollo Lake" to match Intel usage.

  - Below the "---" line, mention what changed between v1 and v2 (I
    see that you added the "#ifdef CONFIG_PCIEASPM", but you should
    save readers the effort of figuring that out).

  - For work-in-progress, the "Reported-by: kernel test robot" is
    pointless and I will remove it.  This quirk is not fixing a bug
    reported by the robot.

>  drivers/pci/quirks.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 285acc4aaccc..fc959be17a9d 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5992,3 +5992,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2d, dpc_log_size);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a2f, dpc_log_size);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x9a31, dpc_log_size);
>  #endif
> +
> +#ifdef CONFIG_PCIEASPM
> +static void chromeos_fixup_apl_bridge_l1ss_capability(struct pci_dev *pdev)
> +{
> +	if (!dmi_match(DMI_SYS_VENDOR, "Google") ||
> +		(!dmi_match(DMI_PRODUCT_FAMILY, "Google_Coral") &&
> +		 !dmi_match(DMI_PRODUCT_FAMILY, "Google_Reef")))
> +		return;
> +
> +	pci_info(pdev, "Fix up L1SS Capability\n");
> +	/* Fix up the L1SS Capability Header*/
> +	pci_write_config_dword(pdev, pdev->l1ss, (0x220 << 20) | (1 << 16) | (PCI_EXT_CAP_ID_L1SS));

This looks like it adds a link to another capability at offset 0x220.
What is that, and how do we know this is safe?

These registers are read-only per spec (PCIe r6.0, sec 7.8.3.1), but I
guess you have device-specific knowledge that they are writable?

> +	/* Fix up the pointer to L1SS Capability*/
> +	pci_write_config_dword(pdev, 0x150, pdev->l1ss << 20);
> +}
> +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL, 0x5ad6, chromeos_fixup_apl_bridge_l1ss_capability);
> +#endif
> 
> base-commit: e2ca6ba6ba0152361aa4fcbf6067db71b2c7a770
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
  2022-12-15 15:16 ` Bjorn Helgaas
@ 2022-12-16 16:29   ` Lee, Ron
  2022-12-16 17:48     ` Bjorn Helgaas
  2022-12-20  9:11     ` Lee, Ron
  0 siblings, 2 replies; 7+ messages in thread
From: Lee, Ron @ 2022-12-16 16:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, linux-kernel, linux-pci, lmajczak, Jain, Rajat, Ron Lee

> On Thu, Dec 15, 2022 at 05:13:57PM +0800, Ron Lee wrote:
> > On Google Coral and Reef family chromebooks, the PCIe bridge lost its
> > L1 PM Substates capability after resumed from D3cold, and identify
> > that the pointer to the this capability and capapability header are
> > missing from the capability list.
> 
> s/chromebooks/Chromebooks/
> s/to the this/this/
> s/capapability/capability/
I will submit patch v3 to correct these errors.
> 
> This should say what problem we're solving.  I assume some devices used L1 PM
> Substates before suspend, but after resume they do not, so the user-visible
> effect is that battery life is worse after resume.
This bug has existed since these series of Chromebooks was shipping, it seems no harm for system execution, 
and we didn't identified battery life drop after resume. However we still expect this issue could be solved and 
follow spec criteria as per PCIe spec rev6.0, section 5.5.4 L1 PM Substates Configuration

    An L1 PM Substate enable bit must only be Set in the Upstream and Downstream Ports on a Link when the
    corresponding supported capability bit is Set by both the Upstream and Downstream Ports on that Link, otherwise the
    behavior is undefined

> 
> > Capabilities: [150 v0] Null
> > Capabilities: [200 v1] L1 PM Substates
> >         L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ ...
> >                   PortCommonModeRestoreTime=40us PortTPowerOnTime=10us
> >         L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> >                    T_CommonMode=40us LTR1.2_Threshold=98304ns
> >         L1SubCtl2: T_PwrOn=60us
> 
> I'm not sure what this snippet is telling me.  Based on the patch, I guess before
> suspend, lspci would show:
> 
>   Capabilities: [150 v0] Null
>   Capabilities: [200 v1] L1 PM Substates
>   Capabilities: [220] <some other valid capability?>
> 
> but after resume, you see only:
> 
>   Capabilities: [150 v0] Null
> 
> Right?
Yes, but there is a difference in this case.
Before suspend:
        ....
        Capabilities: [140 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [150 v0] Null
        Capabilities: [200 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=40us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
                           T_CommonMode=40us LTR1.2_Threshold=98304ns
                L1SubCtl2: T_PwrOn=60us
        Kernel driver in use: pcieport

After resume:
        ....
        Capabilities: [140 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: pcieport

The following merged commit can save/restore the L1SubCap/L1SubCtl1/L1SubCtl2 registers for this bridge, 
However this bridge not only lost its capability contents but also lost the link to this capability.

    PCI/ASPM: Save/restore L1SS Capability for suspend/resume
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/pci/pcie/aspm.c?h=v6.1&id=4257f7e008ea394fcecc050f1569c3503b8bcc15

> 
> > This patch fix up the header and the pointer to the L1SS capability
> > after resuming from D3Cold.
> 
> The main problem here is that this patch covers up an issue without saying what
> the root cause is.  Presumably this is a firmware issue.
> Has that been identified?  Has it been fixed for future firmware releases?
This issue could be and should be fixed by BIOS, however the manufacturers have no resource for firmware validation and it's risky for firmware update per their assessment.
> 
> s/D3Cold/D3cold/ to match above.
> 
> Is there a bug report for this issue?  Include the URL here.
> 
> Is there a bug report for the firmware?
> 
There is a Google's internal issue tracker for this bug, seems not available for public.
Actually this bug had a discussion on this thread, and Lukasz Majczak identified this issue on Apollo Lake platform.
https://patchwork.kernel.org/project/linux-pci/patch/20220705060014.10050-1-vidyas@nvidia.com/

> > Signed-off-by: Ron Lee <ron.lee@intel.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > ---
> 
> Nits:
> 
>   - Use "Apollo Lake" to match Intel usage.
> 
>   - Below the "---" line, mention what changed between v1 and v2 (I
>     see that you added the "#ifdef CONFIG_PCIEASPM", but you should
>     save readers the effort of figuring that out).
> 
>   - For work-in-progress, the "Reported-by: kernel test robot" is
>     pointless and I will remove it.  This quirk is not fixing a bug
>     reported by the robot.
> 
> >  drivers/pci/quirks.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
> > 285acc4aaccc..fc959be17a9d 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -5992,3 +5992,20 @@
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
> > 0x9a2d, dpc_log_size);  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
> > 0x9a2f, dpc_log_size);  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
> > 0x9a31, dpc_log_size);  #endif
> > +
> > +#ifdef CONFIG_PCIEASPM
> > +static void chromeos_fixup_apl_bridge_l1ss_capability(struct pci_dev
> > +*pdev) {
> > +	if (!dmi_match(DMI_SYS_VENDOR, "Google") ||
> > +		(!dmi_match(DMI_PRODUCT_FAMILY, "Google_Coral") &&
> > +		 !dmi_match(DMI_PRODUCT_FAMILY, "Google_Reef")))
> > +		return;
> > +
> > +	pci_info(pdev, "Fix up L1SS Capability\n");
> > +	/* Fix up the L1SS Capability Header*/
> > +	pci_write_config_dword(pdev, pdev->l1ss, (0x220 << 20) | (1 << 16) |
> > +(PCI_EXT_CAP_ID_L1SS));
> 
> This looks like it adds a link to another capability at offset 0x220.
> What is that, and how do we know this is safe?
The following is the dump of this bridge config before suspend, the L1SS capability is at offset 0x200 and 
it point to offset 0x220 which is a null capability. This patch just add it to keep consistent during suspend/resume.
...
150: 00 00 00 20 00 04 00 00 00 00 00 00 00 00 00 00
160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
200: 1e 00 01 22 1f 28 28 00 0f 28 03 60 f0 00 00 00
210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
220: 00 00 00 00 00 00 00 00 00 00 00 00 7f 7f 7f 7f...

> 
> These registers are read-only per spec (PCIe r6.0, sec 7.8.3.1), but I guess you
> have device-specific knowledge that they are writable?
> 
> > +	/* Fix up the pointer to L1SS Capability*/
> > +	pci_write_config_dword(pdev, 0x150, pdev->l1ss << 20); }
> > +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL, 0x5ad6,
> > +chromeos_fixup_apl_bridge_l1ss_capability);
> > +#endif
> >
> > base-commit: e2ca6ba6ba0152361aa4fcbf6067db71b2c7a770
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
  2022-12-16 16:29   ` Lee, Ron
@ 2022-12-16 17:48     ` Bjorn Helgaas
  2022-12-20  9:11     ` Lee, Ron
  1 sibling, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2022-12-16 17:48 UTC (permalink / raw)
  To: Lee, Ron
  Cc: bhelgaas, linux-kernel, linux-pci, lmajczak, Jain, Rajat, Ron Lee

On Fri, Dec 16, 2022 at 04:29:39PM +0000, Lee, Ron wrote:
> > On Thu, Dec 15, 2022 at 05:13:57PM +0800, Ron Lee wrote:
> > > On Google Coral and Reef family chromebooks, the PCIe bridge lost its
> > > L1 PM Substates capability after resumed from D3cold, and identify
> > > that the pointer to the this capability and capapability header are
> > > missing from the capability list.

> > This should say what problem we're solving.  I assume some devices
> > used L1 PM Substates before suspend, but after resume they do not,
> > so the user-visible effect is that battery life is worse after
> > resume.
>
> This bug has existed since these series of Chromebooks was shipping,
> it seems no harm for system execution, and we didn't identified
> battery life drop after resume. However we still expect this issue
> could be solved and follow spec criteria as per PCIe spec rev6.0,
> section 5.5.4 L1 PM Substates Configuration
> 
>     An L1 PM Substate enable bit must only be Set in the Upstream
>     and Downstream Ports on a Link when the corresponding supported
>     capability bit is Set by both the Upstream and Downstream Ports
>     on that Link, otherwise the behavior is undefined

Even if you haven't seen a battery life issue, I suspect you might be
able to measure a power consumption difference if you looked for it
and likely could see issues with manual ASPM enable/disable using
sysfs.  That might be a legitimate reason for this quirk, and if it
is, we should mention it here.

> The following merged commit can save/restore the
> L1SubCap/L1SubCtl1/L1SubCtl2 registers for this bridge, However this
> bridge not only lost its capability contents but also lost the link
> to this capability.
> 
>     PCI/ASPM: Save/restore L1SS Capability for suspend/resume
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/pci/pcie/aspm.c?h=v6.1&id=4257f7e008ea394fcecc050f1569c3503b8bcc15

The current version of that code:
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/aspm.c?id=v6.1#n760
doesn't search for the L1SS capability; it uses dev->l1ss just like
your patch does.  So it should restore the capability even though the
linked list is broken.

> > > This patch fix up the header and the pointer to the L1SS
> > > capability after resuming from D3Cold.
> > 
> > The main problem here is that this patch covers up an issue
> > without saying what the root cause is.  Presumably this is a
> > firmware issue.  Has that been identified?  Has it been fixed for
> > future firmware releases?
>
> This issue could be and should be fixed by BIOS, however the
> manufacturers have no resource for firmware validation and it's
> risky for firmware update per their assessment.

This fix is risky, too, because it writes to random places in config
space and there's no guarantee that this is safe or even that the
capabilities are at those locations.

> > Is there a bug report for this issue?  Include the URL here.
> > 
> > Is there a bug report for the firmware?
> > 
> There is a Google's internal issue tracker for this bug, seems not
> available for public.

Maybe you can make a public report with any secret details removed?
A simple email would be enough.  I haven't seen the internal issue;
hopefully it has more details than are in this patch.

> Actually this bug had a discussion on this thread, and Lukasz
> Majczak identified this issue on Apollo Lake platform.
> https://patchwork.kernel.org/project/linux-pci/patch/20220705060014.10050-1-vidyas@nvidia.com/

That patch mentions Dell XPS 13, not a Chromebook, so your patch
wouldn't affect it.  Are you saying this issue is common across all
Apollo Lake platforms?  If so, maybe a fix should be more generic?

> > > +#ifdef CONFIG_PCIEASPM
> > > +static void chromeos_fixup_apl_bridge_l1ss_capability(struct pci_dev
> > > +*pdev) {
> > > +	if (!dmi_match(DMI_SYS_VENDOR, "Google") ||
> > > +		(!dmi_match(DMI_PRODUCT_FAMILY, "Google_Coral") &&
> > > +		 !dmi_match(DMI_PRODUCT_FAMILY, "Google_Reef")))
> > > +		return;
> > > +
> > > +	pci_info(pdev, "Fix up L1SS Capability\n");
> > > +	/* Fix up the L1SS Capability Header*/
> > > +	pci_write_config_dword(pdev, pdev->l1ss, (0x220 << 20) | (1 << 16) |
> > > +(PCI_EXT_CAP_ID_L1SS));
> > 
> > This looks like it adds a link to another capability at offset
> > 0x220.  What is that, and how do we know this is safe?
>
> The following is the dump of this bridge config before suspend, the
> L1SS capability is at offset 0x200 and it point to offset 0x220
> which is a null capability. This patch just add it to keep
> consistent during suspend/resume.
> ...

My point is that there is no PCI requirement for capabilities to be at
0x150 and 0x220.  The only defined way to find these capabilities is
to traverse the list starting at offset 0x100.

The L1SS capability at pdev->l1ss is reasonable since we found it by
traversing that list, but 0x150 and 0x220 are magic numbers with no
justification.  We have no reason to believe there are capabilities
there.

We might know this based on device-specific knowledge of the Root
Port.  If that's the case, please cite the Intel spec for device 5ad6
so we can tell this quirk can't be blindly applied to other Root
Ports.

> > > +	/* Fix up the pointer to L1SS Capability*/
> > > +	pci_write_config_dword(pdev, 0x150, pdev->l1ss << 20); }
> > > +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL, 0x5ad6,
> > > +chromeos_fixup_apl_bridge_l1ss_capability);
> > > +#endif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
  2022-12-16 16:29   ` Lee, Ron
  2022-12-16 17:48     ` Bjorn Helgaas
@ 2022-12-20  9:11     ` Lee, Ron
  2022-12-29 19:07       ` Bjorn Helgaas
  1 sibling, 1 reply; 7+ messages in thread
From: Lee, Ron @ 2022-12-20  9:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, linux-kernel, linux-pci, lmajczak, Jain, Rajat, Ron Lee

> On Fri, Dec 16, 2022 at 04:29:39PM +0000, Lee, Ron wrote:
> > > On Thu, Dec 15, 2022 at 05:13:57PM +0800, Ron Lee wrote:
> Even if you haven't seen a battery life issue, I suspect you might be 
> able to measure a power consumption difference if you looked for it 
> and likely could see issues with manual ASPM enable/disable using 
> sysfs.  That might be a legitimate reason for this quirk, and if it is, we should mention it here.
We can arrange the power measurement, but I doubt this quirk has correlation to power consumption. 
My point is that the ASPM behavior is not changed with or without this quirk.
> 
> > The following merged commit can save/restore the
> > L1SubCap/L1SubCtl1/L1SubCtl2 registers for this bridge, However this 
> > bridge not only lost its capability contents but also lost the link 
> > to this capability.
> >
> >     PCI/ASPM: Save/restore L1SS Capability for suspend/resume
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > om
> >
> mit/drivers/pci/pcie/aspm.c?h=v6.1&id=4257f7e008ea394fcecc050f1569c350
> > 3b8bcc15
> 
> The current version of that code:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> e/driver
> s/pci/pcie/aspm.c?id=v6.1#n760
> doesn't search for the L1SS capability; it uses dev->l1ss just like 
> your patch does.  So it should restore the capability even though the linked list is broken.

Vidya's patch definitely can restore the L1SS capability, the dev->l1ss is still valid and
L1SubCtl1/L1SubCtl2 are restored after resume. With Vidya's V4 patch I think although 
the L1SS capability is missing after resume , it didn't change the ASPM behavior, 
the ASPM function may still remain normal and same as before suspend.  

But Vidya's patch didn't restore L1SS capability header and the pointer to L1SS capability, 
this is the purpose of this quirk, it just reflect the bridge's current L1SS status authentically.
i.e. the user and developer can see the L1 PM substates through lspci command.

For reference, the following is PCI config dump of this bridge between suspend/resume.

Before Suspend:
...
150: 00 00 00 20 00 04 00 00 00 00 00 00 00 00 00 00    <-- point to L1SS capability at offset 0x200
160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
200: 1e 00 01 22 1f 28 28 00 0f 28 03 60 f0 00 00 00      <-- here is L1SS capability
210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
220: 00 00 00 00 00 00 00 00 00 00 00 00 7f 7f 7f 7f

After Resume:
...
150: 00 00 00 00 00 04 00 00 00 00 00 00 00 00 00 00       <-- The pointer to L1SS capability was cleared
160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
200: 00 00 00 00 1f 28 28 00 0f 28 03 60 f0 00 00 00       <-- L1SS capability header was cleared too

> 
> >
> > This issue could be and should be fixed by BIOS, however the 
> > manufacturers have no resource for firmware validation and it's 
> > risky for firmware update per their assessment.
> 
> This fix is risky, too, because it writes to random places in config 
> space and there's no guarantee that this is safe or even that the 
> capabilities are at those locations.
Agree.

> 
> > > Is there a bug report for this issue?  Include the URL here.
> > >
> > > Is there a bug report for the firmware?
> > >
> > There is a Google's internal issue tracker for this bug, seems not 
> > available for public.
> 
> Maybe you can make a public report with any secret details removed?
> A simple email would be enough.  I haven't seen the internal issue; 
> hopefully it has more details than are in this patch.
Sorry, I forget Lukasz ever filed a bug report for this, please see 
https://lore.kernel.org/linux-pci/CAFJ_xbq0cxcH-cgpXLU4Mjk30+muWyWm1aUZGK7iG53yaLBaQg@mail.gmail.com/T/#u

> 
> > Actually this bug had a discussion on this thread, and Lukasz 
> > Majczak identified this issue on Apollo Lake platform.
> > https://patchwork.kernel.org/project/linux-pci/patch/20220705060014.
> > 10
> > 050-1-vidyas@nvidia.com/
> 
> That patch mentions Dell XPS 13, not a Chromebook, so your patch 
> wouldn't affect it.  Are you saying this issue is common across all Apollo Lake platforms?
> If so, maybe a fix should be more generic?

The history is
1. Apollo Lake's bridge DSP is connected to a WiFi card, the WiFi card crash while system resume due to Vidya's save/restore L1SS patch.
2. Vidya submitted V4 save/restore L1SS patch, and it work fine now.
3. However Lukasz found the bridge's L1SS capability still disappear even applying Vidya's V1-V4 patch

Allow me to excerpt one of their discussion,
    Hi Vidya,

    The results from my previous mail are for V3 of your patch;
    Amberlake - works fine
    Apollolake - still the same issue, but here it is not related to your
    changes (we are still working on this).

    Best regards,
    Lukasz

> 
> My point is that there is no PCI requirement for capabilities to be at
> 0x150 and 0x220.  The only defined way to find these capabilities is 
> to traverse the list starting at offset 0x100.
> 
> The L1SS capability at pdev->l1ss is reasonable since we found it by 
> traversing that list, but 0x150 and 0x220 are magic numbers with no 
> justification.  We have no reason to believe there are capabilities there.
> 
I agree that, I ever try to recover the link by traversing list, but it didn't work and the capability list have no method to do reverse traversal.
One approach may save the whole capability list before suspend, and check each capability link then restore the missing one after resume.
Do you think it's practical ? It is appreciated if you could recommend a practical solution for this issue.

> We might know this based on device-specific knowledge of the Root 
> Port.  If that's the case, please cite the Intel spec for device 5ad6 
> so we can tell this quirk can't be blindly applied to other Root Ports.
> 
> > > > +	/* Fix up the pointer to L1SS Capability*/
> > > > +	pci_write_config_dword(pdev, 0x150, pdev->l1ss << 20); } 
> > > > +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL, 0x5ad6, 
> > > > +chromeos_fixup_apl_bridge_l1ss_capability);
> > > > +#endif


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
  2022-12-20  9:11     ` Lee, Ron
@ 2022-12-29 19:07       ` Bjorn Helgaas
  2022-12-30 12:13         ` Lee, Ron
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2022-12-29 19:07 UTC (permalink / raw)
  To: Lee, Ron
  Cc: bhelgaas, linux-kernel, linux-pci, lmajczak, Jain, Rajat, Ron Lee

On Tue, Dec 20, 2022 at 09:11:31AM +0000, Lee, Ron wrote:
> > On Fri, Dec 16, 2022 at 04:29:39PM +0000, Lee, Ron wrote:
> > > > On Thu, Dec 15, 2022 at 05:13:57PM +0800, Ron Lee wrote:
> > Even if you haven't seen a battery life issue, I suspect you might be 
> > able to measure a power consumption difference if you looked for it 
> > and likely could see issues with manual ASPM enable/disable using 
> > sysfs.  That might be a legitimate reason for this quirk, and if
> > it is, we should mention it here.
>
> We can arrange the power measurement, but I doubt this quirk has
> correlation to power consumption.  My point is that the ASPM
> behavior is not changed with or without this quirk.

Makes sense.

> ...
> I agree that, I ever try to recover the link by traversing list, but
> it didn't work and the capability list have no method to do reverse
> traversal.  One approach may save the whole capability list before
> suspend, and check each capability link then restore the missing one
> after resume.  Do you think it's practical ? It is appreciated if
> you could recommend a practical solution for this issue.

The issue being "lspci doesn't show L1SS after suspend/resume"?

Is the point of this basically to fix lspci output after
suspend/resume?  Or is there something else this fixes?

It sounds like ASPM and L1SS works correctly after suspend/resume
even without this patch?

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge
  2022-12-29 19:07       ` Bjorn Helgaas
@ 2022-12-30 12:13         ` Lee, Ron
  0 siblings, 0 replies; 7+ messages in thread
From: Lee, Ron @ 2022-12-30 12:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, linux-kernel, linux-pci, lmajczak, Jain, Rajat, Ron Lee

> The issue being "lspci doesn't show L1SS after suspend/resume"?
> 
> Is the point of this basically to fix lspci output after suspend/resume?  Or is
> there something else this fixes?
> 
> It sounds like ASPM and L1SS works correctly after suspend/resume even
> without this patch?
> 
> Bjorn

Yes, that's my point. 
This quirk is going to recover the display of L1SS in lspci output, but not recover the L1SS contents and change its ASPM behavior.
If the commit title here make misunderstanding, I can change it per your suggestion.

Besides, with power measurement by power meter, I didn't see significant power consumption and battery life difference w/wo this patch. 
With suspend/resume stress testing, the system can pass the testing w/wo this patch and the its child iwlwifi device work fine in both situation. 
So I believe this quirk didn't change the ASPM behavior.

I have a generic implementation for this quirk.
It save the L1SS header and the link offset to L1SS at system booting, and restore them every time after resume.
It also check the L1SS header and its link, only recover them if they are missing and do nothing in case system has BIOS fix.
Do you think it's practical ?

```
#ifdef CONFIG_PCIEASPM
static u16 pos_to_l1ss;
static u32 l1ss_header;
static void chromeos_save_pci_l1ss_capability(struct pci_dev *pdev)
{
	u32 header;
	int pos = PCI_CFG_SPACE_SIZE;

	while (pos) {
		pci_read_config_dword(pdev, pos, &header);
		if (PCI_EXT_CAP_NEXT(header) == pdev->l1ss)
			pos_to_l1ss = pos;
		else if (PCI_EXT_CAP_ID(header) == PCI_EXT_CAP_ID_L1SS)
			l1ss_header = header;

		pos = PCI_EXT_CAP_NEXT(header);
	}
}

static void chromeos_fixup_apl_pci_l1ss_capability(struct pci_dev *pdev)
{
	u32 header;

	if (!pos_to_l1ss || !l1ss_header)
		return;

	pci_info(pdev, "Fixup L1SS Capability\n");
	/* Fixup the header of L1SS Capability if missing */
	pci_read_config_dword(pdev, pdev->l1ss, &header);
	if (PCI_EXT_CAP_ID(header) != PCI_EXT_CAP_ID_L1SS)
		pci_write_config_dword(pdev, pdev->l1ss, l1ss_header);

	/* Fixup the link to L1SS Capability if missing*/
	pci_read_config_dword(pdev, pos_to_l1ss, &header);
	if (PCI_EXT_CAP_NEXT(header) != pdev->l1ss)
		pci_write_config_dword(pdev, pos_to_l1ss, pdev->l1ss << 20);
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x5ad6, chromeos_save_pci_l1ss_capability);
DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_INTEL, 0x5ad6, chromeos_fixup_apl_pci_l1ss_capability);
#endif
```

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-30 12:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-15  9:13 [PATCH v2] PCI: Fix up L1SS capability for Intel Apollolake PCIe bridge Ron Lee
2022-12-15 15:16 ` Bjorn Helgaas
2022-12-16 16:29   ` Lee, Ron
2022-12-16 17:48     ` Bjorn Helgaas
2022-12-20  9:11     ` Lee, Ron
2022-12-29 19:07       ` Bjorn Helgaas
2022-12-30 12:13         ` Lee, Ron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).