linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges
@ 2021-03-26 12:43 Pali Rohár
  2021-03-26 18:13 ` Toke Høiland-Jørgensen
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Pali Rohár @ 2021-03-26 12:43 UTC (permalink / raw)
  To: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	Jason Cooper, linux-pci, ath10k, linux-kernel

Atheros QCA9880 and QCA9890 chips do not behave after a bus reset and also
after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
The device will throw a Link Down error and config space is not accessible
again. Retrain link can be called only when using GEN1 PCIe bridge or when
PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.

This issue was reproduced with more Compex WLE900VX cards (QCA9880 based)
on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
pci-aardvark.c driver. Also this issue was reproduced with some "noname"
card with QCA9890 WiFi chip on Armada 3720. All problematic cards with
these QCA chips have PCI device id 0x003c.

Tests showed that other WiFi cards based on AR93xx (PCI device id 0x0030)
and AR9287 (PCI device id 0x002e) chips do not have these problems.

To workaround this issue, this change introduces a new PCI quirk called
PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 for PCI device id 0x003c.

When this quirk is set then kernel disallows triggering PCI_EXP_LNKCTL_RL
bit in config space of PCIe Bridge in case PCIe Bridge is capable of higher
speed than 2.5 GT/s and higher speed is already allowed. When PCIe Bridge
has accessible LNKCTL2 register then kernel tries to force target link
speed via PCI_EXP_LNKCTL2_TLS* bits to 2.5 GT/s. After this change it is
possible to trigger PCI_EXP_LNKCTL_RL bit without causing issues on
problematic Atheros QCA98xx cards.

Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
so quirk check is added only into pcie/aspm.c file.

Signed-off-by: Pali Rohár <pali@kernel.org>
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Marek Behún <kabel@kernel.org>
Link: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")
---
 drivers/pci/pcie/aspm.c | 43 +++++++++++++++++++++++++++++++++++++++++
 drivers/pci/quirks.c    | 15 +++++++++++++-
 include/linux/pci.h     |  2 ++
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index ac0557a305af..ea5bdf6107f6 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -192,11 +192,54 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
 	link->clkpm_disable = blacklist ? 1 : 0;
 }
 
+static int pcie_change_tls_to_gen1(struct pci_dev *parent)
+{
+	u16 reg16;
+	u32 reg32;
+	int ret;
+
+	/* Check if link speed can be forced to 2.5 GT/s */
+	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
+	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
+		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* Force link speed to 2.5 GT/s */
+	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
+						 PCI_EXP_LNKCTL2_TLS,
+						 PCI_EXP_LNKCTL2_TLS_2_5GT);
+	if (ret == 0) {
+		/* Verify that new value was really set */
+		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
+		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
+			ret = -EINVAL;
+	}
+
+	if (ret != 0)
+		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
+
+	return ret;
+}
+
 static bool pcie_retrain_link(struct pcie_link_state *link)
 {
 	struct pci_dev *parent = link->pdev;
 	unsigned long end_jiffies;
 	u16 reg16;
+	u32 reg32;
+
+	if (link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) {
+		/* Check if link is capable of higher speed than 2.5 GT/s and needs quirk */
+		pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
+		if ((reg32 & PCI_EXP_LNKCAP_SLS) > PCI_EXP_LNKCAP_SLS_2_5GB) {
+			if (pcie_change_tls_to_gen1(parent) != 0) {
+				pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
+				return false;
+			}
+			pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
+		}
+	}
 
 	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
 	reg16 |= PCI_EXP_LNKCTL_RL;
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 653660e3ba9e..8ff690c7679d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3558,6 +3558,11 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
 	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
 }
 
+static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
+{
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET | PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
+}
+
 /*
  * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
  * The device will throw a Link Down error on AER-capable systems and
@@ -3567,10 +3572,18 @@ static void quirk_no_bus_reset(struct pci_dev *dev)
  */
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
 
+/*
+ * Atheros QCA9880 and QCA9890 chips do not behave after a bus reset and also
+ * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
+ * The device will throw a Link Down error and config space is not accessible
+ * again. Retrain link can be called only when using GEN1 PCIe bridge or when
+ * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
+ */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset_and_no_retrain_link);
+
 /*
  * Root port on some Cavium CN8xxx chips do not successfully complete a bus
  * reset when used with certain child devices.  After the reset, config
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..fdbf7254e4ab 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -227,6 +227,8 @@ enum pci_dev_flags {
 	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
 	/* Don't use Relaxed Ordering for TLPs directed at this device */
 	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
+	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
+	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
 };
 
 enum pci_irq_reroute_variant {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges
  2021-03-26 12:43 [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges Pali Rohár
@ 2021-03-26 18:13 ` Toke Høiland-Jørgensen
  2021-03-27  0:14 ` Krzysztof Wilczyński
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-26 18:13 UTC (permalink / raw)
  To: Pali Rohár, Bjorn Helgaas, Kalle Valo, Marek Behún
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	Jason Cooper, linux-pci, ath10k, linux-kernel

Pali Rohár <pali@kernel.org> writes:

> Atheros QCA9880 and QCA9890 chips do not behave after a bus reset and also
> after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
> The device will throw a Link Down error and config space is not accessible
> again. Retrain link can be called only when using GEN1 PCIe bridge or when
> PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
>
> This issue was reproduced with more Compex WLE900VX cards (QCA9880 based)
> on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
> pci-aardvark.c driver. Also this issue was reproduced with some "noname"
> card with QCA9890 WiFi chip on Armada 3720. All problematic cards with
> these QCA chips have PCI device id 0x003c.
>
> Tests showed that other WiFi cards based on AR93xx (PCI device id 0x0030)
> and AR9287 (PCI device id 0x002e) chips do not have these problems.
>
> To workaround this issue, this change introduces a new PCI quirk called
> PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 for PCI device id 0x003c.
>
> When this quirk is set then kernel disallows triggering PCI_EXP_LNKCTL_RL
> bit in config space of PCIe Bridge in case PCIe Bridge is capable of higher
> speed than 2.5 GT/s and higher speed is already allowed. When PCIe Bridge
> has accessible LNKCTL2 register then kernel tries to force target link
> speed via PCI_EXP_LNKCTL2_TLS* bits to 2.5 GT/s. After this change it is
> possible to trigger PCI_EXP_LNKCTL_RL bit without causing issues on
> problematic Atheros QCA98xx cards.
>
> Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
> so quirk check is added only into pcie/aspm.c file.
>
> Signed-off-by: Pali Rohár <pali@kernel.org>
> Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Marek Behún <kabel@kernel.org>
> Link: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
> Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add
> PCI_EXP_LNKCTL2_TLS* macros")

Thanks!

Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges
  2021-03-26 12:43 [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges Pali Rohár
  2021-03-26 18:13 ` Toke Høiland-Jørgensen
@ 2021-03-27  0:14 ` Krzysztof Wilczyński
  2021-03-27 13:29   ` Pali Rohár
  2021-04-27 10:55 ` [PATCH v2] PCI: Disallow retraining link for Atheros " Pali Rohár
  2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
  3 siblings, 1 reply; 24+ messages in thread
From: Krzysztof Wilczyński @ 2021-03-27  0:14 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, vtolkm, Rob Herring, Ilias Apalodimas,
	Thomas Petazzoni, Jason Cooper, linux-pci, ath10k, linux-kernel

Hi Pali,

Thank you for sending the patch over!

[...]
> +static int pcie_change_tls_to_gen1(struct pci_dev *parent)

Just a nitpick, so feel free to ignore it.  I would just call the
variable "dev" as we pass a pointer to a particular device, but it does
not matter as much, so I am leaving this to you.

[...]
> +	if (ret == 0) {

You prefer this style over "if (!ret)"?  Just asking in the view of the
style that seem to be preferred in the code base at the moment.

> +		/* Verify that new value was really set */
> +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> +			ret = -EINVAL;

I am wondering about this verification - did you have a case where the
device would not properly set its capability, or accept the write and do
nothing?

> +	if (ret != 0)

I think "if (ret)" would be fine to use here, unless you prefer being
more explicit.  See my question about style above.

>  static bool pcie_retrain_link(struct pcie_link_state *link)
>  {
>  	struct pci_dev *parent = link->pdev;
>  	unsigned long end_jiffies;
>  	u16 reg16;
> +	u32 reg32;
> +
> +		/* Check if link is capable of higher speed than 2.5 GT/s and needs quirk */
> +		pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> +		if ((reg32 & PCI_EXP_LNKCAP_SLS) > PCI_EXP_LNKCAP_SLS_2_5GB) {

I wonder if moving this check to pcie_change_tls_to_gen1() would make
more sense?  It would then make this function a little cleaner.  What do
you think?

[...]
> +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
> +{
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET | PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
> +}
[...]

I know that the style has been changed to allow 100 characters width and
that checkpatch.pl now also does not warn about line length, as per
commit bdc48fa11e46 ("checkpatch/coding-style: deprecate 80-column
warning"), but I think Bjorn still prefers 80 characters, thus this line
above might have to be aligned.

Krzysztof

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges
  2021-03-27  0:14 ` Krzysztof Wilczyński
@ 2021-03-27 13:29   ` Pali Rohár
  2021-03-27 14:42     ` Marek Behún
  0 siblings, 1 reply; 24+ messages in thread
From: Pali Rohár @ 2021-03-27 13:29 UTC (permalink / raw)
  To: Krzysztof Wilczyński
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, vtolkm, Rob Herring, Ilias Apalodimas,
	Thomas Petazzoni, Jason Cooper, linux-pci, ath10k, linux-kernel

Hello!

On Saturday 27 March 2021 01:14:10 Krzysztof Wilczyński wrote:
> Hi Pali,
> 
> Thank you for sending the patch over!
> 
> [...]
> > +static int pcie_change_tls_to_gen1(struct pci_dev *parent)
> 
> Just a nitpick, so feel free to ignore it.  I would just call the
> variable "dev" as we pass a pointer to a particular device, but it does
> not matter as much, so I am leaving this to you.

I called it 'parent' because it is called 'parent' also in caller
function. Link consists of two devices, so 'dev' could be ambiguous.

> [...]
> > +	if (ret == 0) {
> 
> You prefer this style over "if (!ret)"?  Just asking in the view of the
> style that seem to be preferred in the code base at the moment.

I can change this to 'if (!ret)' if needed, no problem.

I use 'if (!val)' mostly for boolean and pointer variables. If variable
can contain more integer values then I lot of times I use '=='.

> > +		/* Verify that new value was really set */
> > +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> > +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> > +			ret = -EINVAL;
> 
> I am wondering about this verification - did you have a case where the
> device would not properly set its capability, or accept the write and do
> nothing?

I do not know any real device which is doing this thing.

But this issue can happen with kernel's PCIe emulated root bridge:
drivers/pci/pci-bridge-emul.c

Drivers which are using this emulated root bridge (and both pci-mvebu.c
and pci-aardvark.c are using it!) do not have to implement callback for
every read and every write operation of every register.

Note that both pci-mvebu.c and pci-aardvark.c currently does not
implement access to HW register PCI_EXP_LNKCTL2. So currently it is not
possible to set PCI_EXP_LNKCTL2_TLS_2_5GT. And above code correctly
fails and disallow ASPM code to retrain link.

I have some WIP patches which implement LNKCAP2, LNKCTL2 and LNKSTA2
read/write callbacks on emulated bridge for pci-mvebu.c and
pci-aardvark.c. And I have tested that with those WIP patches ASPM code
can correctly switch link to 2.5GT/s and enable ASPM.

> > +	if (ret != 0)
> 
> I think "if (ret)" would be fine to use here, unless you prefer being
> more explicit.  See my question about style above.
> 
> >  static bool pcie_retrain_link(struct pcie_link_state *link)
> >  {
> >  	struct pci_dev *parent = link->pdev;
> >  	unsigned long end_jiffies;
> >  	u16 reg16;
> > +	u32 reg32;
> > +
> > +		/* Check if link is capable of higher speed than 2.5 GT/s and needs quirk */
> > +		pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> > +		if ((reg32 & PCI_EXP_LNKCAP_SLS) > PCI_EXP_LNKCAP_SLS_2_5GB) {
> 
> I wonder if moving this check to pcie_change_tls_to_gen1() would make
> more sense?  It would then make this function a little cleaner.  What do
> you think?

Ok, no problem. But then function needs to be renamed. Any idea how
should be this function called?

> [...]
> > +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
> > +{
> > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET | PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
> > +}
> [...]
> 
> I know that the style has been changed to allow 100 characters width and
> that checkpatch.pl now also does not warn about line length, as per
> commit bdc48fa11e46 ("checkpatch/coding-style: deprecate 80-column
> warning"), but I think Bjorn still prefers 80 characters, thus this line
> above might have to be aligned.

Ok! If needed I can reformat patch to 80 characters per line.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges
  2021-03-27 13:29   ` Pali Rohár
@ 2021-03-27 14:42     ` Marek Behún
  2021-03-27 17:33       ` Pali Rohár
  0 siblings, 1 reply; 24+ messages in thread
From: Marek Behún @ 2021-03-27 14:42 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Krzysztof Wilczyński, Bjorn Helgaas, Kalle Valo,
	Toke Høiland-Jørgensen, Marek Behún, vtolkm,
	Rob Herring, Ilias Apalodimas, Thomas Petazzoni, Jason Cooper,
	linux-pci, ath10k, linux-kernel

On Sat, 27 Mar 2021 14:29:59 +0100
Pali Rohár <pali@kernel.org> wrote:

> I can change this to 'if (!ret)' if needed, no problem.
> 
> I use 'if (!val)' mostly for boolean and pointer variables. If
> variable can contain more integer values then I lot of times I use
> '=='.

Comparing integer varibales with explicit literals is sensible, but
if a function returning integer returns 0 on success and negative value
on error, Linux kernel has a tradition of using just
  if (!ret)
or 
  if (ret)

Marek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges
  2021-03-27 14:42     ` Marek Behún
@ 2021-03-27 17:33       ` Pali Rohár
  0 siblings, 0 replies; 24+ messages in thread
From: Pali Rohár @ 2021-03-27 17:33 UTC (permalink / raw)
  To: Marek Behún
  Cc: Krzysztof Wilczyński, Bjorn Helgaas, Kalle Valo,
	Toke Høiland-Jørgensen, Marek Behún, vtolkm,
	Rob Herring, Ilias Apalodimas, Thomas Petazzoni, linux-pci,
	ath10k, linux-kernel

On Saturday 27 March 2021 15:42:13 Marek Behún wrote:
> On Sat, 27 Mar 2021 14:29:59 +0100
> Pali Rohár <pali@kernel.org> wrote:
> 
> > I can change this to 'if (!ret)' if needed, no problem.
> > 
> > I use 'if (!val)' mostly for boolean and pointer variables. If
> > variable can contain more integer values then I lot of times I use
> > '=='.
> 
> Comparing integer varibales with explicit literals is sensible, but
> if a function returning integer returns 0 on success and negative value
> on error, Linux kernel has a tradition of using just
>   if (!ret)
> or 
>   if (ret)
> 
> Marek

Ok, I will change it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-03-26 12:43 [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges Pali Rohár
  2021-03-26 18:13 ` Toke Høiland-Jørgensen
  2021-03-27  0:14 ` Krzysztof Wilczyński
@ 2021-04-27 10:55 ` Pali Rohár
  2021-04-30 11:41   ` Pali Rohár
  2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
  3 siblings, 1 reply; 24+ messages in thread
From: Pali Rohár @ 2021-04-27 10:55 UTC (permalink / raw)
  To: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	linux-pci, ath10k, linux-wireless, linux-kernel

Atheros AR9xxx and QCA98xx chips do not behave not only after a bus reset
but also after doing retrain link when PCIe bridge is not in GEN1 mode at
2.5 GT/s speed.

QCA9880 and QCA9890 chips throw a Link Down event and completely disappear
from the bus and their config space is not accessible again.

AR9390 chip throws a Link Down event followed by Link Up event, config
space is accessible again, but contains nonsense values. PCI device ID is
0xABCD which indicates HW bug that chip itself was not able to read values
from internal EEPROM/OTP.

AR9287 chip throws also Link Down and Link Up events, also has accessible
config space and moreover its config space contains correct values. But
ath9k driver cannot initialize card from this state as it is unable to
access HW registers. This also indicates that chip iself was not able to
read values from internal EEPROM/OTP.

This issue related to PCI device ID 0xABCD and reading internal EEPROM/OTP
was previously discussed at ath9k-devel mailing list in following thread:

https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html

After experiments we come up with workaround that Retrain link can be
called only when using GEN1 PCIe bridge or when PCIe bridge has forced link
speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.

This issue was reproduced with more cards: Compex WLE900VX (QCA9880 based /
device ID 0x003c), Compex WLE200NX (AR9287 based / device ID 0x002e),
"noname" card (QCA9890 based / device ID 0x003c) and Wistron NKR-DNXAH1
(AR9390 based / device ID 0x0030) on Armada 385 with pci-mvebu.c driver and
also on Armada 3720 with pci-aardvark.c driver.

To workaround this issue, this change introduces a new PCI quirk called
PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 which is enabled for all
Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, plus also for Atheros
chip AR9287.

When this quirk is set then kernel disallows triggering PCI_EXP_LNKCTL_RL
bit in config space of PCIe Bridge in case PCIe Bridge is capable of higher
speed than 2.5 GT/s and higher speed is already allowed. When PCIe Bridge
has accessible LNKCTL2 register then kernel tries to force target link
speed via PCI_EXP_LNKCTL2_TLS* bits to 2.5 GT/s. After this change it is
possible to trigger PCI_EXP_LNKCTL_RL bit without causing issues on
problematic Atheros cards.

Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
so quirk check is added only into pcie/aspm.c file.

Signed-off-by: Pali Rohár <pali@kernel.org>
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Marek Behún <kabel@kernel.org>
BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")

---
Changes since v1:
* Move whole quirk code into pcie_downgrade_link_to_gen1() function
* Reformat to 80 chars per line where possible
* Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
* Extend commit message description and add information about 0xABCD
---
 drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
 drivers/pci/quirks.c    | 37 ++++++++++++++++++++++++++--------
 include/linux/pci.h     |  2 ++
 3 files changed, 75 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index ac0557a305af..729b0389562b 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
 	link->clkpm_disable = blacklist ? 1 : 0;
 }
 
+static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
+{
+	u16 reg16;
+	u32 reg32;
+	int ret;
+
+	/* Check if link is capable of higher speed than 2.5 GT/s */
+	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
+	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
+		return 0;
+
+	/* Check if link speed can be downgraded to 2.5 GT/s */
+	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
+	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
+		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* Force link speed to 2.5 GT/s */
+	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
+						 PCI_EXP_LNKCTL2_TLS,
+						 PCI_EXP_LNKCTL2_TLS_2_5GT);
+	if (!ret) {
+		/* Verify that new value was really set */
+		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
+		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
+			ret = -EINVAL;
+	}
+
+	if (ret) {
+		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
+		return ret;
+	}
+
+	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
+	return 0;
+}
+
 static bool pcie_retrain_link(struct pcie_link_state *link)
 {
 	struct pci_dev *parent = link->pdev;
 	unsigned long end_jiffies;
 	u16 reg16;
 
+	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
+	    pcie_downgrade_link_to_gen1(parent)) {
+		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
+		return false;
+	}
+
 	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
 	reg16 |= PCI_EXP_LNKCTL_RL;
 	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 653660e3ba9e..68c5e8f4ff8c 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3553,23 +3553,44 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
 			mellanox_check_broken_intx_masking);
 
-static void quirk_no_bus_reset(struct pci_dev *dev)
+static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
 {
-	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
+			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
 }
 
 /*
- * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
+ * Atheros AR9xxx and QCA98xx chips do not behave after a bus reset and also
+ * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
  * The device will throw a Link Down error on AER-capable systems and
  * regardless of AER, config space of the device is never accessible again
  * and typically causes the system to hang or reset when access is attempted.
+ * Or if config space is accessible again then it contains only dummy values
+ * like fixed PCI device ID 0xABCD or values not initialized at all.
+ * Retrain link can be called only when using GEN1 PCIe bridge or when
+ * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
+ * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
  * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
+ * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
+ * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
  */
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
+			 quirk_no_bus_reset_and_no_retrain_link);
+
+static void quirk_no_bus_reset(struct pci_dev *dev)
+{
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+}
 
 /*
  * Root port on some Cavium CN8xxx chips do not successfully complete a bus
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..fdbf7254e4ab 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -227,6 +227,8 @@ enum pci_dev_flags {
 	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
 	/* Don't use Relaxed Ordering for TLPs directed at this device */
 	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
+	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
+	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
 };
 
 enum pci_irq_reroute_variant {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-04-27 10:55 ` [PATCH v2] PCI: Disallow retraining link for Atheros " Pali Rohár
@ 2021-04-30 11:41   ` Pali Rohár
  0 siblings, 0 replies; 24+ messages in thread
From: Pali Rohár @ 2021-04-30 11:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	linux-pci, ath10k, linux-wireless, linux-kernel

Now I have tested another card ath10k card QCNFA435 (QCA9377 168c:0042)
and after link retraining it cause Synchronous external abort and AER
Uncorrected Non-Fatal error when trying to load ath10k_pci.ko driver.

# insmod /ath10k_pci.ko 
[   45.818017] Internal error: synchronous external abort: 96000210 [#1] SMP
[   45.825268] Modules linked in: ath10k_pci(+)
[   45.834695] CPU: 1 PID: 4050 Comm: insmod Not tainted 5.12.0-dirty #934
[   45.861149] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[   45.875585] pc : ath10k_pci_wake_wait+0x48/0xf8 [ath10k_pci]
[   45.887066] lr : ath10k_pci_wake.part.26+0xe0/0x110 [ath10k_pci]
[   45.901414] sp : ffffffc014c93820 
[   45.910833] x29: ffffffc014c93820 x28: ffffffc008cc68e0 
[   45.923663] x27: ffffff8000bf20c8 x26: ffffff800e578db8 
[   45.941605] x25: ffffffc008cc62b8 x24: ffffff8000bf2000 
[   45.957488] x23: ffffff800e578db8 x22: 000000000000752f 
[   45.975432] x21: ffffff800e571f20 x20: 0000000000000000 
[   45.995525] x19: 0000000000000005 x18: 0000000000000000 
[   46.011407] x17: 00000000000010f8 x16: 0000000000006400 
[   46.030872] x15: 0000000000000010 x14: 3030373030642065 
[   46.073937] x13: 00000000e8200000 x12: ffffffc010000000 
[   46.079418] x11: ffffffc0117bf000 x10: 00000000e8200000 
[   46.084900] x9 : ffffffc008cc2808 x8 : ffffffc011e00000 
[   46.090380] x7 : ffffff801ffff478 x6 : ffffffc011e00000 
[   46.095861] x5 : ffffffc008cc66a0 x4 : 0000000000200000 
[   46.101342] x3 : 0000000000000000 x2 : 0000000000000001 
[   46.106822] x1 : ffffffc011c80000 x0 : 0000000000080000 
[   46.112304] Call trace:
[   46.114825]  ath10k_pci_wake_wait+0x48/0xf8 [ath10k_pci]
[   46.120324]  ath10k_pci_wake.part.26+0xe0/0x110 [ath10k_pci]
[   46.126173]  ath10k_bus_pci_write32+0x78/0xd0 [ath10k_pci]
[   46.131841]  ath10k_ce_deinit_pipe+0x58/0x218
[   46.136343]  ath10k_pci_probe+0x1fc/0x8a8 [ath10k_pci]
[   46.141653]  pci_device_probe+0xbc/0x170
[   46.145703]  really_probe+0x1bc/0x414
[   46.149482]  driver_probe_device.part.15+0xcc/0x108
[   46.154513]  device_driver_attach+0x80/0x88
[   46.158831]  __driver_attach+0x64/0x120
[   46.162788]  bus_for_each_dev+0x68/0xa0
[   46.166746]  driver_attach+0x28/0x30
[   46.170433]  bus_add_driver+0x188/0x1e8
[   46.174390]  driver_register+0x68/0x118
[   46.178347]  __pci_register_driver+0x48/0x50
[   46.182751]  ath10k_pci_init+0x2c/0x1000 [ath10k_pci]
[   46.187972]  do_one_initcall+0x50/0x218
[   46.191930]  do_init_module+0x5c/0x240
[   46.195798]  load_module+0x1f94/0x24d0
[   46.199662]  __do_sys_init_module+0x120/0x1b8
[   46.204155]  __arm64_sys_init_module+0x20/0x28
[   46.208738]  el0_svc_common.constprop.3+0x90/0x120
[   46.213682]  do_el0_svc+0x74/0x90
[   46.217101]  el0_svc+0x20/0x30
[   46.220251]  el0_sync_handler+0x88/0xb0
[   46.224209]  el0_sync+0x13c/0x140
[   46.227632] Code: f94216a0 f9400ee1 b9405800 8b000021 (b9400021) 
[   46.233918] ---[ end trace 3619825164cb9b73 ]---
Segmentation fault
[   46.238818] pcieport 0000:00:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.0
[   46.400457] ath10k_pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   46.412426] ath10k_pci 0000:01:00.0:   device [168c:0042] error status/mask=00100000/00400000
[   46.421248] ath10k_pci 0000:01:00.0:    [20] UnsupReq               (First)
[   46.428771] ath10k_pci 0000:01:00.0: AER:   TLP Header: 40008001 0000020f e8080004 00000000
[   46.588115] pcieport 0000:00:00.0: AER: broadcast error_detected message

Adding this card into this quirk list fixes above issues for this card
too. I will send a V3 patch with 168c:0042 in quirk list.

On Tuesday 27 April 2021 12:55:25 Pali Rohár wrote:
> Atheros AR9xxx and QCA98xx chips do not behave not only after a bus reset
> but also after doing retrain link when PCIe bridge is not in GEN1 mode at
> 2.5 GT/s speed.
> 
> QCA9880 and QCA9890 chips throw a Link Down event and completely disappear
> from the bus and their config space is not accessible again.
> 
> AR9390 chip throws a Link Down event followed by Link Up event, config
> space is accessible again, but contains nonsense values. PCI device ID is
> 0xABCD which indicates HW bug that chip itself was not able to read values
> from internal EEPROM/OTP.
> 
> AR9287 chip throws also Link Down and Link Up events, also has accessible
> config space and moreover its config space contains correct values. But
> ath9k driver cannot initialize card from this state as it is unable to
> access HW registers. This also indicates that chip iself was not able to
> read values from internal EEPROM/OTP.
> 
> This issue related to PCI device ID 0xABCD and reading internal EEPROM/OTP
> was previously discussed at ath9k-devel mailing list in following thread:
> 
> https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> 
> After experiments we come up with workaround that Retrain link can be
> called only when using GEN1 PCIe bridge or when PCIe bridge has forced link
> speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> 
> This issue was reproduced with more cards: Compex WLE900VX (QCA9880 based /
> device ID 0x003c), Compex WLE200NX (AR9287 based / device ID 0x002e),
> "noname" card (QCA9890 based / device ID 0x003c) and Wistron NKR-DNXAH1
> (AR9390 based / device ID 0x0030) on Armada 385 with pci-mvebu.c driver and
> also on Armada 3720 with pci-aardvark.c driver.
> 
> To workaround this issue, this change introduces a new PCI quirk called
> PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 which is enabled for all
> Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, plus also for Atheros
> chip AR9287.
> 
> When this quirk is set then kernel disallows triggering PCI_EXP_LNKCTL_RL
> bit in config space of PCIe Bridge in case PCIe Bridge is capable of higher
> speed than 2.5 GT/s and higher speed is already allowed. When PCIe Bridge
> has accessible LNKCTL2 register then kernel tries to force target link
> speed via PCI_EXP_LNKCTL2_TLS* bits to 2.5 GT/s. After this change it is
> possible to trigger PCI_EXP_LNKCTL_RL bit without causing issues on
> problematic Atheros cards.
> 
> Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
> so quirk check is added only into pcie/aspm.c file.
> 
> Signed-off-by: Pali Rohár <pali@kernel.org>
> Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Marek Behún <kabel@kernel.org>
> BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
> Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")
> 
> ---
> Changes since v1:
> * Move whole quirk code into pcie_downgrade_link_to_gen1() function
> * Reformat to 80 chars per line where possible
> * Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
> * Extend commit message description and add information about 0xABCD
> ---
>  drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/quirks.c    | 37 ++++++++++++++++++++++++++--------
>  include/linux/pci.h     |  2 ++
>  3 files changed, 75 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index ac0557a305af..729b0389562b 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
>  	link->clkpm_disable = blacklist ? 1 : 0;
>  }
>  
> +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> +{
> +	u16 reg16;
> +	u32 reg32;
> +	int ret;
> +
> +	/* Check if link is capable of higher speed than 2.5 GT/s */
> +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> +		return 0;
> +
> +	/* Check if link speed can be downgraded to 2.5 GT/s */
> +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* Force link speed to 2.5 GT/s */
> +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> +						 PCI_EXP_LNKCTL2_TLS,
> +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> +	if (!ret) {
> +		/* Verify that new value was really set */
> +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> +			ret = -EINVAL;
> +	}
> +
> +	if (ret) {
> +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> +	return 0;
> +}
> +
>  static bool pcie_retrain_link(struct pcie_link_state *link)
>  {
>  	struct pci_dev *parent = link->pdev;
>  	unsigned long end_jiffies;
>  	u16 reg16;
>  
> +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> +	    pcie_downgrade_link_to_gen1(parent)) {
> +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> +		return false;
> +	}
> +
>  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
>  	reg16 |= PCI_EXP_LNKCTL_RL;
>  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 653660e3ba9e..68c5e8f4ff8c 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3553,23 +3553,44 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
>  			mellanox_check_broken_intx_masking);
>  
> -static void quirk_no_bus_reset(struct pci_dev *dev)
> +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
>  {
> -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
>  }
>  
>  /*
> - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> + * Atheros AR9xxx and QCA98xx chips do not behave after a bus reset and also
> + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
>   * The device will throw a Link Down error on AER-capable systems and
>   * regardless of AER, config space of the device is never accessible again
>   * and typically causes the system to hang or reset when access is attempted.
> + * Or if config space is accessible again then it contains only dummy values
> + * like fixed PCI device ID 0xABCD or values not initialized at all.
> + * Retrain link can be called only when using GEN1 PCIe bridge or when
> + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
>   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
>   */
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +
> +static void quirk_no_bus_reset(struct pci_dev *dev)
> +{
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
>  
>  /*
>   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 86c799c97b77..fdbf7254e4ab 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -227,6 +227,8 @@ enum pci_dev_flags {
>  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
>  	/* Don't use Relaxed Ordering for TLPs directed at this device */
>  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
>  };
>  
>  enum pci_irq_reroute_variant {
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-03-26 12:43 [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges Pali Rohár
                   ` (2 preceding siblings ...)
  2021-04-27 10:55 ` [PATCH v2] PCI: Disallow retraining link for Atheros " Pali Rohár
@ 2021-05-05 16:33 ` Pali Rohár
  2021-05-11 20:39   ` Pali Rohár
                     ` (3 more replies)
  3 siblings, 4 replies; 24+ messages in thread
From: Pali Rohár @ 2021-05-05 16:33 UTC (permalink / raw)
  To: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	linux-pci, ath10k, linux-wireless, linux-kernel

Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
bus reset, but also after doing retrain link, if PCIe bridge is not in
GEN1 mode (at 2.5 GT/s speed):

- QCA9880 and QCA9890 chips throw a Link Down event and completely
  disappear from the bus and their config space is not accessible
  afterwards.

- QCA9377 chip throws a Link Down event followed by Link Up event, the
  config space is accessible and PCI device ID is correct. But trying to
  access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
  followed by Synchronous external abort 96000210 and Segmentation fault
  of insmod while loading ath10k_pci.ko module.

- AR9390 chip throws a Link Down event followed by Link Up event, config
  space is accessible, but contains nonsense values. PCI device ID is
  0xABCD which indicates HW bug that chip itself was not able to read
  values from internal EEPROM/OTP.

- AR9287 chip throws also Link Down and Link Up events, also has
  accessible config space containing correct values. But ath9k driver
  fails to initialize card from this state as it is unable to access HW
  registers. This also indicates that the chip iself is not able to read
  values from internal EEPROM/OTP.

These issues related to PCI device ID 0xABCD and to reading internal
EEPROM/OTP were previously discussed at ath9k-devel mailing list in
following thread:

  https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html

After experiments we've come up with a solution: it seems that Retrain
link can be called only when using GEN1 PCIe bridge or when PCIe bridge
link speed is forced to 2.5 GT/s. Applying this workaround fixes all
mentioned cards.

This issue was reproduced with more cards:
- Compex WLE900VX (QCA9880 based / device ID 0x003c)
- QCNFA435 (QCA9377 based / device ID 0x0042)
- Compex WLE200NX (AR9287 based / device ID 0x002e)
- "noname" card (QCA9890 based / device ID 0x003c)
- Wistron NKR-DNXAH1 (AR9390 based / device ID 0x0030)
on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
pci-aardvark.c driver.

To workaround this issue, this change introduces a new PCI quirk called
PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1, which is enabled for all
Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, and also for Atheros
chip AR9287.

When this quirk is set, kernel disallows triggering PCI_EXP_LNKCTL_RL
bit in config space of PCIe Bridge in the case when PCIe Bridge is
capable of higher speed than 2.5 GT/s and this higher speed is already
allowed. When PCIe Bridge has accessible LNKCTL2 register, we try to
force target link speed to 2.5 GT/s. After this change it is possible
to trigger PCI_EXP_LNKCTL_RL bit without issues.

Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
so quirk check is added only into pcie/aspm.c file.

Signed-off-by: Pali Rohár <pali@kernel.org>
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Marek Behún <kabel@kernel.org>
BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")

---
Changes since v1:
* Move whole quirk code into pcie_downgrade_link_to_gen1() function
* Reformat to 80 chars per line where possible
* Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
* Extend commit message description and add information about 0xABCD

Changes since v2:
* Add quirk also for Atheros QCA9377 chip
---
 drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
 drivers/pci/quirks.c    | 39 ++++++++++++++++++++++++++++--------
 include/linux/pci.h     |  2 ++
 3 files changed, 77 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index ac0557a305af..729b0389562b 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
 	link->clkpm_disable = blacklist ? 1 : 0;
 }
 
+static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
+{
+	u16 reg16;
+	u32 reg32;
+	int ret;
+
+	/* Check if link is capable of higher speed than 2.5 GT/s */
+	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
+	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
+		return 0;
+
+	/* Check if link speed can be downgraded to 2.5 GT/s */
+	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
+	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
+		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* Force link speed to 2.5 GT/s */
+	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
+						 PCI_EXP_LNKCTL2_TLS,
+						 PCI_EXP_LNKCTL2_TLS_2_5GT);
+	if (!ret) {
+		/* Verify that new value was really set */
+		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
+		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
+			ret = -EINVAL;
+	}
+
+	if (ret) {
+		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
+		return ret;
+	}
+
+	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
+	return 0;
+}
+
 static bool pcie_retrain_link(struct pcie_link_state *link)
 {
 	struct pci_dev *parent = link->pdev;
 	unsigned long end_jiffies;
 	u16 reg16;
 
+	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
+	    pcie_downgrade_link_to_gen1(parent)) {
+		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
+		return false;
+	}
+
 	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
 	reg16 |= PCI_EXP_LNKCTL_RL;
 	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 653660e3ba9e..4999ad9d08b8 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
 			mellanox_check_broken_intx_masking);
 
-static void quirk_no_bus_reset(struct pci_dev *dev)
+static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
 {
-	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
+			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
 }
 
 /*
- * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
+ * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
+ * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
  * The device will throw a Link Down error on AER-capable systems and
  * regardless of AER, config space of the device is never accessible again
  * and typically causes the system to hang or reset when access is attempted.
+ * Or if config space is accessible again then it contains only dummy values
+ * like fixed PCI device ID 0xABCD or values not initialized at all.
+ * Retrain link can be called only when using GEN1 PCIe bridge or when
+ * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
+ * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
  * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
+ * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
+ * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
  */
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
+			 quirk_no_bus_reset_and_no_retrain_link);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
+			 quirk_no_bus_reset_and_no_retrain_link);
+
+static void quirk_no_bus_reset(struct pci_dev *dev)
+{
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+}
 
 /*
  * Root port on some Cavium CN8xxx chips do not successfully complete a bus
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..fdbf7254e4ab 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -227,6 +227,8 @@ enum pci_dev_flags {
 	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
 	/* Don't use Relaxed Ordering for TLPs directed at this device */
 	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
+	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
+	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
 };
 
 enum pci_irq_reroute_variant {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
@ 2021-05-11 20:39   ` Pali Rohár
  2021-05-28  0:08     ` Pali Rohár
  2021-06-01 11:28   ` Krzysztof Wilczyński
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Pali Rohár @ 2021-05-11 20:39 UTC (permalink / raw)
  To: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	linux-pci, ath10k, linux-wireless, linux-kernel

Hello! I would like to remind this fix. Is there something else needed
to change?

On Wednesday 05 May 2021 18:33:57 Pali Rohár wrote:
> Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
> bus reset, but also after doing retrain link, if PCIe bridge is not in
> GEN1 mode (at 2.5 GT/s speed):
> 
> - QCA9880 and QCA9890 chips throw a Link Down event and completely
>   disappear from the bus and their config space is not accessible
>   afterwards.
> 
> - QCA9377 chip throws a Link Down event followed by Link Up event, the
>   config space is accessible and PCI device ID is correct. But trying to
>   access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
>   followed by Synchronous external abort 96000210 and Segmentation fault
>   of insmod while loading ath10k_pci.ko module.
> 
> - AR9390 chip throws a Link Down event followed by Link Up event, config
>   space is accessible, but contains nonsense values. PCI device ID is
>   0xABCD which indicates HW bug that chip itself was not able to read
>   values from internal EEPROM/OTP.
> 
> - AR9287 chip throws also Link Down and Link Up events, also has
>   accessible config space containing correct values. But ath9k driver
>   fails to initialize card from this state as it is unable to access HW
>   registers. This also indicates that the chip iself is not able to read
>   values from internal EEPROM/OTP.
> 
> These issues related to PCI device ID 0xABCD and to reading internal
> EEPROM/OTP were previously discussed at ath9k-devel mailing list in
> following thread:
> 
>   https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> 
> After experiments we've come up with a solution: it seems that Retrain
> link can be called only when using GEN1 PCIe bridge or when PCIe bridge
> link speed is forced to 2.5 GT/s. Applying this workaround fixes all
> mentioned cards.
> 
> This issue was reproduced with more cards:
> - Compex WLE900VX (QCA9880 based / device ID 0x003c)
> - QCNFA435 (QCA9377 based / device ID 0x0042)
> - Compex WLE200NX (AR9287 based / device ID 0x002e)
> - "noname" card (QCA9890 based / device ID 0x003c)
> - Wistron NKR-DNXAH1 (AR9390 based / device ID 0x0030)
> on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
> pci-aardvark.c driver.
> 
> To workaround this issue, this change introduces a new PCI quirk called
> PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1, which is enabled for all
> Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, and also for Atheros
> chip AR9287.
> 
> When this quirk is set, kernel disallows triggering PCI_EXP_LNKCTL_RL
> bit in config space of PCIe Bridge in the case when PCIe Bridge is
> capable of higher speed than 2.5 GT/s and this higher speed is already
> allowed. When PCIe Bridge has accessible LNKCTL2 register, we try to
> force target link speed to 2.5 GT/s. After this change it is possible
> to trigger PCI_EXP_LNKCTL_RL bit without issues.
> 
> Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
> so quirk check is added only into pcie/aspm.c file.
> 
> Signed-off-by: Pali Rohár <pali@kernel.org>
> Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Marek Behún <kabel@kernel.org>
> BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
> Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")
> 
> ---
> Changes since v1:
> * Move whole quirk code into pcie_downgrade_link_to_gen1() function
> * Reformat to 80 chars per line where possible
> * Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
> * Extend commit message description and add information about 0xABCD
> 
> Changes since v2:
> * Add quirk also for Atheros QCA9377 chip
> ---
>  drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/quirks.c    | 39 ++++++++++++++++++++++++++++--------
>  include/linux/pci.h     |  2 ++
>  3 files changed, 77 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index ac0557a305af..729b0389562b 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
>  	link->clkpm_disable = blacklist ? 1 : 0;
>  }
>  
> +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> +{
> +	u16 reg16;
> +	u32 reg32;
> +	int ret;
> +
> +	/* Check if link is capable of higher speed than 2.5 GT/s */
> +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> +		return 0;
> +
> +	/* Check if link speed can be downgraded to 2.5 GT/s */
> +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* Force link speed to 2.5 GT/s */
> +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> +						 PCI_EXP_LNKCTL2_TLS,
> +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> +	if (!ret) {
> +		/* Verify that new value was really set */
> +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> +			ret = -EINVAL;
> +	}
> +
> +	if (ret) {
> +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> +	return 0;
> +}
> +
>  static bool pcie_retrain_link(struct pcie_link_state *link)
>  {
>  	struct pci_dev *parent = link->pdev;
>  	unsigned long end_jiffies;
>  	u16 reg16;
>  
> +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> +	    pcie_downgrade_link_to_gen1(parent)) {
> +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> +		return false;
> +	}
> +
>  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
>  	reg16 |= PCI_EXP_LNKCTL_RL;
>  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 653660e3ba9e..4999ad9d08b8 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
>  			mellanox_check_broken_intx_masking);
>  
> -static void quirk_no_bus_reset(struct pci_dev *dev)
> +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
>  {
> -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
>  }
>  
>  /*
> - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> + * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
> + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
>   * The device will throw a Link Down error on AER-capable systems and
>   * regardless of AER, config space of the device is never accessible again
>   * and typically causes the system to hang or reset when access is attempted.
> + * Or if config space is accessible again then it contains only dummy values
> + * like fixed PCI device ID 0xABCD or values not initialized at all.
> + * Retrain link can be called only when using GEN1 PCIe bridge or when
> + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
>   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
>   */
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +
> +static void quirk_no_bus_reset(struct pci_dev *dev)
> +{
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
>  
>  /*
>   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 86c799c97b77..fdbf7254e4ab 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -227,6 +227,8 @@ enum pci_dev_flags {
>  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
>  	/* Don't use Relaxed Ordering for TLPs directed at this device */
>  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
>  };
>  
>  enum pci_irq_reroute_variant {
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-05-11 20:39   ` Pali Rohár
@ 2021-05-28  0:08     ` Pali Rohár
  0 siblings, 0 replies; 24+ messages in thread
From: Pali Rohár @ 2021-05-28  0:08 UTC (permalink / raw)
  To: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński
  Cc: vtolkm, Rob Herring, Ilias Apalodimas, Thomas Petazzoni,
	linux-pci, ath10k, linux-wireless, linux-kernel

On Tuesday 11 May 2021 22:39:50 Pali Rohár wrote:
> Hello! I would like to remind this fix. Is there something else needed
> to change?

PING

> On Wednesday 05 May 2021 18:33:57 Pali Rohár wrote:
> > Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
> > bus reset, but also after doing retrain link, if PCIe bridge is not in
> > GEN1 mode (at 2.5 GT/s speed):
> > 
> > - QCA9880 and QCA9890 chips throw a Link Down event and completely
> >   disappear from the bus and their config space is not accessible
> >   afterwards.
> > 
> > - QCA9377 chip throws a Link Down event followed by Link Up event, the
> >   config space is accessible and PCI device ID is correct. But trying to
> >   access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
> >   followed by Synchronous external abort 96000210 and Segmentation fault
> >   of insmod while loading ath10k_pci.ko module.
> > 
> > - AR9390 chip throws a Link Down event followed by Link Up event, config
> >   space is accessible, but contains nonsense values. PCI device ID is
> >   0xABCD which indicates HW bug that chip itself was not able to read
> >   values from internal EEPROM/OTP.
> > 
> > - AR9287 chip throws also Link Down and Link Up events, also has
> >   accessible config space containing correct values. But ath9k driver
> >   fails to initialize card from this state as it is unable to access HW
> >   registers. This also indicates that the chip iself is not able to read
> >   values from internal EEPROM/OTP.
> > 
> > These issues related to PCI device ID 0xABCD and to reading internal
> > EEPROM/OTP were previously discussed at ath9k-devel mailing list in
> > following thread:
> > 
> >   https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> > 
> > After experiments we've come up with a solution: it seems that Retrain
> > link can be called only when using GEN1 PCIe bridge or when PCIe bridge
> > link speed is forced to 2.5 GT/s. Applying this workaround fixes all
> > mentioned cards.
> > 
> > This issue was reproduced with more cards:
> > - Compex WLE900VX (QCA9880 based / device ID 0x003c)
> > - QCNFA435 (QCA9377 based / device ID 0x0042)
> > - Compex WLE200NX (AR9287 based / device ID 0x002e)
> > - "noname" card (QCA9890 based / device ID 0x003c)
> > - Wistron NKR-DNXAH1 (AR9390 based / device ID 0x0030)
> > on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
> > pci-aardvark.c driver.
> > 
> > To workaround this issue, this change introduces a new PCI quirk called
> > PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1, which is enabled for all
> > Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, and also for Atheros
> > chip AR9287.
> > 
> > When this quirk is set, kernel disallows triggering PCI_EXP_LNKCTL_RL
> > bit in config space of PCIe Bridge in the case when PCIe Bridge is
> > capable of higher speed than 2.5 GT/s and this higher speed is already
> > allowed. When PCIe Bridge has accessible LNKCTL2 register, we try to
> > force target link speed to 2.5 GT/s. After this change it is possible
> > to trigger PCI_EXP_LNKCTL_RL bit without issues.
> > 
> > Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
> > so quirk check is added only into pcie/aspm.c file.
> > 
> > Signed-off-by: Pali Rohár <pali@kernel.org>
> > Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > Tested-by: Marek Behún <kabel@kernel.org>
> > BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
> > Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")
> > 
> > ---
> > Changes since v1:
> > * Move whole quirk code into pcie_downgrade_link_to_gen1() function
> > * Reformat to 80 chars per line where possible
> > * Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
> > * Extend commit message description and add information about 0xABCD
> > 
> > Changes since v2:
> > * Add quirk also for Atheros QCA9377 chip
> > ---
> >  drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
> >  drivers/pci/quirks.c    | 39 ++++++++++++++++++++++++++++--------
> >  include/linux/pci.h     |  2 ++
> >  3 files changed, 77 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index ac0557a305af..729b0389562b 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
> >  	link->clkpm_disable = blacklist ? 1 : 0;
> >  }
> >  
> > +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> > +{
> > +	u16 reg16;
> > +	u32 reg32;
> > +	int ret;
> > +
> > +	/* Check if link is capable of higher speed than 2.5 GT/s */
> > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> > +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> > +		return 0;
> > +
> > +	/* Check if link speed can be downgraded to 2.5 GT/s */
> > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> > +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> > +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	/* Force link speed to 2.5 GT/s */
> > +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> > +						 PCI_EXP_LNKCTL2_TLS,
> > +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> > +	if (!ret) {
> > +		/* Verify that new value was really set */
> > +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> > +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> > +			ret = -EINVAL;
> > +	}
> > +
> > +	if (ret) {
> > +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> > +		return ret;
> > +	}
> > +
> > +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> > +	return 0;
> > +}
> > +
> >  static bool pcie_retrain_link(struct pcie_link_state *link)
> >  {
> >  	struct pci_dev *parent = link->pdev;
> >  	unsigned long end_jiffies;
> >  	u16 reg16;
> >  
> > +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> > +	    pcie_downgrade_link_to_gen1(parent)) {
> > +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> > +		return false;
> > +	}
> > +
> >  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
> >  	reg16 |= PCI_EXP_LNKCTL_RL;
> >  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 653660e3ba9e..4999ad9d08b8 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
> >  			mellanox_check_broken_intx_masking);
> >  
> > -static void quirk_no_bus_reset(struct pci_dev *dev)
> > +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
> >  {
> > -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> > +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
> >  }
> >  
> >  /*
> > - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> > + * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
> > + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
> >   * The device will throw a Link Down error on AER-capable systems and
> >   * regardless of AER, config space of the device is never accessible again
> >   * and typically causes the system to hang or reset when access is attempted.
> > + * Or if config space is accessible again then it contains only dummy values
> > + * like fixed PCI device ID 0xABCD or values not initialized at all.
> > + * Retrain link can be called only when using GEN1 PCIe bridge or when
> > + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> > + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
> >   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> > + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> > + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> >   */
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +
> > +static void quirk_no_bus_reset(struct pci_dev *dev)
> > +{
> > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > +}
> >  
> >  /*
> >   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 86c799c97b77..fdbf7254e4ab 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -227,6 +227,8 @@ enum pci_dev_flags {
> >  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> >  	/* Don't use Relaxed Ordering for TLPs directed at this device */
> >  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> > +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> > +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
> >  };
> >  
> >  enum pci_irq_reroute_variant {
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
  2021-05-11 20:39   ` Pali Rohár
@ 2021-06-01 11:28   ` Krzysztof Wilczyński
  2021-06-01 20:05   ` Bjorn Helgaas
  2021-10-05 19:43   ` Jannis
  3 siblings, 0 replies; 24+ messages in thread
From: Krzysztof Wilczyński @ 2021-06-01 11:28 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, vtolkm, Rob Herring, Ilias Apalodimas,
	Thomas Petazzoni, linux-pci, ath10k, linux-wireless,
	linux-kernel

Hi Pali,

Thank you for working on this and fixing the problem, and also thank you
goes to Marek and Toke for testing!  Much appreciated!

[...]
> - AR9287 chip throws also Link Down and Link Up events, also has
>   accessible config space containing correct values. But ath9k driver
>   fails to initialize card from this state as it is unable to access HW
>   registers. This also indicates that the chip iself is not able to read

A typo here - it would be "itself" in the above.  But this is not worth
sending v4, and I am sure that Bjorn or Lorenzo could fix this in-place
when merging.

[...]
> Signed-off-by: Pali Rohár <pali@kernel.org>
> Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Marek Behún <kabel@kernel.org>
[...]

Thank you everyone!

Reviewed-by: Krzysztof Wilczyński <kw@linux.com>

	Krzysztof

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
  2021-05-11 20:39   ` Pali Rohár
  2021-06-01 11:28   ` Krzysztof Wilczyński
@ 2021-06-01 20:05   ` Bjorn Helgaas
  2021-06-01 21:18     ` Pali Rohár
  2021-10-05 19:43   ` Jannis
  3 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2021-06-01 20:05 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Wed, May 05, 2021 at 06:33:57PM +0200, Pali Rohár wrote:
> Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
> bus reset, but also after doing retrain link, if PCIe bridge is not in
> GEN1 mode (at 2.5 GT/s speed):
> 
> - QCA9880 and QCA9890 chips throw a Link Down event and completely
>   disappear from the bus and their config space is not accessible
>   afterwards.
> 
> - QCA9377 chip throws a Link Down event followed by Link Up event, the
>   config space is accessible and PCI device ID is correct. But trying to
>   access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
>   followed by Synchronous external abort 96000210 and Segmentation fault
>   of insmod while loading ath10k_pci.ko module.
> 
> - AR9390 chip throws a Link Down event followed by Link Up event, config
>   space is accessible, but contains nonsense values. PCI device ID is
>   0xABCD which indicates HW bug that chip itself was not able to read
>   values from internal EEPROM/OTP.
> 
> - AR9287 chip throws also Link Down and Link Up events, also has
>   accessible config space containing correct values. But ath9k driver
>   fails to initialize card from this state as it is unable to access HW
>   registers. This also indicates that the chip iself is not able to read
>   values from internal EEPROM/OTP.
> 
> These issues related to PCI device ID 0xABCD and to reading internal
> EEPROM/OTP were previously discussed at ath9k-devel mailing list in
> following thread:
> 
>   https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> 
> After experiments we've come up with a solution: it seems that Retrain
> link can be called only when using GEN1 PCIe bridge or when PCIe bridge
> link speed is forced to 2.5 GT/s. Applying this workaround fixes all
> mentioned cards.

I *assume* this means the device was running at > 2.5 GT/s in the
first place, and when aspm.c retrains the link to configure the common
clock, we downgrade to 2.5 GT/s, so the device is now slower than it
used to be?

Is that slower speed acceptable?  Is it better to be slower with ASPM,
or faster without ASPM?  Or maybe it could be faster *with* ASPM if we
avoided the common clock config and the retrain?  I think that would
give up some of the benefit of ASPM, but maybe it would still be
worthwhile?

If the device was running at > 2.5 GT/s to begin with, obviously there
is *some* way to get there.  This patch implies that the hardware
automatically trained to a higher rate after power-on (which I think
is what PCIe hardware is *supposed* to do) and something prevents that
from succeeding when we retrain, or maybe BIOS did something different
than what Linux is doing, or ... something else?

Maybe the device can only retrain to 2.5 GT/s or from the current
speed to a higher speed.  This sort of experimentation could probably
be done with setpci.

> This issue was reproduced with more cards:
> - Compex WLE900VX (QCA9880 based / device ID 0x003c)
> - QCNFA435 (QCA9377 based / device ID 0x0042)
> - Compex WLE200NX (AR9287 based / device ID 0x002e)
> - "noname" card (QCA9890 based / device ID 0x003c)
> - Wistron NKR-DNXAH1 (AR9390 based / device ID 0x0030)
> on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
> pci-aardvark.c driver.
> 
> To workaround this issue, this change introduces a new PCI quirk called
> PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1, which is enabled for all
> Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, and also for Atheros
> chip AR9287.
> 
> When this quirk is set, kernel disallows triggering PCI_EXP_LNKCTL_RL
> bit in config space of PCIe Bridge in the case when PCIe Bridge is
> capable of higher speed than 2.5 GT/s and this higher speed is already
> allowed. When PCIe Bridge has accessible LNKCTL2 register, we try to
> force target link speed to 2.5 GT/s. After this change it is possible
> to trigger PCI_EXP_LNKCTL_RL bit without issues.

This basically feels like a "it hurts when I do X, so stop doing X"
patch.  We don't really know what's wrong; we've just determined
experimentally how to avoid it.

> Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
> so quirk check is added only into pcie/aspm.c file.
> 
> Signed-off-by: Pali Rohár <pali@kernel.org>
> Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Tested-by: Marek Behún <kabel@kernel.org>
> BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
> Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")
> 
> ---
> Changes since v1:
> * Move whole quirk code into pcie_downgrade_link_to_gen1() function
> * Reformat to 80 chars per line where possible
> * Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
> * Extend commit message description and add information about 0xABCD
> 
> Changes since v2:
> * Add quirk also for Atheros QCA9377 chip
> ---
>  drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/quirks.c    | 39 ++++++++++++++++++++++++++++--------
>  include/linux/pci.h     |  2 ++
>  3 files changed, 77 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index ac0557a305af..729b0389562b 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
>  	link->clkpm_disable = blacklist ? 1 : 0;
>  }
>  
> +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> +{
> +	u16 reg16;
> +	u32 reg32;
> +	int ret;
> +
> +	/* Check if link is capable of higher speed than 2.5 GT/s */
> +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> +		return 0;

I guess this means "if the link is already at 2.5 GT/s, no need to do
anything."  Right?

> +	/* Check if link speed can be downgraded to 2.5 GT/s */
> +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> +		return -EOPNOTSUPP;
> +	}

Why is this check needed?  Per PCIe r5.0, sec 8.2.1, all devices must
support 2.5 GT/s.

> +	/* Force link speed to 2.5 GT/s */
> +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> +						 PCI_EXP_LNKCTL2_TLS,
> +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> +	if (!ret) {
> +		/* Verify that new value was really set */
> +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> +			ret = -EINVAL;
> +	}
> +
> +	if (ret) {
> +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> +	return 0;
> +}
> +
>  static bool pcie_retrain_link(struct pcie_link_state *link)
>  {
>  	struct pci_dev *parent = link->pdev;
>  	unsigned long end_jiffies;
>  	u16 reg16;
>  
> +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> +	    pcie_downgrade_link_to_gen1(parent)) {

I assume (correct me if I'm wrong) that this would work equally well
if we set the *endpoint's* target link speed to 2.5 GT/s instead of
the upstream bridge's?  I think the log messages would make more sense
then, since the problem is really with the endpoint, not the parent.

> +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> +		return false;
> +	}
> +
>  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
>  	reg16 |= PCI_EXP_LNKCTL_RL;
>  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 653660e3ba9e..4999ad9d08b8 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
>  			mellanox_check_broken_intx_masking);
>  
> -static void quirk_no_bus_reset(struct pci_dev *dev)
> +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
>  {
> -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
>  }
>  
>  /*
> - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> + * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
> + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
>   * The device will throw a Link Down error on AER-capable systems and
>   * regardless of AER, config space of the device is never accessible again
>   * and typically causes the system to hang or reset when access is attempted.
> + * Or if config space is accessible again then it contains only dummy values
> + * like fixed PCI device ID 0xABCD or values not initialized at all.
> + * Retrain link can be called only when using GEN1 PCIe bridge or when
> + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
>   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
>   */
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
> +			 quirk_no_bus_reset_and_no_retrain_link);
> +
> +static void quirk_no_bus_reset(struct pci_dev *dev)
> +{
> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> +}
>  
>  /*
>   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 86c799c97b77..fdbf7254e4ab 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -227,6 +227,8 @@ enum pci_dev_flags {
>  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
>  	/* Don't use Relaxed Ordering for TLPs directed at this device */
>  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),

I know this is entangled with the existing PCI_DEV_FLAGS_NO_BUS_RESET,
but unless there's a better reason to use pci_dev_flags, I'd prefer a
new "unsigned retrain_gen1:1" or similar bit.  

Whatever you do, I'd like to avoid the double negative of "*no*
retrain when *not* gen1."

It does make me wonder whether the bus reset would work on these
devices if we set the target link speed back down to 2.5 GT/s.

>  };
>  
>  enum pci_irq_reroute_variant {
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-01 20:05   ` Bjorn Helgaas
@ 2021-06-01 21:18     ` Pali Rohár
  2021-06-02  0:00       ` Bjorn Helgaas
  0 siblings, 1 reply; 24+ messages in thread
From: Pali Rohár @ 2021-06-01 21:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

Hello!

On Tuesday 01 June 2021 15:05:49 Bjorn Helgaas wrote:
> On Wed, May 05, 2021 at 06:33:57PM +0200, Pali Rohár wrote:
> > Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
> > bus reset, but also after doing retrain link, if PCIe bridge is not in
> > GEN1 mode (at 2.5 GT/s speed):
> > 
> > - QCA9880 and QCA9890 chips throw a Link Down event and completely
> >   disappear from the bus and their config space is not accessible
> >   afterwards.
> > 
> > - QCA9377 chip throws a Link Down event followed by Link Up event, the
> >   config space is accessible and PCI device ID is correct. But trying to
> >   access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
> >   followed by Synchronous external abort 96000210 and Segmentation fault
> >   of insmod while loading ath10k_pci.ko module.
> > 
> > - AR9390 chip throws a Link Down event followed by Link Up event, config
> >   space is accessible, but contains nonsense values. PCI device ID is
> >   0xABCD which indicates HW bug that chip itself was not able to read
> >   values from internal EEPROM/OTP.
> > 
> > - AR9287 chip throws also Link Down and Link Up events, also has
> >   accessible config space containing correct values. But ath9k driver
> >   fails to initialize card from this state as it is unable to access HW
> >   registers. This also indicates that the chip iself is not able to read
> >   values from internal EEPROM/OTP.
> > 
> > These issues related to PCI device ID 0xABCD and to reading internal
> > EEPROM/OTP were previously discussed at ath9k-devel mailing list in
> > following thread:
> > 
> >   https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> > 
> > After experiments we've come up with a solution: it seems that Retrain
> > link can be called only when using GEN1 PCIe bridge or when PCIe bridge
> > link speed is forced to 2.5 GT/s. Applying this workaround fixes all
> > mentioned cards.
> 
> I *assume* this means the device was running at > 2.5 GT/s in the
> first place,

No. All these Atheros chips are 2.5 GT/s only. It looks like that if
PCIe Bridge has initial value 5 GT/s (or higher) in PCI_EXP_LNKCAP2
register and link retraining is activated, something happen which cause
these Atheros chips to "crash". Looks like that Root Bridge tries to
change link speed from 2.5 GT/s to 5 GT/s (which is not supported by all
these Atheros chips).

> and when aspm.c retrains the link to configure the common
> clock, we downgrade to 2.5 GT/s, so the device is now slower than it
> used to be?
> 
> Is that slower speed acceptable?  Is it better to be slower with ASPM,
> or faster without ASPM?  Or maybe it could be faster *with* ASPM if we
> avoided the common clock config and the retrain?  I think that would
> give up some of the benefit of ASPM, but maybe it would still be
> worthwhile?

There is no slow down with these chips as all these which I tested and
written into description are 2.5 GT/s -only.

> If the device was running at > 2.5 GT/s to begin with, obviously there
> is *some* way to get there.

Root Bridge in PCI_EXP_LNKSTA reports that they are running at 2.5 GT/s
even on begin.

> This patch implies that the hardware
> automatically trained to a higher rate after power-on (which I think
> is what PCIe hardware is *supposed* to do) and something prevents that
> from succeeding when we retrain, or maybe BIOS did something different
> than what Linux is doing, or ... something else?

Tested platforms was also without BIOS and without any other firmware
which touched PCIe.

> Maybe the device can only retrain to 2.5 GT/s or from the current
> speed to a higher speed.  This sort of experimentation could probably
> be done with setpci.
> 
> > This issue was reproduced with more cards:
> > - Compex WLE900VX (QCA9880 based / device ID 0x003c)
> > - QCNFA435 (QCA9377 based / device ID 0x0042)
> > - Compex WLE200NX (AR9287 based / device ID 0x002e)
> > - "noname" card (QCA9890 based / device ID 0x003c)
> > - Wistron NKR-DNXAH1 (AR9390 based / device ID 0x0030)
> > on Armada 385 with pci-mvebu.c driver and also on Armada 3720 with
> > pci-aardvark.c driver.
> > 
> > To workaround this issue, this change introduces a new PCI quirk called
> > PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1, which is enabled for all
> > Atheros chips with PCI_DEV_FLAGS_NO_BUS_RESET quirk, and also for Atheros
> > chip AR9287.
> > 
> > When this quirk is set, kernel disallows triggering PCI_EXP_LNKCTL_RL
> > bit in config space of PCIe Bridge in the case when PCIe Bridge is
> > capable of higher speed than 2.5 GT/s and this higher speed is already
> > allowed. When PCIe Bridge has accessible LNKCTL2 register, we try to
> > force target link speed to 2.5 GT/s. After this change it is possible
> > to trigger PCI_EXP_LNKCTL_RL bit without issues.
> 
> This basically feels like a "it hurts when I do X, so stop doing X"
> patch.  We don't really know what's wrong; we've just determined
> experimentally how to avoid it.

Yes.

> > Currently only PCIe ASPM kernel code triggers this PCI_EXP_LNKCTL_RL bit,
> > so quirk check is added only into pcie/aspm.c file.
> > 
> > Signed-off-by: Pali Rohár <pali@kernel.org>
> > Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > Tested-by: Marek Behún <kabel@kernel.org>
> > BugLink: https://lore.kernel.org/linux-pci/87h7l8axqp.fsf@toke.dk/
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=84821
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=192441
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209833
> > Cc: stable@vger.kernel.org # c80851f6ce63a ("PCI: Add PCI_EXP_LNKCTL2_TLS* macros")
> > 
> > ---
> > Changes since v1:
> > * Move whole quirk code into pcie_downgrade_link_to_gen1() function
> > * Reformat to 80 chars per line where possible
> > * Add quirk also for cards with AR9287 chip (PCI ID 0x002e)
> > * Extend commit message description and add information about 0xABCD
> > 
> > Changes since v2:
> > * Add quirk also for Atheros QCA9377 chip
> > ---
> >  drivers/pci/pcie/aspm.c | 44 +++++++++++++++++++++++++++++++++++++++++
> >  drivers/pci/quirks.c    | 39 ++++++++++++++++++++++++++++--------
> >  include/linux/pci.h     |  2 ++
> >  3 files changed, 77 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index ac0557a305af..729b0389562b 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -192,12 +192,56 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist)
> >  	link->clkpm_disable = blacklist ? 1 : 0;
> >  }
> >  
> > +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> > +{
> > +	u16 reg16;
> > +	u32 reg32;
> > +	int ret;
> > +
> > +	/* Check if link is capable of higher speed than 2.5 GT/s */
> > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> > +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> > +		return 0;
> 
> I guess this means "if the link is already at 2.5 GT/s, no need to do
> anything."  Right?

PCI_EXP_LNKCAP_SLS is maximal supported speed by Bridge. So if bridge
does not support higher speed, we do not have to do anything.

> > +	/* Check if link speed can be downgraded to 2.5 GT/s */
> > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> > +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> > +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> > +		return -EOPNOTSUPP;
> > +	}
> 
> Why is this check needed?  Per PCIe r5.0, sec 8.2.1, all devices must
> support 2.5 GT/s.

Because older PCIe devices does not have to support PCI_EXP_LNKCAP2
register (in which cause they returns zero). And this applies also for
pci-bridge-emul.c driver. So this check is needed at least for devices
which use pci-bridge-emul.c driver.

> > +	/* Force link speed to 2.5 GT/s */
> > +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> > +						 PCI_EXP_LNKCTL2_TLS,
> > +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> > +	if (!ret) {
> > +		/* Verify that new value was really set */
> > +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> > +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> > +			ret = -EINVAL;
> > +	}
> > +
> > +	if (ret) {
> > +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> > +		return ret;
> > +	}
> > +
> > +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> > +	return 0;
> > +}
> > +
> >  static bool pcie_retrain_link(struct pcie_link_state *link)
> >  {
> >  	struct pci_dev *parent = link->pdev;
> >  	unsigned long end_jiffies;
> >  	u16 reg16;
> >  
> > +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> > +	    pcie_downgrade_link_to_gen1(parent)) {
> 
> I assume (correct me if I'm wrong) that this would work equally well
> if we set the *endpoint's* target link speed to 2.5 GT/s instead of
> the upstream bridge's?

I think not. Issue is really when Bridge-end of the link supports higher
than 2.5 GT/s speed this end tries to increase speed. As device-end of
the link supports only 2.5 GT/s there is nothing to change to higher
speed from device-end point of view.

> I think the log messages would make more sense
> then, since the problem is really with the endpoint, not the parent.

So... buggy is device (child) end of the link and only bridge (parent)
end of the link can workaround it. And if bridge end is not capable
(e.g. because of pci-bridge-emul.c) then it is problem of bridge
(parent) end.

> > +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> > +		return false;
> > +	}
> > +
> >  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
> >  	reg16 |= PCI_EXP_LNKCTL_RL;
> >  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 653660e3ba9e..4999ad9d08b8 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
> >  			mellanox_check_broken_intx_masking);
> >  
> > -static void quirk_no_bus_reset(struct pci_dev *dev)
> > +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
> >  {
> > -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> > +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
> >  }
> >  
> >  /*
> > - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> > + * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
> > + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
> >   * The device will throw a Link Down error on AER-capable systems and
> >   * regardless of AER, config space of the device is never accessible again
> >   * and typically causes the system to hang or reset when access is attempted.
> > + * Or if config space is accessible again then it contains only dummy values
> > + * like fixed PCI device ID 0xABCD or values not initialized at all.
> > + * Retrain link can be called only when using GEN1 PCIe bridge or when
> > + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> > + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
> >   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> > + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> > + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> >   */
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
> > +			 quirk_no_bus_reset_and_no_retrain_link);
> > +
> > +static void quirk_no_bus_reset(struct pci_dev *dev)
> > +{
> > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > +}
> >  
> >  /*
> >   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 86c799c97b77..fdbf7254e4ab 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -227,6 +227,8 @@ enum pci_dev_flags {
> >  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> >  	/* Don't use Relaxed Ordering for TLPs directed at this device */
> >  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> > +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> > +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
> 
> I know this is entangled with the existing PCI_DEV_FLAGS_NO_BUS_RESET,
> but unless there's a better reason to use pci_dev_flags, I'd prefer a
> new "unsigned retrain_gen1:1" or similar bit.  

Ok! I can change patch...

> Whatever you do, I'd like to avoid the double negative of "*no*
> retrain when *not* gen1."

Do you have a suggestion for this name? Because I do not know how to
call this "quirk" in English, so it describes "disallow link retrain
when link is not at gen1 = 2.5GT/s". Somehow I cannot imagine name
without double negative words.

> It does make me wonder whether the bus reset would work on these
> devices if we set the target link speed back down to 2.5 GT/s.

Tested and does not work. Secondary bus reset (=Hot Reset) is broken
also when link is forced to 2.5 GT/s. It looks like that when
PCI_EXP_LNKCTL2_TLS is not set to 2.5 GT/s when setting
PCI_EXP_LNKCTL_RL it results in the same effect / issue like calling
secondary bus reset.

> >  };
> >  
> >  enum pci_irq_reroute_variant {
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-01 21:18     ` Pali Rohár
@ 2021-06-02  0:00       ` Bjorn Helgaas
  2021-06-02 12:08         ` Pali Rohár
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2021-06-02  0:00 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Tue, Jun 01, 2021 at 11:18:39PM +0200, Pali Rohár wrote:
> On Tuesday 01 June 2021 15:05:49 Bjorn Helgaas wrote:
> > On Wed, May 05, 2021 at 06:33:57PM +0200, Pali Rohár wrote:
> > > Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
> > > bus reset, but also after doing retrain link, if PCIe bridge is not in
> > > GEN1 mode (at 2.5 GT/s speed):
> > > 
> > > - QCA9880 and QCA9890 chips throw a Link Down event and completely
> > >   disappear from the bus and their config space is not accessible
> > >   afterwards.
> > > 
> > > - QCA9377 chip throws a Link Down event followed by Link Up event, the
> > >   config space is accessible and PCI device ID is correct. But trying to
> > >   access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
> > >   followed by Synchronous external abort 96000210 and Segmentation fault
> > >   of insmod while loading ath10k_pci.ko module.
> > > 
> > > - AR9390 chip throws a Link Down event followed by Link Up event, config
> > >   space is accessible, but contains nonsense values. PCI device ID is
> > >   0xABCD which indicates HW bug that chip itself was not able to read
> > >   values from internal EEPROM/OTP.
> > > 
> > > - AR9287 chip throws also Link Down and Link Up events, also has
> > >   accessible config space containing correct values. But ath9k driver
> > >   fails to initialize card from this state as it is unable to access HW
> > >   registers. This also indicates that the chip iself is not able to read
> > >   values from internal EEPROM/OTP.
> > > 
> > > These issues related to PCI device ID 0xABCD and to reading internal
> > > EEPROM/OTP were previously discussed at ath9k-devel mailing list in
> > > following thread:
> > > 
> > >   https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> > > 
> > > After experiments we've come up with a solution: it seems that Retrain
> > > link can be called only when using GEN1 PCIe bridge or when PCIe bridge
> > > link speed is forced to 2.5 GT/s. Applying this workaround fixes all
> > > mentioned cards.
> > 
> > I *assume* this means the device was running at > 2.5 GT/s in the
> > first place,
> 
> No. All these Atheros chips are 2.5 GT/s only. It looks like that if
> PCIe Bridge has initial value 5 GT/s (or higher) in PCI_EXP_LNKCAP2
> register and link retraining is activated, something happen which cause
> these Atheros chips to "crash". Looks like that Root Bridge tries to
> change link speed from 2.5 GT/s to 5 GT/s (which is not supported by all
> these Atheros chips).

Oh, perfect.  Then I guess all we need is to restrict these devices to
2.5 GT/s.  And we can just ignore all my rambling about higher speeds
below, so I'll elide them.

> > ...

Except this:

> > This patch implies that the hardware automatically trained to a
> > higher rate after power-on (which I think is what PCIe hardware is
> > *supposed* to do) and something prevents that from succeeding when
> > we retrain, or maybe BIOS did something different than what Linux
> > is doing, or ... something else?

> Tested platforms was also without BIOS and without any other firmware
> which touched PCIe.

The fact that the link came up automatically without any firmware or
software at all is very interesting.  The retrain path is actually
different from a hardware point of view: the power-on path through
LTSSM would normally be Detect, Polling, Configuration, L0; the
retrain path would be L0, Recovery, L0.  So I guess it isn't *too*
surprising that the power-on path could work even if the retrain path
is broken.

I wonder if setting, then clearing, the bridge's Link Disable bit
would work, since that would start again with the LTSSM Detect state,
just like power-on.  But I don't think that would help with this
ASPM/Common Clock issue because I think the link disable would look
like a hot reset to the endpoint, and it would clear the Common Clock
Configuration bit.

So backing up a loooong ways, how much value is there in doing this
retrain at all?  AFAICT the only reason we do it is because we think
the Common Clock Configuration is inconsistent, and we tried to fix
something, and we have to retrain the link to get the devices to
update their L0s and L1 exit latencies.  I guess it's the Slock Clock
(PCI_EXP_LNKSTA_SLC) bits that determines all this, right?  Do you
know those?

I wonder if this could be restructured as a generic quirk in quirks.c
that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
would the retrain fail even in that case?

> > > +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> > > +{
> > > +	u16 reg16;
> > > +	u32 reg32;
> > > +	int ret;
> > > +
> > > +	/* Check if link is capable of higher speed than 2.5 GT/s */
> > > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> > > +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> > > +		return 0;
> > 
> > I guess this means "if the link is already at 2.5 GT/s, no need to do
> > anything."  Right?
> 
> PCI_EXP_LNKCAP_SLS is maximal supported speed by Bridge. So if bridge
> does not support higher speed, we do not have to do anything.
> 
> > > +	/* Check if link speed can be downgraded to 2.5 GT/s */
> > > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> > > +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> > > +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> > > +		return -EOPNOTSUPP;
> > > +	}
> > 
> > Why is this check needed?  Per PCIe r5.0, sec 8.2.1, all devices must
> > support 2.5 GT/s.
> 
> Because older PCIe devices does not have to support PCI_EXP_LNKCAP2
> register (in which cause they returns zero). And this applies also for
> pci-bridge-emul.c driver. So this check is needed at least for devices
> which use pci-bridge-emul.c driver.

Ugh.  So this depends on the fact that pcie_capability_read_dword()
sets "*val = 0" if PCI_EXP_LNKCAP2 is not implemented.  I have a
half-baked idea that we should be doing "*val = ~0" instead because
that's what we normally get for *hardware* registers that are
implemented.

> > > +	/* Force link speed to 2.5 GT/s */
> > > +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> > > +						 PCI_EXP_LNKCTL2_TLS,
> > > +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> > > +	if (!ret) {
> > > +		/* Verify that new value was really set */
> > > +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> > > +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> > > +			ret = -EINVAL;
> > > +	}
> > > +
> > > +	if (ret) {
> > > +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> > > +	return 0;
> > > +}
> > > +
> > >  static bool pcie_retrain_link(struct pcie_link_state *link)
> > >  {
> > >  	struct pci_dev *parent = link->pdev;
> > >  	unsigned long end_jiffies;
> > >  	u16 reg16;
> > >  
> > > +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> > > +	    pcie_downgrade_link_to_gen1(parent)) {
> > 
> > I assume (correct me if I'm wrong) that this would work equally well
> > if we set the *endpoint's* target link speed to 2.5 GT/s instead of
> > the upstream bridge's?
> 
> I think not. Issue is really when Bridge-end of the link supports higher
> than 2.5 GT/s speed this end tries to increase speed. As device-end of
> the link supports only 2.5 GT/s there is nothing to change to higher
> speed from device-end point of view.

Based on PCIe r5.0, sec 4.2.6, when the link is trained, both devices
start at 2.5 GT/s.  Then they exchange TS Ordered Sets, which
advertise all the data rates supported by each side.  After the link
is operating in L0 at 2.5 GT/s, either side can initiate a change to
the highest supported common data rate.

A device can't initiate a change until it knows the rates supported by
both sides.  So either:

  - The bridge initiated a change because it knows *it* supports a
    higher rate and the Atheros device incorrectly advertised a higher
    rate), or 

  - The Atheros device initiated a change because the bridge
    advertised a higher rate and the Atheros device incorrectly thinks
    that *it* supports a higher rate.

Setting Target Link Speed in the bridge should prevent either case
because it restricts the rates advertised by the bridge in its
training sequences (per sec 7.5.3.19).

Interesting -- also per 7.5.3.19, Target Link Speed is permitted to
have no effect for Upstream Ports (like the one in the Atheros
device).  Probably makes sense since that port is not reachable until
the link is already operating.

Bottom line, I think you do have to do this with the bridge, not the
Atheros device.

> > I think the log messages would make more sense
> > then, since the problem is really with the endpoint, not the parent.
> 
> So... buggy is device (child) end of the link and only bridge (parent)
> end of the link can workaround it. And if bridge end is not capable
> (e.g. because of pci-bridge-emul.c) then it is problem of bridge
> (parent) end.
> 
> > > +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> > > +		return false;
> > > +	}
> > > +
> > >  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
> > >  	reg16 |= PCI_EXP_LNKCTL_RL;
> > >  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 653660e3ba9e..4999ad9d08b8 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
> > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
> > >  			mellanox_check_broken_intx_masking);
> > >  
> > > -static void quirk_no_bus_reset(struct pci_dev *dev)
> > > +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
> > >  {
> > > -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> > > +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
> > >  }
> > >  
> > >  /*
> > > - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> > > + * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
> > > + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
> > >   * The device will throw a Link Down error on AER-capable systems and
> > >   * regardless of AER, config space of the device is never accessible again
> > >   * and typically causes the system to hang or reset when access is attempted.
> > > + * Or if config space is accessible again then it contains only dummy values
> > > + * like fixed PCI device ID 0xABCD or values not initialized at all.
> > > + * Retrain link can be called only when using GEN1 PCIe bridge or when
> > > + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> > > + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
> > >   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> > > + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> > > + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> > >   */
> > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
> > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > +
> > > +static void quirk_no_bus_reset(struct pci_dev *dev)
> > > +{
> > > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > > +}
> > >  
> > >  /*
> > >   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index 86c799c97b77..fdbf7254e4ab 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -227,6 +227,8 @@ enum pci_dev_flags {
> > >  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> > >  	/* Don't use Relaxed Ordering for TLPs directed at this device */
> > >  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> > > +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> > > +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
> > 
> > I know this is entangled with the existing PCI_DEV_FLAGS_NO_BUS_RESET,
> > but unless there's a better reason to use pci_dev_flags, I'd prefer a
> > new "unsigned retrain_gen1:1" or similar bit.  
> 
> Ok! I can change patch...
> 
> > Whatever you do, I'd like to avoid the double negative of "*no*
> > retrain when *not* gen1."
> 
> Do you have a suggestion for this name? Because I do not know how to
> call this "quirk" in English, so it describes "disallow link retrain
> when link is not at gen1 = 2.5GT/s". Somehow I cannot imagine name
> without double negative words.
> 
> > It does make me wonder whether the bus reset would work on these
> > devices if we set the target link speed back down to 2.5 GT/s.
> 
> Tested and does not work. Secondary bus reset (=Hot Reset) is broken
> also when link is forced to 2.5 GT/s. It looks like that when
> PCI_EXP_LNKCTL2_TLS is not set to 2.5 GT/s when setting
> PCI_EXP_LNKCTL_RL it results in the same effect / issue like calling
> secondary bus reset.
> 
> > >  };
> > >  
> > >  enum pci_irq_reroute_variant {
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-02  0:00       ` Bjorn Helgaas
@ 2021-06-02 12:08         ` Pali Rohár
  2021-06-02 15:55           ` Bjorn Helgaas
  0 siblings, 1 reply; 24+ messages in thread
From: Pali Rohár @ 2021-06-02 12:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> On Tue, Jun 01, 2021 at 11:18:39PM +0200, Pali Rohár wrote:
> > On Tuesday 01 June 2021 15:05:49 Bjorn Helgaas wrote:
> > > On Wed, May 05, 2021 at 06:33:57PM +0200, Pali Rohár wrote:
> > > > Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a
> > > > bus reset, but also after doing retrain link, if PCIe bridge is not in
> > > > GEN1 mode (at 2.5 GT/s speed):
> > > > 
> > > > - QCA9880 and QCA9890 chips throw a Link Down event and completely
> > > >   disappear from the bus and their config space is not accessible
> > > >   afterwards.
> > > > 
> > > > - QCA9377 chip throws a Link Down event followed by Link Up event, the
> > > >   config space is accessible and PCI device ID is correct. But trying to
> > > >   access chip's I/O space causes Uncorrected (Non-Fatal) AER error,
> > > >   followed by Synchronous external abort 96000210 and Segmentation fault
> > > >   of insmod while loading ath10k_pci.ko module.
> > > > 
> > > > - AR9390 chip throws a Link Down event followed by Link Up event, config
> > > >   space is accessible, but contains nonsense values. PCI device ID is
> > > >   0xABCD which indicates HW bug that chip itself was not able to read
> > > >   values from internal EEPROM/OTP.
> > > > 
> > > > - AR9287 chip throws also Link Down and Link Up events, also has
> > > >   accessible config space containing correct values. But ath9k driver
> > > >   fails to initialize card from this state as it is unable to access HW
> > > >   registers. This also indicates that the chip iself is not able to read
> > > >   values from internal EEPROM/OTP.
> > > > 
> > > > These issues related to PCI device ID 0xABCD and to reading internal
> > > > EEPROM/OTP were previously discussed at ath9k-devel mailing list in
> > > > following thread:
> > > > 
> > > >   https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> > > > 
> > > > After experiments we've come up with a solution: it seems that Retrain
> > > > link can be called only when using GEN1 PCIe bridge or when PCIe bridge
> > > > link speed is forced to 2.5 GT/s. Applying this workaround fixes all
> > > > mentioned cards.
> > > 
> > > I *assume* this means the device was running at > 2.5 GT/s in the
> > > first place,
> > 
> > No. All these Atheros chips are 2.5 GT/s only. It looks like that if
> > PCIe Bridge has initial value 5 GT/s (or higher) in PCI_EXP_LNKCAP2
> > register and link retraining is activated, something happen which cause
> > these Atheros chips to "crash". Looks like that Root Bridge tries to
> > change link speed from 2.5 GT/s to 5 GT/s (which is not supported by all
> > these Atheros chips).
> 
> Oh, perfect.  Then I guess all we need is to restrict these devices to
> 2.5 GT/s.  And we can just ignore all my rambling about higher speeds
> below, so I'll elide them.

Yes, all these tests shows that these Atheros chips are stable only when
link is operating at 2.5 GT/s.

> > > ...
> 
> Except this:
> 
> > > This patch implies that the hardware automatically trained to a
> > > higher rate after power-on (which I think is what PCIe hardware is
> > > *supposed* to do) and something prevents that from succeeding when
> > > we retrain, or maybe BIOS did something different than what Linux
> > > is doing, or ... something else?
> 
> > Tested platforms was also without BIOS and without any other firmware
> > which touched PCIe.
> 
> The fact that the link came up automatically without any firmware or
> software at all is very interesting.  The retrain path is actually
> different from a hardware point of view: the power-on path through
> LTSSM would normally be Detect, Polling, Configuration, L0; the
> retrain path would be L0, Recovery, L0.  So I guess it isn't *too*
> surprising that the power-on path could work even if the retrain path
> is broken.

Yes, this is truth. In my opinion these Atheros chips are trying to do
some kind of init / reset procedure when either entering or leaving
Recovery state. And because there is known bug that Hot Reset should be
avoided, it looks like that Hot Reset is just one from more options how
to trigger this bug.

> I wonder if setting, then clearing, the bridge's Link Disable bit
> would work, since that would start again with the LTSSM Detect state,
> just like power-on.

Tested and it does not work. Same effect as Hot Reset. This really looks
like OTP/EEPROM related issue which was already described, that doing
(something) related to reset too fast cause internal chip to not finish
reading OTP/EEPROM data needed to correctly initialize PCIe part of
card.

> But I don't think that would help with this
> ASPM/Common Clock issue because I think the link disable would look
> like a hot reset to the endpoint, and it would clear the Common Clock
> Configuration bit.
> 
> So backing up a loooong ways, how much value is there in doing this
> retrain at all?  AFAICT the only reason we do it is because we think
> the Common Clock Configuration is inconsistent, and we tried to fix
> something, and we have to retrain the link to get the devices to
> update their L0s and L1 exit latencies.  I guess it's the Slock Clock
> (PCI_EXP_LNKSTA_SLC) bits that determines all this, right?  Do you
> know those?

I understood that changing ASPM bits require to retrain link. But
personally, I really do not understand how these power management stuff
like ASPM is working. And because I'm reading lot of times that ASPM is
broken (some people are blaming kernel, other x86 BIOS bugs, others also
PCIe cards). This bug was "hidden" for a long time as lot of users are
using OpenWRT kernels and I have not found any discussion or bug report
where OpenWRT people document something related to this issue. Looks
like that they just disabled ASPM, which makes this bug hidden for a
long time.

But result is that tested cards with this patch are stable also when
ASPM is enabled.

> I wonder if this could be restructured as a generic quirk in quirks.c
> that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
> would the retrain fail even in that case?

If I understand it correctly then PCIe link is already up when kernel
starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change
anything here.

Moreover it would have side effect that cards which are already set to
5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for
increasing speed would be needed another round of "enumeration" to set a
new TLS and retrain link again. As TLS affects link only after link goes
into Recovery state.

So this would just complicate card enumeration and settings.

Moreover here we are dealing with specific OTP/EEPROM bug in Atheros
chips, which was confirmed that exists. As I wrote in previous email, I
was told that semi-official workaround is do Warm Reset or Cold Reset
with turning power off from card. Which on most platforms / boards is
not possible.

> > > > +static int pcie_downgrade_link_to_gen1(struct pci_dev *parent)
> > > > +{
> > > > +	u16 reg16;
> > > > +	u32 reg32;
> > > > +	int ret;
> > > > +
> > > > +	/* Check if link is capable of higher speed than 2.5 GT/s */
> > > > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP, &reg32);
> > > > +	if ((reg32 & PCI_EXP_LNKCAP_SLS) <= PCI_EXP_LNKCAP_SLS_2_5GB)
> > > > +		return 0;
> > > 
> > > I guess this means "if the link is already at 2.5 GT/s, no need to do
> > > anything."  Right?
> > 
> > PCI_EXP_LNKCAP_SLS is maximal supported speed by Bridge. So if bridge
> > does not support higher speed, we do not have to do anything.
> > 
> > > > +	/* Check if link speed can be downgraded to 2.5 GT/s */
> > > > +	pcie_capability_read_dword(parent, PCI_EXP_LNKCAP2, &reg32);
> > > > +	if (!(reg32 & PCI_EXP_LNKCAP2_SLS_2_5GB)) {
> > > > +		pci_err(parent, "ASPM: Bridge does not support changing Link Speed to 2.5 GT/s\n");
> > > > +		return -EOPNOTSUPP;
> > > > +	}
> > > 
> > > Why is this check needed?  Per PCIe r5.0, sec 8.2.1, all devices must
> > > support 2.5 GT/s.
> > 
> > Because older PCIe devices does not have to support PCI_EXP_LNKCAP2
> > register (in which cause they returns zero). And this applies also for
> > pci-bridge-emul.c driver. So this check is needed at least for devices
> > which use pci-bridge-emul.c driver.
> 
> Ugh.  So this depends on the fact that pcie_capability_read_dword()
> sets "*val = 0" if PCI_EXP_LNKCAP2 is not implemented.

Yes.

> I have a
> half-baked idea that we should be doing "*val = ~0" instead because
> that's what we normally get for *hardware* registers that are
> implemented.

Really? Is not rather zero returned when some particular bits are not
implemented in PCIe registers?

Returning all-ones may confuse lspci as it would think that all bits are
supported = all speeds are supported, including 5 GT/s, 8 GT/s, ...

> > > > +	/* Force link speed to 2.5 GT/s */
> > > > +	ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> > > > +						 PCI_EXP_LNKCTL2_TLS,
> > > > +						 PCI_EXP_LNKCTL2_TLS_2_5GT);
> > > > +	if (!ret) {
> > > > +		/* Verify that new value was really set */
> > > > +		pcie_capability_read_word(parent, PCI_EXP_LNKCTL2, &reg16);
> > > > +		if ((reg16 & PCI_EXP_LNKCTL2_TLS) != PCI_EXP_LNKCTL2_TLS_2_5GT)
> > > > +			ret = -EINVAL;
> > > > +	}
> > > > +
> > > > +	if (ret) {
> > > > +		pci_err(parent, "ASPM: Changing Target Link Speed to 2.5 GT/s failed: %d\n", ret);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	pci_info(parent, "ASPM: Target Link Speed changed to 2.5 GT/s due to quirk\n");
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static bool pcie_retrain_link(struct pcie_link_state *link)
> > > >  {
> > > >  	struct pci_dev *parent = link->pdev;
> > > >  	unsigned long end_jiffies;
> > > >  	u16 reg16;
> > > >  
> > > > +	if ((link->downstream->dev_flags & PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1) &&
> > > > +	    pcie_downgrade_link_to_gen1(parent)) {
> > > 
> > > I assume (correct me if I'm wrong) that this would work equally well
> > > if we set the *endpoint's* target link speed to 2.5 GT/s instead of
> > > the upstream bridge's?
> > 
> > I think not. Issue is really when Bridge-end of the link supports higher
> > than 2.5 GT/s speed this end tries to increase speed. As device-end of
> > the link supports only 2.5 GT/s there is nothing to change to higher
> > speed from device-end point of view.
> 
> Based on PCIe r5.0, sec 4.2.6, when the link is trained, both devices
> start at 2.5 GT/s.  Then they exchange TS Ordered Sets, which
> advertise all the data rates supported by each side.  After the link
> is operating in L0 at 2.5 GT/s, either side can initiate a change to
> the highest supported common data rate.
> 
> A device can't initiate a change until it knows the rates supported by
> both sides.  So either:
> 
>   - The bridge initiated a change because it knows *it* supports a
>     higher rate and the Atheros device incorrectly advertised a higher
>     rate), or 
> 
>   - The Atheros device initiated a change because the bridge
>     advertised a higher rate and the Atheros device incorrectly thinks
>     that *it* supports a higher rate.
> 
> Setting Target Link Speed in the bridge should prevent either case
> because it restricts the rates advertised by the bridge in its
> training sequences (per sec 7.5.3.19).
> 
> Interesting -- also per 7.5.3.19, Target Link Speed is permitted to
> have no effect for Upstream Ports (like the one in the Atheros
> device).  Probably makes sense since that port is not reachable until
> the link is already operating.
> 
> Bottom line, I think you do have to do this with the bridge, not the
> Atheros device.

Yes! And code is doing it already with bridge, not device. In parent
variable is stored pointer to bridge. Just flag is stored in device part
as buggy is device.

> > > I think the log messages would make more sense
> > > then, since the problem is really with the endpoint, not the parent.
> > 
> > So... buggy is device (child) end of the link and only bridge (parent)
> > end of the link can workaround it. And if bridge end is not capable
> > (e.g. because of pci-bridge-emul.c) then it is problem of bridge
> > (parent) end.
> > 
> > > > +		pci_err(parent, "ASPM: Retrain Link at higher speed is disallowed by quirk\n");
> > > > +		return false;
> > > > +	}
> > > > +
> > > >  	pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16);
> > > >  	reg16 |= PCI_EXP_LNKCTL_RL;
> > > >  	pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16);
> > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > index 653660e3ba9e..4999ad9d08b8 100644
> > > > --- a/drivers/pci/quirks.c
> > > > +++ b/drivers/pci/quirks.c
> > > > @@ -3553,23 +3553,46 @@ static void mellanox_check_broken_intx_masking(struct pci_dev *pdev)
> > > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_MELLANOX, PCI_ANY_ID,
> > > >  			mellanox_check_broken_intx_masking);
> > > >  
> > > > -static void quirk_no_bus_reset(struct pci_dev *dev)
> > > > +static void quirk_no_bus_reset_and_no_retrain_link(struct pci_dev *dev)
> > > >  {
> > > > -	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > > > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET |
> > > > +			  PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1;
> > > >  }
> > > >  
> > > >  /*
> > > > - * Some Atheros AR9xxx and QCA988x chips do not behave after a bus reset.
> > > > + * Atheros AR9xxx and QCA9xxx chips do not behave after a bus reset and also
> > > > + * after retrain link when PCIe bridge is not in GEN1 mode at 2.5 GT/s speed.
> > > >   * The device will throw a Link Down error on AER-capable systems and
> > > >   * regardless of AER, config space of the device is never accessible again
> > > >   * and typically causes the system to hang or reset when access is attempted.
> > > > + * Or if config space is accessible again then it contains only dummy values
> > > > + * like fixed PCI device ID 0xABCD or values not initialized at all.
> > > > + * Retrain link can be called only when using GEN1 PCIe bridge or when
> > > > + * PCIe bridge has forced link speed to 2.5 GT/s via PCI_EXP_LNKCTL2 register.
> > > > + * To reset these cards it is required to do PCIe Warm Reset via PERST# pin.
> > > >   * https://lore.kernel.org/r/20140923210318.498dacbd@dualc.maya.org/
> > > > + * https://lore.kernel.org/r/87h7l8axqp.fsf@toke.dk/
> > > > + * https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html
> > > >   */
> > > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
> > > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032, quirk_no_bus_reset);
> > > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c, quirk_no_bus_reset);
> > > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033, quirk_no_bus_reset);
> > > > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034, quirk_no_bus_reset);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x002e,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0032,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0033,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0034,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x003c,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0042,
> > > > +			 quirk_no_bus_reset_and_no_retrain_link);
> > > > +
> > > > +static void quirk_no_bus_reset(struct pci_dev *dev)
> > > > +{
> > > > +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
> > > > +}
> > > >  
> > > >  /*
> > > >   * Root port on some Cavium CN8xxx chips do not successfully complete a bus
> > > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > > index 86c799c97b77..fdbf7254e4ab 100644
> > > > --- a/include/linux/pci.h
> > > > +++ b/include/linux/pci.h
> > > > @@ -227,6 +227,8 @@ enum pci_dev_flags {
> > > >  	PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> > > >  	/* Don't use Relaxed Ordering for TLPs directed at this device */
> > > >  	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> > > > +	/* Don't Retrain Link for device when bridge is not in GEN1 mode */
> > > > +	PCI_DEV_FLAGS_NO_RETRAIN_LINK_WHEN_NOT_GEN1 = (__force pci_dev_flags_t) (1 << 12),
> > > 
> > > I know this is entangled with the existing PCI_DEV_FLAGS_NO_BUS_RESET,
> > > but unless there's a better reason to use pci_dev_flags, I'd prefer a
> > > new "unsigned retrain_gen1:1" or similar bit.  
> > 
> > Ok! I can change patch...
> > 
> > > Whatever you do, I'd like to avoid the double negative of "*no*
> > > retrain when *not* gen1."
> > 
> > Do you have a suggestion for this name? Because I do not know how to
> > call this "quirk" in English, so it describes "disallow link retrain
> > when link is not at gen1 = 2.5GT/s". Somehow I cannot imagine name
> > without double negative words.
> > 
> > > It does make me wonder whether the bus reset would work on these
> > > devices if we set the target link speed back down to 2.5 GT/s.
> > 
> > Tested and does not work. Secondary bus reset (=Hot Reset) is broken
> > also when link is forced to 2.5 GT/s. It looks like that when
> > PCI_EXP_LNKCTL2_TLS is not set to 2.5 GT/s when setting
> > PCI_EXP_LNKCTL_RL it results in the same effect / issue like calling
> > secondary bus reset.
> > 
> > > >  };
> > > >  
> > > >  enum pci_irq_reroute_variant {
> > > > -- 
> > > > 2.20.1
> > > > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-02 12:08         ` Pali Rohár
@ 2021-06-02 15:55           ` Bjorn Helgaas
  2021-06-02 19:03             ` Pali Rohár
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2021-06-02 15:55 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:

> > I wonder if this could be restructured as a generic quirk in quirks.c
> > that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
> > would the retrain fail even in that case?
> 
> If I understand it correctly then PCIe link is already up when kernel
> starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change
> anything here.
> 
> Moreover it would have side effect that cards which are already set to
> 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for
> increasing speed would be needed another round of "enumeration" to set a
> new TLS and retrain link again. As TLS affects link only after link goes
> into Recovery state.
> 
> So this would just complicate card enumeration and settings.

The current quirk complicates the ASPM code.  I'm hoping that if we
set the bridge's Target Link Speed during enumeration, the link
retrain will "just work" without complicating the ASPM code.

An enumeration quirk wouldn't have to set the bridge's TLS to 2.5
GT/s; the quirk would be attached to specific endpoint devices and
could set the bridge's TLS to whatever the endpoint supports.

> Moreover here we are dealing with specific OTP/EEPROM bug in Atheros
> chips, which was confirmed that exists. As I wrote in previous email, I
> was told that semi-official workaround is do Warm Reset or Cold Reset
> with turning power off from card. Which on most platforms / boards is
> not possible.

If there's a specific bug with a real root-cause analysis, please cite
it.  The threads mentioned in the current commit log are basically
informed speculation.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-02 15:55           ` Bjorn Helgaas
@ 2021-06-02 19:03             ` Pali Rohár
  2021-06-16 21:38               ` Bjorn Helgaas
  2021-06-21 14:39               ` Pali Rohár
  0 siblings, 2 replies; 24+ messages in thread
From: Pali Rohár @ 2021-06-02 19:03 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote:
> On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> 
> > > I wonder if this could be restructured as a generic quirk in quirks.c
> > > that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
> > > would the retrain fail even in that case?
> > 
> > If I understand it correctly then PCIe link is already up when kernel
> > starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change
> > anything here.
> > 
> > Moreover it would have side effect that cards which are already set to
> > 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for
> > increasing speed would be needed another round of "enumeration" to set a
> > new TLS and retrain link again. As TLS affects link only after link goes
> > into Recovery state.
> > 
> > So this would just complicate card enumeration and settings.
> 
> The current quirk complicates the ASPM code.  I'm hoping that if we
> set the bridge's Target Link Speed during enumeration, the link
> retrain will "just work" without complicating the ASPM code.
> 
> An enumeration quirk wouldn't have to set the bridge's TLS to 2.5
> GT/s; the quirk would be attached to specific endpoint devices and
> could set the bridge's TLS to whatever the endpoint supports.

Now I see what you mean. Yes, I agree this is a good idea and can
simplify code. Quirk is not related to ASPM code and basically has
nothing with it, just I put it into aspm.c because this is the only
place where link retraining was activated.

But with this proposal there is one issue. Some kernel drivers already
overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI enumeration code set some
value into PCI_EXP_LNKCTL2_TLS bits then drivers can change it and once
ASPM will try to retrain link this may cause this issue.

> > Moreover here we are dealing with specific OTP/EEPROM bug in Atheros
> > chips, which was confirmed that exists. As I wrote in previous email, I
> > was told that semi-official workaround is do Warm Reset or Cold Reset
> > with turning power off from card. Which on most platforms / boards is
> > not possible.
> 
> If there's a specific bug with a real root-cause analysis, please cite
> it.  The threads mentioned in the current commit log are basically
> informed speculation.

I had (private) discussion with Adrian Chadd about ABCD device id issue.
I hope that nobody is against if I put there summary and important parts
about secondary bus reset (=hot reset):


The reason for abcd is because:
* the MAC has hardware that upon cold reset, will read EEPROM/OTP
  values for things like PCIe and other register defaults, and squirt
  them into the MAC/PHY/etc registers
* the default values for the PCIe bus pre-AR9300 were 0x168c:0xff<id>,
  where <id> is the normal chip ID
* the default values for the PCIe bus POST-AR9300 were 0x168c:0xabcd,
  where they're always that regardless of the chip family
* so yeah, all you know with 0x168c:0xabcd is there's an atheros
  device there, but not WHICH it is.

* the bug is that the reset line isn't held low for long enough, or it's
  bounced twice in quick succession, before the MAC has time to program
  in the defaults from EEPROM/OTP and it doesn't do it a second time.

* the MAC has hardware that upon cold reset, will read EEPROM/OTP
  values for things like PCIe and other register defaults, and squirt
  them into the MAC/PHY/etc registers

* need to use the external reset line OR try using D3, not D3hot
  (I assume that "external reset line" means PERST# - PCIe Warm Reset
  and "D3, not D3hot" means D3cold)


And now my experiments: Disabling and Enabling link via root bridge has
exactly same syndromes as hot reset on all tested cards. See that
different chips (pre-AR9300 and post-AR9300) have slightly different
behavior and it matches all my experiments (I wrote test details in
commit message). And doing link retrain when root bridge has non-2.5GT/s
value in PCI_EXP_LNKCTL2_TLS has also same effect as hot reset.
So based on same results from my experiments all these actions
(disabling link, hot reset and link retrain) have common issue.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-02 19:03             ` Pali Rohár
@ 2021-06-16 21:38               ` Bjorn Helgaas
  2021-06-21 14:28                 ` Pali Rohár
  2021-06-21 14:39               ` Pali Rohár
  1 sibling, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2021-06-16 21:38 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Wed, Jun 02, 2021 at 09:03:02PM +0200, Pali Rohár wrote:
> On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote:
> > On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> > > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> > 
> > > > I wonder if this could be restructured as a generic quirk in quirks.c
> > > > that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
> > > > would the retrain fail even in that case?
> > > 
> > > If I understand it correctly then PCIe link is already up when kernel
> > > starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change
> > > anything here.
> > > 
> > > Moreover it would have side effect that cards which are already set to
> > > 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for
> > > increasing speed would be needed another round of "enumeration" to set a
> > > new TLS and retrain link again. As TLS affects link only after link goes
> > > into Recovery state.
> > > 
> > > So this would just complicate card enumeration and settings.
> > 
> > The current quirk complicates the ASPM code.  I'm hoping that if we
> > set the bridge's Target Link Speed during enumeration, the link
> > retrain will "just work" without complicating the ASPM code.
> > 
> > An enumeration quirk wouldn't have to set the bridge's TLS to 2.5
> > GT/s; the quirk would be attached to specific endpoint devices and
> > could set the bridge's TLS to whatever the endpoint supports.
> 
> Now I see what you mean. Yes, I agree this is a good idea and can
> simplify code. Quirk is not related to ASPM code and basically has
> nothing with it, just I put it into aspm.c because this is the only
> place where link retraining was activated.
> 
> But with this proposal there is one issue. Some kernel drivers already
> overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI enumeration code set some
> value into PCI_EXP_LNKCTL2_TLS bits then drivers can change it and once
> ASPM will try to retrain link this may cause this issue.

I guess you mean the amdgpu, radeon, and hfi1 drivers.  They really
shouldn't be mucking with that stuff anyway.  But they do and are
unlikely to change because we don't have any good alternative.

One way around that would be to add some quirk code to
pcie_capability_write_word().  Ugly, but we do have something sort of
similar in pcie_capability_read_word() already.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-16 21:38               ` Bjorn Helgaas
@ 2021-06-21 14:28                 ` Pali Rohár
  2021-06-25 20:19                   ` Bjorn Helgaas
  0 siblings, 1 reply; 24+ messages in thread
From: Pali Rohár @ 2021-06-21 14:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Wednesday 16 June 2021 16:38:19 Bjorn Helgaas wrote:
> On Wed, Jun 02, 2021 at 09:03:02PM +0200, Pali Rohár wrote:
> > On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote:
> > > On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> > > > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> > > 
> > > > > I wonder if this could be restructured as a generic quirk in quirks.c
> > > > > that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
> > > > > would the retrain fail even in that case?
> > > > 
> > > > If I understand it correctly then PCIe link is already up when kernel
> > > > starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change
> > > > anything here.
> > > > 
> > > > Moreover it would have side effect that cards which are already set to
> > > > 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for
> > > > increasing speed would be needed another round of "enumeration" to set a
> > > > new TLS and retrain link again. As TLS affects link only after link goes
> > > > into Recovery state.
> > > > 
> > > > So this would just complicate card enumeration and settings.
> > > 
> > > The current quirk complicates the ASPM code.  I'm hoping that if we
> > > set the bridge's Target Link Speed during enumeration, the link
> > > retrain will "just work" without complicating the ASPM code.
> > > 
> > > An enumeration quirk wouldn't have to set the bridge's TLS to 2.5
> > > GT/s; the quirk would be attached to specific endpoint devices and
> > > could set the bridge's TLS to whatever the endpoint supports.
> > 
> > Now I see what you mean. Yes, I agree this is a good idea and can
> > simplify code. Quirk is not related to ASPM code and basically has
> > nothing with it, just I put it into aspm.c because this is the only
> > place where link retraining was activated.
> > 
> > But with this proposal there is one issue. Some kernel drivers already
> > overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI enumeration code set some
> > value into PCI_EXP_LNKCTL2_TLS bits then drivers can change it and once
> > ASPM will try to retrain link this may cause this issue.
> 
> I guess you mean the amdgpu, radeon, and hfi1 drivers.  They really
> shouldn't be mucking with that stuff anyway.  But they do and are
> unlikely to change because we don't have any good alternative.

Yea, these are examples of such drivers... Maybe it is a good idea to
ask those people why changing PCI_EXP_LNKCTL2_TLS is needed. As these
drivers are often derived from codebase of shared multisystem drivers or
from common documentation, it is possible that original source has this
code as a workaround or common pattern used in other operating systems,
not related to linux...

> One way around that would be to add some quirk code to
> pcie_capability_write_word().  Ugly, but we do have something sort of
> similar in pcie_capability_read_word() already.

Bjorn, do you really want such ugly hack in pcie_capability_write_word?
It is common code used and called from lot of places so it may affect
whole system if in future somebody changes it again...

Or we can let it as is, say that those drivers which are doing it are
buggy and for future try to reduce code which touches registers
PCI_EXP_LNKCTL2_TLS. Good code review or some checkpatch.pl warnings may
prevent introduction of other code which will do it.

> 
> Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-02 19:03             ` Pali Rohár
  2021-06-16 21:38               ` Bjorn Helgaas
@ 2021-06-21 14:39               ` Pali Rohár
  1 sibling, 0 replies; 24+ messages in thread
From: Pali Rohár @ 2021-06-21 14:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Wednesday 02 June 2021 21:03:02 Pali Rohár wrote:
> On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote:
> > On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> > > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> > 
> > > > I wonder if this could be restructured as a generic quirk in quirks.c
> > > > that simply set the bridge's TLS to 2.5 GT/s during enumeration.  Or
> > > > would the retrain fail even in that case?
> > > 
> > > If I understand it correctly then PCIe link is already up when kernel
> > > starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change
> > > anything here.
> > > 
> > > Moreover it would have side effect that cards which are already set to
> > > 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for
> > > increasing speed would be needed another round of "enumeration" to set a
> > > new TLS and retrain link again. As TLS affects link only after link goes
> > > into Recovery state.
> > > 
> > > So this would just complicate card enumeration and settings.
> > 
> > The current quirk complicates the ASPM code.  I'm hoping that if we
> > set the bridge's Target Link Speed during enumeration, the link
> > retrain will "just work" without complicating the ASPM code.
> > 
> > An enumeration quirk wouldn't have to set the bridge's TLS to 2.5
> > GT/s; the quirk would be attached to specific endpoint devices and
> > could set the bridge's TLS to whatever the endpoint supports.
> 
> Now I see what you mean. Yes, I agree this is a good idea and can
> simplify code. Quirk is not related to ASPM code and basically has
> nothing with it, just I put it into aspm.c because this is the only
> place where link retraining was activated.
> 
> But with this proposal there is one issue. Some kernel drivers already
> overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI enumeration code set some
> value into PCI_EXP_LNKCTL2_TLS bits then drivers can change it and once
> ASPM will try to retrain link this may cause this issue.

And I see there another issue which does not allow to fully move code
from aspm.c file. Bridge does not have to support changing value in
PCI_EXP_LNKCTL2_TLS register or does not support setting 2.5 GT/s bits.
So logic for these checks needs to be in code which tries to retrain
link, in our case aspm.c.

And yes, there are bridges which do not support this functionality. It
applies also for PCI bridge implemented / emulated by kernel driver
drivers/pci/pci-bridge-emul.c.

So what we can do is just to move code which sets PCI_EXP_LNKCTL2_TLS
bits, not code which reads them and verifies that bridge is in correct
state.

Current quirk code has already code path when it forbid link retraining
and therefore forbid enabling ASPM when "bad bridge" (e.g. that emulated
by kernel) is in use.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-21 14:28                 ` Pali Rohár
@ 2021-06-25 20:19                   ` Bjorn Helgaas
  2021-06-26 14:38                     ` Pali Rohár
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2021-06-25 20:19 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Mon, Jun 21, 2021 at 04:28:55PM +0200, Pali Rohár wrote:
> On Wednesday 16 June 2021 16:38:19 Bjorn Helgaas wrote:
> > On Wed, Jun 02, 2021 at 09:03:02PM +0200, Pali Rohár wrote:
> > > On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote:
> > > > On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> > > > > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> > > > 
> > > > > > I wonder if this could be restructured as a generic quirk
> > > > > > in quirks.c that simply set the bridge's TLS to 2.5 GT/s
> > > > > > during enumeration.  Or would the retrain fail even in
> > > > > > that case?
> > > > > 
> > > > > If I understand it correctly then PCIe link is already up
> > > > > when kernel starts enumeration. So setting Bridge TLS to 2.5
> > > > > GT/s does not change anything here.
> > > > > 
> > > > > Moreover it would have side effect that cards which are
> > > > > already set to 5+ GT/s would be downgraded to 2.5 GT/s
> > > > > during enumeration and for increasing speed would be needed
> > > > > another round of "enumeration" to set a new TLS and retrain
> > > > > link again. As TLS affects link only after link goes into
> > > > > Recovery state.
> > > > > 
> > > > > So this would just complicate card enumeration and settings.
> > > > 
> > > > The current quirk complicates the ASPM code.  I'm hoping that
> > > > if we set the bridge's Target Link Speed during enumeration,
> > > > the link retrain will "just work" without complicating the
> > > > ASPM code.
> > > > 
> > > > An enumeration quirk wouldn't have to set the bridge's TLS to
> > > > 2.5 GT/s; the quirk would be attached to specific endpoint
> > > > devices and could set the bridge's TLS to whatever the
> > > > endpoint supports.
> > > 
> > > Now I see what you mean. Yes, I agree this is a good idea and
> > > can simplify code. Quirk is not related to ASPM code and
> > > basically has nothing with it, just I put it into aspm.c because
> > > this is the only place where link retraining was activated.
> > > 
> > > But with this proposal there is one issue. Some kernel drivers
> > > already overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI
> > > enumeration code set some value into PCI_EXP_LNKCTL2_TLS bits
> > > then drivers can change it and once ASPM will try to retrain
> > > link this may cause this issue.
> > 
> > I guess you mean the amdgpu, radeon, and hfi1 drivers.  They
> > really shouldn't be mucking with that stuff anyway.  But they do
> > and are unlikely to change because we don't have any good
> > alternative.
> 
> Yea, these are examples of such drivers... Maybe it is a good idea
> to ask those people why changing PCI_EXP_LNKCTL2_TLS is needed. As
> these drivers are often derived from codebase of shared multisystem
> drivers or from common documentation, it is possible that original
> source has this code as a workaround or common pattern used in other
> operating systems, not related to linux...
> 
> > One way around that would be to add some quirk code to
> > pcie_capability_write_word().  Ugly, but we do have something sort
> > of similar in pcie_capability_read_word() already.
> 
> Bjorn, do you really want such ugly hack in
> pcie_capability_write_word?  It is common code used and called from
> lot of places so it may affect whole system if in future somebody
> changes it again...

I don't know which is uglier, a quirk in pcie_capability_write_word()
or a quirk in aspm.c that has nothing to do with ASPM.  They're both
ugly :)

FWIW, in pcie_capability_write_word() I would envision not a check for
Atheros, but rather something like a "dev->max_target_link_speed" that
could be set by an Atheros quirk.  It does get uglier if we want to
restrict the bridge's link speed via a quirk, then unrestrict it when
the endpoint is unplugged.

I know pcie_downgrade_link_to_gen1() only returns failure for corner
cases that "should not occur," but I don't like the fact that it's
possible to change Common Clock Configuration without doing the
retrain.  That would leave us with incorrect ASPM exit latencies,
which is really hard to debug.

Here's the relevant text in the spec (PCIe r5.0):

  7.5.3.6 Link Capabilities

    L0s Exit Latency - This field indicates the L0s exit latency for
    the given PCI Express Link. The value reported indicates the
    length of time this Port requires to complete transition from L0s
    to L0. ...

    Note that exit latencies may be influenced by PCI Express
    reference clock configuration depending upon whether a component
    uses a common or separate reference clock.

  7.5.3.6 Link Control
    Common Clock Configuration - When Set, this bit indicates that
    this component and the component at the opposite end of this Link
    are operating with a distributed common reference clock. ...

    After changing the value in this bit in both components on a Link,
    software must trigger the Link to retrain by writing a 1b to the
    Retrain Link bit of the Downstream Port.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-06-25 20:19                   ` Bjorn Helgaas
@ 2021-06-26 14:38                     ` Pali Rohár
  0 siblings, 0 replies; 24+ messages in thread
From: Pali Rohár @ 2021-06-26 14:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Kalle Valo, Toke Høiland-Jørgensen,
	Marek Behún, Krzysztof Wilczyński, vtolkm, Rob Herring,
	Ilias Apalodimas, Thomas Petazzoni, linux-pci, ath10k,
	linux-wireless, linux-kernel

On Friday 25 June 2021 15:19:36 Bjorn Helgaas wrote:
> On Mon, Jun 21, 2021 at 04:28:55PM +0200, Pali Rohár wrote:
> > On Wednesday 16 June 2021 16:38:19 Bjorn Helgaas wrote:
> > > On Wed, Jun 02, 2021 at 09:03:02PM +0200, Pali Rohár wrote:
> > > > On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote:
> > > > > On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote:
> > > > > > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote:
> > > > > 
> > > > > > > I wonder if this could be restructured as a generic quirk
> > > > > > > in quirks.c that simply set the bridge's TLS to 2.5 GT/s
> > > > > > > during enumeration.  Or would the retrain fail even in
> > > > > > > that case?
> > > > > > 
> > > > > > If I understand it correctly then PCIe link is already up
> > > > > > when kernel starts enumeration. So setting Bridge TLS to 2.5
> > > > > > GT/s does not change anything here.
> > > > > > 
> > > > > > Moreover it would have side effect that cards which are
> > > > > > already set to 5+ GT/s would be downgraded to 2.5 GT/s
> > > > > > during enumeration and for increasing speed would be needed
> > > > > > another round of "enumeration" to set a new TLS and retrain
> > > > > > link again. As TLS affects link only after link goes into
> > > > > > Recovery state.
> > > > > > 
> > > > > > So this would just complicate card enumeration and settings.
> > > > > 
> > > > > The current quirk complicates the ASPM code.  I'm hoping that
> > > > > if we set the bridge's Target Link Speed during enumeration,
> > > > > the link retrain will "just work" without complicating the
> > > > > ASPM code.
> > > > > 
> > > > > An enumeration quirk wouldn't have to set the bridge's TLS to
> > > > > 2.5 GT/s; the quirk would be attached to specific endpoint
> > > > > devices and could set the bridge's TLS to whatever the
> > > > > endpoint supports.
> > > > 
> > > > Now I see what you mean. Yes, I agree this is a good idea and
> > > > can simplify code. Quirk is not related to ASPM code and
> > > > basically has nothing with it, just I put it into aspm.c because
> > > > this is the only place where link retraining was activated.
> > > > 
> > > > But with this proposal there is one issue. Some kernel drivers
> > > > already overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI
> > > > enumeration code set some value into PCI_EXP_LNKCTL2_TLS bits
> > > > then drivers can change it and once ASPM will try to retrain
> > > > link this may cause this issue.
> > > 
> > > I guess you mean the amdgpu, radeon, and hfi1 drivers.  They
> > > really shouldn't be mucking with that stuff anyway.  But they do
> > > and are unlikely to change because we don't have any good
> > > alternative.
> > 
> > Yea, these are examples of such drivers... Maybe it is a good idea
> > to ask those people why changing PCI_EXP_LNKCTL2_TLS is needed. As
> > these drivers are often derived from codebase of shared multisystem
> > drivers or from common documentation, it is possible that original
> > source has this code as a workaround or common pattern used in other
> > operating systems, not related to linux...
> > 
> > > One way around that would be to add some quirk code to
> > > pcie_capability_write_word().  Ugly, but we do have something sort
> > > of similar in pcie_capability_read_word() already.
> > 
> > Bjorn, do you really want such ugly hack in
> > pcie_capability_write_word?  It is common code used and called from
> > lot of places so it may affect whole system if in future somebody
> > changes it again...
> 
> I don't know which is uglier, a quirk in pcie_capability_write_word()
> or a quirk in aspm.c that has nothing to do with ASPM.  They're both
> ugly :)

Ok :-)

> FWIW, in pcie_capability_write_word() I would envision not a check for
> Atheros, but rather something like a "dev->max_target_link_speed" that
> could be set by an Atheros quirk.  It does get uglier if we want to
> restrict the bridge's link speed via a quirk, then unrestrict it when
> the endpoint is unplugged.
> 
> I know pcie_downgrade_link_to_gen1() only returns failure for corner
> cases that "should not occur,"

It is not only corner case. It happens _always_ with at least two
pci drivers.

As I wrote in other email due to this issue, some quirk code which
allows / disallows link retraining is required in aspm.c file.

> but I don't like the fact that it's
> possible to change Common Clock Configuration without doing the
> retrain.  That would leave us with incorrect ASPM exit latencies,
> which is really hard to debug.

I see... Any idea how to solve this issue?

> Here's the relevant text in the spec (PCIe r5.0):
> 
>   7.5.3.6 Link Capabilities
> 
>     L0s Exit Latency - This field indicates the L0s exit latency for
>     the given PCI Express Link. The value reported indicates the
>     length of time this Port requires to complete transition from L0s
>     to L0. ...
> 
>     Note that exit latencies may be influenced by PCI Express
>     reference clock configuration depending upon whether a component
>     uses a common or separate reference clock.
> 
>   7.5.3.6 Link Control
>     Common Clock Configuration - When Set, this bit indicates that
>     this component and the component at the opposite end of this Link
>     are operating with a distributed common reference clock. ...
> 
>     After changing the value in this bit in both components on a Link,
>     software must trigger the Link to retrain by writing a 1b to the
>     Retrain Link bit of the Downstream Port.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
  2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
                     ` (2 preceding siblings ...)
  2021-06-01 20:05   ` Bjorn Helgaas
@ 2021-10-05 19:43   ` Jannis
  3 siblings, 0 replies; 24+ messages in thread
From: Jannis @ 2021-10-05 19:43 UTC (permalink / raw)
  To: pali
  Cc: ath10k, bhelgaas, ilias.apalodimas, kabel, kvalo, kw,
	linux-kernel, linux-pci, linux-wireless, robh, thomas.petazzoni,
	toke, vtolkm

Hello!

Just tested this patch on the SolidRun Clearfog Pro with QCA988x based 
WLE600VX wifi card. Fixes the PCI issues, works with no directly visible 
side effects on (at least) kernel 5.10.y and 5.14.y.

Tested-by: Jannis Finkler <jannis@imserv.org>



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-10-05 19:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26 12:43 [PATCH] PCI: Disallow retraining link for Atheros QCA98xx chips on non-Gen1 PCIe bridges Pali Rohár
2021-03-26 18:13 ` Toke Høiland-Jørgensen
2021-03-27  0:14 ` Krzysztof Wilczyński
2021-03-27 13:29   ` Pali Rohár
2021-03-27 14:42     ` Marek Behún
2021-03-27 17:33       ` Pali Rohár
2021-04-27 10:55 ` [PATCH v2] PCI: Disallow retraining link for Atheros " Pali Rohár
2021-04-30 11:41   ` Pali Rohár
2021-05-05 16:33 ` [PATCH v3] " Pali Rohár
2021-05-11 20:39   ` Pali Rohár
2021-05-28  0:08     ` Pali Rohár
2021-06-01 11:28   ` Krzysztof Wilczyński
2021-06-01 20:05   ` Bjorn Helgaas
2021-06-01 21:18     ` Pali Rohár
2021-06-02  0:00       ` Bjorn Helgaas
2021-06-02 12:08         ` Pali Rohár
2021-06-02 15:55           ` Bjorn Helgaas
2021-06-02 19:03             ` Pali Rohár
2021-06-16 21:38               ` Bjorn Helgaas
2021-06-21 14:28                 ` Pali Rohár
2021-06-25 20:19                   ` Bjorn Helgaas
2021-06-26 14:38                     ` Pali Rohár
2021-06-21 14:39               ` Pali Rohár
2021-10-05 19:43   ` Jannis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).