All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] at803x: don't power-down SGMII link
@ 2016-10-24 10:40 Zefir Kurtisi
  2016-10-24 10:40 ` [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link" Zefir Kurtisi
                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Zefir Kurtisi @ 2016-10-24 10:40 UTC (permalink / raw)
  To: netdev; +Cc: timur, andrew, f.fainelli

In a device where the ar8031 operates in SGMII mode, we
observed that after a suspend-resume cycle in very rare
cases the copper side autonegotiation secceeds but the
SGMII side fails to come up.

As a work-around, a patch was provided that on suspend and
resume powers the SGMII link down and up along with the
copper side. This fixed the observed failure, but
introduced a regression Timur Tabi observed: once the SGMII
is powered down, the PHY is inaccessible by the CPU and
with that e.g. can't be re-initialized after suspend.

Since the original issue could not be reproduced by others,
this series provides an alternative handling:
* the first patch reverts the prvious fix that powers down
  SGMII
* the second patch adds double-checking for the observed
  failure condition

Zefir Kurtisi (2):
  Revert "at803x: fix suspend/resume for SGMII link"
  at803x: double check SGMII side autoneg

 drivers/net/phy/at803x.c | 65 +++++++++++++++++++++++++++++-------------------
 1 file changed, 39 insertions(+), 26 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link"
  2016-10-24 10:40 [PATCH 0/2] at803x: don't power-down SGMII link Zefir Kurtisi
@ 2016-10-24 10:40 ` Zefir Kurtisi
  2016-10-27 20:05   ` David Miller
  2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
  2016-10-25 17:31 ` [PATCH 0/2] at803x: don't power-down SGMII link Timur Tabi
  2 siblings, 1 reply; 46+ messages in thread
From: Zefir Kurtisi @ 2016-10-24 10:40 UTC (permalink / raw)
  To: netdev; +Cc: timur, andrew, f.fainelli

This reverts commit 98267311fe3b334ae7c107fa0e2413adcf3ba735.

Suspending the SGMII alongside the copper side
made the at803x inaccessable while powered down,
e.g. it can't be re-probed after suspend.

Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
---
 drivers/net/phy/at803x.c | 26 --------------------------
 1 file changed, 26 deletions(-)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index f279a89..cf74d10 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -52,9 +52,6 @@
 #define AT803X_DEBUG_REG_5			0x05
 #define AT803X_DEBUG_TX_CLK_DLY_EN		BIT(8)
 
-#define AT803X_REG_CHIP_CONFIG			0x1f
-#define AT803X_BT_BX_REG_SEL			0x8000
-
 #define ATH8030_PHY_ID 0x004dd076
 #define ATH8031_PHY_ID 0x004dd074
 #define ATH8035_PHY_ID 0x004dd072
@@ -209,7 +206,6 @@ static int at803x_suspend(struct phy_device *phydev)
 {
 	int value;
 	int wol_enabled;
-	int ccr;
 
 	mutex_lock(&phydev->lock);
 
@@ -225,16 +221,6 @@ static int at803x_suspend(struct phy_device *phydev)
 
 	phy_write(phydev, MII_BMCR, value);
 
-	if (phydev->interface != PHY_INTERFACE_MODE_SGMII)
-		goto done;
-
-	/* also power-down SGMII interface */
-	ccr = phy_read(phydev, AT803X_REG_CHIP_CONFIG);
-	phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr & ~AT803X_BT_BX_REG_SEL);
-	phy_write(phydev, MII_BMCR, phy_read(phydev, MII_BMCR) | BMCR_PDOWN);
-	phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr | AT803X_BT_BX_REG_SEL);
-
-done:
 	mutex_unlock(&phydev->lock);
 
 	return 0;
@@ -243,7 +229,6 @@ static int at803x_suspend(struct phy_device *phydev)
 static int at803x_resume(struct phy_device *phydev)
 {
 	int value;
-	int ccr;
 
 	mutex_lock(&phydev->lock);
 
@@ -251,17 +236,6 @@ static int at803x_resume(struct phy_device *phydev)
 	value &= ~(BMCR_PDOWN | BMCR_ISOLATE);
 	phy_write(phydev, MII_BMCR, value);
 
-	if (phydev->interface != PHY_INTERFACE_MODE_SGMII)
-		goto done;
-
-	/* also power-up SGMII interface */
-	ccr = phy_read(phydev, AT803X_REG_CHIP_CONFIG);
-	phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr & ~AT803X_BT_BX_REG_SEL);
-	value = phy_read(phydev, MII_BMCR) & ~(BMCR_PDOWN | BMCR_ISOLATE);
-	phy_write(phydev, MII_BMCR, value);
-	phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr | AT803X_BT_BX_REG_SEL);
-
-done:
 	mutex_unlock(&phydev->lock);
 
 	return 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/2] at803x: double check SGMII side autoneg
  2016-10-24 10:40 [PATCH 0/2] at803x: don't power-down SGMII link Zefir Kurtisi
  2016-10-24 10:40 ` [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link" Zefir Kurtisi
@ 2016-10-24 10:40 ` Zefir Kurtisi
  2016-10-27 20:05   ` David Miller
                     ` (3 more replies)
  2016-10-25 17:31 ` [PATCH 0/2] at803x: don't power-down SGMII link Timur Tabi
  2 siblings, 4 replies; 46+ messages in thread
From: Zefir Kurtisi @ 2016-10-24 10:40 UTC (permalink / raw)
  To: netdev; +Cc: timur, andrew, f.fainelli

In SGMII mode, we observed an autonegotiation issue
after power-down-up cycles where the copper side
reports successful link establishment but the
SGMII side's link is down.

This happened in a setup where the at8031 is
connected over SGMII to a eTSEC (fsl gianfar),
but so far could not be reproduced with other
Ethernet device / driver combinations.

This commit adds a wrapper function for at8031
that in case of operating in SGMII mode double
checks SGMII link state when generic aneg_done()
succeeds. It prints a warning on failure but
intentionally does not try to recover from this
state. As a result, if you ever see a warning
'803x_aneg_done: SGMII link is not ok' you will
end up having an Ethernet link up but won't get
any data through. This should not happen, if it
does, please contact the module maintainer.

Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
---
 drivers/net/phy/at803x.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index cf74d10..a52b560 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -42,10 +42,18 @@
 #define AT803X_MMD_ACCESS_CONTROL		0x0D
 #define AT803X_MMD_ACCESS_CONTROL_DATA		0x0E
 #define AT803X_FUNC_DATA			0x4003
+#define AT803X_REG_CHIP_CONFIG			0x1f
+#define AT803X_BT_BX_REG_SEL			0x8000
 
 #define AT803X_DEBUG_ADDR			0x1D
 #define AT803X_DEBUG_DATA			0x1E
 
+#define AT803X_MODE_CFG_MASK			0x0F
+#define AT803X_MODE_CFG_SGMII			0x01
+
+#define AT803X_PSSR			0x11	/*PHY-Specific Status Register*/
+#define AT803X_PSSR_MR_AN_COMPLETE	0x0200
+
 #define AT803X_DEBUG_REG_0			0x00
 #define AT803X_DEBUG_RX_CLK_DLY_EN		BIT(15)
 
@@ -355,6 +363,36 @@ static void at803x_link_change_notify(struct phy_device *phydev)
 	}
 }
 
+static int at803x_aneg_done(struct phy_device *phydev)
+{
+	int ccr;
+
+	int aneg_done = genphy_aneg_done(phydev);
+	if (aneg_done != BMSR_ANEGCOMPLETE)
+		return aneg_done;
+
+	/*
+	 * in SGMII mode, if copper side autoneg is successful,
+	 * also check SGMII side autoneg result
+	 */
+	ccr = phy_read(phydev, AT803X_REG_CHIP_CONFIG);
+	if ((ccr & AT803X_MODE_CFG_MASK) != AT803X_MODE_CFG_SGMII)
+		return aneg_done;
+
+	/* switch to SGMII/fiber page */
+	phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr & ~AT803X_BT_BX_REG_SEL);
+
+	/* check if the SGMII link is OK. */
+	if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) {
+		pr_warn("803x_aneg_done: SGMII link is not ok\n");
+		aneg_done = 0;
+	}
+	/* switch back to copper page */
+	phy_write(phydev, AT803X_REG_CHIP_CONFIG, ccr | AT803X_BT_BX_REG_SEL);
+
+	return aneg_done;
+}
+
 static struct phy_driver at803x_driver[] = {
 {
 	/* ATHEROS 8035 */
@@ -406,6 +444,7 @@ static struct phy_driver at803x_driver[] = {
 	.flags			= PHY_HAS_INTERRUPT,
 	.config_aneg		= genphy_config_aneg,
 	.read_status		= genphy_read_status,
+	.aneg_done		= at803x_aneg_done,
 	.ack_interrupt		= &at803x_ack_interrupt,
 	.config_intr		= &at803x_config_intr,
 } };
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] at803x: don't power-down SGMII link
  2016-10-24 10:40 [PATCH 0/2] at803x: don't power-down SGMII link Zefir Kurtisi
  2016-10-24 10:40 ` [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link" Zefir Kurtisi
  2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
@ 2016-10-25 17:31 ` Timur Tabi
  2016-10-27  8:05   ` Zefir Kurtisi
  2 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2016-10-25 17:31 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew, f.fainelli

Zefir Kurtisi wrote:
> In a device where the ar8031 operates in SGMII mode, we
> observed that after a suspend-resume cycle in very rare
> cases the copper side autonegotiation secceeds but the
> SGMII side fails to come up.
>
> As a work-around, a patch was provided that on suspend and
> resume powers the SGMII link down and up along with the
> copper side. This fixed the observed failure, but
> introduced a regression Timur Tabi observed: once the SGMII
> is powered down, the PHY is inaccessible by the CPU and
> with that e.g. can't be re-initialized after suspend.
>
> Since the original issue could not be reproduced by others,
> this series provides an alternative handling:
> * the first patch reverts the prvious fix that powers down
>    SGMII
> * the second patch adds double-checking for the observed
>    failure condition
>
> Zefir Kurtisi (2):
>    Revert "at803x: fix suspend/resume for SGMII link"
>    at803x: double check SGMII side autoneg

Tested-by: Timur Tabi <timur@codeaurora.org>

With these patches, the problem I was seeing no longer occurs, and the 
new code does not appear to break anything.  As before, I still have 
never seen the original problem, but this patchset seems to work for 
both of us.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] at803x: don't power-down SGMII link
  2016-10-25 17:31 ` [PATCH 0/2] at803x: don't power-down SGMII link Timur Tabi
@ 2016-10-27  8:05   ` Zefir Kurtisi
  0 siblings, 0 replies; 46+ messages in thread
From: Zefir Kurtisi @ 2016-10-27  8:05 UTC (permalink / raw)
  To: Timur Tabi, netdev; +Cc: andrew, f.fainelli

On 10/25/2016 07:31 PM, Timur Tabi wrote:
> Zefir Kurtisi wrote:
>> In a device where the ar8031 operates in SGMII mode, we
>> observed that after a suspend-resume cycle in very rare
>> cases the copper side autonegotiation secceeds but the
>> SGMII side fails to come up.
>>
>> As a work-around, a patch was provided that on suspend and
>> resume powers the SGMII link down and up along with the
>> copper side. This fixed the observed failure, but
>> introduced a regression Timur Tabi observed: once the SGMII
>> is powered down, the PHY is inaccessible by the CPU and
>> with that e.g. can't be re-initialized after suspend.
>>
>> Since the original issue could not be reproduced by others,
>> this series provides an alternative handling:
>> * the first patch reverts the prvious fix that powers down
>>    SGMII
>> * the second patch adds double-checking for the observed
>>    failure condition
>>
>> Zefir Kurtisi (2):
>>    Revert "at803x: fix suspend/resume for SGMII link"
>>    at803x: double check SGMII side autoneg
> 
> Tested-by: Timur Tabi <timur@codeaurora.org>
> 
> With these patches, the problem I was seeing no longer occurs, and the new code
> does not appear to break anything.  As before, I still have never seen the
> original problem, but this patchset seems to work for both of us.
> 
Thanks for testing.

I could make the double checking configurable and disabled by default, but this
would render it useless to detect the potential issue in the field.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
@ 2016-10-27 20:05   ` David Miller
  2016-10-28 22:24   ` Timur Tabi
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2016-10-27 20:05 UTC (permalink / raw)
  To: zefir.kurtisi; +Cc: netdev, timur, andrew, f.fainelli

From: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Date: Mon, 24 Oct 2016 12:40:54 +0200

> In SGMII mode, we observed an autonegotiation issue
> after power-down-up cycles where the copper side
> reports successful link establishment but the
> SGMII side's link is down.
> 
> This happened in a setup where the at8031 is
> connected over SGMII to a eTSEC (fsl gianfar),
> but so far could not be reproduced with other
> Ethernet device / driver combinations.
> 
> This commit adds a wrapper function for at8031
> that in case of operating in SGMII mode double
> checks SGMII link state when generic aneg_done()
> succeeds. It prints a warning on failure but
> intentionally does not try to recover from this
> state. As a result, if you ever see a warning
> '803x_aneg_done: SGMII link is not ok' you will
> end up having an Ethernet link up but won't get
> any data through. This should not happen, if it
> does, please contact the module maintainer.
> 
> Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>

Applied.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link"
  2016-10-24 10:40 ` [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link" Zefir Kurtisi
@ 2016-10-27 20:05   ` David Miller
  0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2016-10-27 20:05 UTC (permalink / raw)
  To: zefir.kurtisi; +Cc: netdev, timur, andrew, f.fainelli

From: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Date: Mon, 24 Oct 2016 12:40:53 +0200

> This reverts commit 98267311fe3b334ae7c107fa0e2413adcf3ba735.
> 
> Suspending the SGMII alongside the copper side
> made the at803x inaccessable while powered down,
> e.g. it can't be re-probed after suspend.
> 
> Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>

Applied.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
  2016-10-27 20:05   ` David Miller
@ 2016-10-28 22:24   ` Timur Tabi
  2016-11-01 11:13     ` Zefir Kurtisi
  2017-01-17 23:32   ` Timur Tabi
  2017-05-22 20:12   ` Timur Tabi
  3 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2016-10-28 22:24 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev

Zefir Kurtisi wrote:
> +	/* check if the SGMII link is OK. */
> +	if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) {
> +		pr_warn("803x_aneg_done: SGMII link is not ok\n");
> +		aneg_done = 0;

I see this message appear sometimes when bring up the interface via 
ifup.  However, contrary to your patch description, everything seems to 
work:

$ iperf3 -c 192.168.1.1 -t 3600
Connecting to host 192.168.1.1, port 5201
[  4] local 192.168.1.2 port 52640 connected to 192.168.1.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   109 MBytes   909 Mbits/sec    0    485 KBytes
[  4]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    622 KBytes

I wonder if you're impacted with all of the pause frame problems I'm having.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2016-10-28 22:24   ` Timur Tabi
@ 2016-11-01 11:13     ` Zefir Kurtisi
  0 siblings, 0 replies; 46+ messages in thread
From: Zefir Kurtisi @ 2016-11-01 11:13 UTC (permalink / raw)
  To: Timur Tabi, netdev

On 10/29/2016 12:24 AM, Timur Tabi wrote:
> Zefir Kurtisi wrote:
>> +    /* check if the SGMII link is OK. */
>> +    if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) {
>> +        pr_warn("803x_aneg_done: SGMII link is not ok\n");
>> +        aneg_done = 0;
> 
> I see this message appear sometimes when bring up the interface via ifup. 
> However, contrary to your patch description, everything seems to work:
> 
Right so, seeing the message and still having the SGMII link up and working is
what happens most. But randomly (in our setup it is in the order of 1%) we see
that SGMII remains down after a suspend-resume cycle. This message is a required
but not sufficient condition.

> $ iperf3 -c 192.168.1.1 -t 3600
> Connecting to host 192.168.1.1, port 5201
> [  4] local 192.168.1.2 port 52640 connected to 192.168.1.1 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec   109 MBytes   909 Mbits/sec    0    485 KBytes
> [  4]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    622 KBytes
> 
> I wonder if you're impacted with all of the pause frame problems I'm having.
> 

I doubt, since resetting SGMII link alone (via toggling MII_BMCR on fibre page)
brings back the SGMII communication, I assume it to be below the MAC layer.

Are you aware of any related erratas for the at803x? I am not and therefore left
alone with my guesswork.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
  2016-10-27 20:05   ` David Miller
  2016-10-28 22:24   ` Timur Tabi
@ 2017-01-17 23:32   ` Timur Tabi
  2017-01-18 11:00     ` Zefir Kurtisi
  2017-05-22 20:12   ` Timur Tabi
  3 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-01-17 23:32 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew, f.fainelli

On 10/24/2016 05:40 AM, Zefir Kurtisi wrote:
> As a result, if you ever see a warning
> '803x_aneg_done: SGMII link is not ok' you will
> end up having an Ethernet link up but won't get
> any data through. This should not happen, if it
> does, please contact the module maintainer.

I am now seeing this:

ubuntu@ubuntu:~$ ifup eth1
ubuntu@ubuntu:~$ [  588.687689] 803x_aneg_done: SGMII link is not ok
[  588.694909] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow 
control rx/tx
[  588.703985] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow 
control rx/tx

ubuntu@ubuntu:~$ ping 192.168.3.1
PING 192.168.3.1 (192.168.3.1) 56(84) bytes of data.
64 bytes from 192.168.3.1: icmp_seq=1 ttl=64 time=0.502 ms
64 bytes from 192.168.3.1: icmp_seq=2 ttl=64 time=0.244 ms
64 bytes from 192.168.3.1: icmp_seq=3 ttl=64 time=0.220 ms
^C
--- 192.168.3.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2107ms
rtt min/avg/max/mdev = 0.220/0.322/0.502/0.127 ms

So I do get the "SGMII link is not ok" message, but my connection is fine.  I 
don't know why the link-up message is displayed twice.  It's only displayed 
once if I use the genphy driver instead of the at803x driver.

I'm going to debug the at803x to see what it does that causes the double 
link-up message.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-17 23:32   ` Timur Tabi
@ 2017-01-18 11:00     ` Zefir Kurtisi
  2017-01-18 13:13       ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Zefir Kurtisi @ 2017-01-18 11:00 UTC (permalink / raw)
  To: Timur Tabi, netdev; +Cc: andrew, f.fainelli

On 01/18/2017 12:32 AM, Timur Tabi wrote:
> On 10/24/2016 05:40 AM, Zefir Kurtisi wrote:
>> As a result, if you ever see a warning
>> '803x_aneg_done: SGMII link is not ok' you will
>> end up having an Ethernet link up but won't get
>> any data through. This should not happen, if it
>> does, please contact the module maintainer.
> 
> I am now seeing this:
> 
> ubuntu@ubuntu:~$ ifup eth1
> ubuntu@ubuntu:~$ [  588.687689] 803x_aneg_done: SGMII link is not ok
> [  588.694909] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
> rx/tx
> [  588.703985] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
> rx/tx
> 
> ubuntu@ubuntu:~$ ping 192.168.3.1
> PING 192.168.3.1 (192.168.3.1) 56(84) bytes of data.
> 64 bytes from 192.168.3.1: icmp_seq=1 ttl=64 time=0.502 ms
> 64 bytes from 192.168.3.1: icmp_seq=2 ttl=64 time=0.244 ms
> 64 bytes from 192.168.3.1: icmp_seq=3 ttl=64 time=0.220 ms
> ^C
> --- 192.168.3.1 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 2107ms
> rtt min/avg/max/mdev = 0.220/0.322/0.502/0.127 ms
> 
> So I do get the "SGMII link is not ok" message, but my connection is fine.  I
> don't know why the link-up message is displayed twice.  It's only displayed once
> if I use the genphy driver instead of the at803x driver.
> 
> I'm going to debug the at803x to see what it does that causes the double link-up
> message.
> 

The fact that you see the warning means external autoneg completes before the
SGMII side in best case or SGMII link remains down in worst case. To prevent this,
I am using a private variant of the at8031 driver that ensures that the SGMII
autoneg is never restarted. If you ever end up with a dead link, feel free to test
with the related functions below.


Cheers,
Zefir

---
static int nt_at8031_no_soft_reset(struct phy_device *phydev)
{
	return 0;
}

/*
 * Powering the chip down occasionally causes SGMII link loss, which in turn
 * causes the connection to gianfar to remain down.
 *
 * To prevent permanent link loss, instead of power down just isolate pins.
 */
static int nt_at8031_suspend(struct phy_device *phydev)
{
	mutex_lock(&phydev->lock);
	phy_write(phydev, MII_BMCR, phy_read(phydev, MII_BMCR) | BMCR_ISOLATE);
	mutex_unlock(&phydev->lock);
	return 0;
}

static int nt_at8031_resume(struct phy_device *phydev)
{
	mutex_lock(&phydev->lock);
	phy_write(phydev, MII_BMCR, phy_read(phydev, MII_BMCR) & ~BMCR_ISOLATE);
	mutex_unlock(&phydev->lock);
	return 0;
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-18 11:00     ` Zefir Kurtisi
@ 2017-01-18 13:13       ` Timur Tabi
  2017-01-18 13:53         ` Zefir Kurtisi
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-01-18 13:13 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew, f.fainelli

Zefir Kurtisi wrote:
> The fact that you see the warning means external autoneg completes before the
> SGMII side in best case or SGMII link remains down in worst case.

So I'm no expert on this.  Are you saying that I might possibly be doing 
things backwards in my driver?  That is, I should be configuring the 
SGMII side of the link before I start autonegotiation?

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-18 13:13       ` Timur Tabi
@ 2017-01-18 13:53         ` Zefir Kurtisi
  2017-01-18 15:02           ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Zefir Kurtisi @ 2017-01-18 13:53 UTC (permalink / raw)
  To: Timur Tabi, netdev; +Cc: andrew, f.fainelli

On 01/18/2017 02:13 PM, Timur Tabi wrote:
> Zefir Kurtisi wrote:
>> The fact that you see the warning means external autoneg completes before the
>> SGMII side in best case or SGMII link remains down in worst case.
> 
> So I'm no expert on this.  Are you saying that I might possibly be doing things
> backwards in my driver?  That is, I should be configuring the SGMII side of the
> link before I start autonegotiation?
> 
No, not necessarily. The SGMII link default configuration is set such that you do
not have to bother at all.

That is, if in your case you see the warning popping up but the link always
regains connection, then it is an ignorable false positive.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-18 13:53         ` Zefir Kurtisi
@ 2017-01-18 15:02           ` Timur Tabi
  2017-01-19  9:43             ` Zefir Kurtisi
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-01-18 15:02 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew, f.fainelli

On 01/18/2017 07:53 AM, Zefir Kurtisi wrote:
> No, not necessarily. The SGMII link default configuration is set such that you do
> not have to bother at all.

Does the SGMII link need to make the speed and duplex of the external link?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c#n50

Here is where I can configure the SGMII link to match whatever phydev says 
the external link is.  This is code that was handed down to me.  I've never 
really understood what the purpose was.  Shouldn't I just be able to 
configure the SGMII link to 1Gbs full-duplex, regardless of what the external 
link is set to?

> That is, if in your case you see the warning popping up but the link always
> regains connection, then it is an ignorable false positive.

Unfortunately, I can't really ignore it because genphy does this:

$ ifup eth1
[ 1034.754276] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow 
control rx/tx

But the at803x driver does this:

$ ifup eth1
[ 1020.507728] 803x_aneg_done: SGMII link is not ok
[ 1020.513517] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow 
control rx/tx
[ 1020.522839] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow 
control rx/tx

Customers are going to complain.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-18 15:02           ` Timur Tabi
@ 2017-01-19  9:43             ` Zefir Kurtisi
  2017-01-19 18:01               ` Florian Fainelli
  2017-01-20  2:38               ` Timur Tabi
  0 siblings, 2 replies; 46+ messages in thread
From: Zefir Kurtisi @ 2017-01-19  9:43 UTC (permalink / raw)
  To: Timur Tabi, netdev; +Cc: andrew, f.fainelli

On 01/18/2017 04:02 PM, Timur Tabi wrote:
> On 01/18/2017 07:53 AM, Zefir Kurtisi wrote:
>> No, not necessarily. The SGMII link default configuration is set such that you do
>> not have to bother at all.
> 
> Does the SGMII link need to make the speed and duplex of the external link?
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c#n50
> 
> 
> Here is where I can configure the SGMII link to match whatever phydev says the
> external link is.  This is code that was handed down to me.  I've never really
> understood what the purpose was.  Shouldn't I just be able to configure the SGMII
> link to 1Gbs full-duplex, regardless of what the external link is set to?
> 

This is because the SGMII link is a transparent interface to the upper layer with
no means to inform the other side of the link type in the lower layer.

It always operates at 675MHz, which with two lines gives 1.25Gbps, which at 10/8
coding gives exactly 1Gbps net data rate. If the at803x's copper side
autonegotiates to 1Gbps, the bits traversing over the SGMII match the copper side
1:1. In case the copper side autonegotiates to e.g. 100Mbps, each bit at the
copper side on the SGMII bus is replicated and sent 10x times - or 100x times in
case of 10Mbps. The MAC side of the ETH needs to be aware of how the SGMII data
has to be interpreted, and this is why you have to set the bits you are referring to.

>> That is, if in your case you see the warning popping up but the link always
>> regains connection, then it is an ignorable false positive.
> 
> Unfortunately, I can't really ignore it because genphy does this:
> 
> $ ifup eth1
> [ 1034.754276] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
> rx/tx
> 
> But the at803x driver does this:
> 
> $ ifup eth1
> [ 1020.507728] 803x_aneg_done: SGMII link is not ok
> [ 1020.513517] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
> rx/tx
> [ 1020.522839] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
> rx/tx
> 
> Customers are going to complain.
> 
Yes, the double link-status message is annoying - I was referring to the other
message.

To track down who is causing the additional message, I would proceed with
following technique that helped me dig down a similar problem: since you control
the events in question and there is no risk of flooding the kernel log, in the top
of phy.c::phy_print_status() add a dump_stack() call. In the debug log ensure that
all of the traces end up in the same phydev->adjust_link() callback handler (in
your case emac_adjust_link()).

In the gianfar's handler there is an additional check whether the link really
changed before phy_print_status() is called, in emac_adjust_link() this happens
unconditionally - maybe that is the problem you are facing.

@Florian: is it safe to assume that when phydev->adjust_link() is called there was
in fact a link change (link, duplex, pause), or does the handler have to double
check for a change? I see some ETH drivers maintain a private old state instance
for that, while others don't.


Cheers,
Zefir

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-19  9:43             ` Zefir Kurtisi
@ 2017-01-19 18:01               ` Florian Fainelli
  2017-01-20  2:38               ` Timur Tabi
  1 sibling, 0 replies; 46+ messages in thread
From: Florian Fainelli @ 2017-01-19 18:01 UTC (permalink / raw)
  To: Zefir Kurtisi, Timur Tabi, netdev; +Cc: andrew



On 01/19/2017 01:43 AM, Zefir Kurtisi wrote:
> On 01/18/2017 04:02 PM, Timur Tabi wrote:
>> On 01/18/2017 07:53 AM, Zefir Kurtisi wrote:
>>> No, not necessarily. The SGMII link default configuration is set such that you do
>>> not have to bother at all.
>>
>> Does the SGMII link need to make the speed and duplex of the external link?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c#n50
>>
>>
>> Here is where I can configure the SGMII link to match whatever phydev says the
>> external link is.  This is code that was handed down to me.  I've never really
>> understood what the purpose was.  Shouldn't I just be able to configure the SGMII
>> link to 1Gbs full-duplex, regardless of what the external link is set to?
>>
> 
> This is because the SGMII link is a transparent interface to the upper layer with
> no means to inform the other side of the link type in the lower layer.
> 
> It always operates at 675MHz, which with two lines gives 1.25Gbps, which at 10/8
> coding gives exactly 1Gbps net data rate. If the at803x's copper side
> autonegotiates to 1Gbps, the bits traversing over the SGMII match the copper side
> 1:1. In case the copper side autonegotiates to e.g. 100Mbps, each bit at the
> copper side on the SGMII bus is replicated and sent 10x times - or 100x times in
> case of 10Mbps. The MAC side of the ETH needs to be aware of how the SGMII data
> has to be interpreted, and this is why you have to set the bits you are referring to.
> 
>>> That is, if in your case you see the warning popping up but the link always
>>> regains connection, then it is an ignorable false positive.
>>
>> Unfortunately, I can't really ignore it because genphy does this:
>>
>> $ ifup eth1
>> [ 1034.754276] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
>> rx/tx
>>
>> But the at803x driver does this:
>>
>> $ ifup eth1
>> [ 1020.507728] 803x_aneg_done: SGMII link is not ok
>> [ 1020.513517] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
>> rx/tx
>> [ 1020.522839] qcom-emac QCOM8070:00 eth1: Link is Up - 1Gbps/Full - flow control
>> rx/tx
>>
>> Customers are going to complain.
>>
> Yes, the double link-status message is annoying - I was referring to the other
> message.
> 
> To track down who is causing the additional message, I would proceed with
> following technique that helped me dig down a similar problem: since you control
> the events in question and there is no risk of flooding the kernel log, in the top
> of phy.c::phy_print_status() add a dump_stack() call. In the debug log ensure that
> all of the traces end up in the same phydev->adjust_link() callback handler (in
> your case emac_adjust_link()).
> 
> In the gianfar's handler there is an additional check whether the link really
> changed before phy_print_status() is called, in emac_adjust_link() this happens
> unconditionally - maybe that is the problem you are facing.
> 
> @Florian: is it safe to assume that when phydev->adjust_link() is called there was
> in fact a link change (link, duplex, pause), or does the handler have to double
> check for a change? I see some ETH drivers maintain a private old state instance
> for that, while others don't.

Yes, it is not just safe, it is a contract. The reason most drivers
cache the values from one run on adjust_link to the other is because you
don't want to needlessly read/write registers if nothing has changed, or
just a subset of link parameters did change.

NB: at some point I was considering bringing this caching into the core
PHY library to save some housekeeping code in drivers...
-- 
Florian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-19  9:43             ` Zefir Kurtisi
  2017-01-19 18:01               ` Florian Fainelli
@ 2017-01-20  2:38               ` Timur Tabi
  2017-01-20 15:31                 ` Zefir Kurtisi
  1 sibling, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-01-20  2:38 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew, f.fainelli

Zefir Kurtisi wrote:
> It always operates at 675MHz, which with two lines gives 1.25Gbps,
> which at 10/8 coding gives exactly 1Gbps net data rate. If the
> at803x's copper side autonegotiates to 1Gbps, the bits traversing
> over the SGMII match the copper side 1:1. In case the copper side
> autonegotiates to e.g. 100Mbps, each bit at the copper side on the
> SGMII bus is replicated and sent 10x times - or 100x times in case of
> 10Mbps. The MAC side of the ETH needs to be aware of how the SGMII
> data has to be interpreted, and this is why you have to set the bits
> you are referring to.

So does this mean that the SGMII link should not be autonegotiated? I 
currently have this code:

     if (phydev->autoneg == AUTONEG_ENABLE) {
         val &= ~(FORCE_AN_RX_CFG | FORCE_AN_TX_CFG);
         val |= AN_ENABLE;
         writel(val, phy->base + EMAC_SGMII_PHY_AUTONEG_CFG2);
     } else {
         ...

So if the external PHY is set to autonegotiate, then the SGMII block is 
set to also negotiate.  Now that I think about it, this seems wrong. 
And in fact, I'm not sure how it works.  It seems that the this only 
makes sense if the SGMII block is configured to act as the only PHY. 
This is an option that the hardware supports but my driver does not.  So 
perhaps I should remove this part, and just do the rest:


	u32 speed_cfg;

	switch (phydev->speed) {
	case SPEED_10:
		speed_cfg = SPDMODE_10;
		break;
	case SPEED_100:
		speed_cfg = SPDMODE_100;
		break;
	case SPEED_1000:
		speed_cfg = SPDMODE_1000;
		break;
	default:
		return -EINVAL;
	}

	if (phydev->duplex == DUPLEX_FULL)
		speed_cfg |= DUPLEX_MODE;

	val &= ~AN_ENABLE;
	writel(speed_cfg, phy->base + EMAC_SGMII_PHY_SPEED_CFG1);
	writel(val, phy->base + EMAC_SGMII_PHY_AUTONEG_CFG2);

Should I be doing this all the time?

> To track down who is causing the additional message, I would proceed
> with following technique that helped me dig down a similar problem:
> since you control the events in question and there is no risk of
> flooding the kernel log, in the top of phy.c::phy_print_status() add
> a dump_stack() call. In the debug log ensure that all of the traces
> end up in the same phydev->adjust_link() callback handler (in your
> case emac_adjust_link()).

That's a good idea, thanks.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-01-20  2:38               ` Timur Tabi
@ 2017-01-20 15:31                 ` Zefir Kurtisi
  0 siblings, 0 replies; 46+ messages in thread
From: Zefir Kurtisi @ 2017-01-20 15:31 UTC (permalink / raw)
  To: Timur Tabi, netdev; +Cc: andrew, f.fainelli

On 01/20/2017 03:38 AM, Timur Tabi wrote:
> Zefir Kurtisi wrote:
>> It always operates at 675MHz, which with two lines gives 1.25Gbps,
>> which at 10/8 coding gives exactly 1Gbps net data rate. If the
>> at803x's copper side autonegotiates to 1Gbps, the bits traversing
>> over the SGMII match the copper side 1:1. In case the copper side
>> autonegotiates to e.g. 100Mbps, each bit at the copper side on the
>> SGMII bus is replicated and sent 10x times - or 100x times in case of
>> 10Mbps. The MAC side of the ETH needs to be aware of how the SGMII
>> data has to be interpreted, and this is why you have to set the bits
>> you are referring to.
> 
> So does this mean that the SGMII link should not be autonegotiated? I currently
> have this code:
> 
>     if (phydev->autoneg == AUTONEG_ENABLE) {
>         val &= ~(FORCE_AN_RX_CFG | FORCE_AN_TX_CFG);
>         val |= AN_ENABLE;
>         writel(val, phy->base + EMAC_SGMII_PHY_AUTONEG_CFG2);
>     } else {
>         ...
> 
> So if the external PHY is set to autonegotiate, then the SGMII block is set to
> also negotiate.  Now that I think about it, this seems wrong. And in fact, I'm not
> sure how it works.  It seems that the this only makes sense if the SGMII block is
> configured to act as the only PHY. This is an option that the hardware supports
> but my driver does not.  So perhaps I should remove this part, and just do the rest:
> 
> 
>     u32 speed_cfg;
> 
>     switch (phydev->speed) {
>     case SPEED_10:
>         speed_cfg = SPDMODE_10;
>         break;
>     case SPEED_100:
>         speed_cfg = SPDMODE_100;
>         break;
>     case SPEED_1000:
>         speed_cfg = SPDMODE_1000;
>         break;
>     default:
>         return -EINVAL;
>     }
> 
>     if (phydev->duplex == DUPLEX_FULL)
>         speed_cfg |= DUPLEX_MODE;
> 
>     val &= ~AN_ENABLE;
>     writel(speed_cfg, phy->base + EMAC_SGMII_PHY_SPEED_CFG1);
>     writel(val, phy->base + EMAC_SGMII_PHY_AUTONEG_CFG2);
> 
> Should I be doing this all the time?
> 
Hm, that might be product dependent. From my experience with the gianfar/at8031
this is what I can share: there is a known flaw in some revisions of the eTSEC
that causes random failures in SGMII autonegotiation. As a workaround FSL proposes
to use fixed speed. I spent lots of time getting this working, but in the end it
turned out it does not. Setting fixed Gbps speed on both ends of the link is
reported to be ok (link failure bits clean, page transmitted ok), but no data goes
through the link.

>From a source I can't remember atm I was told that autonegotiation is mandatory
for SGMII, therefore I believe there is no need to change the autoneg bits for
SGMII in your case. I'd assume that resetting the link before setting the speed
value should be enough, i.e. the second write to EMAC_SGMII_PHY_AUTONEG_CFG2 might
be useless. YMMV

>> To track down who is causing the additional message, I would proceed
>> with following technique that helped me dig down a similar problem:
>> since you control the events in question and there is no risk of
>> flooding the kernel log, in the top of phy.c::phy_print_status() add
>> a dump_stack() call. In the debug log ensure that all of the traces
>> end up in the same phydev->adjust_link() callback handler (in your
>> case emac_adjust_link()).
> 
> That's a good idea, thanks.
> 
As Florian responded, it seems you need to double check for real changes before
printing the link status to get rid of the double printed notifications.

Good Luck.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
                     ` (2 preceding siblings ...)
  2017-01-17 23:32   ` Timur Tabi
@ 2017-05-22 20:12   ` Timur Tabi
  2017-05-22 21:02     ` Andrew Lunn
  2017-05-22 21:09     ` Andrew Lunn
  3 siblings, 2 replies; 46+ messages in thread
From: Timur Tabi @ 2017-05-22 20:12 UTC (permalink / raw)
  To: Zefir Kurtisi, netdev; +Cc: andrew, f.fainelli, David Miller, Manoj Iyer, jhugo

On 10/24/2016 05:40 AM, Zefir Kurtisi wrote:
> This commit adds a wrapper function for at8031
> that in case of operating in SGMII mode double
> checks SGMII link state when generic aneg_done()
> succeeds. It prints a warning on failure but
> intentionally does not try to recover from this
> state. As a result, if you ever see a warning
> '803x_aneg_done: SGMII link is not ok' you will
> end up having an Ethernet link up but won't get
> any data through. This should not happen, if it
> does, please contact the module maintainer.

I'm getting bitten by this one again.  We're now have several systems that
are reporting the link failure ("803x_aneg_done: SGMII link is not ok"), and
the interface comes up but is not functional.  I believe this is expected.

The problem, however, is not because of the link failure.  Instead, the
problem is this:

> +	/* check if the SGMII link is OK. */
> +	if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) {
> +		pr_warn("803x_aneg_done: SGMII link is not ok\n");
> +		aneg_done = 0;

Returning zero is what breaks the interface.  If I comment-out this last
line, so that at803x_aneg_done() returns BMSR_ANEGCOMPLETE instead, then
everything works.

The documentation for phy_aneg_done() says this:

 * Description: Return the auto-negotiation status from this @phydev
 * Returns > 0 on success or < 0 on error. 0 means that auto-negotiation
 * is still pending.

So I think there are two issues here:

1. What exactly is supposed to happen when phy_aneg_done() returns a zero?
On our system, returning a zero results in a broken link, even though there
are no errors reported.  I just can't send any packets.

2. I'm preparing a patch that adds a command-line parameter to at803x that
makes this code conditional.  If you specify the parameter ("linkcheck")
then it will check the link and return 0 on failure.  Otherwise, it will
return whether genphy_aneg_done() returns.  The question is, should it still
print the message?

What I cannot determine is whether or not the link is actually okay.  It
appears to me that the driver says the link is not ok, but in truth it
actually is, and maybe the whole at803x_aneg_done() function based on a
false premise.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 20:12   ` Timur Tabi
@ 2017-05-22 21:02     ` Andrew Lunn
  2017-05-22 21:10       ` Florian Fainelli
  2017-05-22 21:09     ` Andrew Lunn
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-22 21:02 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

> 2. I'm preparing a patch that adds a command-line parameter to at803x that
> makes this code conditional.

FYI:

A patch with a command line argument, i think you actually mean a
module argument, is very likely to be rejected.

  Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 20:12   ` Timur Tabi
  2017-05-22 21:02     ` Andrew Lunn
@ 2017-05-22 21:09     ` Andrew Lunn
  2017-05-22 21:29       ` Timur Tabi
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-22 21:09 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On Mon, May 22, 2017 at 03:12:03PM -0500, Timur Tabi wrote:
> On 10/24/2016 05:40 AM, Zefir Kurtisi wrote:
> > This commit adds a wrapper function for at8031
> > that in case of operating in SGMII mode double
> > checks SGMII link state when generic aneg_done()
> > succeeds. It prints a warning on failure but
> > intentionally does not try to recover from this
> > state. As a result, if you ever see a warning
> > '803x_aneg_done: SGMII link is not ok' you will
> > end up having an Ethernet link up but won't get
> > any data through. This should not happen, if it
> > does, please contact the module maintainer.
> 
> I'm getting bitten by this one again.  We're now have several systems that
> are reporting the link failure ("803x_aneg_done: SGMII link is not ok"), and
> the interface comes up but is not functional.  I believe this is expected.
> 
> The problem, however, is not because of the link failure.  Instead, the
> problem is this:
> 
> > +	/* check if the SGMII link is OK. */
> > +	if (!(phy_read(phydev, AT803X_PSSR) & AT803X_PSSR_MR_AN_COMPLETE)) {
> > +		pr_warn("803x_aneg_done: SGMII link is not ok\n");
> > +		aneg_done = 0;
> 
> Returning zero is what breaks the interface.  If I comment-out this last
> line, so that at803x_aneg_done() returns BMSR_ANEGCOMPLETE instead, then
> everything works.

Are you using interrupts? Or polling?

If polling, it should come back again 1 second later and see if
auto-neg has completed. Hopefully the SGMII side comes up eventually.

If you are using interrupts, you need another interrupt when the SGMII
side comes up, otherwise i think the state machine is stuck waiting.

     Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 21:02     ` Andrew Lunn
@ 2017-05-22 21:10       ` Florian Fainelli
  2017-05-22 21:19         ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Florian Fainelli @ 2017-05-22 21:10 UTC (permalink / raw)
  To: Andrew Lunn, Timur Tabi
  Cc: Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/22/2017 02:02 PM, Andrew Lunn wrote:
>> 2. I'm preparing a patch that adds a command-line parameter to at803x that
>> makes this code conditional.
> 
> FYI:
> 
> A patch with a command line argument, i think you actually mean a
> module argument, is very likely to be rejected.

Even a module argument would be rejected. If you need platform/SoC
specific behavior propagated down to the PHY driver, several options exist:

- pass an agreed upon value for phy_flags to of_phy_connect() see
drivers/net/ethernet/broadcom/tg3.c and
drivers/net/ethernet/broadcom/genet/bcmgenet.c for instance and update
the driver to act on that "flags" see drivers/net/phy/broadcom.c and
drivers/net/phy/bcm7xxx.c

- register a PHY fixup which is specific to the board/SoC, and have the
PHY fixup do whatever is necessary for your platform (like setting
specific registers)

Preference goes for the first solution, but phy_flags is just a 32-bits
integer, so you could run out of bits.
-- 
Florian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 21:10       ` Florian Fainelli
@ 2017-05-22 21:19         ` Timur Tabi
  2017-05-22 21:50           ` Florian Fainelli
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-22 21:19 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn
  Cc: Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/22/2017 04:10 PM, Florian Fainelli wrote:
> Even a module argument would be rejected. If you need platform/SoC
> specific behavior propagated down to the PHY driver, several options exist:
> 
> - pass an agreed upon value for phy_flags to of_phy_connect() see
> drivers/net/ethernet/broadcom/tg3.c and
> drivers/net/ethernet/broadcom/genet/bcmgenet.c for instance and update
> the driver to act on that "flags" see drivers/net/phy/broadcom.c and
> drivers/net/phy/bcm7xxx.c

Will this work on ACPI systems as well?  I call phy_connect_direct() instead
of of_phy_connect().  I see some drivers set phydev->dev_flags before
calling phy_connect_direct().

My concern is that this problem occurs only on an at8031 chip, so having my
network driver passing an at8031-specific flag seems out of place.  What
happens if, on some other board, a different PHY is used, and that flag
means something else?

> - register a PHY fixup which is specific to the board/SoC, and have the
> PHY fixup do whatever is necessary for your platform (like setting
> specific registers)

Do you have an example of that?

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 21:09     ` Andrew Lunn
@ 2017-05-22 21:29       ` Timur Tabi
  2017-05-22 21:32         ` Andrew Lunn
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-22 21:29 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On 05/22/2017 04:09 PM, Andrew Lunn wrote:
> Are you using interrupts? Or polling?

adpt->phydev->irq = PHY_IGNORE_INTERRUPT;
ret = phy_connect_direct(netdev, adpt->phydev, emac_adjust_link,
			 PHY_INTERFACE_MODE_SGMII);

Technically it's polling, except that it's my NIC's hardware that is polling
the MDIO bus, and then generating an interrupt when there's a link state change.

When the link state changes, I call phy_mac_interrupt()

if (status & ISR_GPHY_LINK)
	phy_mac_interrupt(adpt->phydev, !!(status & GPHY_LINK_UP_INT));

> If polling, it should come back again 1 second later and see if
> auto-neg has completed. Hopefully the SGMII side comes up eventually.
> 
> If you are using interrupts, you need another interrupt when the SGMII
> side comes up, otherwise i think the state machine is stuck waiting.

I'll have to test this, but what do I do if I don't get another interrupt?
I have a suspicion that the link is actually okay, and that the error is bogus.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 21:29       ` Timur Tabi
@ 2017-05-22 21:32         ` Andrew Lunn
  2017-05-23 15:54           ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-22 21:32 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

> I'll have to test this, but what do I do if I don't get another interrupt?

It probably means interrupts cannot be used. Poll it.

	   Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 21:19         ` Timur Tabi
@ 2017-05-22 21:50           ` Florian Fainelli
  0 siblings, 0 replies; 46+ messages in thread
From: Florian Fainelli @ 2017-05-22 21:50 UTC (permalink / raw)
  To: Timur Tabi, Andrew Lunn
  Cc: Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/22/2017 02:19 PM, Timur Tabi wrote:
> On 05/22/2017 04:10 PM, Florian Fainelli wrote:
>> Even a module argument would be rejected. If you need platform/SoC
>> specific behavior propagated down to the PHY driver, several options exist:
>>
>> - pass an agreed upon value for phy_flags to of_phy_connect() see
>> drivers/net/ethernet/broadcom/tg3.c and
>> drivers/net/ethernet/broadcom/genet/bcmgenet.c for instance and update
>> the driver to act on that "flags" see drivers/net/phy/broadcom.c and
>> drivers/net/phy/bcm7xxx.c
> 
> Will this work on ACPI systems as well?

Provided you get a reference on the PHY dvice first, yes.

>  I call phy_connect_direct() instead
> of of_phy_connect().  I see some drivers set phydev->dev_flags before
> calling phy_connect_direct().

Setting phydev->dev_flags before calling phy_connect_direct() is okay
and will work here. The key thing is that you will need to get a PHY
device reference before, which would already be populated in your MDIO
bus's mdio_map array, see e.g: mdiobus_get_phy().

You can also set phydev->dev_flags *after* calling phy_connect_direct().
The only reason why it should be done before, is to guarantee that
phydrv::config_init would *see* the correct value there. If you need it
at a later time, like in config_aneg() or aneg_done(), setting it later
*might work*, but you'd have to trigger a auto-negotiation restart to
avoid races between phy_connect_direct() returning, and
phydrv::config_aneg() being called.

> 
> My concern is that this problem occurs only on an at8031 chip, so having my
> network driver passing an at8031-specific flag seems out of place.  What
> happens if, on some other board, a different PHY is used, and that flag
> means something else?

There needs to be an agreement on what the flags bits mean, and this
needs to be in a shared header file that other network drivers can
reference and where they can allocate their own bits if existing
functionality is not covered. The allocation of such flags is centered
around the perspective of the PHY driver.

Of course, this only works if you have a strict mapping between the
Ethernet MAC and the PHY, and if your MAC only uses the same PHY device
driver. If that's not the case, you need to make sure you don't pass a
phy_flags with bits set that could influence the behavior of another PHY
driver that would also try to do something with these phy_flags. Fairly
positive you can figure this out from your Ethernet MAC driver.

> 
>> - register a PHY fixup which is specific to the board/SoC, and have the
>> PHY fixup do whatever is necessary for your platform (like setting
>> specific registers)
> 
> Do you have an example of that?

You could have grepped for phy_register_fixup() but the best example I
can come up with is:

drivers/net/usb/lan78xx.c

and there are a lot more in arch/{arm,powerpc}/:

arch/arm/mach-davinci/board-dm644x-evm.c:
phy_register_fixup_for_uid(LXT971_PHY_ID, LXT971_PHY_MASK,
arch/arm/mach-imx/mach-imx6q.c:
phy_register_fixup_for_uid(PHY_ID_KSZ9021, MICREL_PHY_ID_MASK,
arch/arm/mach-imx/mach-imx6q.c:
phy_register_fixup_for_uid(PHY_ID_KSZ9031, MICREL_PHY_ID_MASK,
arch/arm/mach-imx/mach-imx6q.c:
phy_register_fixup_for_uid(PHY_ID_AR8031, 0xffffffef,
arch/arm/mach-imx/mach-imx6q.c:
phy_register_fixup_for_uid(PHY_ID_AR8035, 0xffffffef,
arch/arm/mach-imx/mach-imx6sx.c:
phy_register_fixup_for_uid(PHY_ID_AR8031, 0xffffffff,
arch/arm/mach-imx/mach-imx6ul.c:
phy_register_fixup_for_uid(PHY_ID_KSZ8081, MICREL_PHY_ID_MASK,
arch/arm/mach-imx/mach-imx7d.c:
phy_register_fixup_for_uid(PHY_ID_AR8031, 0xffffffff,
arch/arm/mach-imx/mach-imx7d.c:
phy_register_fixup_for_uid(PHY_ID_BCM54220, 0xffffffff,
arch/arm/mach-mxs/mach-mxs.c:
phy_register_fixup_for_uid(PHY_ID_KSZ8051, MICREL_PHY_ID_MASK,
arch/arm/mach-orion5x/dns323-setup.c:
phy_register_fixup_for_uid(MARVELL_PHY_ID_88E1118,
arch/powerpc/platforms/85xx/mpc85xx_mds.c:
phy_register_fixup_for_id(phy_id, mpc8568_fixup_125_clock);
arch/powerpc/platforms/85xx/mpc85xx_mds.c:
phy_register_fixup_for_id(phy_id, mpc8568_mds_phy_fixups);
arch/powerpc/platforms/85xx/mpc85xx_mds.c:
phy_register_fixup_for_id(phy_id, mpc8568_mds_phy_fixups);
-- 
Florian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-22 21:32         ` Andrew Lunn
@ 2017-05-23 15:54           ` Timur Tabi
  2017-05-23 16:07             ` Andrew Lunn
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-23 15:54 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On 05/22/2017 04:32 PM, Andrew Lunn wrote:
>> I'll have to test this, but what do I do if I don't get another interrupt?
> It probably means interrupts cannot be used. Poll it.

I will test that to see what happens, but I believe the real problem is that
the at803x driver is lying when it says that the link is not okay.  I think
the link is okay, and that's why I'm not getting any more interrupts.  I
don't think I should have to drop interrupt support in my MAC driver because
one specific PHY driver is broken.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-23 15:54           ` Timur Tabi
@ 2017-05-23 16:07             ` Andrew Lunn
  2017-05-23 16:33               ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-23 16:07 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On Tue, May 23, 2017 at 10:54:57AM -0500, Timur Tabi wrote:
> On 05/22/2017 04:32 PM, Andrew Lunn wrote:
> >> I'll have to test this, but what do I do if I don't get another interrupt?
> > It probably means interrupts cannot be used. Poll it.
> 
> I will test that to see what happens, but I believe the real problem is that
> the at803x driver is lying when it says that the link is not okay.  I think
> the link is okay, and that's why I'm not getting any more interrupts.  I
> don't think I should have to drop interrupt support in my MAC driver because
> one specific PHY driver is broken.

If it turns out the PHY hardware is broken, the phy driver itself can
force it back to polling by setting phydev->irq to PHY_POLL in its
probe() function.

	Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-23 16:07             ` Andrew Lunn
@ 2017-05-23 16:33               ` Timur Tabi
  2017-05-24  7:18                 ` Matthias May
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-23 16:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On 05/23/2017 11:07 AM, Andrew Lunn wrote:
>> > I will test that to see what happens, but I believe the real problem is that
>> > the at803x driver is lying when it says that the link is not okay.  I think
>> > the link is okay, and that's why I'm not getting any more interrupts.  I
>> > don't think I should have to drop interrupt support in my MAC driver because
>> > one specific PHY driver is broken.
> If it turns out the PHY hardware is broken, the phy driver itself can
> force it back to polling by setting phydev->irq to PHY_POLL in its
> probe() function.

I don't think the hardware is broken, I think the driver is broken.  The
patch that sets aneg_done to 0 should be reverted or restricted somehow.

Even the developer of the patch admits that if the warning message is
displayed, the link will appear to be up, but no packets will go through.
Perhaps that's because the driver is returning 0 instead of BMSR_ANEGCOMPLETE?

Would it be okay for the PHY driver to query a property from the device tree
directly (e.g. "qca,check-sgmii-link"), and if present, only then implement
the sgmii link check?  So in at803x_probe(), I would do something like this:

	if (device_property_read_bool(&phydev->mdio.dev,
					"qca,check-sgmii-link")
		priv->check_sgmii_link = true;

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-23 16:33               ` Timur Tabi
@ 2017-05-24  7:18                 ` Matthias May
  2017-05-24 13:29                   ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Matthias May @ 2017-05-24  7:18 UTC (permalink / raw)
  To: Timur Tabi, Andrew Lunn
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On 23/05/17 18:33, Timur Tabi wrote:
> On 05/23/2017 11:07 AM, Andrew Lunn wrote:
>>>> I will test that to see what happens, but I believe the real problem is that
>>>> the at803x driver is lying when it says that the link is not okay.  I think
>>>> the link is okay, and that's why I'm not getting any more interrupts.  I
>>>> don't think I should have to drop interrupt support in my MAC driver because
>>>> one specific PHY driver is broken.
>> If it turns out the PHY hardware is broken, the phy driver itself can
>> force it back to polling by setting phydev->irq to PHY_POLL in its
>> probe() function.
> 
> I don't think the hardware is broken, I think the driver is broken.  The
> patch that sets aneg_done to 0 should be reverted or restricted somehow.
> 
> Even the developer of the patch admits that if the warning message is
> displayed, the link will appear to be up, but no packets will go through.
> Perhaps that's because the driver is returning 0 instead of BMSR_ANEGCOMPLETE?
> 
> Would it be okay for the PHY driver to query a property from the device tree
> directly (e.g. "qca,check-sgmii-link"), and if present, only then implement
> the sgmii link check?  So in at803x_probe(), I would do something like this:
> 
> 	if (device_property_read_bool(&phydev->mdio.dev,
> 					"qca,check-sgmii-link")
> 		priv->check_sgmii_link = true;
> 

Zefir is currently on holiday and will probably get these emails when he gets back around 1. June

I think you missunderstand what "the link appears up but no packets go through means".
There are 2 pages which each reflect a side of the PHY:
* copper side
* SGMII side
Until the patch only the copper side was ever considered when reporting the state of the link.
This lead to the situation that the copper side was up (and the link reported up), but the SGMII side didn't come up.
It could well be a bad combination of CPU and the AR8031/33.
We know of at least 1 CPU (Freescale P1010) which has erratas for certain revisions regarding this.
If the SGMII side doesn't come, well no packets can go through.

With the patch: When the copper side is seen as up, it also checks if aneg of the SGMII link is done.
As far as i know SGMII can not be run without aneg, since it is always Gbit with aneg mandatory.
If SGMII aneg is not done, then, well aneg is not done and thus 0 is returned.

Internally we have this patch extended so we don't only report that aneg is not done but also reset the link.
Eventually aneg on the SGMII side can be completed and the link comes up.


Why do you think that frames are able to go through when aneg is reported as not done by the PHY?
Since aneg is mandatory for SGMII this can as well be seen as "link not up", not?

BR
Matthias

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24  7:18                 ` Matthias May
@ 2017-05-24 13:29                   ` Timur Tabi
  2017-05-24 13:40                     ` Andrew Lunn
  2017-06-01 11:45                     ` Zefir Kurtisi
  0 siblings, 2 replies; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 13:29 UTC (permalink / raw)
  To: Matthias May, Andrew Lunn
  Cc: Zefir Kurtisi, netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On 5/24/17 2:18 AM, Matthias May wrote:
> With the patch: When the copper side is seen as up, it also checks if aneg of the SGMII link is done.
> As far as i know SGMII can not be run without aneg, since it is always Gbit with aneg mandatory.
> If SGMII aneg is not done, then, well aneg is not done and thus 0 is returned.
> 
> Internally we have this patch extended so we don't only report that aneg is not done but also reset the link.
> Eventually aneg on the SGMII side can be completed and the link comes up.

I would really like to test this patch.

> Why do you think that frames are able to go through when aneg is reported as not done by the PHY?

I have two theories:

1. The warning message is bogus.  The link actually is okay, but the 
driver thinks that it isn't.

2. The link is not okay, but it automatically fixes itself soon after 
the at803x_aneg_done() finishes.

> Since aneg is mandatory for SGMII this can as well be seen as "link not up", not?

The problem is that even though the link is up, the driver has returned 
"0", so the kernel thinks that autonegotiation has not finished. 
at803x_aneg_done() is never called again, and so I think the kernel is 
disabling the interface is some secret way.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 13:29                   ` Timur Tabi
@ 2017-05-24 13:40                     ` Andrew Lunn
  2017-05-24 13:48                       ` Timur Tabi
  2017-06-01 11:45                     ` Zefir Kurtisi
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-24 13:40 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

> >Since aneg is mandatory for SGMII this can as well be seen as "link not up", not?
> 
> The problem is that even though the link is up, the driver has
> returned "0", so the kernel thinks that autonegotiation has not
> finished.

You need to prove this, that the link is not up. Any by link, we mean
both the copper and the SGMII link. 

> at803x_aneg_done() is never called again, and so I think
> the kernel is disabling the interface is some secret way.

Well, the driver has told the core that the link is not up. So the
kernel is waiting for another interrupt indicating the link has gone
up. And probably, this second interrupt never happens.

And it is not disabling the interface. Since the PHY is still down,
the core has not called netif_carrier_on().

    Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 13:40                     ` Andrew Lunn
@ 2017-05-24 13:48                       ` Timur Tabi
  2017-05-24 14:09                         ` Andrew Lunn
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 13:48 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On 5/24/17 8:40 AM, Andrew Lunn wrote:

> You need to prove this, that the link is not up. Any by link, we mean
> both the copper and the SGMII link.

I can post the log of my iperf run showing that the, when 
at803x_aneg_done() returns zero, no packets can go through.  And then 
after I change at803x_aneg_done() so that it returns BMSR_ANEGCOMPLETE, 
then packets do go through.  Is that the proof you're looking for?

The current work-around that we're using internally is to blacklist the 
at803x driver.  This forces the kernel to use the genphy driver instead. 
  Everything works when we do this.

>> at803x_aneg_done() is never called again, and so I think
>> the kernel is disabling the interface is some secret way.
> 
> Well, the driver has told the core that the link is not up. So the
> kernel is waiting for another interrupt indicating the link has gone
> up. And probably, this second interrupt never happens.

Exxactly.  That's because the link IS up, and so there is no opportunity 
to receive another interrupt.

> And it is not disabling the interface. Since the PHY is still down,
> the core has not called netif_carrier_on().

Ok, I should have said "not enabled" instead of "disabled".  Thanks.

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 13:48                       ` Timur Tabi
@ 2017-05-24 14:09                         ` Andrew Lunn
  2017-05-24 18:58                           ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-24 14:09 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On Wed, May 24, 2017 at 08:48:04AM -0500, Timur Tabi wrote:
> On 5/24/17 8:40 AM, Andrew Lunn wrote:
> 
> >You need to prove this, that the link is not up. Any by link, we mean
> >both the copper and the SGMII link.
> 
> I can post the log of my iperf run showing that the, when
> at803x_aneg_done() returns zero, no packets can go through.  And
> then after I change at803x_aneg_done() so that it returns
> BMSR_ANEGCOMPLETE, then packets do go through.  Is that the proof
> you're looking for?

No. I would like to see the status of the copper side and the status
of the SGMII side, at the point at803x_aneg_done() is called.

If the copper side is up, but the SGMII side is down, returning 0 is
correct.

> Exxactly.  That's because the link IS up, and so there is no
> opportunity to receive another interrupt.

It could be, the copper side is up, but the SGMII side is down, at the
point at803x_aneg_done() is called. So it is correctly returning
0. Sometime later the SGMII side goes up, but there is not a second
interrupt. Hence the phy core does not know that the full, 2 stage MAC
to PHY to peer PHY link is now up.

	Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 14:09                         ` Andrew Lunn
@ 2017-05-24 18:58                           ` Timur Tabi
  2017-05-24 19:34                             ` Andrew Lunn
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 18:58 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On 05/24/2017 09:09 AM, Andrew Lunn wrote:
> It could be, the copper side is up, but the SGMII side is down, at the
> point at803x_aneg_done() is called. So it is correctly returning
> 0. Sometime later the SGMII side goes up, but there is not a second
> interrupt. Hence the phy core does not know that the full, 2 stage MAC
> to PHY to peer PHY link is now up.

Ok, I'm going to debug this some more.  It turns out that the MAC side of
the SGMII link can send an interrupt when it thinks that auto-negotiation is
done.  I might be able to use this.

What function should my MAC driver call when it wants the phy core to call
at803x_aneg_done again to see if autonegotiation is done?

Also, is there a way for the MAC driver to know that at803x_aneg_done()
previously returned 0, and that it needs to tell the phy core to check again?

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 18:58                           ` Timur Tabi
@ 2017-05-24 19:34                             ` Andrew Lunn
  2017-05-24 20:57                               ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-24 19:34 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On Wed, May 24, 2017 at 01:58:09PM -0500, Timur Tabi wrote:
> On 05/24/2017 09:09 AM, Andrew Lunn wrote:
> > It could be, the copper side is up, but the SGMII side is down, at the
> > point at803x_aneg_done() is called. So it is correctly returning
> > 0. Sometime later the SGMII side goes up, but there is not a second
> > interrupt. Hence the phy core does not know that the full, 2 stage MAC
> > to PHY to peer PHY link is now up.
> 
> Ok, I'm going to debug this some more.  It turns out that the MAC side of
> the SGMII link can send an interrupt when it thinks that auto-negotiation is
> done.  I might be able to use this.

You can use this for your board. But it still leaves the phy driver
broken for everybody else.
 
> What function should my MAC driver call when it wants the phy core to call
> at803x_aneg_done again to see if autonegotiation is done?

You want to trigger the PHY state machine. There is only one exported
API call to do this, phy_mac_interrupt(). But you are supposed to pass
the new link state. And your MAC driver has no idea of that, it does
not know if the copper side of the PHY is up.

So it might be better if you export phy_trigger_machine().

> Also, is there a way for the MAC driver to know that at803x_aneg_done()
> previously returned 0, and that it needs to tell the phy core to check again?

Not that i know of. The MAC layer is not supposed to be messing around
in the PHY layer. However, just triggering the PHY state machine
should be enough.

   Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 19:34                             ` Andrew Lunn
@ 2017-05-24 20:57                               ` Timur Tabi
  2017-05-24 21:15                                 ` Andrew Lunn
  2017-05-24 21:19                                 ` Florian Fainelli
  0 siblings, 2 replies; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 20:57 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On 05/24/2017 02:34 PM, Andrew Lunn wrote:
>> Ok, I'm going to debug this some more.  It turns out that the MAC side of
>> the SGMII link can send an interrupt when it thinks that auto-negotiation is
>> done.  I might be able to use this.
>
> You can use this for your board. But it still leaves the phy driver
> broken for everybody else.

Wait, I thought you said the at803x driver was not broken, since it 
returns 0 when the SGMII side of the link hasn't finished auto-negotiating?

>> What function should my MAC driver call when it wants the phy core to call
>> at803x_aneg_done again to see if autonegotiation is done?
>
> You want to trigger the PHY state machine. There is only one exported
> API call to do this, phy_mac_interrupt(). But you are supposed to pass
> the new link state. And your MAC driver has no idea of that, it does
> not know if the copper side of the PHY is up.

My NIC has a feature called autopolling where it takes over the MDIO bus 
and regularly polls the link state.  When it detects that the link state 
has changed, it generates a MAC interrupt.  This is when I call 
phy_mac_interrupt() normally.

I think I can use the SGMII interrupt to also call phy_mac_interrupt(). 
The problem is the from the copper side, the link is already up, so if I 
call phy_mac_interrupt() again and say the link is up, the phy core is 
going to ignore that.

> So it might be better if you export phy_trigger_machine().

I'll test that, but that does seem a bit hackish.

>> Also, is there a way for the MAC driver to know that at803x_aneg_done()
>> previously returned 0, and that it needs to tell the phy core to check again?
>
> Not that i know of. The MAC layer is not supposed to be messing around
> in the PHY layer. However, just triggering the PHY state machine
> should be enough.

Can you tell my how PHY_HAS_INTERRUPT is supposed to work?  How does the 
PHY send an interrupt?

I'm starting to think that my NIC's autopolling feature is not 
compatible with phylib, and that I should use polling mode.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 20:57                               ` Timur Tabi
@ 2017-05-24 21:15                                 ` Andrew Lunn
  2017-05-24 21:20                                   ` Timur Tabi
  2017-05-24 21:19                                 ` Florian Fainelli
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2017-05-24 21:15 UTC (permalink / raw)
  To: Timur Tabi
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On Wed, May 24, 2017 at 03:57:06PM -0500, Timur Tabi wrote:
> On 05/24/2017 02:34 PM, Andrew Lunn wrote:
> >>Ok, I'm going to debug this some more.  It turns out that the MAC side of
> >>the SGMII link can send an interrupt when it thinks that auto-negotiation is
> >>done.  I might be able to use this.
> >
> >You can use this for your board. But it still leaves the phy driver
> >broken for everybody else.
> 
> Wait, I thought you said the at803x driver was not broken, since it
> returns 0 when the SGMII side of the link hasn't finished
> auto-negotiating?

It is correct so far. But to work, it needs to interrupt again once
the SGMII side has come up. Only then have we link.

> My NIC has a feature called autopolling where it takes over the MDIO
> bus and regularly polls the link state.  When it detects that the
> link state has changed, it generates a MAC interrupt.  This is when
> I call phy_mac_interrupt() normally.

Unfortunately, you need to keep this feature turned off. It will not
respect the phydev mutex. It has no idea what page has been currently
selected. It probably has no way to flip the page and see if the SGMII
link is up. etc.

> Can you tell my how PHY_HAS_INTERRUPT is supposed to work?  How does
> the PHY send an interrupt?

Generally, the PHY interrupt pin is connected to a GPIO. You then use
the GPIO as an interrupt source. So it has an interrupt number. Put
that in phydev->irq, eg using the interrupts property in device tree.
The core will register an interrupt handler, and enable the
interrupt. When it receives an interrupt, it calls the phy driver to
service the interrupt.

	Andrew

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 20:57                               ` Timur Tabi
  2017-05-24 21:15                                 ` Andrew Lunn
@ 2017-05-24 21:19                                 ` Florian Fainelli
  1 sibling, 0 replies; 46+ messages in thread
From: Florian Fainelli @ 2017-05-24 21:19 UTC (permalink / raw)
  To: Timur Tabi, Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/24/2017 01:57 PM, Timur Tabi wrote:
> On 05/24/2017 02:34 PM, Andrew Lunn wrote:
>>> Ok, I'm going to debug this some more.  It turns out that the MAC
>>> side of
>>> the SGMII link can send an interrupt when it thinks that
>>> auto-negotiation is
>>> done.  I might be able to use this.
>>
>> You can use this for your board. But it still leaves the phy driver
>> broken for everybody else.
> 
> Wait, I thought you said the at803x driver was not broken, since it
> returns 0 when the SGMII side of the link hasn't finished auto-negotiating?
> 
>>> What function should my MAC driver call when it wants the phy core to
>>> call
>>> at803x_aneg_done again to see if autonegotiation is done?
>>
>> You want to trigger the PHY state machine. There is only one exported
>> API call to do this, phy_mac_interrupt(). But you are supposed to pass
>> the new link state. And your MAC driver has no idea of that, it does
>> not know if the copper side of the PHY is up.
> 
> My NIC has a feature called autopolling where it takes over the MDIO bus
> and regularly polls the link state.  When it detects that the link state
> has changed, it generates a MAC interrupt.  This is when I call
> phy_mac_interrupt() normally.
> 
> I think I can use the SGMII interrupt to also call phy_mac_interrupt().
> The problem is the from the copper side, the link is already up, so if I
> call phy_mac_interrupt() again and say the link is up, the phy core is
> going to ignore that.

The question is what the HW is auto-polling on? Most HW implementations
that I have seen either auto-poll and assume that BMSR.LINKSTATUS will
reflect what they want to, or they even let you specify which register
and bit(s) to monitor for link status. So which one is it here?

As Andrew already responded, this clashes with any reads/writes that the
PHY library could do, unless your HW is smart enough to serialize MDIO
reads/writes?

> 
>> So it might be better if you export phy_trigger_machine().
> 
> I'll test that, but that does seem a bit hackish.
> 
>>> Also, is there a way for the MAC driver to know that at803x_aneg_done()
>>> previously returned 0, and that it needs to tell the phy core to
>>> check again?
>>
>> Not that i know of. The MAC layer is not supposed to be messing around
>> in the PHY layer. However, just triggering the PHY state machine
>> should be enough.
> 
> Can you tell my how PHY_HAS_INTERRUPT is supposed to work?  How does the
> PHY send an interrupt?

PHY_HAS_INTERRUPT indicates that you have a dedicated interrupt line for
your PHY device and that you have the ability to implement a custom
interrupt handling callback in the PHY driver (ack_interrupt) that lets
you deal with all possible sorts of the interrupts that the PHY could
generate.

> 
> I'm starting to think that my NIC's autopolling feature is not
> compatible with phylib, and that I should use polling mode.

That's pretty much what Andrew told you about 5 emails ago...
-- 
Florian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 21:15                                 ` Andrew Lunn
@ 2017-05-24 21:20                                   ` Timur Tabi
  2017-05-24 21:28                                     ` Florian Fainelli
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 21:20 UTC (permalink / raw)
  To: Andrew Lunn, Timur Tabi
  Cc: Matthias May, Zefir Kurtisi, netdev, f.fainelli, David Miller,
	Manoj Iyer, jhugo

On 05/24/2017 04:15 PM, Andrew Lunn wrote:
>> My NIC has a feature called autopolling where it takes over the MDIO
>> bus and regularly polls the link state.  When it detects that the
>> link state has changed, it generates a MAC interrupt.  This is when
>> I call phy_mac_interrupt() normally.

> Unfortunately, you need to keep this feature turned off. It will not
> respect the phydev mutex. It has no idea what page has been currently
> selected. It probably has no way to flip the page and see if the SGMII
> link is up. etc.

phydev mutex?  And what do you mean by page?

I forgot one detail.  Every time you do an MDIO read/write, it 
temporarily disables the feature.  Although, I think that's not relevant 
to your point.

Disabling this feature and switching from PHY_IGNORE_INTERRUPT to 
PHY_POLL might fix everything.  I will try it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 21:20                                   ` Timur Tabi
@ 2017-05-24 21:28                                     ` Florian Fainelli
  2017-05-24 21:32                                       ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Florian Fainelli @ 2017-05-24 21:28 UTC (permalink / raw)
  To: Timur Tabi, Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/24/2017 02:20 PM, Timur Tabi wrote:
> On 05/24/2017 04:15 PM, Andrew Lunn wrote:
>>> My NIC has a feature called autopolling where it takes over the MDIO
>>> bus and regularly polls the link state.  When it detects that the
>>> link state has changed, it generates a MAC interrupt.  This is when
>>> I call phy_mac_interrupt() normally.
> 
>> Unfortunately, you need to keep this feature turned off. It will not
>> respect the phydev mutex. It has no idea what page has been currently
>> selected. It probably has no way to flip the page and see if the SGMII
>> link is up. etc.
> 
> phydev mutex?  And what do you mean by page?

Yes phydev->lock which is used to serialize the state machine state changes.

Most PHYs have many more registers than the 15 standard exposed
directly, and so you need indirect reads/writes to access these
registers, which typically involve switching a particular page, doing
the indirect register access, and then flipping the page back. If you
interrupt that scheme one way or another, your reads and writes are all
messed up.

> 
> I forgot one detail.  Every time you do an MDIO read/write, it
> temporarily disables the feature.  Although, I think that's not relevant
> to your point.

Is that done by the HW itself, or is this under SW control exclusively.

> 
> Disabling this feature and switching from PHY_IGNORE_INTERRUPT to
> PHY_POLL might fix everything.  I will try it.
> 

Humm yes, that seems like a worthwhile exercise at least.
-- 
Florian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 21:28                                     ` Florian Fainelli
@ 2017-05-24 21:32                                       ` Timur Tabi
  2017-05-24 21:36                                         ` Florian Fainelli
  0 siblings, 1 reply; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 21:32 UTC (permalink / raw)
  To: Florian Fainelli, Timur Tabi, Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/24/2017 04:28 PM, Florian Fainelli wrote:

> Yes phydev->lock which is used to serialize the state machine state changes.
>
> Most PHYs have many more registers than the 15 standard exposed
> directly, and so you need indirect reads/writes to access these
> registers, which typically involve switching a particular page, doing
> the indirect register access, and then flipping the page back. If you
> interrupt that scheme one way or another, your reads and writes are all
> messed up.

Ah, and the at803x is a device like that.

At worst, the autopoll feature could read a register from the wrong 
page, and think that the link state has changed when it hasn't.  But 
that's still bad, and all my problems do revolve around link states.

>> I forgot one detail.  Every time you do an MDIO read/write, it
>> temporarily disables the feature.  Although, I think that's not relevant
>> to your point.
>
> Is that done by the HW itself, or is this under SW control exclusively.

Software.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 21:32                                       ` Timur Tabi
@ 2017-05-24 21:36                                         ` Florian Fainelli
  2017-05-24 22:03                                           ` Timur Tabi
  0 siblings, 1 reply; 46+ messages in thread
From: Florian Fainelli @ 2017-05-24 21:36 UTC (permalink / raw)
  To: Timur Tabi, Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/24/2017 02:32 PM, Timur Tabi wrote:
> On 05/24/2017 04:28 PM, Florian Fainelli wrote:
> 
>> Yes phydev->lock which is used to serialize the state machine state
>> changes.
>>
>> Most PHYs have many more registers than the 15 standard exposed
>> directly, and so you need indirect reads/writes to access these
>> registers, which typically involve switching a particular page, doing
>> the indirect register access, and then flipping the page back. If you
>> interrupt that scheme one way or another, your reads and writes are all
>> messed up.
> 
> Ah, and the at803x is a device like that.
> 
> At worst, the autopoll feature could read a register from the wrong
> page, and think that the link state has changed when it hasn't.  But
> that's still bad, and all my problems do revolve around link states.

Absolutely, just like it's not clear whether autopoll does read
BMSR.LINKSTAT or another at803x specific register? Also how is
BMSR.LINKSTAT defined on at803x when there is SGMII + Copper involved?

> 
>>> I forgot one detail.  Every time you do an MDIO read/write, it
>>> temporarily disables the feature.  Although, I think that's not relevant
>>> to your point.
>>
>> Is that done by the HW itself, or is this under SW control exclusively.
> 
> Software.

OK, and there is no way you can run into the following race condition:

CPU				HW
MDIO read intent
				polling starts
disable HW autopoll
				polling continues
MDIO read is done
				MDIO read done
				polling stops
MDIO read value returned


if you disable autopolling in HW this is guaranteed to immediately stop
by the time the register value is seen in HW and your I/O read/write
returns?
-- 
Florian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 21:36                                         ` Florian Fainelli
@ 2017-05-24 22:03                                           ` Timur Tabi
  0 siblings, 0 replies; 46+ messages in thread
From: Timur Tabi @ 2017-05-24 22:03 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn
  Cc: Matthias May, Zefir Kurtisi, netdev, David Miller, Manoj Iyer, jhugo

On 05/24/2017 04:36 PM, Florian Fainelli wrote:
> OK, and there is no way you can run into the following race condition:
>
> CPU				HW
> MDIO read intent
> 				polling starts
> disable HW autopoll
> 				polling continues

Disabling of the HW autopoll waits for the poll to actually stop before 
continuing.  You can see the code here:

http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/qualcomm/emac/emac-phy.c#L102

> MDIO read is done
> 				MDIO read done
> 				polling stops
> MDIO read value returned
>
>
> if you disable autopolling in HW this is guaranteed to immediately stop
> by the time the register value is seen in HW and your I/O read/write
> returns?

It doesn't immediately stop, but the emac_phy_mdio_autopoll_disable() 
function waits for the MDIO bus to not be busy.  But low-level details 
of this feature are not documented, so who knows what it does exactly? 
The original code that used this feature only supported one PHY and 
never expected there to be any asynchronous MDIO transactios.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-05-24 13:29                   ` Timur Tabi
  2017-05-24 13:40                     ` Andrew Lunn
@ 2017-06-01 11:45                     ` Zefir Kurtisi
  2017-06-01 14:48                       ` Timur Tabi
  1 sibling, 1 reply; 46+ messages in thread
From: Zefir Kurtisi @ 2017-06-01 11:45 UTC (permalink / raw)
  To: Timur Tabi, Matthias May, Andrew Lunn
  Cc: netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

Hello Timur, all,

sorry for being late to chime in here, as Matthias said I have been AFK while this
discussion took place.

The related patch I committed is meant to double check that the SGMII side is done
aneg when copper side is. Usually it does, but this expectation admittedly might
fail in interrupt mode. What seems to happen is: after a power-down-up cycle the
at803x restarts aneg at copper and SGMII sides. In polling mode, if you run into
the case where copper is up but the SGMII is not, you'll see the message and in
(one of) the next poll cycle(s) SGMII should be ready too and the situation
recovers. In interrupt mode, obviously once you hit into the partially done aneg,
you won't recover since no additional link-change interrupt is going to happen.

Wit that, Tamur is right claiming the double-check breaks the driver in interrupt
mode. I could think of several work-arounds to fix it, e.g.
1) check if there is an SGMII link-change interrupt source and enable it
2) when we run into partial aneg completion, temporary switch to polling mode and
back to interrupt mode when at803x_aneg_done() succeeds


Alas, we need to rewind back to whether these double-checks are required at all.
As Matthias explained, there is at least one combination of devices that have
documented issues establishing the SGMII link - in our case it was the at8031
attached to a P1010's eTSEC (errata A-004187, fixed with chip rev. 2.01). Chances
are high that our company is the only one using this exact early version of
problematic devices. Therefore, and since the problem is at ETH side, we moved our
local workarounds into the gianfar driver who on demand restarts SGMII to recover
from the stale link.

I guess we need to decide whether we generally need to handle permanent aneg
failures on the SGMII link. If we expect that it must not fail (like we assumed
until we saw it failing), I agree with Timur and support reverting of the related
commit f62265b53e. If otherwise we want this potential failure to be handled
correctly, things become arbitrary complex. Essentially, we need to handle such
PHYs as a combination of their two sides (copper + SGMII) as virtual sub-PHYs. The
phylib might support that in a future version, but for now this seems like a lot
of work required to handle a rare problem.



tl;dr: ACK with reverting f62265b53e


Cheers,
Zefir


On 05/24/2017 03:29 PM, Timur Tabi wrote:
> On 5/24/17 2:18 AM, Matthias May wrote:
>> With the patch: When the copper side is seen as up, it also checks if aneg of
>> the SGMII link is done.
>> As far as i know SGMII can not be run without aneg, since it is always Gbit with
>> aneg mandatory.
>> If SGMII aneg is not done, then, well aneg is not done and thus 0 is returned.
>>
>> Internally we have this patch extended so we don't only report that aneg is not
>> done but also reset the link.
>> Eventually aneg on the SGMII side can be completed and the link comes up.
> 
> I would really like to test this patch.
> 
>> Why do you think that frames are able to go through when aneg is reported as not
>> done by the PHY?
> 
> I have two theories:
> 
> 1. The warning message is bogus.  The link actually is okay, but the driver thinks
> that it isn't.
> 
> 2. The link is not okay, but it automatically fixes itself soon after the
> at803x_aneg_done() finishes.
> 
>> Since aneg is mandatory for SGMII this can as well be seen as "link not up", not?
> 
> The problem is that even though the link is up, the driver has returned "0", so
> the kernel thinks that autonegotiation has not finished. at803x_aneg_done() is
> never called again, and so I think the kernel is disabling the interface is some
> secret way.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] at803x: double check SGMII side autoneg
  2017-06-01 11:45                     ` Zefir Kurtisi
@ 2017-06-01 14:48                       ` Timur Tabi
  0 siblings, 0 replies; 46+ messages in thread
From: Timur Tabi @ 2017-06-01 14:48 UTC (permalink / raw)
  To: Zefir Kurtisi, Matthias May, Andrew Lunn
  Cc: netdev, f.fainelli, David Miller, Manoj Iyer, jhugo

On 06/01/2017 06:45 AM, Zefir Kurtisi wrote:
> I guess we need to decide whether we generally need to handle permanent aneg
> failures on the SGMII link. If we expect that it must not fail (like we assumed
> until we saw it failing), I agree with Timur and support reverting of the related
> commit f62265b53e. If otherwise we want this potential failure to be handled
> correctly, things become arbitrary complex. Essentially, we need to handle such
> PHYs as a combination of their two sides (copper + SGMII) as virtual sub-PHYs. The
> phylib might support that in a future version, but for now this seems like a lot
> of work required to handle a rare problem.

I'm about to post a patch that removes interrupt support from the EMAC 
driver and relies on software polling of the PHY.  With this patch, we 
don't see the "link is not okay" message from that at803x driver any more.

The link state is generally more reliable now, even when the at803x 
driver doesn't complain.

My theory is that the hardware polling of the PHY is just too 
aggressive.  I think it continuously reads the PHY status register at 
maximum speed and immediately issues an interrupt when the PHY says that 
it's up.

So I think we're okay with leaving the at803x driver as-is, since we 
appear to be no longer getting any false failures.

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-06-01 14:48 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 10:40 [PATCH 0/2] at803x: don't power-down SGMII link Zefir Kurtisi
2016-10-24 10:40 ` [PATCH 1/2] Revert "at803x: fix suspend/resume for SGMII link" Zefir Kurtisi
2016-10-27 20:05   ` David Miller
2016-10-24 10:40 ` [PATCH 2/2] at803x: double check SGMII side autoneg Zefir Kurtisi
2016-10-27 20:05   ` David Miller
2016-10-28 22:24   ` Timur Tabi
2016-11-01 11:13     ` Zefir Kurtisi
2017-01-17 23:32   ` Timur Tabi
2017-01-18 11:00     ` Zefir Kurtisi
2017-01-18 13:13       ` Timur Tabi
2017-01-18 13:53         ` Zefir Kurtisi
2017-01-18 15:02           ` Timur Tabi
2017-01-19  9:43             ` Zefir Kurtisi
2017-01-19 18:01               ` Florian Fainelli
2017-01-20  2:38               ` Timur Tabi
2017-01-20 15:31                 ` Zefir Kurtisi
2017-05-22 20:12   ` Timur Tabi
2017-05-22 21:02     ` Andrew Lunn
2017-05-22 21:10       ` Florian Fainelli
2017-05-22 21:19         ` Timur Tabi
2017-05-22 21:50           ` Florian Fainelli
2017-05-22 21:09     ` Andrew Lunn
2017-05-22 21:29       ` Timur Tabi
2017-05-22 21:32         ` Andrew Lunn
2017-05-23 15:54           ` Timur Tabi
2017-05-23 16:07             ` Andrew Lunn
2017-05-23 16:33               ` Timur Tabi
2017-05-24  7:18                 ` Matthias May
2017-05-24 13:29                   ` Timur Tabi
2017-05-24 13:40                     ` Andrew Lunn
2017-05-24 13:48                       ` Timur Tabi
2017-05-24 14:09                         ` Andrew Lunn
2017-05-24 18:58                           ` Timur Tabi
2017-05-24 19:34                             ` Andrew Lunn
2017-05-24 20:57                               ` Timur Tabi
2017-05-24 21:15                                 ` Andrew Lunn
2017-05-24 21:20                                   ` Timur Tabi
2017-05-24 21:28                                     ` Florian Fainelli
2017-05-24 21:32                                       ` Timur Tabi
2017-05-24 21:36                                         ` Florian Fainelli
2017-05-24 22:03                                           ` Timur Tabi
2017-05-24 21:19                                 ` Florian Fainelli
2017-06-01 11:45                     ` Zefir Kurtisi
2017-06-01 14:48                       ` Timur Tabi
2016-10-25 17:31 ` [PATCH 0/2] at803x: don't power-down SGMII link Timur Tabi
2016-10-27  8:05   ` Zefir Kurtisi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.