netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] e1000e: Power cycle phy on PM resume
@ 2020-09-23  7:47 Kai-Heng Feng
  2020-09-23 12:17 ` Andrew Lunn
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-23  7:47 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: Kai-Heng Feng, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

We are seeing the following error after S3 resume:
[  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
[  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
...
[  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

Since we don't know what platform firmware may do to the phy, so let's
power cycle the phy upon system resume to resolve the issue.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 664e8ccc88d2..c2a87a408102 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
 	    !e1000e_check_me(hw->adapter->pdev->device))
 		e1000e_s0ix_exit_flow(adapter);
 
+	e1000_power_down_phy(adapter);
+
 	rc = __e1000_resume(pdev);
 	if (rc)
 		return rc;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23  7:47 [PATCH] e1000e: Power cycle phy on PM resume Kai-Heng Feng
@ 2020-09-23 12:17 ` Andrew Lunn
  2020-09-23 14:44   ` Kai-Heng Feng
  2020-09-23 13:28 ` [Intel-wired-lan] " Paul Menzel
  2020-09-24 15:09 ` [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit Kai-Heng Feng
  2 siblings, 1 reply; 31+ messages in thread
From: Andrew Lunn @ 2020-09-23 12:17 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: jeffrey.t.kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> Since we don't know what platform firmware may do to the phy, so let's
> power cycle the phy upon system resume to resolve the issue.
> 
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 664e8ccc88d2..c2a87a408102 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>  	    !e1000e_check_me(hw->adapter->pdev->device))
>  		e1000e_s0ix_exit_flow(adapter);
>  
> +	e1000_power_down_phy(adapter);
> +

static void e1000_power_down_phy(struct e1000_adapter *adapter)
{
	struct e1000_hw *hw = &adapter->hw;

	/* Power down the PHY so no link is implied when interface is down *
	 * The PHY cannot be powered down if any of the following is true *
	 * (a) WoL is enabled
	 * (b) AMT is active
	 * (c) SoL/IDER session is active
	 */
	if (!adapter->wol && hw->mac_type >= e1000_82540 &&
	   hw->media_type == e1000_media_type_copper) {

Could it be coming out of S3 because it just received a WoL?

It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,
since that is an MDIO write itself. Do you actually know how this call
to e1000_power_down_phy() fixes the issues?

   Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23  7:47 [PATCH] e1000e: Power cycle phy on PM resume Kai-Heng Feng
  2020-09-23 12:17 ` Andrew Lunn
@ 2020-09-23 13:28 ` Paul Menzel
  2020-09-23 14:46   ` Kai-Heng Feng
  2020-09-24 15:09 ` [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit Kai-Heng Feng
  2 siblings, 1 reply; 31+ messages in thread
From: Paul Menzel @ 2020-09-23 13:28 UTC (permalink / raw)
  To: Kai-Heng Feng, Jeff Kirsher
  Cc: netdev, linux-kernel, intel-wired-lan, Jakub Kicinski, David S. Miller

Dear Kai-Heng,


Am 23.09.20 um 09:47 schrieb Kai-Heng Feng:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> Since we don't know what platform firmware may do to the phy, so let's
> power cycle the phy upon system resume to resolve the issue.

Is there a bug report or list thread for this issue?

> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>   drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 664e8ccc88d2..c2a87a408102 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>   	    !e1000e_check_me(hw->adapter->pdev->device))
>   		e1000e_s0ix_exit_flow(adapter);
>   
> +	e1000_power_down_phy(adapter);
> +
>   	rc = __e1000_resume(pdev);
>   	if (rc)
>   		return rc;

How much does this increase the resume time?


Kind regards,

Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 12:17 ` Andrew Lunn
@ 2020-09-23 14:44   ` Kai-Heng Feng
  2020-09-23 15:37     ` Andrew Lunn
  0 siblings, 1 reply; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-23 14:44 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: jeffrey.t.kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

Hi Andrew,

> On Sep 23, 2020, at 20:17, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:
>> We are seeing the following error after S3 resume:
>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>> ...
>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>> 
>> Since we don't know what platform firmware may do to the phy, so let's
>> power cycle the phy upon system resume to resolve the issue.
>> 
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>> ---
>> drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>> index 664e8ccc88d2..c2a87a408102 100644
>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>> 	    !e1000e_check_me(hw->adapter->pdev->device))
>> 		e1000e_s0ix_exit_flow(adapter);
>> 
>> +	e1000_power_down_phy(adapter);
>> +
> 
> static void e1000_power_down_phy(struct e1000_adapter *adapter)
> {
> 	struct e1000_hw *hw = &adapter->hw;
> 
> 	/* Power down the PHY so no link is implied when interface is down *
> 	 * The PHY cannot be powered down if any of the following is true *
> 	 * (a) WoL is enabled
> 	 * (b) AMT is active
> 	 * (c) SoL/IDER session is active
> 	 */
> 	if (!adapter->wol && hw->mac_type >= e1000_82540 &&
> 	   hw->media_type == e1000_media_type_copper) {

Looks like the the function comes from e1000, drivers/net/ethernet/intel/e1000/e1000_main.c.
However, this patch is for e1000e, so the function with same name is different.

> 
> Could it be coming out of S3 because it just received a WoL?

No, the issue can be reproduced by pressing keyboard or rtcwake.

> 
> It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,
> since that is an MDIO write itself. Do you actually know how this call
> to e1000_power_down_phy() fixes the issues?

I don't know from hardware's perspective, but I think the comment on e1000_power_down_phy_copper() can give us some insight:

/**
 * e1000_power_down_phy_copper - Restore copper link in case of PHY power down
 * @hw: pointer to the HW structure
 *
 * In the case of a PHY power down to save power, or to turn off link during a
 * driver unload, or wake on lan is not enabled, restore the link to previous
 * settings.                       
 **/
void e1000_power_down_phy_copper(struct e1000_hw *hw)
{
        u16 mii_reg = 0;

        /* The PHY will retain its settings across a power down/up cycle */
        e1e_rphy(hw, MII_BMCR, &mii_reg);
        mii_reg |= BMCR_PDOWN;
        e1e_wphy(hw, MII_BMCR, mii_reg);
        usleep_range(1000, 2000);
}

Kai-Heng

> 
>   Andrew


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 13:28 ` [Intel-wired-lan] " Paul Menzel
@ 2020-09-23 14:46   ` Kai-Heng Feng
  2020-09-23 15:02     ` Paul Menzel
  0 siblings, 1 reply; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-23 14:46 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Jeff Kirsher, open list:NETWORKING DRIVERS, open list,
	moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller

Hi Paul,

> On Sep 23, 2020, at 21:28, Paul Menzel <pmenzel@molgen.mpg.de> wrote:
> 
> Dear Kai-Heng,
> 
> 
> Am 23.09.20 um 09:47 schrieb Kai-Heng Feng:
>> We are seeing the following error after S3 resume:
>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>> ...
>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>> Since we don't know what platform firmware may do to the phy, so let's
>> power cycle the phy upon system resume to resolve the issue.
> 
> Is there a bug report or list thread for this issue?

No. That's why I sent a patch to start discussion :)

> 
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>> ---
>>  drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>> index 664e8ccc88d2..c2a87a408102 100644
>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>>  	    !e1000e_check_me(hw->adapter->pdev->device))
>>  		e1000e_s0ix_exit_flow(adapter);
>>  +	e1000_power_down_phy(adapter);
>> +
>>  	rc = __e1000_resume(pdev);
>>  	if (rc)
>>  		return rc;
> 
> How much does this increase the resume time?

I didn't measure it. Because for me it's more important to have a working device.

Does it have a noticeable impact on your system's resume time?

Kai-Heng

> 
> 
> Kind regards,
> 
> Paul
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 14:46   ` Kai-Heng Feng
@ 2020-09-23 15:02     ` Paul Menzel
  2020-09-23 19:28       ` Andrew Lunn
  0 siblings, 1 reply; 31+ messages in thread
From: Paul Menzel @ 2020-09-23 15:02 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Jeff Kirsher, netdev, linux-kernel, intel-wired-lan,
	Jakub Kicinski, David S. Miller

Dear Kai-Heng,


Am 23.09.20 um 16:46 schrieb Kai-Heng Feng:

>> On Sep 23, 2020, at 21:28, Paul Menzel wrote:

>> Am 23.09.20 um 09:47 schrieb Kai-Heng Feng:
>>> We are seeing the following error after S3 resume:
>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>>> ...
>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>>> Since we don't know what platform firmware may do to the phy, so let's
>>> power cycle the phy upon system resume to resolve the issue.
>>
>> Is there a bug report or list thread for this issue?
> 
> No. That's why I sent a patch to start discussion :)

Then please add on what systems that is.

>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>> ---
>>>   drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>>> index 664e8ccc88d2..c2a87a408102 100644
>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>>>   	    !e1000e_check_me(hw->adapter->pdev->device))
>>>   		e1000e_s0ix_exit_flow(adapter);
>>>   +	e1000_power_down_phy(adapter);
>>> +
>>>   	rc = __e1000_resume(pdev);
>>>   	if (rc)
>>>   		return rc;
>>
>> How much does this increase the resume time?
> 
> I didn't measure it. Because for me it's more important to have a working device.
> 
> Does it have a noticeable impact on your system's resume time?

I am not able to test the patch right now. But resume time is important 
to me. As I do not have the problem, nothing should be changed for my 
system (Dell Latitude E7250).

     00:19.0 Ethernet controller [0200]: Intel Corporation Ethernet 
Connection (3) I218-LM [8086:15a2] (rev 03)


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 14:44   ` Kai-Heng Feng
@ 2020-09-23 15:37     ` Andrew Lunn
  2020-09-24 12:50       ` Kai-Heng Feng
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lunn @ 2020-09-23 15:37 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: jeffrey.t.kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

On Wed, Sep 23, 2020 at 10:44:10PM +0800, Kai-Heng Feng wrote:
> Hi Andrew,
> 
> > On Sep 23, 2020, at 20:17, Andrew Lunn <andrew@lunn.ch> wrote:
> > 
> > On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:
> >> We are seeing the following error after S3 resume:
> >> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> >> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> >> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> >> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> >> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> >> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> >> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> >> ...
> >> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> >> 
> >> Since we don't know what platform firmware may do to the phy, so let's
> >> power cycle the phy upon system resume to resolve the issue.
> >> 
> >> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> >> ---
> >> drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
> >> 1 file changed, 2 insertions(+)
> >> 
> >> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> >> index 664e8ccc88d2..c2a87a408102 100644
> >> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> >> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> >> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
> >> 	    !e1000e_check_me(hw->adapter->pdev->device))
> >> 		e1000e_s0ix_exit_flow(adapter);
> >> 
> >> +	e1000_power_down_phy(adapter);
> >> +
> > 
> > static void e1000_power_down_phy(struct e1000_adapter *adapter)
> > {
> > 	struct e1000_hw *hw = &adapter->hw;
> > 
> > 	/* Power down the PHY so no link is implied when interface is down *
> > 	 * The PHY cannot be powered down if any of the following is true *
> > 	 * (a) WoL is enabled
> > 	 * (b) AMT is active
> > 	 * (c) SoL/IDER session is active
> > 	 */
> > 	if (!adapter->wol && hw->mac_type >= e1000_82540 &&
> > 	   hw->media_type == e1000_media_type_copper) {
> 
> Looks like the the function comes from e1000, drivers/net/ethernet/intel/e1000/e1000_main.c.
> However, this patch is for e1000e, so the function with same name is different.

Ah! Sorry. Missed that. Also it is not nice there are two functions in
the kernel with the same name.

> > Could it be coming out of S3 because it just received a WoL?
> 
> No, the issue can be reproduced by pressing keyboard or rtcwake.
 
Not relevant now, since i was looking at the wrong function. But i was
meaning the call is a NOP in the case WoL caused the wake up. So if
the issues can also happen after WoL, your fix is not going to fix it.

> > It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,
> > since that is an MDIO write itself. Do you actually know how this call
> > to e1000_power_down_phy() fixes the issues?
> 

> I don't know from hardware's perspective, but I think the comment on
> e1000_power_down_phy_copper() can give us some insight:

And there is only one function called e1000_power_down_phy_copper()
:-)

> 
> /**
>  * e1000_power_down_phy_copper - Restore copper link in case of PHY power down
>  * @hw: pointer to the HW structure
>  *
>  * In the case of a PHY power down to save power, or to turn off link during a
>  * driver unload, or wake on lan is not enabled, restore the link to previous
>  * settings.                       
>  **/
> void e1000_power_down_phy_copper(struct e1000_hw *hw)
> {
>         u16 mii_reg = 0;
> 
>         /* The PHY will retain its settings across a power down/up cycle */
>         e1e_rphy(hw, MII_BMCR, &mii_reg);
>         mii_reg |= BMCR_PDOWN;
>         e1e_wphy(hw, MII_BMCR, mii_reg);
>         usleep_range(1000, 2000);
> }

I don't really see how this explains this:

> >> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> >> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/phy.c#L181

So first off, the comments are all cut/paste from
e1000e_read_phy_reg_mdic(). It would be nice to s/read/write/g in that
function.

So it sets up the transaction and starts it. MDIO is a serial bus with
no acknowledgements. You clock out around 64 bits, and hope the PHY
receives it. The time it takes to send those 64 bits is fixed by the
bus speed, typically 2.5MHz.

So the driver polls waiting for the hardware to say the bits have been
sent. And this is timing out. How long that takes has nothing to do
with the PHY, or what state it is in. Powering down the PHY has no
effect on the MDIO bus master, and how long it takes to shift those
bits out. Which is why i don't think this patch is correct. This is
probably an MDIO bus issue, not a PHY issue.

Try dumping the value of MDIC in the good/bad case before the
transaction starts.

	 Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 15:02     ` Paul Menzel
@ 2020-09-23 19:28       ` Andrew Lunn
  2020-09-24 13:02         ` Paul Menzel
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lunn @ 2020-09-23 19:28 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Kai-Heng Feng, Jeff Kirsher, netdev, linux-kernel,
	intel-wired-lan, Jakub Kicinski, David S. Miller

> > > How much does this increase the resume time?

Define resume time? Until you get the display manager unlock screen?
Or do you need working networking?

It takes around 1.5 seconds for auto negotiation to get a link. I know
it takes me longer than that to move my fingers to the keyboard and
type in my password to unlock the screen. So by the time you actually
get to see your desktop, you should have link.

I've no idea about how the e1000e driver does link negotiation. But
powering the PHY off means there is going to be a negotiation sometime
later. But if you don't turn it off, the driver might be able to avoid
doing an autoneg if the PHY has already done one when it got powered
up.

      Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 15:37     ` Andrew Lunn
@ 2020-09-24 12:50       ` Kai-Heng Feng
  0 siblings, 0 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-24 12:50 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jeff Kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

Hi Andrew,

> On Sep 23, 2020, at 23:37, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> On Wed, Sep 23, 2020 at 10:44:10PM +0800, Kai-Heng Feng wrote:
>> Hi Andrew,
>> 
>>> On Sep 23, 2020, at 20:17, Andrew Lunn <andrew@lunn.ch> wrote:
>>> 
>>> On Wed, Sep 23, 2020 at 03:47:51PM +0800, Kai-Heng Feng wrote:
>>>> We are seeing the following error after S3 resume:
>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>>>> ...
>>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>>>> 
>>>> Since we don't know what platform firmware may do to the phy, so let's
>>>> power cycle the phy upon system resume to resolve the issue.
>>>> 
>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>>> ---
>>>> drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>> 
>>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>> index 664e8ccc88d2..c2a87a408102 100644
>>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>> @@ -6968,6 +6968,8 @@ static __maybe_unused int e1000e_pm_resume(struct device *dev)
>>>> 	    !e1000e_check_me(hw->adapter->pdev->device))
>>>> 		e1000e_s0ix_exit_flow(adapter);
>>>> 
>>>> +	e1000_power_down_phy(adapter);
>>>> +
>>> 
>>> static void e1000_power_down_phy(struct e1000_adapter *adapter)
>>> {
>>> 	struct e1000_hw *hw = &adapter->hw;
>>> 
>>> 	/* Power down the PHY so no link is implied when interface is down *
>>> 	 * The PHY cannot be powered down if any of the following is true *
>>> 	 * (a) WoL is enabled
>>> 	 * (b) AMT is active
>>> 	 * (c) SoL/IDER session is active
>>> 	 */
>>> 	if (!adapter->wol && hw->mac_type >= e1000_82540 &&
>>> 	   hw->media_type == e1000_media_type_copper) {
>> 
>> Looks like the the function comes from e1000, drivers/net/ethernet/intel/e1000/e1000_main.c.
>> However, this patch is for e1000e, so the function with same name is different.
> 
> Ah! Sorry. Missed that. Also it is not nice there are two functions in
> the kernel with the same name.
> 
>>> Could it be coming out of S3 because it just received a WoL?
>> 
>> No, the issue can be reproduced by pressing keyboard or rtcwake.
> 
> Not relevant now, since i was looking at the wrong function. But i was
> meaning the call is a NOP in the case WoL caused the wake up. So if
> the issues can also happen after WoL, your fix is not going to fix it.
> 
>>> It seems unlikely that it is the MII_CR_POWER_DOWN which is helping,
>>> since that is an MDIO write itself. Do you actually know how this call
>>> to e1000_power_down_phy() fixes the issues?
>> 
> 
>> I don't know from hardware's perspective, but I think the comment on
>> e1000_power_down_phy_copper() can give us some insight:
> 
> And there is only one function called e1000_power_down_phy_copper()
> :-)
> 
>> 
>> /**
>> * e1000_power_down_phy_copper - Restore copper link in case of PHY power down
>> * @hw: pointer to the HW structure
>> *
>> * In the case of a PHY power down to save power, or to turn off link during a
>> * driver unload, or wake on lan is not enabled, restore the link to previous
>> * settings.                       
>> **/
>> void e1000_power_down_phy_copper(struct e1000_hw *hw)
>> {
>>        u16 mii_reg = 0;
>> 
>>        /* The PHY will retain its settings across a power down/up cycle */
>>        e1e_rphy(hw, MII_BMCR, &mii_reg);
>>        mii_reg |= BMCR_PDOWN;
>>        e1e_wphy(hw, MII_BMCR, mii_reg);
>>        usleep_range(1000, 2000);
>> }
> 
> I don't really see how this explains this:
> 
>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/phy.c#L181
> 
> So first off, the comments are all cut/paste from
> e1000e_read_phy_reg_mdic(). It would be nice to s/read/write/g in that
> function.

Ah yes...

> 
> So it sets up the transaction and starts it. MDIO is a serial bus with
> no acknowledgements. You clock out around 64 bits, and hope the PHY
> receives it. The time it takes to send those 64 bits is fixed by the
> bus speed, typically 2.5MHz.

Thanks for the info.

> 
> So the driver polls waiting for the hardware to say the bits have been
> sent. And this is timing out. How long that takes has nothing to do
> with the PHY, or what state it is in. Powering down the PHY has no
> effect on the MDIO bus master, and how long it takes to shift those
> bits out. Which is why i don't think this patch is correct. This is
> probably an MDIO bus issue, not a PHY issue.

Thanks for pointing out the possible root cause.
Indeed this looks like an MDIO issue so this patch is completely wrong.

I'll send a V2, thanks.

Kai-Heng

> 
> Try dumping the value of MDIC in the good/bad case before the
> transaction starts.
> 
> 	 Andrew


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Power cycle phy on PM resume
  2020-09-23 19:28       ` Andrew Lunn
@ 2020-09-24 13:02         ` Paul Menzel
  0 siblings, 0 replies; 31+ messages in thread
From: Paul Menzel @ 2020-09-24 13:02 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Kai-Heng Feng, Jeff Kirsher, netdev, linux-kernel,
	intel-wired-lan, Jakub Kicinski, David S. Miller

Dear Andrew,


Am 23.09.20 um 21:28 schrieb Andrew Lunn:
>>>> How much does this increase the resume time?
> 
> Define resume time? Until you get the display manager unlock screen?
> Or do you need working networking?

Until network is functional again. Currently, the speed negotiation 
alone takes three(?) seconds, so making it even longer is unacceptable. 
(You wrote it below.)

> It takes around 1.5 seconds for auto negotiation to get a link. I know
> it takes me longer than that to move my fingers to the keyboard and
> type in my password to unlock the screen. So by the time you actually
> get to see your desktop, you should have link.

Not here.

> I've no idea about how the e1000e driver does link negotiation. But
> powering the PHY off means there is going to be a negotiation sometime
> later. But if you don't turn it off, the driver might be able to avoid
> doing an autoneg if the PHY has already done one when it got powered
> up.

Indeed.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-23  7:47 [PATCH] e1000e: Power cycle phy on PM resume Kai-Heng Feng
  2020-09-23 12:17 ` Andrew Lunn
  2020-09-23 13:28 ` [Intel-wired-lan] " Paul Menzel
@ 2020-09-24 15:09 ` Kai-Heng Feng
  2020-09-24 15:32   ` [Intel-wired-lan] " Paul Menzel
                     ` (2 more replies)
  2 siblings, 3 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-24 15:09 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: andrew, Kai-Heng Feng, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

We are seeing the following error after S3 resume:
[  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
[  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
...
[  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
increase polling iteration can resolve the issue.

While at it, also move the delay to the end of loop, to potentially save
50 us.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v2:
 - Increase polling iteration instead of powering down the phy.

 drivers/net/ethernet/intel/e1000e/phy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
index e11c877595fb..72968a01164b 100644
--- a/drivers/net/ethernet/intel/e1000e/phy.c
+++ b/drivers/net/ethernet/intel/e1000e/phy.c
@@ -203,11 +203,12 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
 	 * Increasing the time out as testing showed failures with
 	 * the lower time out
 	 */
-	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
-		udelay(50);
+	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
 		mdic = er32(MDIC);
 		if (mdic & E1000_MDIC_READY)
 			break;
+
+		udelay(50);
 	}
 	if (!(mdic & E1000_MDIC_READY)) {
 		e_dbg("MDI Write did not complete\n");
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 15:09 ` [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit Kai-Heng Feng
@ 2020-09-24 15:32   ` Paul Menzel
  2020-09-24 16:04     ` Andrew Lunn
  2020-09-24 15:53   ` Andrew Lunn
  2020-09-24 16:45   ` [PATCH v3] " Kai-Heng Feng
  2 siblings, 1 reply; 31+ messages in thread
From: Paul Menzel @ 2020-09-24 15:32 UTC (permalink / raw)
  To: Kai-Heng Feng, Jeff Kirsher
  Cc: Andrew Lunn, netdev, linux-kernel, intel-wired-lan,
	Jakub Kicinski, David S. Miller

Dear Kai-Heng,


Thank you for sending version 2.

Am 24.09.20 um 17:09 schrieb Kai-Heng Feng:
> We are seeing the following error after S3 resume:

I’d be great if you added the system and used hardware, you are seeing 
this with.

> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete

A follow-up patch, should extend the message to include the timeout value.

 > MDI Write did not complete did not complete in … seconds.

According to the Linux timestamps it’s 98 ms, which makes sense, as (640 
* 3 * 50 μs = 96 ms).

What crappy hardware is this, that it takes longer than 100 ms?

> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
> increase polling iteration can resolve the issue.

Please explicitly state, what the current timeout value is, and what it 
is increased to.

     640 * 3 * 50 μs = 96 ms
     640 * 10 * 50 μs = 320 ms

The macro definition also misses the unit.

     /* SerDes Control */
     #define E1000_GEN_POLL_TIMEOUT          640

How did you determine, that tenfold that value is good. And not 
eightfold, for example? Please give the exact value (Linux log message 
timestamps should be enough), what the hardware needs now.

As a commit message summary, I suggest:

 > e1000e: Increase MDIC ready bit polling timeout from 96 ms to 320 ms

> While at it, also move the delay to the end of loop, to potentially save
> 50 us.
> 
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v2:
>   - Increase polling iteration instead of powering down the phy.
> 
>   drivers/net/ethernet/intel/e1000e/phy.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
> index e11c877595fb..72968a01164b 100644
> --- a/drivers/net/ethernet/intel/e1000e/phy.c
> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
> @@ -203,11 +203,12 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>   	 * Increasing the time out as testing showed failures with
>   	 * the lower time out
>   	 */
> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
> -		udelay(50);
> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
>   		mdic = er32(MDIC);
>   		if (mdic & E1000_MDIC_READY)
>   			break;
> +
> +		udelay(50);
>   	}
>   	if (!(mdic & E1000_MDIC_READY)) {
>   		e_dbg("MDI Write did not complete\n");
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 15:09 ` [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit Kai-Heng Feng
  2020-09-24 15:32   ` [Intel-wired-lan] " Paul Menzel
@ 2020-09-24 15:53   ` Andrew Lunn
  2020-09-24 16:04     ` Kai-Heng Feng
  2020-09-24 16:45   ` [PATCH v3] " Kai-Heng Feng
  2 siblings, 1 reply; 31+ messages in thread
From: Andrew Lunn @ 2020-09-24 15:53 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: jeffrey.t.kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

On Thu, Sep 24, 2020 at 11:09:58PM +0800, Kai-Heng Feng wrote:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
> increase polling iteration can resolve the issue.
> 
> While at it, also move the delay to the end of loop, to potentially save
> 50 us.

You are unlikely to save any time. 64 bits at 2.5MHz is 25.6uS. So it
is very unlikely doing a read directly after setting is going is going
to have E1000_MDIC_READY set. So this change likely causes an addition
read on MDIC. Did you profile this at all, for the normal case?

I also don't fully understand the fix. You are now looping up to 6400
times, each with a delay of 50uS. So that is around 12800 times more
than it actually needs to transfer the 64 bits! I've no idea how this
hardware works, but my guess would be, something is wrong with the
clock setup?

     Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 15:53   ` Andrew Lunn
@ 2020-09-24 16:04     ` Kai-Heng Feng
  2020-09-25  8:50       ` David Laight
  0 siblings, 1 reply; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-24 16:04 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jeff Kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

Hi Andrew,

> On Sep 24, 2020, at 23:53, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> On Thu, Sep 24, 2020 at 11:09:58PM +0800, Kai-Heng Feng wrote:
>> We are seeing the following error after S3 resume:
>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>> ...
>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>> 
>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>> increase polling iteration can resolve the issue.
>> 
>> While at it, also move the delay to the end of loop, to potentially save
>> 50 us.
> 
> You are unlikely to save any time. 64 bits at 2.5MHz is 25.6uS. So it
> is very unlikely doing a read directly after setting is going is going
> to have E1000_MDIC_READY set. So this change likely causes an addition
> read on MDIC. Did you profile this at all, for the normal case?

You are right, it's actually may add an additional read.
Let me send a v3.

> 
> I also don't fully understand the fix. You are now looping up to 6400
> times, each with a delay of 50uS. So that is around 12800 times more
> than it actually needs to transfer the 64 bits! I've no idea how this
> hardware works, but my guess would be, something is wrong with the
> clock setup?

It's probably caused by Intel ME. This is not something new, you can find many polling codes in e1000e driver are for ME, especially after S3 resume.

Unless Intel is willing to open up ME, being patient and wait for a longer while is the best approach we got.

Kai-Heng

> 
>     Andrew


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 15:32   ` [Intel-wired-lan] " Paul Menzel
@ 2020-09-24 16:04     ` Andrew Lunn
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Lunn @ 2020-09-24 16:04 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Kai-Heng Feng, Jeff Kirsher, netdev, linux-kernel,
	intel-wired-lan, Jakub Kicinski, David S. Miller

On Thu, Sep 24, 2020 at 05:32:12PM +0200, Paul Menzel wrote:
> Dear Kai-Heng,
> 
> 
> Thank you for sending version 2.
> 
> Am 24.09.20 um 17:09 schrieb Kai-Heng Feng:
> > We are seeing the following error after S3 resume:
> 
> I’d be great if you added the system and used hardware, you are seeing this
> with.
> 
> > [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> > [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> 
> A follow-up patch, should extend the message to include the timeout value.
> 
> > MDI Write did not complete did not complete in … seconds.
> 
> According to the Linux timestamps it’s 98 ms, which makes sense, as (640 * 3
> * 50 μs = 96 ms).
> 
> What crappy hardware is this, that it takes longer than 100 ms?

I'm speculating, but i guess this happens with just the first couple
of transfers after power up. After that, it probably takes a single
loop. It would be good to see some profile data for this. Completely
different MDIO driver and implementation, but this patch might give
some ideas how to do the profiling:

https://github.com/lunn/linux/commit/76c7810a7e2c1b1e28a7a95d08dd440a8f48a516

Look at the debugfs and num_loops/us parts.

     Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 15:09 ` [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit Kai-Heng Feng
  2020-09-24 15:32   ` [Intel-wired-lan] " Paul Menzel
  2020-09-24 15:53   ` Andrew Lunn
@ 2020-09-24 16:45   ` Kai-Heng Feng
  2020-09-24 19:57     ` Andrew Lunn
                       ` (2 more replies)
  2 siblings, 3 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-24 16:45 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: andrew, Kai-Heng Feng, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

We are seeing the following error after S3 resume:
[  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
[  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
...
[  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
increase polling iteration can resolve the issue.

The root cause is quite likely Intel ME, since it's a blackbox to the
kernel so the only approach we can take is to be patient and wait
longer.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v3:
 - Moving delay to end of loop doesn't save anytime, move it back.
 - Point out this is quitely likely caused by Intel ME.

v2:
 - Increase polling iteration instead of powering down the phy.

 drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
index e11c877595fb..e6d4acd90937 100644
--- a/drivers/net/ethernet/intel/e1000e/phy.c
+++ b/drivers/net/ethernet/intel/e1000e/phy.c
@@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
 	 * Increasing the time out as testing showed failures with
 	 * the lower time out
 	 */
-	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
+	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
 		udelay(50);
 		mdic = er32(MDIC);
 		if (mdic & E1000_MDIC_READY)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 16:45   ` [PATCH v3] " Kai-Heng Feng
@ 2020-09-24 19:57     ` Andrew Lunn
  2020-09-25  3:57       ` Kai-Heng Feng
  2020-09-25  5:16     ` [Intel-wired-lan] " Paul Menzel
  2020-09-28  8:36     ` [PATCH v4] e1000e: Increase polling timeout on " Kai-Heng Feng
  2 siblings, 1 reply; 31+ messages in thread
From: Andrew Lunn @ 2020-09-24 19:57 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: jeffrey.t.kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

On Fri, Sep 25, 2020 at 12:45:42AM +0800, Kai-Heng Feng wrote:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
> increase polling iteration can resolve the issue.
> 
> The root cause is quite likely Intel ME, since it's a blackbox to the
> kernel so the only approach we can take is to be patient and wait
> longer.

Please could you explain how you see Intel ME being responsible for
this. I'm not convinced.

      Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 19:57     ` Andrew Lunn
@ 2020-09-25  3:57       ` Kai-Heng Feng
  0 siblings, 0 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-25  3:57 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jeff Kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list



> On Sep 25, 2020, at 03:57, Andrew Lunn <andrew@lunn.ch> wrote:
> 
> On Fri, Sep 25, 2020 at 12:45:42AM +0800, Kai-Heng Feng wrote:
>> We are seeing the following error after S3 resume:
>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>> ...
>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>> 
>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>> increase polling iteration can resolve the issue.
>> 
>> The root cause is quite likely Intel ME, since it's a blackbox to the
>> kernel so the only approach we can take is to be patient and wait
>> longer.
> 
> Please could you explain how you see Intel ME being responsible for
> this. I'm not convinced.

Some other occurrences:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d17c7868b2f8e329dcee4ecd2f5d16cfc9b26ac8
https://lore.kernel.org/netdev/20200323191639.48826-1-aaron.ma@canonical.com/

Of course we need an ACK from Intel this one is also related to ME.

Kai-Heng

> 
>      Andrew


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v3] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 16:45   ` [PATCH v3] " Kai-Heng Feng
  2020-09-24 19:57     ` Andrew Lunn
@ 2020-09-25  5:16     ` Paul Menzel
  2020-09-28  8:36     ` [PATCH v4] e1000e: Increase polling timeout on " Kai-Heng Feng
  2 siblings, 0 replies; 31+ messages in thread
From: Paul Menzel @ 2020-09-25  5:16 UTC (permalink / raw)
  To: Kai-Heng Feng, Jeff Kirsher
  Cc: Andrew Lunn, netdev, linux-kernel, intel-wired-lan,
	Jakub Kicinski, David S. Miller

Dear Kai-Heng,


Thank you for patch version 3.

Am 24.09.20 um 18:45 schrieb Kai-Heng Feng:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
> increase polling iteration can resolve the issue.
> 
> The root cause is quite likely Intel ME, since it's a blackbox to the
> kernel so the only approach we can take is to be patient and wait
> longer.
> 
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v3:
>   - Moving delay to end of loop doesn't save anytime, move it back.
>   - Point out this is quitely likely caused by Intel ME.

quietly

You seem to have missed my comments regarding patch version 3. It’d be 
great if you improved the commit message with my suggestions.

Without knowing what hardware this happened on, nobody, even later 
getting the hardware, can reproduce the your results. If you say the ME 
is involved, please also document the ME firmware version, which is used 
here.

> v2:
>   - Increase polling iteration instead of powering down the phy.
> 
>   drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
> index e11c877595fb..e6d4acd90937 100644
> --- a/drivers/net/ethernet/intel/e1000e/phy.c
> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>   	 * Increasing the time out as testing showed failures with
>   	 * the lower time out
>   	 */
> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
>   		udelay(50);
>   		mdic = er32(MDIC);
>   		if (mdic & E1000_MDIC_READY)

In the PCI subsystem, a warning is shown, when something takes more then 
100 ms. As you increase it to over 320 ms, a warning should be printed 
to talk to the firmware folks, when it passes 100 ms.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-24 16:04     ` Kai-Heng Feng
@ 2020-09-25  8:50       ` David Laight
  2020-09-25 13:29         ` Andrew Lunn
  0 siblings, 1 reply; 31+ messages in thread
From: David Laight @ 2020-09-25  8:50 UTC (permalink / raw)
  To: 'Kai-Heng Feng', Andrew Lunn
  Cc: Jeff Kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

From: Kai-Heng Feng
> Sent: 24 September 2020 17:04
...
> > I also don't fully understand the fix. You are now looping up to 6400
> > times, each with a delay of 50uS. So that is around 12800 times more
> > than it actually needs to transfer the 64 bits! I've no idea how this
> > hardware works, but my guess would be, something is wrong with the
> > clock setup?
> 
> It's probably caused by Intel ME. This is not something new, you can find many polling codes in e1000e
> driver are for ME, especially after S3 resume.
> 
> Unless Intel is willing to open up ME, being patient and wait for a longer while is the best approach
> we got.

There is some really broken code in the e1000e driver that affect my
Ivy bridge platform were it is trying to avoid hardware bugs in
the ME interface.

It seems that before EVERY write to a MAC register it must check
that the ME isn't using the interface - and spin until it isn't.
This causes massive delays in the TX path because it includes
the write that tells the MAC engine about a new packet.

The code is completely broken though.
Interrupts and processes switches can happen between the
test for the ME being idle and the actual write.

AFAICT the only reliable way to get ethernet access on such
systems is to use a different ethernet interface.

Also, from what I remember, the broken workaround is missing
from some of the setup code paths.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-25  8:50       ` David Laight
@ 2020-09-25 13:29         ` Andrew Lunn
  2020-09-26 10:08           ` David Laight
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Lunn @ 2020-09-25 13:29 UTC (permalink / raw)
  To: David Laight
  Cc: 'Kai-Heng Feng',
	Jeff Kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

On Fri, Sep 25, 2020 at 08:50:30AM +0000, David Laight wrote:
> From: Kai-Heng Feng
> > Sent: 24 September 2020 17:04
> ...
> > > I also don't fully understand the fix. You are now looping up to 6400
> > > times, each with a delay of 50uS. So that is around 12800 times more
> > > than it actually needs to transfer the 64 bits! I've no idea how this
> > > hardware works, but my guess would be, something is wrong with the
> > > clock setup?
> > 
> > It's probably caused by Intel ME. This is not something new, you can find many polling codes in e1000e
> > driver are for ME, especially after S3 resume.
> > 
> > Unless Intel is willing to open up ME, being patient and wait for a longer while is the best approach
> > we got.
> 
> There is some really broken code in the e1000e driver that affect my
> Ivy bridge platform were it is trying to avoid hardware bugs in
> the ME interface.
> 
> It seems that before EVERY write to a MAC register it must check
> that the ME isn't using the interface - and spin until it isn't.
> This causes massive delays in the TX path because it includes
> the write that tells the MAC engine about a new packet.

Hi David

Thanks for the information. This however does not really explain the
issue.

The code busy loops waiting for the MDIO transaction to complete. If
read/writes to the MAC are getting blocked, that just means less
iterations of the loop are needed, not more, since the time to
complete the transaction should be fixed.

If ME really is to blame, it means ME is completely hijacking the
hardware? Stopping the clocks? Maybe doing its own MDIO transactions?
How can you write a PHY driver if something else is also programming
the PHY.

We don't understand what is going on here. We are just papering over
the cracks. The commit message should say this, that the change fixes
the symptoms but probably not the cause.

    Andrew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit
  2020-09-25 13:29         ` Andrew Lunn
@ 2020-09-26 10:08           ` David Laight
  0 siblings, 0 replies; 31+ messages in thread
From: David Laight @ 2020-09-26 10:08 UTC (permalink / raw)
  To: 'Andrew Lunn'
  Cc: 'Kai-Heng Feng',
	Jeff Kirsher, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

From: Andrew Lunn
> Sent: 25 September 2020 14:29
> On Fri, Sep 25, 2020 at 08:50:30AM +0000, David Laight wrote:
> > From: Kai-Heng Feng
> > > Sent: 24 September 2020 17:04
> > ...
> > > > I also don't fully understand the fix. You are now looping up to 6400
> > > > times, each with a delay of 50uS. So that is around 12800 times more
> > > > than it actually needs to transfer the 64 bits! I've no idea how this
> > > > hardware works, but my guess would be, something is wrong with the
> > > > clock setup?
> > >
> > > It's probably caused by Intel ME. This is not something new, you can find many polling codes in
> e1000e
> > > driver are for ME, especially after S3 resume.
> > >
> > > Unless Intel is willing to open up ME, being patient and wait for a longer while is the best
> approach
> > > we got.
> >
> > There is some really broken code in the e1000e driver that affect my
> > Ivy bridge platform were it is trying to avoid hardware bugs in
> > the ME interface.
> >
> > It seems that before EVERY write to a MAC register it must check
> > that the ME isn't using the interface - and spin until it isn't.
> > This causes massive delays in the TX path because it includes
> > the write that tells the MAC engine about a new packet.
> 
> Hi David
> 
> Thanks for the information. This however does not really explain the
> issue.
> 
> The code busy loops waiting for the MDIO transaction to complete. If
> read/writes to the MAC are getting blocked, that just means less
> iterations of the loop are needed, not more, since the time to
> complete the transaction should be fixed.
> 
> If ME really is to blame, it means ME is completely hijacking the
> hardware? Stopping the clocks? Maybe doing its own MDIO transactions?
> How can you write a PHY driver if something else is also programming
> the PHY.
> 
> We don't understand what is going on here. We are just papering over
> the cracks. The commit message should say this, that the change fixes
> the symptoms but probably not the cause.

You may not have the same broken hardware as I have...

From what I could infer from the code and guess from the behaviour
I got the impression that if the ME was accessing any of the MAC
registers it was likely that writes from the kernel just got discarded.

I got the impression that a bug in the hardware was being worked
around by the ME setting a status bit before and access, waiting
a bit for the kernel to finish anything it was doing, then
doing its access and clearing the bit.

The kernel keeps having to wait for the bit to be clear.
These delays were long; sub ms - but far longer than
the rest of the code path for sending a packet.
But the code didn't check/disable pre-emption or interrupts
so the check was actually broken.
(If I removed it completely my system wouldn't boot!)

Thing is I don't want the ME.
I don't need the ME on that system.
The ME might be a security hole.
The ME breaks my system.
But I can't disable it at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-24 16:45   ` [PATCH v3] " Kai-Heng Feng
  2020-09-24 19:57     ` Andrew Lunn
  2020-09-25  5:16     ` [Intel-wired-lan] " Paul Menzel
@ 2020-09-28  8:36     ` Kai-Heng Feng
  2020-09-29 13:08       ` [Intel-wired-lan] " Neftin, Sasha
  2 siblings, 1 reply; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-28  8:36 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: andrew, Kai-Heng Feng, David S. Miller, Jakub Kicinski,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list

We are seeing the following error after S3 resume:
[  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
[  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
[  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
[  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
...
[  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error

As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
increase polling iteration can resolve the issue.

This patch only papers over the symptom, as we don't really know the
root cause of the issue. The most possible culprit is Intel ME, which
may do its own things that conflict with software.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v4:
 - States that this patch just papers over the symptom.

v3:
 - Moving delay to end of loop doesn't save anytime, move it back.
 - Point out this is quitely likely caused by Intel ME.

v2:
 - Increase polling iteration instead of powering down the phy.

 drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
index e11c877595fb..e6d4acd90937 100644
--- a/drivers/net/ethernet/intel/e1000e/phy.c
+++ b/drivers/net/ethernet/intel/e1000e/phy.c
@@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
 	 * Increasing the time out as testing showed failures with
 	 * the lower time out
 	 */
-	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
+	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
 		udelay(50);
 		mdic = er32(MDIC);
 		if (mdic & E1000_MDIC_READY)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-28  8:36     ` [PATCH v4] e1000e: Increase polling timeout on " Kai-Heng Feng
@ 2020-09-29 13:08       ` Neftin, Sasha
  2020-09-29 13:31         ` Kai-Heng Feng
  0 siblings, 1 reply; 31+ messages in thread
From: Neftin, Sasha @ 2020-09-29 13:08 UTC (permalink / raw)
  To: Kai-Heng Feng, jeffrey.t.kirsher
  Cc: andrew, open list:NETWORKING DRIVERS, open list,
	moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Lifshits, Vitaly, Nguyen, Anthony L

On 9/28/2020 11:36, Kai-Heng Feng wrote:
> We are seeing the following error after S3 resume:
> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
> ...
> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
> 
> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
> increase polling iteration can resolve the issue.
> 
> This patch only papers over the symptom, as we don't really know the
> root cause of the issue. The most possible culprit is Intel ME, which
> may do its own things that conflict with software.
> 
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
> v4:
>   - States that this patch just papers over the symptom.
> 
> v3:
>   - Moving delay to end of loop doesn't save anytime, move it back.
>   - Point out this is quitely likely caused by Intel ME.
> 
> v2:
>   - Increase polling iteration instead of powering down the phy.
> 
>   drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
> index e11c877595fb..e6d4acd90937 100644
> --- a/drivers/net/ethernet/intel/e1000e/phy.c
> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>   	 * Increasing the time out as testing showed failures with
>   	 * the lower time out
>   	 */
> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
As we discussed (many threads) - AMT/ME systems not supported on Linux 
as properly. I do not think increasing polling iteration will solve the 
problem. Rather mask it.
I prefer you check option to disable ME vi BIOS on your system.
>   		udelay(50);
>   		mdic = er32(MDIC);
>   		if (mdic & E1000_MDIC_READY)
> 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-29 13:08       ` [Intel-wired-lan] " Neftin, Sasha
@ 2020-09-29 13:31         ` Kai-Heng Feng
  2020-09-29 13:46           ` Neftin, Sasha
  0 siblings, 1 reply; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-29 13:31 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Jeff Kirsher, Andrew Lunn, open list:NETWORKING DRIVERS,
	open list, moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Lifshits, Vitaly, Nguyen, Anthony L

Hi Sasha,

> On Sep 29, 2020, at 21:08, Neftin, Sasha <sasha.neftin@intel.com> wrote:
> 
> On 9/28/2020 11:36, Kai-Heng Feng wrote:
>> We are seeing the following error after S3 resume:
>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>> ...
>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>> increase polling iteration can resolve the issue.
>> This patch only papers over the symptom, as we don't really know the
>> root cause of the issue. The most possible culprit is Intel ME, which
>> may do its own things that conflict with software.
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>> ---
>> v4:
>>  - States that this patch just papers over the symptom.
>> v3:
>>  - Moving delay to end of loop doesn't save anytime, move it back.
>>  - Point out this is quitely likely caused by Intel ME.
>> v2:
>>  - Increase polling iteration instead of powering down the phy.
>>  drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
>> index e11c877595fb..e6d4acd90937 100644
>> --- a/drivers/net/ethernet/intel/e1000e/phy.c
>> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
>> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>>  	 * Increasing the time out as testing showed failures with
>>  	 * the lower time out
>>  	 */
>> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
>> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
> As we discussed (many threads) - AMT/ME systems not supported on Linux as properly. I do not think increasing polling iteration will solve the problem. Rather mask it.

I am aware of the status quo of no proper support on Intel ME. 

> I prefer you check option to disable ME vi BIOS on your system.

We can't ask user to change the BIOS to accommodate Linux. So before a proper solution comes out, masking the problem is good enough for me.
Until then, I'll carry it as a downstream distro patch.

Kai-Heng

>>  		udelay(50);
>>  		mdic = er32(MDIC);
>>  		if (mdic & E1000_MDIC_READY)
> Thanks,
> Sasha


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-29 13:31         ` Kai-Heng Feng
@ 2020-09-29 13:46           ` Neftin, Sasha
  2020-09-29 15:08             ` Kai-Heng Feng
  0 siblings, 1 reply; 31+ messages in thread
From: Neftin, Sasha @ 2020-09-29 13:46 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Jeff Kirsher, Andrew Lunn, open list:NETWORKING DRIVERS,
	open list, moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Lifshits, Vitaly, Nguyen, Anthony L

Hello Kai-Heng,
On 9/29/2020 16:31, Kai-Heng Feng wrote:
> Hi Sasha,
> 
>> On Sep 29, 2020, at 21:08, Neftin, Sasha <sasha.neftin@intel.com> wrote:
>>
>> On 9/28/2020 11:36, Kai-Heng Feng wrote:
>>> We are seeing the following error after S3 resume:
>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>>> ...
>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>>> increase polling iteration can resolve the issue.
>>> This patch only papers over the symptom, as we don't really know the
>>> root cause of the issue. The most possible culprit is Intel ME, which
>>> may do its own things that conflict with software.
>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>> ---
>>> v4:
>>>   - States that this patch just papers over the symptom.
>>> v3:
>>>   - Moving delay to end of loop doesn't save anytime, move it back.
>>>   - Point out this is quitely likely caused by Intel ME.
>>> v2:
>>>   - Increase polling iteration instead of powering down the phy.
>>>   drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
>>> index e11c877595fb..e6d4acd90937 100644
>>> --- a/drivers/net/ethernet/intel/e1000e/phy.c
>>> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
>>> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>>>   	 * Increasing the time out as testing showed failures with
>>>   	 * the lower time out
>>>   	 */
>>> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
>>> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
>> As we discussed (many threads) - AMT/ME systems not supported on Linux as properly. I do not think increasing polling iteration will solve the problem. Rather mask it.
> 
> I am aware of the status quo of no proper support on Intel ME.
> 
>> I prefer you check option to disable ME vi BIOS on your system.
> 
> We can't ask user to change the BIOS to accommodate Linux. So before a proper solution comes out, masking the problem is good enough for me.
> Until then, I'll carry it as a downstream distro patch.
What will you do with system that even after increasing polling time 
will run into HW error?
> 
> Kai-Heng
> 
>>>   		udelay(50);
>>>   		mdic = er32(MDIC);
>>>   		if (mdic & E1000_MDIC_READY)
>> Thanks,
>> Sasha
> 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-29 13:46           ` Neftin, Sasha
@ 2020-09-29 15:08             ` Kai-Heng Feng
  2020-09-29 15:11               ` David Laight
  2020-09-30  6:54               ` Vitaly Lifshits
  0 siblings, 2 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-29 15:08 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Jeff Kirsher, Andrew Lunn, open list:NETWORKING DRIVERS,
	open list, moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Lifshits, Vitaly, Nguyen, Anthony L



> On Sep 29, 2020, at 21:46, Neftin, Sasha <sasha.neftin@intel.com> wrote:
> 
> Hello Kai-Heng,
> On 9/29/2020 16:31, Kai-Heng Feng wrote:
>> Hi Sasha,
>>> On Sep 29, 2020, at 21:08, Neftin, Sasha <sasha.neftin@intel.com> wrote:
>>> 
>>> On 9/28/2020 11:36, Kai-Heng Feng wrote:
>>>> We are seeing the following error after S3 resume:
>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>>>> ...
>>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>>>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>>>> increase polling iteration can resolve the issue.
>>>> This patch only papers over the symptom, as we don't really know the
>>>> root cause of the issue. The most possible culprit is Intel ME, which
>>>> may do its own things that conflict with software.
>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>>> ---
>>>> v4:
>>>>  - States that this patch just papers over the symptom.
>>>> v3:
>>>>  - Moving delay to end of loop doesn't save anytime, move it back.
>>>>  - Point out this is quitely likely caused by Intel ME.
>>>> v2:
>>>>  - Increase polling iteration instead of powering down the phy.
>>>>  drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
>>>> index e11c877595fb..e6d4acd90937 100644
>>>> --- a/drivers/net/ethernet/intel/e1000e/phy.c
>>>> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
>>>> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>>>>  	 * Increasing the time out as testing showed failures with
>>>>  	 * the lower time out
>>>>  	 */
>>>> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
>>>> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
>>> As we discussed (many threads) - AMT/ME systems not supported on Linux as properly. I do not think increasing polling iteration will solve the problem. Rather mask it.
>> I am aware of the status quo of no proper support on Intel ME.
>>> I prefer you check option to disable ME vi BIOS on your system.
>> We can't ask user to change the BIOS to accommodate Linux. So before a proper solution comes out, masking the problem is good enough for me.
>> Until then, I'll carry it as a downstream distro patch.
> What will you do with system that even after increasing polling time will run into HW error?

Hope we finally have proper ME support under Linux?

Kai-Heng

>> Kai-Heng
>>>>  		udelay(50);
>>>>  		mdic = er32(MDIC);
>>>>  		if (mdic & E1000_MDIC_READY)
>>> Thanks,
>>> Sasha
> Thanks,
> Sasha


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-29 15:08             ` Kai-Heng Feng
@ 2020-09-29 15:11               ` David Laight
  2020-09-29 15:12                 ` Kai-Heng Feng
  2020-09-30  6:54               ` Vitaly Lifshits
  1 sibling, 1 reply; 31+ messages in thread
From: David Laight @ 2020-09-29 15:11 UTC (permalink / raw)
  To: 'Kai-Heng Feng', Neftin, Sasha
  Cc: Jeff Kirsher, Andrew Lunn, open list:NETWORKING DRIVERS,
	open list, moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Lifshits, Vitaly, Nguyen, Anthony L

> Hope we finally have proper ME support under Linux?

How about a way to disable it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-29 15:11               ` David Laight
@ 2020-09-29 15:12                 ` Kai-Heng Feng
  0 siblings, 0 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-09-29 15:12 UTC (permalink / raw)
  To: David Laight
  Cc: Neftin, Sasha, Jeff Kirsher, Andrew Lunn,
	open list:NETWORKING DRIVERS, open list,
	moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Lifshits, Vitaly, Nguyen, Anthony L



> On Sep 29, 2020, at 23:11, David Laight <David.Laight@ACULAB.COM> wrote:
> 
>> Hope we finally have proper ME support under Linux?
> 
> How about a way to disable it.

This will do, too :)

Kai-Heng

> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-29 15:08             ` Kai-Heng Feng
  2020-09-29 15:11               ` David Laight
@ 2020-09-30  6:54               ` Vitaly Lifshits
  2020-10-05  6:23                 ` Kai-Heng Feng
  1 sibling, 1 reply; 31+ messages in thread
From: Vitaly Lifshits @ 2020-09-30  6:54 UTC (permalink / raw)
  To: Kai-Heng Feng, Neftin, Sasha
  Cc: Jeff Kirsher, Andrew Lunn, open list:NETWORKING DRIVERS,
	open list, moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Nguyen, Anthony L

On 9/29/2020 18:08, Kai-Heng Feng wrote:

Hello Kai-Heng,
> 
> 
>> On Sep 29, 2020, at 21:46, Neftin, Sasha <sasha.neftin@intel.com> wrote:
>>
>> Hello Kai-Heng,
>> On 9/29/2020 16:31, Kai-Heng Feng wrote:
>>> Hi Sasha,
>>>> On Sep 29, 2020, at 21:08, Neftin, Sasha <sasha.neftin@intel.com> wrote:
>>>>
>>>> On 9/28/2020 11:36, Kai-Heng Feng wrote:
>>>>> We are seeing the following error after S3 resume:
>>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>>>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>>>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>>>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>>>>> ...
>>>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>>>>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>>>>> increase polling iteration can resolve the issue.
>>>>> This patch only papers over the symptom, as we don't really know the
>>>>> root cause of the issue. The most possible culprit is Intel ME, which
>>>>> may do its own things that conflict with software.
>>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>>>> ---
>>>>> v4:
>>>>>   - States that this patch just papers over the symptom.
>>>>> v3:
>>>>>   - Moving delay to end of loop doesn't save anytime, move it back.
>>>>>   - Point out this is quitely likely caused by Intel ME.
>>>>> v2:
>>>>>   - Increase polling iteration instead of powering down the phy.
>>>>>   drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
>>>>> index e11c877595fb..e6d4acd90937 100644
>>>>> --- a/drivers/net/ethernet/intel/e1000e/phy.c
>>>>> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
>>>>> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>>>>>   	 * Increasing the time out as testing showed failures with
>>>>>   	 * the lower time out
>>>>>   	 */
>>>>> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
>>>>> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
>>>> As we discussed (many threads) - AMT/ME systems not supported on Linux as properly. I do not think increasing polling iteration will solve the problem. Rather mask it.
>>> I am aware of the status quo of no proper support on Intel ME.
>>>> I prefer you check option to disable ME vi BIOS on your system.
>>> We can't ask user to change the BIOS to accommodate Linux. So before a proper solution comes out, masking the problem is good enough for me.
>>> Until then, I'll carry it as a downstream distro patch.
>> What will you do with system that even after increasing polling time will run into HW error?
> 
> Hope we finally have proper ME support under Linux?
> 
> Kai-Heng
> 
>>> Kai-Heng
>>>>>   		udelay(50);
>>>>>   		mdic = er32(MDIC);
>>>>>   		if (mdic & E1000_MDIC_READY)
>>>> Thanks,
>>>> Sasha
>> Thanks,
>> Sasha
> 

On which device ID/platform do you see the issue? What is the Firmware 
version on your platform? What is the ME firmware version that you have?

I am asking these questions, since I know there is supposed to be a fix 
in the firmware to many issues that are related to ME and device 
interoperability.

Thanks,

Vitaly

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-wired-lan] [PATCH v4] e1000e: Increase polling timeout on MDIC ready bit
  2020-09-30  6:54               ` Vitaly Lifshits
@ 2020-10-05  6:23                 ` Kai-Heng Feng
  0 siblings, 0 replies; 31+ messages in thread
From: Kai-Heng Feng @ 2020-10-05  6:23 UTC (permalink / raw)
  To: Vitaly Lifshits
  Cc: Neftin, Sasha, Jeff Kirsher, Andrew Lunn,
	open list:NETWORKING DRIVERS, open list,
	moderated list:INTEL ETHERNET DRIVERS, Jakub Kicinski,
	David S. Miller, Nguyen, Anthony L

Hi Vitaly,

> On Sep 30, 2020, at 14:54, Vitaly Lifshits <vitaly.lifshits@intel.com> wrote:
> 
> On 9/29/2020 18:08, Kai-Heng Feng wrote:
> 
> Hello Kai-Heng,
>>> On Sep 29, 2020, at 21:46, Neftin, Sasha <sasha.neftin@intel.com> wrote:
>>> 
>>> Hello Kai-Heng,
>>> On 9/29/2020 16:31, Kai-Heng Feng wrote:
>>>> Hi Sasha,
>>>>> On Sep 29, 2020, at 21:08, Neftin, Sasha <sasha.neftin@intel.com> wrote:
>>>>> 
>>>>> On 9/28/2020 11:36, Kai-Heng Feng wrote:
>>>>>> We are seeing the following error after S3 resume:
>>>>>> [  704.746874] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>>>> [  704.844232] e1000e 0000:00:1f.6 eno1: MDI Write did not complete
>>>>>> [  704.902817] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>>>> [  704.903075] e1000e 0000:00:1f.6 eno1: reading PHY page 769 (or 0x6020 shifted) reg 0x17
>>>>>> [  704.903281] e1000e 0000:00:1f.6 eno1: Setting page 0x6020
>>>>>> [  704.903486] e1000e 0000:00:1f.6 eno1: writing PHY page 769 (or 0x6020 shifted) reg 0x17
>>>>>> [  704.943155] e1000e 0000:00:1f.6 eno1: MDI Error
>>>>>> ...
>>>>>> [  705.108161] e1000e 0000:00:1f.6 eno1: Hardware Error
>>>>>> As Andrew Lunn pointed out, MDIO has nothing to do with phy, and indeed
>>>>>> increase polling iteration can resolve the issue.
>>>>>> This patch only papers over the symptom, as we don't really know the
>>>>>> root cause of the issue. The most possible culprit is Intel ME, which
>>>>>> may do its own things that conflict with software.
>>>>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
>>>>>> ---
>>>>>> v4:
>>>>>>  - States that this patch just papers over the symptom.
>>>>>> v3:
>>>>>>  - Moving delay to end of loop doesn't save anytime, move it back.
>>>>>>  - Point out this is quitely likely caused by Intel ME.
>>>>>> v2:
>>>>>>  - Increase polling iteration instead of powering down the phy.
>>>>>>  drivers/net/ethernet/intel/e1000e/phy.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>> diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
>>>>>> index e11c877595fb..e6d4acd90937 100644
>>>>>> --- a/drivers/net/ethernet/intel/e1000e/phy.c
>>>>>> +++ b/drivers/net/ethernet/intel/e1000e/phy.c
>>>>>> @@ -203,7 +203,7 @@ s32 e1000e_write_phy_reg_mdic(struct e1000_hw *hw, u32 offset, u16 data)
>>>>>>  	 * Increasing the time out as testing showed failures with
>>>>>>  	 * the lower time out
>>>>>>  	 */
>>>>>> -	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 3); i++) {
>>>>>> +	for (i = 0; i < (E1000_GEN_POLL_TIMEOUT * 10); i++) {
>>>>> As we discussed (many threads) - AMT/ME systems not supported on Linux as properly. I do not think increasing polling iteration will solve the problem. Rather mask it.
>>>> I am aware of the status quo of no proper support on Intel ME.
>>>>> I prefer you check option to disable ME vi BIOS on your system.
>>>> We can't ask user to change the BIOS to accommodate Linux. So before a proper solution comes out, masking the problem is good enough for me.
>>>> Until then, I'll carry it as a downstream distro patch.
>>> What will you do with system that even after increasing polling time will run into HW error?
>> Hope we finally have proper ME support under Linux?
>> Kai-Heng
>>>> Kai-Heng
>>>>>>  		udelay(50);
>>>>>>  		mdic = er32(MDIC);
>>>>>>  		if (mdic & E1000_MDIC_READY)
>>>>> Thanks,
>>>>> Sasha
>>> Thanks,
>>> Sasha
> 
> On which device ID/platform do you see the issue?

HP Z4 G4 Workstation,
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-LM [8086:15b7]

> What is the Firmware version on your platform?

BIOS version: P61 v02.59


> What is the ME firmware version that you have?

ME version: 11.11.50.1422
ME mode: AMT disable

Kai-Heng

> 
> I am asking these questions, since I know there is supposed to be a fix in the firmware to many issues that are related to ME and device interoperability.
> 
> Thanks,
> 
> Vitaly


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-10-05  6:24 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23  7:47 [PATCH] e1000e: Power cycle phy on PM resume Kai-Heng Feng
2020-09-23 12:17 ` Andrew Lunn
2020-09-23 14:44   ` Kai-Heng Feng
2020-09-23 15:37     ` Andrew Lunn
2020-09-24 12:50       ` Kai-Heng Feng
2020-09-23 13:28 ` [Intel-wired-lan] " Paul Menzel
2020-09-23 14:46   ` Kai-Heng Feng
2020-09-23 15:02     ` Paul Menzel
2020-09-23 19:28       ` Andrew Lunn
2020-09-24 13:02         ` Paul Menzel
2020-09-24 15:09 ` [PATCH v2] e1000e: Increase iteration on polling MDIC ready bit Kai-Heng Feng
2020-09-24 15:32   ` [Intel-wired-lan] " Paul Menzel
2020-09-24 16:04     ` Andrew Lunn
2020-09-24 15:53   ` Andrew Lunn
2020-09-24 16:04     ` Kai-Heng Feng
2020-09-25  8:50       ` David Laight
2020-09-25 13:29         ` Andrew Lunn
2020-09-26 10:08           ` David Laight
2020-09-24 16:45   ` [PATCH v3] " Kai-Heng Feng
2020-09-24 19:57     ` Andrew Lunn
2020-09-25  3:57       ` Kai-Heng Feng
2020-09-25  5:16     ` [Intel-wired-lan] " Paul Menzel
2020-09-28  8:36     ` [PATCH v4] e1000e: Increase polling timeout on " Kai-Heng Feng
2020-09-29 13:08       ` [Intel-wired-lan] " Neftin, Sasha
2020-09-29 13:31         ` Kai-Heng Feng
2020-09-29 13:46           ` Neftin, Sasha
2020-09-29 15:08             ` Kai-Heng Feng
2020-09-29 15:11               ` David Laight
2020-09-29 15:12                 ` Kai-Heng Feng
2020-09-30  6:54               ` Vitaly Lifshits
2020-10-05  6:23                 ` Kai-Heng Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).