netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] phy: added a PHY_BUSY state into phy_state_machine
@ 2019-07-07 22:32 kwangdo.yi
  2019-07-08  3:07 ` Florian Fainelli
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: kwangdo.yi @ 2019-07-07 22:32 UTC (permalink / raw)
  To: netdev; +Cc: kwangdo.yi

When mdio driver polling the phy state in the phy_state_machine,
sometimes it results in -ETIMEDOUT and link is down. But the phy
is still alive and just didn't meet the polling deadline. 
Closing the phy link in this case seems too radical. Failing to 
meet the deadline happens very rarely. When stress test runs for 
tens of hours with multiple target boards (Xilinx Zynq7000 with
marvell 88E1512 PHY, Xilinx custom emac IP), it happens. This 
patch gives another chance to the phy_state_machine when polling 
timeout happens. Only two consecutive failing the deadline is 
treated as the real phy halt and close the connection.


Signed-off-by: kwangdo.yi <kwangdo.yi@gmail.com>
---
 drivers/net/phy/phy.c | 6 ++++++
 include/linux/phy.h   | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index e888542..9e8138b 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -919,7 +919,13 @@ void phy_state_machine(struct work_struct *work)
 		break;
 	case PHY_NOLINK:
 	case PHY_RUNNING:
+	case PHY_BUSY:
 		err = phy_check_link_status(phydev);
+		if (err == -ETIMEDOUT && old_state == PHY_RUNNING) {
+			phy->state = PHY_BUSY;
+			err = 0;
+
+		}
 		break;
 	case PHY_FORCING:
 		err = genphy_update_link(phydev);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 6424586..4a49401 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -313,6 +313,7 @@ enum phy_state {
 	PHY_RUNNING,
 	PHY_NOLINK,
 	PHY_FORCING,
+	PHY_BUSY,
 };
 
 /**
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-07 22:32 [PATCH] phy: added a PHY_BUSY state into phy_state_machine kwangdo.yi
@ 2019-07-08  3:07 ` Florian Fainelli
  2019-07-08  6:03   ` Heiner Kallweit
  2019-07-09  3:16   ` kwangdo yi
  2019-07-08  4:42 ` Andrew Lunn
  2019-07-09  1:58 ` kbuild test robot
  2 siblings, 2 replies; 8+ messages in thread
From: Florian Fainelli @ 2019-07-08  3:07 UTC (permalink / raw)
  To: kwangdo.yi, netdev, Andrew Lunn, Heiner Kallweit

+Andrew, Heiner (please CC PHY library maintainers).

On 7/7/2019 3:32 PM, kwangdo.yi wrote:
> When mdio driver polling the phy state in the phy_state_machine,
> sometimes it results in -ETIMEDOUT and link is down. But the phy
> is still alive and just didn't meet the polling deadline. 
> Closing the phy link in this case seems too radical. Failing to 
> meet the deadline happens very rarely. When stress test runs for 
> tens of hours with multiple target boards (Xilinx Zynq7000 with
> marvell 88E1512 PHY, Xilinx custom emac IP), it happens. This 
> patch gives another chance to the phy_state_machine when polling 
> timeout happens. Only two consecutive failing the deadline is 
> treated as the real phy halt and close the connection.

How about simply increasing the MDIO polling timeout in the Xilinx EMAC
driver instead? Or if the PHY is where the timeout needs to be
increased, allow the PHY device drivers to advertise min/max timeouts
such that the MDIO bus layer can use that information?

> 
> 
> Signed-off-by: kwangdo.yi <kwangdo.yi@gmail.com>
> ---
>  drivers/net/phy/phy.c | 6 ++++++
>  include/linux/phy.h   | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index e888542..9e8138b 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -919,7 +919,13 @@ void phy_state_machine(struct work_struct *work)
>  		break;
>  	case PHY_NOLINK:
>  	case PHY_RUNNING:
> +	case PHY_BUSY:
>  		err = phy_check_link_status(phydev);
> +		if (err == -ETIMEDOUT && old_state == PHY_RUNNING) {
> +			phy->state = PHY_BUSY;
> +			err = 0;
> +
> +		}
>  		break;
>  	case PHY_FORCING:
>  		err = genphy_update_link(phydev);
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 6424586..4a49401 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -313,6 +313,7 @@ enum phy_state {
>  	PHY_RUNNING,
>  	PHY_NOLINK,
>  	PHY_FORCING,
> +	PHY_BUSY,
>  };
>  
>  /**
> 

-- 
Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-07 22:32 [PATCH] phy: added a PHY_BUSY state into phy_state_machine kwangdo.yi
  2019-07-08  3:07 ` Florian Fainelli
@ 2019-07-08  4:42 ` Andrew Lunn
  2019-07-09  1:58 ` kbuild test robot
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Lunn @ 2019-07-08  4:42 UTC (permalink / raw)
  To: kwangdo.yi; +Cc: netdev

On Sun, Jul 07, 2019 at 06:32:12PM -0400, kwangdo.yi wrote:
> When mdio driver polling the phy state in the phy_state_machine,
> sometimes it results in -ETIMEDOUT and link is down. But the phy
> is still alive and just didn't meet the polling deadline. 
> Closing the phy link in this case seems too radical. Failing to 
> meet the deadline happens very rarely. When stress test runs for 
> tens of hours with multiple target boards (Xilinx Zynq7000 with
> marvell 88E1512 PHY, Xilinx custom emac IP), it happens. This 
> patch gives another chance to the phy_state_machine when polling 
> timeout happens. Only two consecutive failing the deadline is 
> treated as the real phy halt and close the connection.

Hi Kwangdo

I agree with Florian here. This does not seem like a PHY problem. It
is an MDIO bus problem. ETIMEDOUT is only returned from
xemaclite_mdio_wait().

What value are using for HZ? If you have 1000, jiffies + 2 could well
be too short.

   Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-08  3:07 ` Florian Fainelli
@ 2019-07-08  6:03   ` Heiner Kallweit
  2019-07-09  3:16   ` kwangdo yi
  1 sibling, 0 replies; 8+ messages in thread
From: Heiner Kallweit @ 2019-07-08  6:03 UTC (permalink / raw)
  To: Florian Fainelli, kwangdo.yi, netdev, Andrew Lunn

On 08.07.2019 05:07, Florian Fainelli wrote:
> +Andrew, Heiner (please CC PHY library maintainers).
> 
> On 7/7/2019 3:32 PM, kwangdo.yi wrote:
>> When mdio driver polling the phy state in the phy_state_machine,
>> sometimes it results in -ETIMEDOUT and link is down. But the phy
>> is still alive and just didn't meet the polling deadline. 
>> Closing the phy link in this case seems too radical. Failing to 
>> meet the deadline happens very rarely. When stress test runs for 
>> tens of hours with multiple target boards (Xilinx Zynq7000 with
>> marvell 88E1512 PHY, Xilinx custom emac IP), it happens. This 
>> patch gives another chance to the phy_state_machine when polling 
>> timeout happens. Only two consecutive failing the deadline is 
>> treated as the real phy halt and close the connection.
> 
In addition to what Florian said already there's at least one
issue apart from the quite hacky approach in general.

Let's say we are in interrupt mode and the timeout happens when
reading the PHY status after a link-down interrupt. When ignoring
the error we miss the transition and phylib will report a wrong
link status.

I also would prefer to first check for the root cause and try to
fix it, before adding hacks to upper layers for ignoring errors.

> How about simply increasing the MDIO polling timeout in the Xilinx EMAC
> driver instead? Or if the PHY is where the timeout needs to be
> increased, allow the PHY device drivers to advertise min/max timeouts
> such that the MDIO bus layer can use that information?
> 
>>
>>
>> Signed-off-by: kwangdo.yi <kwangdo.yi@gmail.com>
>> ---
>>  drivers/net/phy/phy.c | 6 ++++++
>>  include/linux/phy.h   | 1 +
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index e888542..9e8138b 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -919,7 +919,13 @@ void phy_state_machine(struct work_struct *work)
>>  		break;
>>  	case PHY_NOLINK:
>>  	case PHY_RUNNING:
>> +	case PHY_BUSY:
>>  		err = phy_check_link_status(phydev);
>> +		if (err == -ETIMEDOUT && old_state == PHY_RUNNING) {
>> +			phy->state = PHY_BUSY;
>> +			err = 0;
>> +
>> +		}
>>  		break;
>>  	case PHY_FORCING:
>>  		err = genphy_update_link(phydev);
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index 6424586..4a49401 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
>> @@ -313,6 +313,7 @@ enum phy_state {
>>  	PHY_RUNNING,
>>  	PHY_NOLINK,
>>  	PHY_FORCING,
>> +	PHY_BUSY,
>>  };
>>  
>>  /**
>>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-07 22:32 [PATCH] phy: added a PHY_BUSY state into phy_state_machine kwangdo.yi
  2019-07-08  3:07 ` Florian Fainelli
  2019-07-08  4:42 ` Andrew Lunn
@ 2019-07-09  1:58 ` kbuild test robot
  2 siblings, 0 replies; 8+ messages in thread
From: kbuild test robot @ 2019-07-09  1:58 UTC (permalink / raw)
  To: kwangdo.yi; +Cc: kbuild-all, netdev, kwangdo.yi

[-- Attachment #1: Type: text/plain, Size: 4357 bytes --]

Hi "kwangdo.yi",

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.2]
[cannot apply to next-20190708]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/kwangdo-yi/phy-added-a-PHY_BUSY-state-into-phy_state_machine/20190709-075228
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=arm 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   drivers/net/phy/phy.c: In function 'phy_state_to_str':
>> drivers/net/phy/phy.c:38:2: warning: enumeration value 'PHY_BUSY' not handled in switch [-Wswitch]
     switch (st) {
     ^~~~~~
   drivers/net/phy/phy.c: In function 'phy_state_machine':
>> drivers/net/phy/phy.c:925:4: error: 'phy' undeclared (first use in this function)
       phy->state = PHY_BUSY;
       ^~~
   drivers/net/phy/phy.c:925:4: note: each undeclared identifier is reported only once for each function it appears in

vim +/phy +925 drivers/net/phy/phy.c

   894	
   895	/**
   896	 * phy_state_machine - Handle the state machine
   897	 * @work: work_struct that describes the work to be done
   898	 */
   899	void phy_state_machine(struct work_struct *work)
   900	{
   901		struct delayed_work *dwork = to_delayed_work(work);
   902		struct phy_device *phydev =
   903				container_of(dwork, struct phy_device, state_queue);
   904		bool needs_aneg = false, do_suspend = false;
   905		enum phy_state old_state;
   906		int err = 0;
   907	
   908		mutex_lock(&phydev->lock);
   909	
   910		old_state = phydev->state;
   911	
   912		switch (phydev->state) {
   913		case PHY_DOWN:
   914		case PHY_READY:
   915			break;
   916		case PHY_UP:
   917			needs_aneg = true;
   918	
   919			break;
   920		case PHY_NOLINK:
   921		case PHY_RUNNING:
   922		case PHY_BUSY:
   923			err = phy_check_link_status(phydev);
   924			if (err == -ETIMEDOUT && old_state == PHY_RUNNING) {
 > 925				phy->state = PHY_BUSY;
   926				err = 0;
   927	
   928			}
   929			break;
   930		case PHY_FORCING:
   931			err = genphy_update_link(phydev);
   932			if (err)
   933				break;
   934	
   935			if (phydev->link) {
   936				phydev->state = PHY_RUNNING;
   937				phy_link_up(phydev);
   938			} else {
   939				if (0 == phydev->link_timeout--)
   940					needs_aneg = true;
   941				phy_link_down(phydev, false);
   942			}
   943			break;
   944		case PHY_HALTED:
   945			if (phydev->link) {
   946				phydev->link = 0;
   947				phy_link_down(phydev, true);
   948				do_suspend = true;
   949			}
   950			break;
   951		}
   952	
   953		mutex_unlock(&phydev->lock);
   954	
   955		if (needs_aneg)
   956			err = phy_start_aneg(phydev);
   957		else if (do_suspend)
   958			phy_suspend(phydev);
   959	
   960		if (err < 0)
   961			phy_error(phydev);
   962	
   963		if (old_state != phydev->state) {
   964			phydev_dbg(phydev, "PHY state change %s -> %s\n",
   965				   phy_state_to_str(old_state),
   966				   phy_state_to_str(phydev->state));
   967			if (phydev->drv && phydev->drv->link_change_notify)
   968				phydev->drv->link_change_notify(phydev);
   969		}
   970	
   971		/* Only re-schedule a PHY state machine change if we are polling the
   972		 * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
   973		 * between states from phy_mac_interrupt().
   974		 *
   975		 * In state PHY_HALTED the PHY gets suspended, so rescheduling the
   976		 * state machine would be pointless and possibly error prone when
   977		 * called from phy_disconnect() synchronously.
   978		 */
   979		mutex_lock(&phydev->lock);
   980		if (phy_polling_mode(phydev) && phy_is_started(phydev))
   981			phy_queue_state_machine(phydev, PHY_STATE_TIME);
   982		mutex_unlock(&phydev->lock);
   983	}
   984	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36418 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-08  3:07 ` Florian Fainelli
  2019-07-08  6:03   ` Heiner Kallweit
@ 2019-07-09  3:16   ` kwangdo yi
  2019-07-09  3:22     ` Andrew Lunn
  1 sibling, 1 reply; 8+ messages in thread
From: kwangdo yi @ 2019-07-09  3:16 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, Andrew Lunn, Heiner Kallweit

I simply fixed this issue by increasing the polling time from 20 msec to
60 msec in Xilinx EMAC driver. But the state machine would be in a
better shape if it is capable of handling sub system driver's fake failure.
PHY device driver could advertising the min/max timeouts for its subsystem,
but still some vendor's EMAC driver fails to meet the deadline if this value
is not set properly in PHY driver.

On Sun, Jul 7, 2019 at 11:07 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> +Andrew, Heiner (please CC PHY library maintainers).
>
> On 7/7/2019 3:32 PM, kwangdo.yi wrote:
> > When mdio driver polling the phy state in the phy_state_machine,
> > sometimes it results in -ETIMEDOUT and link is down. But the phy
> > is still alive and just didn't meet the polling deadline.
> > Closing the phy link in this case seems too radical. Failing to
> > meet the deadline happens very rarely. When stress test runs for
> > tens of hours with multiple target boards (Xilinx Zynq7000 with
> > marvell 88E1512 PHY, Xilinx custom emac IP), it happens. This
> > patch gives another chance to the phy_state_machine when polling
> > timeout happens. Only two consecutive failing the deadline is
> > treated as the real phy halt and close the connection.
>
> How about simply increasing the MDIO polling timeout in the Xilinx EMAC
> driver instead? Or if the PHY is where the timeout needs to be
> increased, allow the PHY device drivers to advertise min/max timeouts
> such that the MDIO bus layer can use that information?
>
> >
> >
> > Signed-off-by: kwangdo.yi <kwangdo.yi@gmail.com>
> > ---
> >  drivers/net/phy/phy.c | 6 ++++++
> >  include/linux/phy.h   | 1 +
> >  2 files changed, 7 insertions(+)
> >
> > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> > index e888542..9e8138b 100644
> > --- a/drivers/net/phy/phy.c
> > +++ b/drivers/net/phy/phy.c
> > @@ -919,7 +919,13 @@ void phy_state_machine(struct work_struct *work)
> >               break;
> >       case PHY_NOLINK:
> >       case PHY_RUNNING:
> > +     case PHY_BUSY:
> >               err = phy_check_link_status(phydev);
> > +             if (err == -ETIMEDOUT && old_state == PHY_RUNNING) {
> > +                     phy->state = PHY_BUSY;
> > +                     err = 0;
> > +
> > +             }
> >               break;
> >       case PHY_FORCING:
> >               err = genphy_update_link(phydev);
> > diff --git a/include/linux/phy.h b/include/linux/phy.h
> > index 6424586..4a49401 100644
> > --- a/include/linux/phy.h
> > +++ b/include/linux/phy.h
> > @@ -313,6 +313,7 @@ enum phy_state {
> >       PHY_RUNNING,
> >       PHY_NOLINK,
> >       PHY_FORCING,
> > +     PHY_BUSY,
> >  };
> >
> >  /**
> >
>
> --
> Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-09  3:16   ` kwangdo yi
@ 2019-07-09  3:22     ` Andrew Lunn
  2019-07-09  3:31       ` kwangdo yi
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Lunn @ 2019-07-09  3:22 UTC (permalink / raw)
  To: kwangdo yi; +Cc: Florian Fainelli, netdev, Heiner Kallweit

On Mon, Jul 08, 2019 at 11:16:02PM -0400, kwangdo yi wrote:
> I simply fixed this issue by increasing the polling time from 20 msec to
> 60 msec in Xilinx EMAC driver. But the state machine would be in a
> better shape if it is capable of handling sub system driver's fake failure.
> PHY device driver could advertising the min/max timeouts for its subsystem,
> but still some vendor's EMAC driver fails to meet the deadline if this value
> is not set properly in PHY driver.

Hi Kwangdo

That is not how MDIO works. The PHY has two clock cycles to prepare
its response to any request. There is no min/max. This was always an
MDIO bus driver problem, not a PHY problem.

    Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine
  2019-07-09  3:22     ` Andrew Lunn
@ 2019-07-09  3:31       ` kwangdo yi
  0 siblings, 0 replies; 8+ messages in thread
From: kwangdo yi @ 2019-07-09  3:31 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, netdev, Heiner Kallweit

On Mon, Jul 8, 2019 at 11:22 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Mon, Jul 08, 2019 at 11:16:02PM -0400, kwangdo yi wrote:
> > I simply fixed this issue by increasing the polling time from 20 msec to
> > 60 msec in Xilinx EMAC driver. But the state machine would be in a
> > better shape if it is capable of handling sub system driver's fake failure.
> > PHY device driver could advertising the min/max timeouts for its subsystem,
> > but still some vendor's EMAC driver fails to meet the deadline if this value
> > is not set properly in PHY driver.
>
> Hi Kwangdo
>
> That is not how MDIO works. The PHY has two clock cycles to prepare
> its response to any request. There is no min/max. This was always an
> MDIO bus driver problem, not a PHY problem.
>
>     Andrew

Hi Andrew,

I don't think PHY driver has a problem, nor EMAC driver has, but if PHY driver
is capable of handling EMAC driver's fake failure, the PHY driver would be in
a better fit. That's the intention of this patch.

But, it seems this timeout needs to be handled in each MDIO driver properly.
Thanks.

Regards,

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-07-09  3:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-07 22:32 [PATCH] phy: added a PHY_BUSY state into phy_state_machine kwangdo.yi
2019-07-08  3:07 ` Florian Fainelli
2019-07-08  6:03   ` Heiner Kallweit
2019-07-09  3:16   ` kwangdo yi
2019-07-09  3:22     ` Andrew Lunn
2019-07-09  3:31       ` kwangdo yi
2019-07-08  4:42 ` Andrew Lunn
2019-07-09  1:58 ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).