Potential issue with f5e64032a799 "net: phy: fix resume handling"

All of lore.kernel.org
 help / color / mirror / Atom feed

* Potential issue with f5e64032a799 "net: phy: fix resume handling"
@ 2018-02-03 16:41 Heiner Kallweit
  2018-02-03 20:17 ` Andrew Lunn
  0 siblings, 1 reply; 12+ messages in thread
From: Heiner Kallweit @ 2018-02-03 16:41 UTC (permalink / raw)
  To: Russell King; +Cc: Andrew Lunn, netdev

This commit forces callers of phy_resume() and phy_suspend() to hold
mutex phydev->lock. This was done for calls to phy_resume() and
phy_suspend() in phylib, however there are more callers in network
drivers. I'd assume that these other calls issue a warning now
because of the lock not being held.
So is there something I miss or would this have to be fixed?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-03 16:41 Potential issue with f5e64032a799 "net: phy: fix resume handling" Heiner Kallweit
@ 2018-02-03 20:17 ` Andrew Lunn
  2018-02-03 23:58   ` Heiner Kallweit
  2018-02-25 13:00   ` Potential issue with f5e64032a799 "net: phy: fix resume handling" Heiner Kallweit
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Lunn @ 2018-02-03 20:17 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Russell King, netdev

On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
> This commit forces callers of phy_resume() and phy_suspend() to hold
> mutex phydev->lock. This was done for calls to phy_resume() and
> phy_suspend() in phylib, however there are more callers in network
> drivers. I'd assume that these other calls issue a warning now
> because of the lock not being held.
> So is there something I miss or would this have to be fixed?

Hi Heiner

This is a good point.

Yes, it looks like some fixes are needed. But what exactly?

The phy state machine will suspend and resume the phy is you call
phy_stop() and phy_start() in the MAC suspend and resume functions.

A few examples:

tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
mpc52xx_fec_close(), ucc_geth_suspend(), etc...

So i suspect those drivers which call phy_suspend()/phy_resume()
should really be modified to call phy_stop()/phy_start().

hns_nic_config_phy_loopback() is just funky, and probably needs the
help of the hns guys to fix.

dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
can be removed.

The comments in lpc_eth_open() suggest the phy_resume() is needed, so
locks should be added. socfpga_dwmac_resume() seems to be the same.

    Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-03 20:17 ` Andrew Lunn
@ 2018-02-03 23:58   ` Heiner Kallweit
  2018-02-04  2:48     ` Florian Fainelli
  2018-02-25 13:00   ` Potential issue with f5e64032a799 "net: phy: fix resume handling" Heiner Kallweit
  1 sibling, 1 reply; 12+ messages in thread
From: Heiner Kallweit @ 2018-02-03 23:58 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Russell King, netdev

Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>> This commit forces callers of phy_resume() and phy_suspend() to hold
>> mutex phydev->lock. This was done for calls to phy_resume() and
>> phy_suspend() in phylib, however there are more callers in network
>> drivers. I'd assume that these other calls issue a warning now
>> because of the lock not being held.
>> So is there something I miss or would this have to be fixed?
> 
> Hi Heiner
> 
> This is a good point.
> 
> Yes, it looks like some fixes are needed. But what exactly?
> 
> The phy state machine will suspend and resume the phy is you call
> phy_stop() and phy_start() in the MAC suspend and resume functions.
> 
AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
to PHY_HALTED and (at least if we're not in polling mode) doesn't
call phy_suspend(). Maybe a call to phy_trigger_machine() is
needed like in phy_start() ? Then the state machine would call
phy_suspend(), provided the link is still up.

However, if the link is down already (due to whatever calls
around phy_stop() in the driver) then phy_suspend() wouldn't be
called.

Heiner

> A few examples:
> 
> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
> 
> So i suspect those drivers which call phy_suspend()/phy_resume()
> should really be modified to call phy_stop()/phy_start().
> 
> hns_nic_config_phy_loopback() is just funky, and probably needs the
> help of the hns guys to fix.
> 
> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
> can be removed.
> 
> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
> locks should be added. socfpga_dwmac_resume() seems to be the same.
> 
>     Andrew
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-03 23:58   ` Heiner Kallweit
@ 2018-02-04  2:48     ` Florian Fainelli
  2018-02-05 21:48       ` Heiner Kallweit
  2018-02-07 20:56       ` handling of phy_stop() and phy_stop_machine() in phylib Heiner Kallweit
  0 siblings, 2 replies; 12+ messages in thread
From: Florian Fainelli @ 2018-02-04  2:48 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn; +Cc: Russell King, netdev



On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>>> This commit forces callers of phy_resume() and phy_suspend() to hold
>>> mutex phydev->lock. This was done for calls to phy_resume() and
>>> phy_suspend() in phylib, however there are more callers in network
>>> drivers. I'd assume that these other calls issue a warning now
>>> because of the lock not being held.
>>> So is there something I miss or would this have to be fixed?
>>
>> Hi Heiner
>>
>> This is a good point.
>>
>> Yes, it looks like some fixes are needed. But what exactly?
>>
>> The phy state machine will suspend and resume the phy is you call
>> phy_stop() and phy_start() in the MAC suspend and resume functions.
>>
> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
> to PHY_HALTED and (at least if we're not in polling mode) doesn't
> call phy_suspend(). Maybe a call to phy_trigger_machine() is
> needed like in phy_start() ? Then the state machine would call
> phy_suspend(), provided the link is still up.

Right, phy_stop() merely just moves the state machine to PHY_HALTED and
this is actually a great source of problems which I tried to address here:

https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html

because phy_stop() is not a synchronous call, so when it returns the
state machine might still be running (it can take up to a 1 HZ depending
on when you called phy_stop()) and so if you took that as a
synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
you will likely see problems. phy_stop_machine() would provide that
synchronization point, but is not currently exported, despite being a
global symbol. This patch series above is all well and good, except that
Geert reported issues with suspend/resume interactions which I have not
been able to track down.

We should most definitively try to consolidate the different PHY
suspend/resume within the Ethernet MAC suspend/resume implementation and
document exactly what the appropriate behavior must be under the
following circumstances:

- when to call phy_stop() + phy_stop_machine()
- when to call phy_suspend() (if the network interface does do not WoL)
- when to call phy_resume() (if needed, actually, it usually is not)
- when to call phy_start()

I don't unfortunately have the time to code this myself at the moment,
but I will happily review patches if you have the opportunity to do so.

> 
> However, if the link is down already (due to whatever calls
> around phy_stop() in the driver) then phy_suspend() wouldn't be
> called.

Correct, there is an implicit assumption that when the link is down,
there is an opportunity for the Ethernet MAC driver to put things in low
power, and the PHY itself, should be in a lower power mode where only
link/energy detection might be utilizing power. At least this is the theory.

> 
> Heiner
> 
>> A few examples:
>>
>> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
>> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
>> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
>>
>> So i suspect those drivers which call phy_suspend()/phy_resume()
>> should really be modified to call phy_stop()/phy_start().
>>
>> hns_nic_config_phy_loopback() is just funky, and probably needs the
>> help of the hns guys to fix.
>>
>> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
>> can be removed.
>>
>> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
>> locks should be added. socfpga_dwmac_resume() seems to be the same.
>>
>>     Andrew
>>
> 

-- 
Florian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-04  2:48     ` Florian Fainelli
@ 2018-02-05 21:48       ` Heiner Kallweit
  2018-02-06 11:00         ` Russell King - ARM Linux
  2018-02-07 20:56       ` handling of phy_stop() and phy_stop_machine() in phylib Heiner Kallweit
  1 sibling, 1 reply; 12+ messages in thread
From: Heiner Kallweit @ 2018-02-05 21:48 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn; +Cc: Russell King, netdev

Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> 
> 
> On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
>> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
>>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>>>> This commit forces callers of phy_resume() and phy_suspend() to hold
>>>> mutex phydev->lock. This was done for calls to phy_resume() and
>>>> phy_suspend() in phylib, however there are more callers in network
>>>> drivers. I'd assume that these other calls issue a warning now
>>>> because of the lock not being held.
>>>> So is there something I miss or would this have to be fixed?
>>>
>>> Hi Heiner
>>>
>>> This is a good point.
>>>
>>> Yes, it looks like some fixes are needed. But what exactly?
>>>
>>> The phy state machine will suspend and resume the phy is you call
>>> phy_stop() and phy_start() in the MAC suspend and resume functions.
>>>
>> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
>> to PHY_HALTED and (at least if we're not in polling mode) doesn't
>> call phy_suspend(). Maybe a call to phy_trigger_machine() is
>> needed like in phy_start() ? Then the state machine would call
>> phy_suspend(), provided the link is still up.
> 
> Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> this is actually a great source of problems which I tried to address here:
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> 
> because phy_stop() is not a synchronous call, so when it returns the
> state machine might still be running (it can take up to a 1 HZ depending
> on when you called phy_stop()) and so if you took that as a
> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> you will likely see problems. phy_stop_machine() would provide that
> synchronization point, but is not currently exported, despite being a
> global symbol. This patch series above is all well and good, except that
> Geert reported issues with suspend/resume interactions which I have not
> been able to track down.
> 
> We should most definitively try to consolidate the different PHY
> suspend/resume within the Ethernet MAC suspend/resume implementation and
> document exactly what the appropriate behavior must be under the
> following circumstances:
> 
> - when to call phy_stop() + phy_stop_machine()
> - when to call phy_suspend() (if the network interface does do not WoL)
> - when to call phy_resume() (if needed, actually, it usually is not)
> - when to call phy_start()
> 

I think phy_start() / phy_start_machine() / phy_start_interrupts()
belong together and we may call the latter two functions from phy_start().
Same for stop.

This would mean:
- Remove call to phy_start_interrupts() from phy_connect_direct()
- Call phy_start_machine() and phy_start_interrupts() from phy_start()
- mdio_bus_phy_suspend() calls phy_stop()
Same for stop, plus: phy_error() calls phy_stop().

In this setup a second call to phy_stop() wouldn't hurt because state
is PHY_HALTED already and phy_stop() is a no-op.

A functional change would be that interrupts are disabled during system
suspend (except WoL because we don't suspend the PHY is this case). 

These are first thoughts and therefore it's fine if you totally disagree ..
I didn't test this yet, it's only a "Gedankenexperiment" so far.

When talking about suspend/resume I think we talked about system suspend /
resume. However I think we need to consider also runtime pm.
If a link is down the network driver may decide to runtime-suspend the PHY
(power it down). In case of runtime pm I'd say we need to keep irq and
workqueue active to be able to react if a cable is plugged in and the PHY
wakes up automatically and establishes a link.

Heiner


> I don't unfortunately have the time to code this myself at the moment,
> but I will happily review patches if you have the opportunity to do so.
> 
>>
>> However, if the link is down already (due to whatever calls
>> around phy_stop() in the driver) then phy_suspend() wouldn't be
>> called.
> 
> Correct, there is an implicit assumption that when the link is down,
> there is an opportunity for the Ethernet MAC driver to put things in low
> power, and the PHY itself, should be in a lower power mode where only
> link/energy detection might be utilizing power. At least this is the theory.
> 
>>
>> Heiner
>>
>>> A few examples:
>>>
>>> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
>>> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
>>> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
>>>
>>> So i suspect those drivers which call phy_suspend()/phy_resume()
>>> should really be modified to call phy_stop()/phy_start().
>>>
>>> hns_nic_config_phy_loopback() is just funky, and probably needs the
>>> help of the hns guys to fix.
>>>
>>> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
>>> can be removed.
>>>
>>> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
>>> locks should be added. socfpga_dwmac_resume() seems to be the same.
>>>
>>>     Andrew
>>>
>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-05 21:48       ` Heiner Kallweit
@ 2018-02-06 11:00         ` Russell King - ARM Linux
  2018-02-06 12:55           ` Andrew Lunn
  0 siblings, 1 reply; 12+ messages in thread
From: Russell King - ARM Linux @ 2018-02-06 11:00 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, Andrew Lunn, netdev

On Mon, Feb 05, 2018 at 10:48:55PM +0100, Heiner Kallweit wrote:
> Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> > 
> > 
> > On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
> >> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> >>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
> >>>> This commit forces callers of phy_resume() and phy_suspend() to hold
> >>>> mutex phydev->lock. This was done for calls to phy_resume() and
> >>>> phy_suspend() in phylib, however there are more callers in network
> >>>> drivers. I'd assume that these other calls issue a warning now
> >>>> because of the lock not being held.
> >>>> So is there something I miss or would this have to be fixed?
> >>>
> >>> Hi Heiner
> >>>
> >>> This is a good point.
> >>>
> >>> Yes, it looks like some fixes are needed. But what exactly?
> >>>
> >>> The phy state machine will suspend and resume the phy is you call
> >>> phy_stop() and phy_start() in the MAC suspend and resume functions.
> >>>
> >> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
> >> to PHY_HALTED and (at least if we're not in polling mode) doesn't
> >> call phy_suspend(). Maybe a call to phy_trigger_machine() is
> >> needed like in phy_start() ? Then the state machine would call
> >> phy_suspend(), provided the link is still up.
> > 
> > Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> > this is actually a great source of problems which I tried to address here:
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> > 
> > because phy_stop() is not a synchronous call, so when it returns the
> > state machine might still be running (it can take up to a 1 HZ depending
> > on when you called phy_stop()) and so if you took that as a
> > synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> > you will likely see problems. phy_stop_machine() would provide that
> > synchronization point, but is not currently exported, despite being a
> > global symbol. This patch series above is all well and good, except that
> > Geert reported issues with suspend/resume interactions which I have not
> > been able to track down.
> > 
> > We should most definitively try to consolidate the different PHY
> > suspend/resume within the Ethernet MAC suspend/resume implementation and
> > document exactly what the appropriate behavior must be under the
> > following circumstances:
> > 
> > - when to call phy_stop() + phy_stop_machine()
> > - when to call phy_suspend() (if the network interface does do not WoL)
> > - when to call phy_resume() (if needed, actually, it usually is not)
> > - when to call phy_start()
> > 
> 
> I think phy_start() / phy_start_machine() / phy_start_interrupts()
> belong together and we may call the latter two functions from phy_start().
> Same for stop.
> 
> This would mean:
> - Remove call to phy_start_interrupts() from phy_connect_direct()
> - Call phy_start_machine() and phy_start_interrupts() from phy_start()
> - mdio_bus_phy_suspend() calls phy_stop()
> Same for stop, plus: phy_error() calls phy_stop().
> 
> In this setup a second call to phy_stop() wouldn't hurt because state
> is PHY_HALTED already and phy_stop() is a no-op.
> 
> A functional change would be that interrupts are disabled during system
> suspend (except WoL because we don't suspend the PHY is this case). 
> 
> These are first thoughts and therefore it's fine if you totally disagree ..
> I didn't test this yet, it's only a "Gedankenexperiment" so far.
> 
> When talking about suspend/resume I think we talked about system suspend /
> resume. However I think we need to consider also runtime pm.
> If a link is down the network driver may decide to runtime-suspend the PHY
> (power it down). In case of runtime pm I'd say we need to keep irq and
> workqueue active to be able to react if a cable is plugged in and the PHY
> wakes up automatically and establishes a link.

Maybe a better solution now would be to restore phy_resume()'s lock-
taking behaviour, and provide a lockless __phy_resume() which can be
used internally within phylib.  This means drivers using phy_resume()
would see no change.  Maybe something like (untested):

 drivers/net/phy/phy.c        |  2 +-
 drivers/net/phy/phy_device.c | 17 ++++++++++++++---
 include/linux/phy.h          |  1 +
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index f3313a129531..4574d02dce93 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -819,7 +819,7 @@ void phy_start(struct phy_device *phydev)
 		break;
 	case PHY_HALTED:
 		/* if phy was suspended, bring the physical link up again */
-		phy_resume(phydev);
+		__phy_resume(phydev);
 
 		/* make sure interrupts are re-enabled for the PHY */
 		if (phydev->irq != PHY_POLL) {
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index b13eed21c87d..ae0f9306bbdc 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -136,7 +136,7 @@ static int mdio_bus_phy_resume(struct device *dev)
 		goto no_resume;
 
 	mutex_lock(&phydev->lock);
-	ret = phy_resume(phydev);
+	ret = __phy_resume(phydev);
 	mutex_unlock(&phydev->lock);
 	if (ret < 0)
 		return ret;
@@ -1042,7 +1042,7 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 		goto error;
 
 	mutex_lock(&phydev->lock);
-	phy_resume(phydev);
+	__phy_resume(phydev);
 	mutex_unlock(&phydev->lock);
 	phy_led_triggers_register(phydev);
 
@@ -1172,7 +1172,7 @@ int phy_suspend(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_suspend);
 
-int phy_resume(struct phy_device *phydev)
+int __phy_resume(struct phy_device *phydev)
 {
 	struct phy_driver *phydrv = to_phy_driver(phydev->mdio.dev.driver);
 	int ret = 0;
@@ -1189,6 +1189,17 @@ int phy_resume(struct phy_device *phydev)
 
 	return ret;
 }
+
+int phy_resume(struct phy_device *phydev)
+{
+	int ret;
+
+	mutex_lock(&phydev->lock);
+	ret = __phy-resume(phydev);
+	mutex_unlock(&phydev->lock);
+
+	return ret;
+}
 EXPORT_SYMBOL(phy_resume);
 
 int phy_loopback(struct phy_device *phydev, bool enable)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 5a0c3e53e7c2..8f82bd64f82d 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -923,6 +923,7 @@ static inline void phy_device_free(struct phy_device *phydev) { }
 void phy_device_remove(struct phy_device *phydev);
 int phy_init_hw(struct phy_device *phydev);
 int phy_suspend(struct phy_device *phydev);
+int __phy_resume(struct phy_device *phydev);
 int phy_resume(struct phy_device *phydev);
 int phy_loopback(struct phy_device *phydev, bool enable);
 struct phy_device *phy_attach(struct net_device *dev, const char *bus_id,


-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-06 11:00         ` Russell King - ARM Linux
@ 2018-02-06 12:55           ` Andrew Lunn
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Lunn @ 2018-02-06 12:55 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: Heiner Kallweit, Florian Fainelli, netdev

> Maybe a better solution now would be to restore phy_resume()'s lock-
> taking behaviour, and provide a lockless __phy_resume() which can be
> used internally within phylib.  This means drivers using phy_resume()
> would see no change.  Maybe something like (untested):

Hi Russell

I was thinking the same, and have a pretty much identical untested
patch. This gets things 'fixed' and we can then later come back and
look at the overall architecture.

    Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: handling of phy_stop() and phy_stop_machine() in phylib
  2018-02-04  2:48     ` Florian Fainelli
  2018-02-05 21:48       ` Heiner Kallweit
@ 2018-02-07 20:56       ` Heiner Kallweit
  2018-02-07 21:13         ` Russell King - ARM Linux
  1 sibling, 1 reply; 12+ messages in thread
From: Heiner Kallweit @ 2018-02-07 20:56 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn; +Cc: Russell King, netdev

Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> 
> 
> On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
>> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
>>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>>>> This commit forces callers of phy_resume() and phy_suspend() to hold
>>>> mutex phydev->lock. This was done for calls to phy_resume() and
>>>> phy_suspend() in phylib, however there are more callers in network
>>>> drivers. I'd assume that these other calls issue a warning now
>>>> because of the lock not being held.
>>>> So is there something I miss or would this have to be fixed?
>>>
>>> Hi Heiner
>>>
>>> This is a good point.
>>>
>>> Yes, it looks like some fixes are needed. But what exactly?
>>>
>>> The phy state machine will suspend and resume the phy is you call
>>> phy_stop() and phy_start() in the MAC suspend and resume functions.
>>>
>> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
>> to PHY_HALTED and (at least if we're not in polling mode) doesn't
>> call phy_suspend(). Maybe a call to phy_trigger_machine() is
>> needed like in phy_start() ? Then the state machine would call
>> phy_suspend(), provided the link is still up.
> 
> Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> this is actually a great source of problems which I tried to address here:
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> 
> because phy_stop() is not a synchronous call, so when it returns the
> state machine might still be running (it can take up to a 1 HZ depending
> on when you called phy_stop()) and so if you took that as a
> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> you will likely see problems. phy_stop_machine() would provide that
> synchronization point, but is not currently exported, despite being a
> global symbol. This patch series above is all well and good, except that
> Geert reported issues with suspend/resume interactions which I have not
> been able to track down.
> 
To not confuse readers I changed the subject of the mail to reflect that
the discussion isn't about the original topic any longer.

It seems to me that (at least one) reason for the issues is that pm
callbacks for the phy device and the network device interfere.
phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing
with suspending/resuming the PHY, and the network driver pm callbacks
as well.

Maybe, if the network driver takes care, it should inform phy device pm
to do nothing. For this we could add a flag to phydev.mdio.flags.
If the network driver sets this flag then mdio_bus_phy_suspend()
and mdio_bus_phy_resume() could turn into no-ops.
Not totally sure yet about mdio_bus_phy_restore() ..


> We should most definitively try to consolidate the different PHY
> suspend/resume within the Ethernet MAC suspend/resume implementation and
> document exactly what the appropriate behavior must be under the
> following circumstances:
> 
> - when to call phy_stop() + phy_stop_machine()
> - when to call phy_suspend() (if the network interface does do not WoL)
> - when to call phy_resume() (if needed, actually, it usually is not)
> - when to call phy_start()
> 
> I don't unfortunately have the time to code this myself at the moment,
> but I will happily review patches if you have the opportunity to do so.
> 
>>
>> However, if the link is down already (due to whatever calls
>> around phy_stop() in the driver) then phy_suspend() wouldn't be
>> called.
> 
> Correct, there is an implicit assumption that when the link is down,
> there is an opportunity for the Ethernet MAC driver to put things in low
> power, and the PHY itself, should be in a lower power mode where only
> link/energy detection might be utilizing power. At least this is the theory.
> 
>>
>> Heiner
>>
>>> A few examples:
>>>
>>> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
>>> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
>>> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
>>>
>>> So i suspect those drivers which call phy_suspend()/phy_resume()
>>> should really be modified to call phy_stop()/phy_start().
>>>
>>> hns_nic_config_phy_loopback() is just funky, and probably needs the
>>> help of the hns guys to fix.
>>>
>>> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
>>> can be removed.
>>>
>>> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
>>> locks should be added. socfpga_dwmac_resume() seems to be the same.
>>>
>>>     Andrew
>>>
>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: handling of phy_stop() and phy_stop_machine() in phylib
  2018-02-07 20:56       ` handling of phy_stop() and phy_stop_machine() in phylib Heiner Kallweit
@ 2018-02-07 21:13         ` Russell King - ARM Linux
  2018-02-07 23:03           ` Florian Fainelli
  0 siblings, 1 reply; 12+ messages in thread
From: Russell King - ARM Linux @ 2018-02-07 21:13 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, Andrew Lunn, netdev

On Wed, Feb 07, 2018 at 09:56:37PM +0100, Heiner Kallweit wrote:
> Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> > 
> > 
> > On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
> >> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> >>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
> >>>> This commit forces callers of phy_resume() and phy_suspend() to hold
> >>>> mutex phydev->lock. This was done for calls to phy_resume() and
> >>>> phy_suspend() in phylib, however there are more callers in network
> >>>> drivers. I'd assume that these other calls issue a warning now
> >>>> because of the lock not being held.
> >>>> So is there something I miss or would this have to be fixed?
> >>>
> >>> Hi Heiner
> >>>
> >>> This is a good point.
> >>>
> >>> Yes, it looks like some fixes are needed. But what exactly?
> >>>
> >>> The phy state machine will suspend and resume the phy is you call
> >>> phy_stop() and phy_start() in the MAC suspend and resume functions.
> >>>
> >> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
> >> to PHY_HALTED and (at least if we're not in polling mode) doesn't
> >> call phy_suspend(). Maybe a call to phy_trigger_machine() is
> >> needed like in phy_start() ? Then the state machine would call
> >> phy_suspend(), provided the link is still up.
> > 
> > Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> > this is actually a great source of problems which I tried to address here:
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> > 
> > because phy_stop() is not a synchronous call, so when it returns the
> > state machine might still be running (it can take up to a 1 HZ depending
> > on when you called phy_stop()) and so if you took that as a
> > synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> > you will likely see problems. phy_stop_machine() would provide that
> > synchronization point, but is not currently exported, despite being a
> > global symbol. This patch series above is all well and good, except that
> > Geert reported issues with suspend/resume interactions which I have not
> > been able to track down.
> > 
> To not confuse readers I changed the subject of the mail to reflect that
> the discussion isn't about the original topic any longer.
> 
> It seems to me that (at least one) reason for the issues is that pm
> callbacks for the phy device and the network device interfere.
> phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing
> with suspending/resuming the PHY, and the network driver pm callbacks
> as well.
> 
> Maybe, if the network driver takes care, it should inform phy device pm
> to do nothing. For this we could add a flag to phydev.mdio.flags.
> If the network driver sets this flag then mdio_bus_phy_suspend()
> and mdio_bus_phy_resume() could turn into no-ops.
> Not totally sure yet about mdio_bus_phy_restore() ..

What if the MDIO bus is handled by a separate device and the MDIO bus
is suspended prior to the network driver, thereby making the PHY
inaccessible?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: handling of phy_stop() and phy_stop_machine() in phylib
  2018-02-07 21:13         ` Russell King - ARM Linux
@ 2018-02-07 23:03           ` Florian Fainelli
  0 siblings, 0 replies; 12+ messages in thread
From: Florian Fainelli @ 2018-02-07 23:03 UTC (permalink / raw)
  To: Russell King - ARM Linux, Heiner Kallweit; +Cc: Andrew Lunn, netdev

On 02/07/2018 01:13 PM, Russell King - ARM Linux wrote:
> On Wed, Feb 07, 2018 at 09:56:37PM +0100, Heiner Kallweit wrote:
>> Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
>>>
>>>
>>> On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
>>>> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
>>>>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>>>>>> This commit forces callers of phy_resume() and phy_suspend() to hold
>>>>>> mutex phydev->lock. This was done for calls to phy_resume() and
>>>>>> phy_suspend() in phylib, however there are more callers in network
>>>>>> drivers. I'd assume that these other calls issue a warning now
>>>>>> because of the lock not being held.
>>>>>> So is there something I miss or would this have to be fixed?
>>>>>
>>>>> Hi Heiner
>>>>>
>>>>> This is a good point.
>>>>>
>>>>> Yes, it looks like some fixes are needed. But what exactly?
>>>>>
>>>>> The phy state machine will suspend and resume the phy is you call
>>>>> phy_stop() and phy_start() in the MAC suspend and resume functions.
>>>>>
>>>> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
>>>> to PHY_HALTED and (at least if we're not in polling mode) doesn't
>>>> call phy_suspend(). Maybe a call to phy_trigger_machine() is
>>>> needed like in phy_start() ? Then the state machine would call
>>>> phy_suspend(), provided the link is still up.
>>>
>>> Right, phy_stop() merely just moves the state machine to PHY_HALTED and
>>> this is actually a great source of problems which I tried to address here:
>>>
>>> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
>>>
>>> because phy_stop() is not a synchronous call, so when it returns the
>>> state machine might still be running (it can take up to a 1 HZ depending
>>> on when you called phy_stop()) and so if you took that as a
>>> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
>>> you will likely see problems. phy_stop_machine() would provide that
>>> synchronization point, but is not currently exported, despite being a
>>> global symbol. This patch series above is all well and good, except that
>>> Geert reported issues with suspend/resume interactions which I have not
>>> been able to track down.
>>>
>> To not confuse readers I changed the subject of the mail to reflect that
>> the discussion isn't about the original topic any longer.
>>
>> It seems to me that (at least one) reason for the issues is that pm
>> callbacks for the phy device and the network device interfere.
>> phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing
>> with suspending/resuming the PHY, and the network driver pm callbacks
>> as well.
>>
>> Maybe, if the network driver takes care, it should inform phy device pm
>> to do nothing. For this we could add a flag to phydev.mdio.flags.
>> If the network driver sets this flag then mdio_bus_phy_suspend()
>> and mdio_bus_phy_resume() could turn into no-ops.
>> Not totally sure yet about mdio_bus_phy_restore() ..
> 
> What if the MDIO bus is handled by a separate device and the MDIO bus
> is suspended prior to the network driver, thereby making the PHY
> inaccessible?

Indeed. We can know that in the PHY library though, because there is
logic to hold the module reference count, see phy_attach_direct(), we
cannot quite trust whether the Ethernet controller does the right thing
though, as I can think of several ways for things to be done wrong like:
CONFIG_PM is enabled, but the Ethernet driver does not implement any
suspend/resume callback, or does it wrong etc...
-- 
Florian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-03 20:17 ` Andrew Lunn
  2018-02-03 23:58   ` Heiner Kallweit
@ 2018-02-25 13:00   ` Heiner Kallweit
  2018-02-25 16:38     ` Andrew Lunn
  1 sibling, 1 reply; 12+ messages in thread
From: Heiner Kallweit @ 2018-02-25 13:00 UTC (permalink / raw)
  To: Andrew Lunn, Russell King; +Cc: netdev

Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>> This commit forces callers of phy_resume() and phy_suspend() to hold
>> mutex phydev->lock. This was done for calls to phy_resume() and
>> phy_suspend() in phylib, however there are more callers in network
>> drivers. I'd assume that these other calls issue a warning now
>> because of the lock not being held.
>> So is there something I miss or would this have to be fixed?
> 
> Hi Heiner
> 
> This is a good point.
> 
> Yes, it looks like some fixes are needed. But what exactly?
> 
The issue with phy_suspend/phy_resume and the changed locking
behavior is still open AFAICS. There was a proposed fix
https://www.mail-archive.com/netdev@vger.kernel.org/msg215455.html
and then the discussion stopped.
I think we need the fix before 4.16 leaves the rc phase.

Heiner


> The phy state machine will suspend and resume the phy is you call
> phy_stop() and phy_start() in the MAC suspend and resume functions.
> 
> A few examples:
> 
> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
> 
> So i suspect those drivers which call phy_suspend()/phy_resume()
> should really be modified to call phy_stop()/phy_start().
> 
> hns_nic_config_phy_loopback() is just funky, and probably needs the
> help of the hns guys to fix.
> 
> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
> can be removed.
> 
> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
> locks should be added. socfpga_dwmac_resume() seems to be the same.
> 
>     Andrew
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"
  2018-02-25 13:00   ` Potential issue with f5e64032a799 "net: phy: fix resume handling" Heiner Kallweit
@ 2018-02-25 16:38     ` Andrew Lunn
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Lunn @ 2018-02-25 16:38 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Russell King, netdev

On Sun, Feb 25, 2018 at 02:00:43PM +0100, Heiner Kallweit wrote:
> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> > On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
> >> This commit forces callers of phy_resume() and phy_suspend() to hold
> >> mutex phydev->lock. This was done for calls to phy_resume() and
> >> phy_suspend() in phylib, however there are more callers in network
> >> drivers. I'd assume that these other calls issue a warning now
> >> because of the lock not being held.
> >> So is there something I miss or would this have to be fixed?
> > 
> > Hi Heiner
> > 
> > This is a good point.
> > 
> > Yes, it looks like some fixes are needed. But what exactly?
> > 
> The issue with phy_suspend/phy_resume and the changed locking
> behavior is still open AFAICS. There was a proposed fix
> https://www.mail-archive.com/netdev@vger.kernel.org/msg215455.html
> and then the discussion stopped.
> I think we need the fix before 4.16 leaves the rc phase.

Hi Heiner

I have a patch i will post later today.

  Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-02-25 16:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-03 16:41 Potential issue with f5e64032a799 "net: phy: fix resume handling" Heiner Kallweit
2018-02-03 20:17 ` Andrew Lunn
2018-02-03 23:58   ` Heiner Kallweit
2018-02-04  2:48     ` Florian Fainelli
2018-02-05 21:48       ` Heiner Kallweit
2018-02-06 11:00         ` Russell King - ARM Linux
2018-02-06 12:55           ` Andrew Lunn
2018-02-07 20:56       ` handling of phy_stop() and phy_stop_machine() in phylib Heiner Kallweit
2018-02-07 21:13         ` Russell King - ARM Linux
2018-02-07 23:03           ` Florian Fainelli
2018-02-25 13:00   ` Potential issue with f5e64032a799 "net: phy: fix resume handling" Heiner Kallweit
2018-02-25 16:38     ` Andrew Lunn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.