netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] net: Fix crashes due to activity during suspend
@ 2017-08-22 18:37 Geert Uytterhoeven
  2017-08-22 18:37 ` [PATCH 1/2] net: phy: Freeze PHY polling before suspending devices Geert Uytterhoeven
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Geert Uytterhoeven @ 2017-08-22 18:37 UTC (permalink / raw)
  To: David S . Miller, Steve Glendinning, Andrew Lunn, Florian Fainelli
  Cc: Lukas Wunner, Rafael J . Wysocki, netdev, linux-pm,
	linux-renesas-soc, linux-kernel, Geert Uytterhoeven

	Hi all,

If an Ethernet device is used while the device is suspended, the system may
crash.

E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
driven by a PM controlled clock.  If the Ethernet registers are accessed
while the clock is not running, the system will crash with an imprecise
external abort.

This patch series fixes two of such crashes:
  1. The first patch prevents the PHY polling state machine from accessing
     PHY registers while a device is suspended,
  2. The second patch prevents the net core from trying to transmit packets
     when an smsc911x device is suspended.

Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
s2ram (rarely), or by using pm_test (more likely to trigger):

    # echo 0 > /sys/module/printk/parameters/console_suspend
    # echo platform > /sys/power/pm_test
    # echo mem > /sys/power/state

With this series applied, my test systems survive a loop of 100 test
suspends.

Thanks for your comments!

Geert Uytterhoeven (2):
  net: phy: Freeze PHY polling before suspending devices
  net: smsc911x: Quiten netif during suspend

 drivers/net/ethernet/smsc/smsc911x.c | 15 ++++++++++++++-
 drivers/net/phy/phy.c                | 12 +++++++-----
 2 files changed, 21 insertions(+), 6 deletions(-)

-- 
2.7.4

Gr{oetje,eeting}s,

						Geert

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] net: phy: Freeze PHY polling before suspending devices
  2017-08-22 18:37 [PATCH 0/2] net: Fix crashes due to activity during suspend Geert Uytterhoeven
@ 2017-08-22 18:37 ` Geert Uytterhoeven
  2017-08-22 18:37 ` [PATCH 2/2] net: smsc911x: Quiten netif during suspend Geert Uytterhoeven
  2017-08-22 18:49 ` [PATCH 0/2] net: Fix crashes due to activity " Florian Fainelli
  2 siblings, 0 replies; 10+ messages in thread
From: Geert Uytterhoeven @ 2017-08-22 18:37 UTC (permalink / raw)
  To: David S . Miller, Steve Glendinning, Andrew Lunn, Florian Fainelli
  Cc: Lukas Wunner, Rafael J . Wysocki, netdev, linux-pm,
	linux-renesas-soc, linux-kernel, Geert Uytterhoeven

During system resume, the PHY state machine may be run from the
workqueue while the Ethernet device (and its PHY) are still suspended,
which may lead to a system crash.

E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
driven by a PM controlled clock.  If the Ethernet registers are accessed
while the clock is not running, the system will crash with an imprecise
external abort.

As this is a race condition with a small time window, it is not so easy
to trigger at will.  Using pm_test may increase your chances:

    # echo 0 > /sys/module/printk/parameters/console_suspend
    # echo platform > /sys/power/pm_test
    # echo mem > /sys/power/state

However, since commit 7ad813f208533ceb ("net: phy: Correctly process
PHY_HALTED in phy_stop_machine()"), this is much more likely to happen,
presumably due to the new sync point where phy_state_machine() calls
queue_delayed_work() just before entering system suspend.

To fix this, move PHY polling from the non-freezable to the freezable
power efficient work queue.  This is similar to what was done in commit
ea00353f36b64375 ("PCI: Freeze PME scan before suspending devices").

Stacktrace for posterity:

    PM: Syncing filesystems ... done.
    Freezing user space processes ... (elapsed 0.001 seconds) done.
    OOM killer disabled.
    Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
    PM: suspend devices took 0.010 seconds
    Disabling non-boot CPUs ...
    Enabling non-boot CPUs ...
    Unhandled fault: imprecise external abort (0x1406) at 0x000cf6c8
    pgd = c0004000
    [000cf6c8] *pgd=00000000
    Internal error: : 1406 [#1] SMP ARM
    Modules linked in:
    CPU: 0 PID: 57 Comm: kworker/0:2 Not tainted 4.13.0-rc5-kzm9g-04113-g3a897d275111fd6c #903
    Hardware name: Generic SH73A0 (Flattened Device Tree)
    Workqueue: events_power_efficient phy_state_machine
    task: df6b4800 task.stack: df796000
    PC is at __smsc911x_reg_read+0x1c/0x60
    LR is at smsc911x_mac_read+0x4c/0x118
    pc : [<c03c7e88>]    lr : [<c03c8b88>]    psr: 200d0093
    sp : df797e98  ip : 00000000  fp : 00000001
    r10: 00000000  r9 : df70e380  r8 : 800d0093
    r7 : df631de8  r6 : 00000006  r5 : df631e08  r4 : df631dc0
    r3 : e0903000  r2 : 00000000  r1 : e09030a4  r0 : 00000000
    Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
    Control: 10c5387d  Table: 5ee1404a  DAC: 00000051
    Process kworker/0:2 (pid: 57, stack limit = 0xdf796210)
    Stack: (0xdf797e98 to 0xdf798000)
    7e80:                                                       df631dc0 00000001
    7ea0: 00000001 df631de8 600d0013 c03c8df4 df70e400 00000001 00000001 df70e458
    7ec0: df70e000 c03c7104 c03c6170 df70e000 df70e000 dfbc7bc0 df797f30 c03c6138
    7ee0: df77d900 c03c617c df77d900 df70e32c dfbc7bc0 c03c4218 df77d900 df70e32c
    7f00: dfbc7bc0 df797f30 dfbcb200 00000000 00000000 c013aa0c 00000001 00000000
    7f20: c013a994 00000000 00000000 00000000 c1117f70 00000000 00000000 c07420ec
    7f40: c0904900 df77d900 dfbc7bc0 dfbc7bc0 df796000 dfbc7bf4 c0904900 df77d918
    7f60: 00000008 c013b1e0 df6b4800 df75b500 df77e5c0 00000000 df44dea8 df77d900
    7f80: c013af28 df75b538 00000000 c0140584 df77e5c0 c0140460 00000000 00000000
    7fa0: 00000000 00000000 00000000 c0106ef0 00000000 00000000 00000000 00000000
    7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    7fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
    [<c03c7e88>] (__smsc911x_reg_read) from [<c03c8b88>] (smsc911x_mac_read+0x4c/0x118)
    [<c03c8b88>] (smsc911x_mac_read) from [<c03c8df4>] (smsc911x_mii_read+0x2c/0xa4)
    [<c03c8df4>] (smsc911x_mii_read) from [<c03c7104>] (mdiobus_read+0x58/0x70)
    [<c03c7104>] (mdiobus_read) from [<c03c6138>] (genphy_update_link+0x18/0x50)
    [<c03c6138>] (genphy_update_link) from [<c03c617c>] (genphy_read_status+0xc/0x1cc)
    [<c03c617c>] (genphy_read_status) from [<c03c4218>] (phy_state_machine+0xa8/0x3dc)
    [<c03c4218>] (phy_state_machine) from [<c013aa0c>] (process_one_work+0x240/0x3fc)
    [<c013aa0c>] (process_one_work) from [<c013b1e0>] (worker_thread+0x2b8/0x3f4)
    [<c013b1e0>] (worker_thread) from [<c0140584>] (kthread+0x124/0x144)
    [<c0140584>] (kthread) from [<c0106ef0>] (ret_from_fork+0x14/0x24)
    Code: e5903000 e0831001 e5910000 f57ff04f (e12fff1e)

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
---
 drivers/net/phy/phy.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index dae13f028c84ee17..2055165ca6349ee5 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -554,7 +554,8 @@ EXPORT_SYMBOL(phy_start_aneg);
  */
 void phy_start_machine(struct phy_device *phydev)
 {
-	queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, HZ);
+	queue_delayed_work(system_freezable_power_efficient_wq,
+			   &phydev->state_queue, HZ);
 }
 EXPORT_SYMBOL_GPL(phy_start_machine);
 
@@ -574,7 +575,8 @@ void phy_trigger_machine(struct phy_device *phydev, bool sync)
 		cancel_delayed_work_sync(&phydev->state_queue);
 	else
 		cancel_delayed_work(&phydev->state_queue);
-	queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, 0);
+	queue_delayed_work(system_freezable_power_efficient_wq,
+			   &phydev->state_queue, 0);
 }
 
 /**
@@ -1081,8 +1083,8 @@ void phy_state_machine(struct work_struct *work)
 	 * between states from phy_mac_interrupt()
 	 */
 	if (phydev->irq == PHY_POLL)
-		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
-				   PHY_STATE_TIME * HZ);
+		queue_delayed_work(system_freezable_power_efficient_wq,
+				   &phydev->state_queue, PHY_STATE_TIME * HZ);
 }
 
 /**
@@ -1099,7 +1101,7 @@ void phy_mac_interrupt(struct phy_device *phydev, int new_link)
 	phydev->link = new_link;
 
 	/* Trigger a state machine change */
-	queue_work(system_power_efficient_wq, &phydev->phy_queue);
+	queue_work(system_freezable_power_efficient_wq, &phydev->phy_queue);
 }
 EXPORT_SYMBOL(phy_mac_interrupt);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] net: smsc911x: Quiten netif during suspend
  2017-08-22 18:37 [PATCH 0/2] net: Fix crashes due to activity during suspend Geert Uytterhoeven
  2017-08-22 18:37 ` [PATCH 1/2] net: phy: Freeze PHY polling before suspending devices Geert Uytterhoeven
@ 2017-08-22 18:37 ` Geert Uytterhoeven
  2017-08-22 20:17   ` Andrew Lunn
  2017-08-22 18:49 ` [PATCH 0/2] net: Fix crashes due to activity " Florian Fainelli
  2 siblings, 1 reply; 10+ messages in thread
From: Geert Uytterhoeven @ 2017-08-22 18:37 UTC (permalink / raw)
  To: David S . Miller, Steve Glendinning, Andrew Lunn, Florian Fainelli
  Cc: Lukas Wunner, Rafael J . Wysocki, netdev, linux-pm,
	linux-renesas-soc, linux-kernel, Geert Uytterhoeven

If the network interface is kept running during suspend, the net core
may call net_device_ops.ndo_start_xmit() while the Ethernet device is
still suspended, which may lead to a system crash.

E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
driven by a PM controlled clock.  If the Ethernet registers are accessed
while the clock is not running, the system will crash with an imprecise
external abort.

As this is a race condition with a small time window, it is not so easy
to trigger at will.  Using pm_test may increase your chances:

    # echo 0 > /sys/module/printk/parameters/console_suspend
    # echo platform > /sys/power/pm_test
    # echo mem > /sys/power/state

To fix this, make sure the network interface is quitened during suspend.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
---
No stacktrace is provided, as the imprecise external abort is usually
reported from an innocent looking and unrelated function like
__loop_delay(), cpu_idle_poll(), or arch_timer_read_counter_long().
---
 drivers/net/ethernet/smsc/smsc911x.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index 0b6a39b003a4e188..012fb66eed8dd618 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -2595,6 +2595,11 @@ static int smsc911x_suspend(struct device *dev)
 	struct net_device *ndev = dev_get_drvdata(dev);
 	struct smsc911x_data *pdata = netdev_priv(ndev);
 
+	if (netif_running(ndev)) {
+		netif_stop_queue(ndev);
+		netif_device_detach(ndev);
+	}
+
 	/* enable wake on LAN, energy detection and the external PME
 	 * signal. */
 	smsc911x_reg_write(pdata, PMT_CTRL,
@@ -2628,7 +2633,15 @@ static int smsc911x_resume(struct device *dev)
 	while (!(smsc911x_reg_read(pdata, PMT_CTRL) & PMT_CTRL_READY_) && --to)
 		udelay(1000);
 
-	return (to == 0) ? -EIO : 0;
+	if (to == 0)
+		return -EIO;
+
+	if (netif_running(ndev)) {
+		netif_device_attach(ndev);
+		netif_start_queue(ndev);
+	}
+
+	return 0;
 }
 
 static const struct dev_pm_ops smsc911x_pm_ops = {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
  2017-08-22 18:37 [PATCH 0/2] net: Fix crashes due to activity during suspend Geert Uytterhoeven
  2017-08-22 18:37 ` [PATCH 1/2] net: phy: Freeze PHY polling before suspending devices Geert Uytterhoeven
  2017-08-22 18:37 ` [PATCH 2/2] net: smsc911x: Quiten netif during suspend Geert Uytterhoeven
@ 2017-08-22 18:49 ` Florian Fainelli
  2017-08-22 21:16   ` Geert Uytterhoeven
  2017-08-23 11:45   ` Geert Uytterhoeven
  2 siblings, 2 replies; 10+ messages in thread
From: Florian Fainelli @ 2017-08-22 18:49 UTC (permalink / raw)
  To: Geert Uytterhoeven, David S . Miller, Steve Glendinning, Andrew Lunn
  Cc: Lukas Wunner, Rafael J . Wysocki, netdev, linux-pm,
	linux-renesas-soc, linux-kernel

On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
> 	Hi all,
> 
> If an Ethernet device is used while the device is suspended, the system may
> crash.
> 
> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
> driven by a PM controlled clock.  If the Ethernet registers are accessed
> while the clock is not running, the system will crash with an imprecise
> external abort.
> 
> This patch series fixes two of such crashes:
>   1. The first patch prevents the PHY polling state machine from accessing
>      PHY registers while a device is suspended,
>   2. The second patch prevents the net core from trying to transmit packets
>      when an smsc911x device is suspended.
> 
> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
> s2ram (rarely), or by using pm_test (more likely to trigger):
> 
>     # echo 0 > /sys/module/printk/parameters/console_suspend
>     # echo platform > /sys/power/pm_test
>     # echo mem > /sys/power/state
> 
> With this series applied, my test systems survive a loop of 100 test
> suspends.

It seems to me like part, if not the entire problem is that smsc91xx's
suspend and resume functions are way too simplistic and absolutely do
not manage the PHY during suspend/resume, the PHY state machine is not
even stopped, so of course, this will cause bus errors if you access
those registers.

You are addressing this as part of patch 2, but this seems to me like
this is still a bit incomplete and you'd need at least phy_stop() and/or
phy_suspend() (does a power down of the PHY) and phy_start() and/or
phy_resume() calls to complete the PHY state machine shutdown during
suspend.

Have you tried that?
-- 
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] net: smsc911x: Quiten netif during suspend
  2017-08-22 18:37 ` [PATCH 2/2] net: smsc911x: Quiten netif during suspend Geert Uytterhoeven
@ 2017-08-22 20:17   ` Andrew Lunn
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Lunn @ 2017-08-22 20:17 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: David S . Miller, Steve Glendinning, Florian Fainelli,
	Lukas Wunner, Rafael J . Wysocki, netdev, linux-pm,
	linux-renesas-soc, linux-kernel

On Tue, Aug 22, 2017 at 08:37:26PM +0200, Geert Uytterhoeven wrote:

Hi Geert

quieten. Has an E in the middle.

Otherwise this patch looks reasonable.

	  Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
  2017-08-22 18:49 ` [PATCH 0/2] net: Fix crashes due to activity " Florian Fainelli
@ 2017-08-22 21:16   ` Geert Uytterhoeven
  2017-08-23 11:45   ` Geert Uytterhoeven
  1 sibling, 0 replies; 10+ messages in thread
From: Geert Uytterhoeven @ 2017-08-22 21:16 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Geert Uytterhoeven, David S . Miller, Steve Glendinning,
	Andrew Lunn, Lukas Wunner, Rafael J . Wysocki, netdev,
	Linux PM list, Linux-Renesas, linux-kernel

Hi Florian,

On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
>> If an Ethernet device is used while the device is suspended, the system may
>> crash.
>>
>> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
>> driven by a PM controlled clock.  If the Ethernet registers are accessed
>> while the clock is not running, the system will crash with an imprecise
>> external abort.
>>
>> This patch series fixes two of such crashes:
>>   1. The first patch prevents the PHY polling state machine from accessing
>>      PHY registers while a device is suspended,
>>   2. The second patch prevents the net core from trying to transmit packets
>>      when an smsc911x device is suspended.
>>
>> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
>> s2ram (rarely), or by using pm_test (more likely to trigger):
>>
>>     # echo 0 > /sys/module/printk/parameters/console_suspend
>>     # echo platform > /sys/power/pm_test
>>     # echo mem > /sys/power/state
>>
>> With this series applied, my test systems survive a loop of 100 test
>> suspends.
>
> It seems to me like part, if not the entire problem is that smsc91xx's
> suspend and resume functions are way too simplistic and absolutely do
> not manage the PHY during suspend/resume, the PHY state machine is not
> even stopped, so of course, this will cause bus errors if you access
> those registers.
>
> You are addressing this as part of patch 2, but this seems to me like
> this is still a bit incomplete and you'd need at least phy_stop() and/or
> phy_suspend() (does a power down of the PHY) and phy_start() and/or
> phy_resume() calls to complete the PHY state machine shutdown during
> suspend.
>
> Have you tried that?

Thank you, I will give that a try!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
  2017-08-22 18:49 ` [PATCH 0/2] net: Fix crashes due to activity " Florian Fainelli
  2017-08-22 21:16   ` Geert Uytterhoeven
@ 2017-08-23 11:45   ` Geert Uytterhoeven
  2017-08-23 17:13     ` Florian Fainelli
  1 sibling, 1 reply; 10+ messages in thread
From: Geert Uytterhoeven @ 2017-08-23 11:45 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Geert Uytterhoeven, David S . Miller, Steve Glendinning,
	Andrew Lunn, Lukas Wunner, Rafael J . Wysocki, netdev,
	Linux PM list, Linux-Renesas, linux-kernel

Hi Florian,

On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
>> If an Ethernet device is used while the device is suspended, the system may
>> crash.
>>
>> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
>> driven by a PM controlled clock.  If the Ethernet registers are accessed
>> while the clock is not running, the system will crash with an imprecise
>> external abort.
>>
>> This patch series fixes two of such crashes:
>>   1. The first patch prevents the PHY polling state machine from accessing
>>      PHY registers while a device is suspended,
>>   2. The second patch prevents the net core from trying to transmit packets
>>      when an smsc911x device is suspended.
>>
>> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
>> s2ram (rarely), or by using pm_test (more likely to trigger):
>>
>>     # echo 0 > /sys/module/printk/parameters/console_suspend
>>     # echo platform > /sys/power/pm_test
>>     # echo mem > /sys/power/state
>>
>> With this series applied, my test systems survive a loop of 100 test
>> suspends.
>
> It seems to me like part, if not the entire problem is that smsc91xx's
> suspend and resume functions are way too simplistic and absolutely do
> not manage the PHY during suspend/resume, the PHY state machine is not
> even stopped, so of course, this will cause bus errors if you access
> those registers.
>
> You are addressing this as part of patch 2, but this seems to me like
> this is still a bit incomplete and you'd need at least phy_stop() and/or
> phy_suspend() (does a power down of the PHY) and phy_start() and/or
> phy_resume() calls to complete the PHY state machine shutdown during
> suspend.
>
> Have you tried that?

Unfortunately that doesn't help.
In state PHY_HALTED, the PHY state machine still calls the .adjust_link()
callback while the device is suspended.

Do you have a clue? This is too far beyond my phy-foo...

Thanks!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
  2017-08-23 11:45   ` Geert Uytterhoeven
@ 2017-08-23 17:13     ` Florian Fainelli
  2017-09-07 13:09       ` Florian Fainelli
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Fainelli @ 2017-08-23 17:13 UTC (permalink / raw)
  To: Geert Uytterhoeven, marc_gonzalez, slash.tmp, andrew
  Cc: Geert Uytterhoeven, David S . Miller, Steve Glendinning,
	Andrew Lunn, Lukas Wunner, Rafael J . Wysocki, netdev,
	Linux PM list, Linux-Renesas, linux-kernel

On 08/23/2017 04:45 AM, Geert Uytterhoeven wrote:
> Hi Florian,
> 
> On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
>>> If an Ethernet device is used while the device is suspended, the system may
>>> crash.
>>>
>>> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
>>> driven by a PM controlled clock.  If the Ethernet registers are accessed
>>> while the clock is not running, the system will crash with an imprecise
>>> external abort.
>>>
>>> This patch series fixes two of such crashes:
>>>   1. The first patch prevents the PHY polling state machine from accessing
>>>      PHY registers while a device is suspended,
>>>   2. The second patch prevents the net core from trying to transmit packets
>>>      when an smsc911x device is suspended.
>>>
>>> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
>>> s2ram (rarely), or by using pm_test (more likely to trigger):
>>>
>>>     # echo 0 > /sys/module/printk/parameters/console_suspend
>>>     # echo platform > /sys/power/pm_test
>>>     # echo mem > /sys/power/state
>>>
>>> With this series applied, my test systems survive a loop of 100 test
>>> suspends.
>>
>> It seems to me like part, if not the entire problem is that smsc91xx's
>> suspend and resume functions are way too simplistic and absolutely do
>> not manage the PHY during suspend/resume, the PHY state machine is not
>> even stopped, so of course, this will cause bus errors if you access
>> those registers.
>>
>> You are addressing this as part of patch 2, but this seems to me like
>> this is still a bit incomplete and you'd need at least phy_stop() and/or
>> phy_suspend() (does a power down of the PHY) and phy_start() and/or
>> phy_resume() calls to complete the PHY state machine shutdown during
>> suspend.
>>
>> Have you tried that?
> 
> Unfortunately that doesn't help.
> In state PHY_HALTED, the PHY state machine still calls the .adjust_link()
> callback while the device is suspended.

Humm that is correct yes.

> 
> Do you have a clue? This is too far beyond my phy-foo...

I was initially contemplating a revert of
7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy: Correctly process
PHY_HALTED in phy_stop_machine()") but this is not the root of the
problem. The problem really is that phy_stop() does not wait for the PHY
state machine to be stopped so you cannot rely on that and past the
function return be offered any guarantees that adjust_link is not called.

We seem to be getting away with that in most drivers because when we see
phydev->link = 0, we either do nothing or actually turn of the HW block.

How about we export phy_stop_machine() to drivers which would provide a
synchronization point that would ensure that no HW accesses are done
past this point?

I am absolutely not clear on the implications of using a freezable
workqueue with respect to the PHY state machine and how devices are
going to wind-up being powered down or not...

Thanks!
-- 
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
  2017-08-23 17:13     ` Florian Fainelli
@ 2017-09-07 13:09       ` Florian Fainelli
  2017-09-13 17:33         ` Geert Uytterhoeven
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Fainelli @ 2017-09-07 13:09 UTC (permalink / raw)
  To: Geert Uytterhoeven, andrew
  Cc: marc_gonzalez, slash.tmp, Geert Uytterhoeven, David S . Miller,
	Steve Glendinning, Lukas Wunner, Rafael J . Wysocki, netdev,
	Linux PM list, Linux-Renesas, linux-kernel



On 08/23/2017 10:13 AM, Florian Fainelli wrote:
> On 08/23/2017 04:45 AM, Geert Uytterhoeven wrote:
>> Hi Florian,
>>
>> On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>> On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
>>>> If an Ethernet device is used while the device is suspended, the system may
>>>> crash.
>>>>
>>>> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
>>>> driven by a PM controlled clock.  If the Ethernet registers are accessed
>>>> while the clock is not running, the system will crash with an imprecise
>>>> external abort.
>>>>
>>>> This patch series fixes two of such crashes:
>>>>   1. The first patch prevents the PHY polling state machine from accessing
>>>>      PHY registers while a device is suspended,
>>>>   2. The second patch prevents the net core from trying to transmit packets
>>>>      when an smsc911x device is suspended.
>>>>
>>>> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
>>>> s2ram (rarely), or by using pm_test (more likely to trigger):
>>>>
>>>>     # echo 0 > /sys/module/printk/parameters/console_suspend
>>>>     # echo platform > /sys/power/pm_test
>>>>     # echo mem > /sys/power/state
>>>>
>>>> With this series applied, my test systems survive a loop of 100 test
>>>> suspends.
>>>
>>> It seems to me like part, if not the entire problem is that smsc91xx's
>>> suspend and resume functions are way too simplistic and absolutely do
>>> not manage the PHY during suspend/resume, the PHY state machine is not
>>> even stopped, so of course, this will cause bus errors if you access
>>> those registers.
>>>
>>> You are addressing this as part of patch 2, but this seems to me like
>>> this is still a bit incomplete and you'd need at least phy_stop() and/or
>>> phy_suspend() (does a power down of the PHY) and phy_start() and/or
>>> phy_resume() calls to complete the PHY state machine shutdown during
>>> suspend.
>>>
>>> Have you tried that?
>>
>> Unfortunately that doesn't help.
>> In state PHY_HALTED, the PHY state machine still calls the .adjust_link()
>> callback while the device is suspended.
> 
> Humm that is correct yes.
> 
>>
>> Do you have a clue? This is too far beyond my phy-foo...
> 
> I was initially contemplating a revert of
> 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy: Correctly process
> PHY_HALTED in phy_stop_machine()") but this is not the root of the
> problem. The problem really is that phy_stop() does not wait for the PHY
> state machine to be stopped so you cannot rely on that and past the
> function return be offered any guarantees that adjust_link is not called.
> 
> We seem to be getting away with that in most drivers because when we see
> phydev->link = 0, we either do nothing or actually turn of the HW block.
> 
> How about we export phy_stop_machine() to drivers which would provide a
> synchronization point that would ensure that no HW accesses are done
> past this point?
> 
> I am absolutely not clear on the implications of using a freezable
> workqueue with respect to the PHY state machine and how devices are
> going to wind-up being powered down or not...

Geert, as you may have notice a revert of the change was sent so 4.13
should be fine, but ultimately I would like to put the non-reverted code
back in after we add a few safeguards:

- David Daney reported he could crash the kernel by calling just
phy_disconnect() with no prior phy_stop() which is not "legal" but
should not crash either

- and you reported the bus errors on smsc911x when we call adjust_link
during suspend, and due to a lack of hard synchronization so phy_stop()
here does not give you enough guarantees to let you turn off power to
the smsc911x block

If that seems accurate then we can work on something that should be
working again (famous last words).

Thanks
-- 
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
  2017-09-07 13:09       ` Florian Fainelli
@ 2017-09-13 17:33         ` Geert Uytterhoeven
  0 siblings, 0 replies; 10+ messages in thread
From: Geert Uytterhoeven @ 2017-09-13 17:33 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, marc_gonzalez, Mason, Geert Uytterhoeven,
	David S . Miller, Steve Glendinning, Lukas Wunner,
	Rafael J . Wysocki, netdev, Linux PM list, Linux-Renesas,
	linux-kernel

Hi Florian,

On Thu, Sep 7, 2017 at 3:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 08/23/2017 10:13 AM, Florian Fainelli wrote:
>> On 08/23/2017 04:45 AM, Geert Uytterhoeven wrote:
>>> On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>> On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
>>>>> If an Ethernet device is used while the device is suspended, the system may
>>>>> crash.
>>>>>
>>>>> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
>>>>> driven by a PM controlled clock.  If the Ethernet registers are accessed
>>>>> while the clock is not running, the system will crash with an imprecise
>>>>> external abort.
>>>>>
>>>>> This patch series fixes two of such crashes:
>>>>>   1. The first patch prevents the PHY polling state machine from accessing
>>>>>      PHY registers while a device is suspended,
>>>>>   2. The second patch prevents the net core from trying to transmit packets
>>>>>      when an smsc911x device is suspended.
>>>>>
>>>>> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
>>>>> s2ram (rarely), or by using pm_test (more likely to trigger):
>>>>>
>>>>>     # echo 0 > /sys/module/printk/parameters/console_suspend
>>>>>     # echo platform > /sys/power/pm_test
>>>>>     # echo mem > /sys/power/state
>>>>>
>>>>> With this series applied, my test systems survive a loop of 100 test
>>>>> suspends.
>>>>
>>>> It seems to me like part, if not the entire problem is that smsc91xx's
>>>> suspend and resume functions are way too simplistic and absolutely do
>>>> not manage the PHY during suspend/resume, the PHY state machine is not
>>>> even stopped, so of course, this will cause bus errors if you access
>>>> those registers.
>>>>
>>>> You are addressing this as part of patch 2, but this seems to me like
>>>> this is still a bit incomplete and you'd need at least phy_stop() and/or
>>>> phy_suspend() (does a power down of the PHY) and phy_start() and/or
>>>> phy_resume() calls to complete the PHY state machine shutdown during
>>>> suspend.
>>>>
>>>> Have you tried that?
>>>
>>> Unfortunately that doesn't help.
>>> In state PHY_HALTED, the PHY state machine still calls the .adjust_link()
>>> callback while the device is suspended.
>>
>> Humm that is correct yes.
>>
>>> Do you have a clue? This is too far beyond my phy-foo...
>>
>> I was initially contemplating a revert of
>> 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy: Correctly process
>> PHY_HALTED in phy_stop_machine()") but this is not the root of the
>> problem. The problem really is that phy_stop() does not wait for the PHY
>> state machine to be stopped so you cannot rely on that and past the
>> function return be offered any guarantees that adjust_link is not called.
>>
>> We seem to be getting away with that in most drivers because when we see
>> phydev->link = 0, we either do nothing or actually turn of the HW block.
>>
>> How about we export phy_stop_machine() to drivers which would provide a
>> synchronization point that would ensure that no HW accesses are done
>> past this point?
>>
>> I am absolutely not clear on the implications of using a freezable
>> workqueue with respect to the PHY state machine and how devices are
>> going to wind-up being powered down or not...
>
> Geert, as you may have notice a revert of the change was sent so 4.13
> should be fine, but ultimately I would like to put the non-reverted code
> back in after we add a few safeguards:

With the revert, I no longer need "[PATCH 1/2] net: phy: Freeze PHY polling
before suspending devices".
I just did more than 50 successful suspend/resume cycles to verify that.

I still need "[PATCH 2/2] net: smsc911x: Quiten netif during suspend", so
I'll submit a v2 for that.

> - and you reported the bus errors on smsc911x when we call adjust_link
> during suspend, and due to a lack of hard synchronization so phy_stop()
> here does not give you enough guarantees to let you turn off power to
> the smsc911x block
>
> If that seems accurate then we can work on something that should be
> working again (famous last words).

Sounds accurate to me.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-09-13 17:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-22 18:37 [PATCH 0/2] net: Fix crashes due to activity during suspend Geert Uytterhoeven
2017-08-22 18:37 ` [PATCH 1/2] net: phy: Freeze PHY polling before suspending devices Geert Uytterhoeven
2017-08-22 18:37 ` [PATCH 2/2] net: smsc911x: Quiten netif during suspend Geert Uytterhoeven
2017-08-22 20:17   ` Andrew Lunn
2017-08-22 18:49 ` [PATCH 0/2] net: Fix crashes due to activity " Florian Fainelli
2017-08-22 21:16   ` Geert Uytterhoeven
2017-08-23 11:45   ` Geert Uytterhoeven
2017-08-23 17:13     ` Florian Fainelli
2017-09-07 13:09       ` Florian Fainelli
2017-09-13 17:33         ` Geert Uytterhoeven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).