All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY
@ 2023-09-14 15:34 Russell King (Oracle)
  2023-09-14 15:35 ` [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock Russell King (Oracle)
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:34 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

This series addresses a problem reported by Jijie Shao where the PHY
state machine can race with phy_stop() leading to an incorrect state.

The issue centres around phy_state_machine() dropping the phydev->lock
mutex briefly, which allows phy_stop() to get in half-way through the
state machine, and when the state machine resumes, it overwrites
phydev->state with a value incompatible with a stopped PHY. This causes
a subsequent phy_start() to issue a warning.

We address this firstly by using versions of functions that do not take
tne lock, moving them into the locked region. The only function that
this can't be done with is phy_suspend() which needs to call into the
driver without taking the lock.

For phy_suspend(), we split the state machine into two parts - the
initial part which runs under the phydev->lock, and the second part
which runs without the lock.

We finish off by using the split state machine in phy_stop() which
removes another unnecessary unlock-lock sequence from phylib.

Changes from RFC:
- Added Jijie Shao's tested-by

 drivers/net/phy/phy.c | 204 +++++++++++++++++++++++++++-----------------------
 1 file changed, 110 insertions(+), 94 deletions(-)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
@ 2023-09-14 15:35 ` Russell King (Oracle)
  2023-09-14 18:21   ` Florian Fainelli
       [not found]   ` <CGME20230918123304eucas1p2b628f00ed8df536372f1f2b445706021@eucas1p2.samsung.com>
  2023-09-14 15:35 ` [PATCH net-next 2/7] net: phy: call phy_error_precise() while holding the lock Russell King (Oracle)
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:35 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

phy_stop() calls phy_process_state_change() while holding the phydev
lock, so also arrange for phy_state_machine() to do the same, so that
this function is called with consistent locking.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index df54c137c5f5..1e5218935eb3 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1506,6 +1506,7 @@ void phy_state_machine(struct work_struct *work)
 	if (err < 0)
 		phy_error_precise(phydev, func, err);
 
+	mutex_lock(&phydev->lock);
 	phy_process_state_change(phydev, old_state);
 
 	/* Only re-schedule a PHY state machine change if we are polling the
@@ -1516,7 +1517,6 @@ void phy_state_machine(struct work_struct *work)
 	 * state machine would be pointless and possibly error prone when
 	 * called from phy_disconnect() synchronously.
 	 */
-	mutex_lock(&phydev->lock);
 	if (phy_polling_mode(phydev) && phy_is_started(phydev))
 		phy_queue_state_machine(phydev, PHY_STATE_TIME);
 	mutex_unlock(&phydev->lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 2/7] net: phy: call phy_error_precise() while holding the lock
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
  2023-09-14 15:35 ` [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock Russell King (Oracle)
@ 2023-09-14 15:35 ` Russell King (Oracle)
  2023-09-14 18:21   ` Florian Fainelli
  2023-09-14 15:35 ` [PATCH net-next 3/7] net: phy: move call to start aneg Russell King (Oracle)
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:35 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Move the locking out of phy_error_precise() and to its only call site,
merging with the locked region that has already been taken.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 1e5218935eb3..990d387b31bd 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1231,9 +1231,7 @@ static void phy_error_precise(struct phy_device *phydev,
 			      const void *func, int err)
 {
 	WARN(1, "%pS: returned: %d\n", func, err);
-	mutex_lock(&phydev->lock);
 	phy_process_error(phydev);
-	mutex_unlock(&phydev->lock);
 }
 
 /**
@@ -1503,10 +1501,10 @@ void phy_state_machine(struct work_struct *work)
 	if (err == -ENODEV)
 		return;
 
+	mutex_lock(&phydev->lock);
 	if (err < 0)
 		phy_error_precise(phydev, func, err);
 
-	mutex_lock(&phydev->lock);
 	phy_process_state_change(phydev, old_state);
 
 	/* Only re-schedule a PHY state machine change if we are polling the
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 3/7] net: phy: move call to start aneg
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
  2023-09-14 15:35 ` [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock Russell King (Oracle)
  2023-09-14 15:35 ` [PATCH net-next 2/7] net: phy: call phy_error_precise() while holding the lock Russell King (Oracle)
@ 2023-09-14 15:35 ` Russell King (Oracle)
  2023-09-14 18:22   ` Florian Fainelli
  2023-09-14 15:35 ` [PATCH net-next 4/7] net: phy: move phy_suspend() to end of phy_state_machine() Russell King (Oracle)
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:35 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Move the call to start auto-negotiation inside the lock in the PHYLIB
state machine, calling the locked variant _phy_start_aneg(). This
avoids unnecessarily releasing and re-acquiring the lock.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 990d387b31bd..5bb33af2a4cb 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1489,14 +1489,15 @@ void phy_state_machine(struct work_struct *work)
 		break;
 	}
 
+	if (needs_aneg) {
+		err = _phy_start_aneg(phydev);
+		func = &_phy_start_aneg;
+	}
+
 	mutex_unlock(&phydev->lock);
 
-	if (needs_aneg) {
-		err = phy_start_aneg(phydev);
-		func = &phy_start_aneg;
-	} else if (do_suspend) {
+	if (do_suspend)
 		phy_suspend(phydev);
-	}
 
 	if (err == -ENODEV)
 		return;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 4/7] net: phy: move phy_suspend() to end of phy_state_machine()
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
                   ` (2 preceding siblings ...)
  2023-09-14 15:35 ` [PATCH net-next 3/7] net: phy: move call to start aneg Russell King (Oracle)
@ 2023-09-14 15:35 ` Russell King (Oracle)
  2023-09-14 18:22   ` Florian Fainelli
  2023-09-14 15:35 ` [PATCH net-next 5/7] net: phy: move phy_state_machine() Russell King (Oracle)
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:35 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Move the call to phy_suspend() to the end of phy_state_machine() after
we release the lock so that we can combine the locked areas.
phy_suspend() can not be called while holding phydev->lock as it has
caused deadlocks in the past.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 5bb33af2a4cb..756326f38b14 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1494,15 +1494,11 @@ void phy_state_machine(struct work_struct *work)
 		func = &_phy_start_aneg;
 	}
 
-	mutex_unlock(&phydev->lock);
-
-	if (do_suspend)
-		phy_suspend(phydev);
-
-	if (err == -ENODEV)
+	if (err == -ENODEV) {
+		mutex_unlock(&phydev->lock);
 		return;
+	}
 
-	mutex_lock(&phydev->lock);
 	if (err < 0)
 		phy_error_precise(phydev, func, err);
 
@@ -1519,6 +1515,9 @@ void phy_state_machine(struct work_struct *work)
 	if (phy_polling_mode(phydev) && phy_is_started(phydev))
 		phy_queue_state_machine(phydev, PHY_STATE_TIME);
 	mutex_unlock(&phydev->lock);
+
+	if (do_suspend)
+		phy_suspend(phydev);
 }
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 5/7] net: phy: move phy_state_machine()
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
                   ` (3 preceding siblings ...)
  2023-09-14 15:35 ` [PATCH net-next 4/7] net: phy: move phy_suspend() to end of phy_state_machine() Russell King (Oracle)
@ 2023-09-14 15:35 ` Russell King (Oracle)
  2023-09-14 18:22   ` Florian Fainelli
  2023-09-14 15:35 ` [PATCH net-next 6/7] net: phy: split locked and unlocked section of phy_state_machine() Russell King (Oracle)
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:35 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Move phy_state_machine() before phy_stop() to avoid subsequent patches
introducing forward references.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 152 +++++++++++++++++++++---------------------
 1 file changed, 76 insertions(+), 76 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 756326f38b14..20e23fa9cf96 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1353,82 +1353,6 @@ void phy_free_interrupt(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_free_interrupt);
 
-/**
- * phy_stop - Bring down the PHY link, and stop checking the status
- * @phydev: target phy_device struct
- */
-void phy_stop(struct phy_device *phydev)
-{
-	struct net_device *dev = phydev->attached_dev;
-	enum phy_state old_state;
-
-	if (!phy_is_started(phydev) && phydev->state != PHY_DOWN &&
-	    phydev->state != PHY_ERROR) {
-		WARN(1, "called from state %s\n",
-		     phy_state_to_str(phydev->state));
-		return;
-	}
-
-	mutex_lock(&phydev->lock);
-	old_state = phydev->state;
-
-	if (phydev->state == PHY_CABLETEST) {
-		phy_abort_cable_test(phydev);
-		netif_testing_off(dev);
-	}
-
-	if (phydev->sfp_bus)
-		sfp_upstream_stop(phydev->sfp_bus);
-
-	phydev->state = PHY_HALTED;
-	phy_process_state_change(phydev, old_state);
-
-	mutex_unlock(&phydev->lock);
-
-	phy_state_machine(&phydev->state_queue.work);
-	phy_stop_machine(phydev);
-
-	/* Cannot call flush_scheduled_work() here as desired because
-	 * of rtnl_lock(), but PHY_HALTED shall guarantee irq handler
-	 * will not reenable interrupts.
-	 */
-}
-EXPORT_SYMBOL(phy_stop);
-
-/**
- * phy_start - start or restart a PHY device
- * @phydev: target phy_device struct
- *
- * Description: Indicates the attached device's readiness to
- *   handle PHY-related work.  Used during startup to start the
- *   PHY, and after a call to phy_stop() to resume operation.
- *   Also used to indicate the MDIO bus has cleared an error
- *   condition.
- */
-void phy_start(struct phy_device *phydev)
-{
-	mutex_lock(&phydev->lock);
-
-	if (phydev->state != PHY_READY && phydev->state != PHY_HALTED) {
-		WARN(1, "called from state %s\n",
-		     phy_state_to_str(phydev->state));
-		goto out;
-	}
-
-	if (phydev->sfp_bus)
-		sfp_upstream_start(phydev->sfp_bus);
-
-	/* if phy was suspended, bring the physical link up again */
-	__phy_resume(phydev);
-
-	phydev->state = PHY_UP;
-
-	phy_start_machine(phydev);
-out:
-	mutex_unlock(&phydev->lock);
-}
-EXPORT_SYMBOL(phy_start);
-
 /**
  * phy_state_machine - Handle the state machine
  * @work: work_struct that describes the work to be done
@@ -1520,6 +1444,82 @@ void phy_state_machine(struct work_struct *work)
 		phy_suspend(phydev);
 }
 
+/**
+ * phy_stop - Bring down the PHY link, and stop checking the status
+ * @phydev: target phy_device struct
+ */
+void phy_stop(struct phy_device *phydev)
+{
+	struct net_device *dev = phydev->attached_dev;
+	enum phy_state old_state;
+
+	if (!phy_is_started(phydev) && phydev->state != PHY_DOWN &&
+	    phydev->state != PHY_ERROR) {
+		WARN(1, "called from state %s\n",
+		     phy_state_to_str(phydev->state));
+		return;
+	}
+
+	mutex_lock(&phydev->lock);
+	old_state = phydev->state;
+
+	if (phydev->state == PHY_CABLETEST) {
+		phy_abort_cable_test(phydev);
+		netif_testing_off(dev);
+	}
+
+	if (phydev->sfp_bus)
+		sfp_upstream_stop(phydev->sfp_bus);
+
+	phydev->state = PHY_HALTED;
+	phy_process_state_change(phydev, old_state);
+
+	mutex_unlock(&phydev->lock);
+
+	phy_state_machine(&phydev->state_queue.work);
+	phy_stop_machine(phydev);
+
+	/* Cannot call flush_scheduled_work() here as desired because
+	 * of rtnl_lock(), but PHY_HALTED shall guarantee irq handler
+	 * will not reenable interrupts.
+	 */
+}
+EXPORT_SYMBOL(phy_stop);
+
+/**
+ * phy_start - start or restart a PHY device
+ * @phydev: target phy_device struct
+ *
+ * Description: Indicates the attached device's readiness to
+ *   handle PHY-related work.  Used during startup to start the
+ *   PHY, and after a call to phy_stop() to resume operation.
+ *   Also used to indicate the MDIO bus has cleared an error
+ *   condition.
+ */
+void phy_start(struct phy_device *phydev)
+{
+	mutex_lock(&phydev->lock);
+
+	if (phydev->state != PHY_READY && phydev->state != PHY_HALTED) {
+		WARN(1, "called from state %s\n",
+		     phy_state_to_str(phydev->state));
+		goto out;
+	}
+
+	if (phydev->sfp_bus)
+		sfp_upstream_start(phydev->sfp_bus);
+
+	/* if phy was suspended, bring the physical link up again */
+	__phy_resume(phydev);
+
+	phydev->state = PHY_UP;
+
+	phy_start_machine(phydev);
+out:
+	mutex_unlock(&phydev->lock);
+}
+EXPORT_SYMBOL(phy_start);
+
 /**
  * phy_mac_interrupt - MAC says the link has changed
  * @phydev: phy_device struct with changed link
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 6/7] net: phy: split locked and unlocked section of phy_state_machine()
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
                   ` (4 preceding siblings ...)
  2023-09-14 15:35 ` [PATCH net-next 5/7] net: phy: move phy_state_machine() Russell King (Oracle)
@ 2023-09-14 15:35 ` Russell King (Oracle)
  2023-09-14 18:39   ` Florian Fainelli
  2023-09-14 15:36 ` [PATCH net-next 7/7] net: phy: convert phy_stop() to use split state machine Russell King (Oracle)
  2023-09-17 13:40 ` [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY patchwork-bot+netdevbpf
  7 siblings, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:35 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Split out the locked and unlocked sections of phy_state_machine() into
two separate functions which can be called inside the phydev lock and
outside the phydev lock as appropriate, thus allowing us to combine
the locked regions in the caller of phy_state_machine() with the
locked region inside phy_state_machine().

This avoids unnecessarily dropping the phydev lock which may allow
races to occur.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 68 ++++++++++++++++++++++++++-----------------
 1 file changed, 42 insertions(+), 26 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 20e23fa9cf96..d78c2cc003ce 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1353,33 +1353,27 @@ void phy_free_interrupt(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_free_interrupt);
 
-/**
- * phy_state_machine - Handle the state machine
- * @work: work_struct that describes the work to be done
- */
-void phy_state_machine(struct work_struct *work)
+enum phy_state_work {
+	PHY_STATE_WORK_NONE,
+	PHY_STATE_WORK_ANEG,
+	PHY_STATE_WORK_SUSPEND,
+};
+
+static enum phy_state_work _phy_state_machine(struct phy_device *phydev)
 {
-	struct delayed_work *dwork = to_delayed_work(work);
-	struct phy_device *phydev =
-			container_of(dwork, struct phy_device, state_queue);
+	enum phy_state_work state_work = PHY_STATE_WORK_NONE;
 	struct net_device *dev = phydev->attached_dev;
-	bool needs_aneg = false, do_suspend = false;
-	enum phy_state old_state;
+	enum phy_state old_state = phydev->state;
 	const void *func = NULL;
 	bool finished = false;
 	int err = 0;
 
-	mutex_lock(&phydev->lock);
-
-	old_state = phydev->state;
-
 	switch (phydev->state) {
 	case PHY_DOWN:
 	case PHY_READY:
 		break;
 	case PHY_UP:
-		needs_aneg = true;
-
+		state_work = PHY_STATE_WORK_ANEG;
 		break;
 	case PHY_NOLINK:
 	case PHY_RUNNING:
@@ -1391,7 +1385,7 @@ void phy_state_machine(struct work_struct *work)
 		if (err) {
 			phy_abort_cable_test(phydev);
 			netif_testing_off(dev);
-			needs_aneg = true;
+			state_work = PHY_STATE_WORK_ANEG;
 			phydev->state = PHY_UP;
 			break;
 		}
@@ -1399,7 +1393,7 @@ void phy_state_machine(struct work_struct *work)
 		if (finished) {
 			ethnl_cable_test_finished(phydev);
 			netif_testing_off(dev);
-			needs_aneg = true;
+			state_work = PHY_STATE_WORK_ANEG;
 			phydev->state = PHY_UP;
 		}
 		break;
@@ -1409,19 +1403,17 @@ void phy_state_machine(struct work_struct *work)
 			phydev->link = 0;
 			phy_link_down(phydev);
 		}
-		do_suspend = true;
+		state_work = PHY_STATE_WORK_SUSPEND;
 		break;
 	}
 
-	if (needs_aneg) {
+	if (state_work == PHY_STATE_WORK_ANEG) {
 		err = _phy_start_aneg(phydev);
 		func = &_phy_start_aneg;
 	}
 
-	if (err == -ENODEV) {
-		mutex_unlock(&phydev->lock);
-		return;
-	}
+	if (err == -ENODEV)
+		return state_work;
 
 	if (err < 0)
 		phy_error_precise(phydev, func, err);
@@ -1438,12 +1430,36 @@ void phy_state_machine(struct work_struct *work)
 	 */
 	if (phy_polling_mode(phydev) && phy_is_started(phydev))
 		phy_queue_state_machine(phydev, PHY_STATE_TIME);
-	mutex_unlock(&phydev->lock);
 
-	if (do_suspend)
+	return state_work;
+}
+
+/* unlocked part of the PHY state machine */
+static void _phy_state_machine_post_work(struct phy_device *phydev,
+					 enum phy_state_work state_work)
+{
+	if (state_work == PHY_STATE_WORK_SUSPEND)
 		phy_suspend(phydev);
 }
 
+/**
+ * phy_state_machine - Handle the state machine
+ * @work: work_struct that describes the work to be done
+ */
+void phy_state_machine(struct work_struct *work)
+{
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct phy_device *phydev =
+			container_of(dwork, struct phy_device, state_queue);
+	enum phy_state_work state_work;
+
+	mutex_lock(&phydev->lock);
+	state_work = _phy_state_machine(phydev);
+	mutex_unlock(&phydev->lock);
+
+	_phy_state_machine_post_work(phydev, state_work);
+}
+
 /**
  * phy_stop - Bring down the PHY link, and stop checking the status
  * @phydev: target phy_device struct
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 7/7] net: phy: convert phy_stop() to use split state machine
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
                   ` (5 preceding siblings ...)
  2023-09-14 15:35 ` [PATCH net-next 6/7] net: phy: split locked and unlocked section of phy_state_machine() Russell King (Oracle)
@ 2023-09-14 15:36 ` Russell King (Oracle)
  2023-09-14 18:39   ` Florian Fainelli
  2023-09-17 13:40 ` [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY patchwork-bot+netdevbpf
  7 siblings, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-14 15:36 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Convert phy_stop() to use the new locked-section and unlocked-section
parts of the PHY state machine.

Tested-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phy.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d78c2cc003ce..93a8676dd8d8 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1467,6 +1467,7 @@ void phy_state_machine(struct work_struct *work)
 void phy_stop(struct phy_device *phydev)
 {
 	struct net_device *dev = phydev->attached_dev;
+	enum phy_state_work state_work;
 	enum phy_state old_state;
 
 	if (!phy_is_started(phydev) && phydev->state != PHY_DOWN &&
@@ -1490,9 +1491,10 @@ void phy_stop(struct phy_device *phydev)
 	phydev->state = PHY_HALTED;
 	phy_process_state_change(phydev, old_state);
 
+	state_work = _phy_state_machine(phydev);
 	mutex_unlock(&phydev->lock);
 
-	phy_state_machine(&phydev->state_queue.work);
+	_phy_state_machine_post_work(phydev, state_work);
 	phy_stop_machine(phydev);
 
 	/* Cannot call flush_scheduled_work() here as desired because
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-14 15:35 ` [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock Russell King (Oracle)
@ 2023-09-14 18:21   ` Florian Fainelli
       [not found]   ` <CGME20230918123304eucas1p2b628f00ed8df536372f1f2b445706021@eucas1p2.samsung.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:21 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:35, Russell King (Oracle) wrote:
> phy_stop() calls phy_process_state_change() while holding the phydev
> lock, so also arrange for phy_state_machine() to do the same, so that
> this function is called with consistent locking.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 2/7] net: phy: call phy_error_precise() while holding the lock
  2023-09-14 15:35 ` [PATCH net-next 2/7] net: phy: call phy_error_precise() while holding the lock Russell King (Oracle)
@ 2023-09-14 18:21   ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:21 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:35, Russell King (Oracle) wrote:
> Move the locking out of phy_error_precise() and to its only call site,
> merging with the locked region that has already been taken.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 3/7] net: phy: move call to start aneg
  2023-09-14 15:35 ` [PATCH net-next 3/7] net: phy: move call to start aneg Russell King (Oracle)
@ 2023-09-14 18:22   ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:22 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:35, Russell King (Oracle) wrote:
> Move the call to start auto-negotiation inside the lock in the PHYLIB
> state machine, calling the locked variant _phy_start_aneg(). This
> avoids unnecessarily releasing and re-acquiring the lock.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 4/7] net: phy: move phy_suspend() to end of phy_state_machine()
  2023-09-14 15:35 ` [PATCH net-next 4/7] net: phy: move phy_suspend() to end of phy_state_machine() Russell King (Oracle)
@ 2023-09-14 18:22   ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:22 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:35, Russell King (Oracle) wrote:
> Move the call to phy_suspend() to the end of phy_state_machine() after
> we release the lock so that we can combine the locked areas.
> phy_suspend() can not be called while holding phydev->lock as it has
> caused deadlocks in the past.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/7] net: phy: move phy_state_machine()
  2023-09-14 15:35 ` [PATCH net-next 5/7] net: phy: move phy_state_machine() Russell King (Oracle)
@ 2023-09-14 18:22   ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:22 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:35, Russell King (Oracle) wrote:
> Move phy_state_machine() before phy_stop() to avoid subsequent patches
> introducing forward references.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 6/7] net: phy: split locked and unlocked section of phy_state_machine()
  2023-09-14 15:35 ` [PATCH net-next 6/7] net: phy: split locked and unlocked section of phy_state_machine() Russell King (Oracle)
@ 2023-09-14 18:39   ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:39 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:35, Russell King (Oracle) wrote:
> Split out the locked and unlocked sections of phy_state_machine() into
> two separate functions which can be called inside the phydev lock and
> outside the phydev lock as appropriate, thus allowing us to combine
> the locked regions in the caller of phy_state_machine() with the
> locked region inside phy_state_machine().
> 
> This avoids unnecessarily dropping the phydev lock which may allow
> races to occur.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 7/7] net: phy: convert phy_stop() to use split state machine
  2023-09-14 15:36 ` [PATCH net-next 7/7] net: phy: convert phy_stop() to use split state machine Russell King (Oracle)
@ 2023-09-14 18:39   ` Florian Fainelli
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Fainelli @ 2023-09-14 18:39 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

On 9/14/23 08:36, Russell King (Oracle) wrote:
> Convert phy_stop() to use the new locked-section and unlocked-section
> parts of the PHY state machine.
> 
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
-- 
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY
  2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
                   ` (6 preceding siblings ...)
  2023-09-14 15:36 ` [PATCH net-next 7/7] net: phy: convert phy_stop() to use split state machine Russell King (Oracle)
@ 2023-09-17 13:40 ` patchwork-bot+netdevbpf
  7 siblings, 0 replies; 22+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-09-17 13:40 UTC (permalink / raw)
  To: Russell King
  Cc: andrew, hkallweit1, chenhao418, davem, edumazet, kuba, shaojijie,
	lanhao, liuyonglong, netdev, pabeni, shenjian15, wangjie125,
	wangpeiyang1

Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Thu, 14 Sep 2023 16:34:17 +0100 you wrote:
> This series addresses a problem reported by Jijie Shao where the PHY
> state machine can race with phy_stop() leading to an incorrect state.
> 
> The issue centres around phy_state_machine() dropping the phydev->lock
> mutex briefly, which allows phy_stop() to get in half-way through the
> state machine, and when the state machine resumes, it overwrites
> phydev->state with a value incompatible with a stopped PHY. This causes
> a subsequent phy_start() to issue a warning.
> 
> [...]

Here is the summary with links:
  - [net-next,1/7] net: phy: always call phy_process_state_change() under lock
    https://git.kernel.org/netdev/net-next/c/8da77df649c4
  - [net-next,2/7] net: phy: call phy_error_precise() while holding the lock
    https://git.kernel.org/netdev/net-next/c/ef113a60d0a9
  - [net-next,3/7] net: phy: move call to start aneg
    https://git.kernel.org/netdev/net-next/c/ea5968cd7d6e
  - [net-next,4/7] net: phy: move phy_suspend() to end of phy_state_machine()
    https://git.kernel.org/netdev/net-next/c/6e19b3502c59
  - [net-next,5/7] net: phy: move phy_state_machine()
    https://git.kernel.org/netdev/net-next/c/c398ef41b6d4
  - [net-next,6/7] net: phy: split locked and unlocked section of phy_state_machine()
    https://git.kernel.org/netdev/net-next/c/8635c0663e6b
  - [net-next,7/7] net: phy: convert phy_stop() to use split state machine
    https://git.kernel.org/netdev/net-next/c/adcbb85508c8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
       [not found]   ` <CGME20230918123304eucas1p2b628f00ed8df536372f1f2b445706021@eucas1p2.samsung.com>
@ 2023-09-18 12:33     ` Marek Szyprowski
  2023-09-18 12:55       ` Andrew Lunn
  2023-09-18 13:06       ` Russell King (Oracle)
  0 siblings, 2 replies; 22+ messages in thread
From: Marek Szyprowski @ 2023-09-18 12:33 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: chenhao418, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jijie Shao, lanhao, liuyonglong, netdev, Paolo Abeni, shenjian15,
	wangjie125, wangpeiyang1

Hi Russell,

On 14.09.2023 17:35, Russell King (Oracle) wrote:
> phy_stop() calls phy_process_state_change() while holding the phydev
> lock, so also arrange for phy_state_machine() to do the same, so that
> this function is called with consistent locking.
>
> Tested-by: Jijie Shao <shaojijie@huawei.com>
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

This change, merged to linux-next as commit 8da77df649c4 ("net: phy: 
always call phy_process_state_change() under lock") introduces the 
following deadlock with ASIX AX8817X USB driver:

--->8---

asix 1-1.4:1.0 (unnamed net_device) (uninitialized): PHY 
[usb-001:003:10] driver [Asix Electronics AX88772A] (irq=POLL)
Asix Electronics AX88772A usb-001:003:10: attached PHY driver 
(mii_bus:phy_addr=usb-001:003:10, irq=POLL)
asix 1-1.4:1.0 eth0: register 'asix' at usb-12110000.usb-1.4, ASIX 
AX88772 USB 2.0 Ethernet, a2:99:b6:cd:11:eb

asix 1-1.4:1.0 eth0: configuring for phy/internal link mode

============================================
WARNING: possible recursive locking detected
6.6.0-rc1-00239-g8da77df649c4-dirty #13949 Not tainted
--------------------------------------------
kworker/3:3/71 is trying to acquire lock:
c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_start_aneg+0x1c/0x38

but task is already holding lock:
c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x100/0x2b8

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&dev->lock);
   lock(&dev->lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

3 locks held by kworker/3:3/71:
  #0: c1c090a8 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: 
process_one_work+0x148/0x608
  #1: f0bddf20 
((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}, at: 
process_one_work+0x148/0x608
  #2: c6c704cc (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x100/0x2b8

stack backtrace:
CPU: 3 PID: 71 Comm: kworker/3:3 Not tainted 
6.6.0-rc1-00239-g8da77df649c4-dirty #13949
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events_power_efficient phy_state_machine
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x58/0x70
  dump_stack_lvl from __lock_acquire+0x1300/0x2984
  __lock_acquire from lock_acquire+0x130/0x37c
  lock_acquire from __mutex_lock+0x94/0x94c
  __mutex_lock from mutex_lock_nested+0x1c/0x24
  mutex_lock_nested from phy_start_aneg+0x1c/0x38
  phy_start_aneg from phy_state_machine+0x10c/0x2b8
  phy_state_machine from process_one_work+0x204/0x608
  process_one_work from worker_thread+0x1e0/0x498
  worker_thread from kthread+0x104/0x138
  kthread from ret_from_fork+0x14/0x28
Exception stack(0xf0bddfb0 to 0xf0bddff8)
...

-------

This probably need to be fixed somewhere in drivers/net/usb/asix* but at 
the first glance I don't see any obvious place that need a fix.

> ---
>   drivers/net/phy/phy.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index df54c137c5f5..1e5218935eb3 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -1506,6 +1506,7 @@ void phy_state_machine(struct work_struct *work)
>   	if (err < 0)
>   		phy_error_precise(phydev, func, err);
>   
> +	mutex_lock(&phydev->lock);
>   	phy_process_state_change(phydev, old_state);
>   
>   	/* Only re-schedule a PHY state machine change if we are polling the
> @@ -1516,7 +1517,6 @@ void phy_state_machine(struct work_struct *work)
>   	 * state machine would be pointless and possibly error prone when
>   	 * called from phy_disconnect() synchronously.
>   	 */
> -	mutex_lock(&phydev->lock);
>   	if (phy_polling_mode(phydev) && phy_is_started(phydev))
>   		phy_queue_state_machine(phydev, PHY_STATE_TIME);
>   	mutex_unlock(&phydev->lock);

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-18 12:33     ` Marek Szyprowski
@ 2023-09-18 12:55       ` Andrew Lunn
  2023-09-18 13:05         ` Marek Szyprowski
  2023-09-18 13:07         ` Russell King (Oracle)
  2023-09-18 13:06       ` Russell King (Oracle)
  1 sibling, 2 replies; 22+ messages in thread
From: Andrew Lunn @ 2023-09-18 12:55 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Russell King (Oracle),
	Heiner Kallweit, chenhao418, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Jijie Shao, lanhao, liuyonglong, netdev,
	Paolo Abeni, shenjian15, wangjie125, wangpeiyang1

> This probably need to be fixed somewhere in drivers/net/usb/asix* but at 
> the first glance I don't see any obvious place that need a fix.

static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc,
                            bool in_pm)
{
        struct usbnet *dev = netdev_priv(netdev);
        __le16 res;
        int ret;

        mutex_lock(&dev->phy_mutex);

Taking this lock here is the problem. Same for write.

There is some funky stuff going on in asix_devices.c. It using both
phylib and the much older mii code.

       Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-18 12:55       ` Andrew Lunn
@ 2023-09-18 13:05         ` Marek Szyprowski
  2023-09-18 13:07         ` Russell King (Oracle)
  1 sibling, 0 replies; 22+ messages in thread
From: Marek Szyprowski @ 2023-09-18 13:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King (Oracle),
	Heiner Kallweit, chenhao418, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Jijie Shao, lanhao, liuyonglong, netdev,
	Paolo Abeni, shenjian15, wangjie125, wangpeiyang1

On 18.09.2023 14:55, Andrew Lunn wrote:
>> This probably need to be fixed somewhere in drivers/net/usb/asix* but at
>> the first glance I don't see any obvious place that need a fix.
> static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc,
>                              bool in_pm)
> {
>          struct usbnet *dev = netdev_priv(netdev);
>          __le16 res;
>          int ret;
>
>          mutex_lock(&dev->phy_mutex);
>
> Taking this lock here is the problem. Same for write.
>
> There is some funky stuff going on in asix_devices.c. It using both
> phylib and the much older mii code.

This must be something different. Removing those calls to phy_mutex from 
__asix_mdio_read/write doesn't fix this deadlock (I intentionally 
ignored the fact that some kind of synchronization is probably required 
there).

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-18 12:33     ` Marek Szyprowski
  2023-09-18 12:55       ` Andrew Lunn
@ 2023-09-18 13:06       ` Russell King (Oracle)
  2023-09-18 13:15         ` Marek Szyprowski
  1 sibling, 1 reply; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-18 13:06 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Andrew Lunn, Heiner Kallweit, chenhao418, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Jijie Shao, lanhao, liuyonglong,
	netdev, Paolo Abeni, shenjian15, wangjie125, wangpeiyang1

On Mon, Sep 18, 2023 at 02:33:04PM +0200, Marek Szyprowski wrote:
> Hi Russell,
> 
> On 14.09.2023 17:35, Russell King (Oracle) wrote:
> > phy_stop() calls phy_process_state_change() while holding the phydev
> > lock, so also arrange for phy_state_machine() to do the same, so that
> > this function is called with consistent locking.
> >
> > Tested-by: Jijie Shao <shaojijie@huawei.com>
> > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> 
> This change, merged to linux-next as commit 8da77df649c4 ("net: phy: 
> always call phy_process_state_change() under lock") introduces the 
> following deadlock with ASIX AX8817X USB driver:

Yay, latent bug found...

I guess this is asix_ax88772a_link_change_notify() which is causing
the problem, and yes, that phy_start_aneg() needs to be the unlocked
version (which we'll have to export.)

This should fix it.

diff --git a/drivers/net/phy/ax88796b.c b/drivers/net/phy/ax88796b.c
index 0f1e617a26c9..eb74a8cf8df1 100644
--- a/drivers/net/phy/ax88796b.c
+++ b/drivers/net/phy/ax88796b.c
@@ -90,7 +90,7 @@ static void asix_ax88772a_link_change_notify(struct phy_device *phydev)
 	 */
 	if (phydev->state == PHY_NOLINK) {
 		phy_init_hw(phydev);
-		phy_start_aneg(phydev);
+		_phy_start_aneg(phydev);
 	}
 }
 
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 93a8676dd8d8..a5fa077650e8 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -981,7 +981,7 @@ static int phy_check_link_status(struct phy_device *phydev)
  *   If the PHYCONTROL Layer is operating, we change the state to
  *   reflect the beginning of Auto-negotiation or forcing.
  */
-static int _phy_start_aneg(struct phy_device *phydev)
+int _phy_start_aneg(struct phy_device *phydev)
 {
 	int err;
 
@@ -1002,6 +1002,7 @@ static int _phy_start_aneg(struct phy_device *phydev)
 
 	return err;
 }
+EXPORT_SYMBOL(_phy_start_aneg);
 
 /**
  * phy_start_aneg - start auto-negotiation for this PHY device
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 1351b802ffcf..3cc52826f18e 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1736,6 +1736,7 @@ void phy_detach(struct phy_device *phydev);
 void phy_start(struct phy_device *phydev);
 void phy_stop(struct phy_device *phydev);
 int phy_config_aneg(struct phy_device *phydev);
+int _phy_start_aneg(struct phy_device *phydev);
 int phy_start_aneg(struct phy_device *phydev);
 int phy_aneg_done(struct phy_device *phydev);
 int phy_speed_down(struct phy_device *phydev, bool sync);
-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-18 12:55       ` Andrew Lunn
  2023-09-18 13:05         ` Marek Szyprowski
@ 2023-09-18 13:07         ` Russell King (Oracle)
  1 sibling, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2023-09-18 13:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Marek Szyprowski, Heiner Kallweit, chenhao418, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Jijie Shao, lanhao, liuyonglong,
	netdev, Paolo Abeni, shenjian15, wangjie125, wangpeiyang1

On Mon, Sep 18, 2023 at 02:55:32PM +0200, Andrew Lunn wrote:
> > This probably need to be fixed somewhere in drivers/net/usb/asix* but at 
> > the first glance I don't see any obvious place that need a fix.
> 
> static int __asix_mdio_read(struct net_device *netdev, int phy_id, int loc,
>                             bool in_pm)
> {
>         struct usbnet *dev = netdev_priv(netdev);
>         __le16 res;
>         int ret;
> 
>         mutex_lock(&dev->phy_mutex);
> 
> Taking this lock here is the problem. Same for write.
> 
> There is some funky stuff going on in asix_devices.c. It using both
> phylib and the much older mii code.

I don't think that's the problem...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock
  2023-09-18 13:06       ` Russell King (Oracle)
@ 2023-09-18 13:15         ` Marek Szyprowski
  0 siblings, 0 replies; 22+ messages in thread
From: Marek Szyprowski @ 2023-09-18 13:15 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, Heiner Kallweit, chenhao418, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Jijie Shao, lanhao, liuyonglong,
	netdev, Paolo Abeni, shenjian15, wangjie125, wangpeiyang1

On 18.09.2023 15:06, Russell King (Oracle) wrote:
> On Mon, Sep 18, 2023 at 02:33:04PM +0200, Marek Szyprowski wrote:
>> Hi Russell,
>>
>> On 14.09.2023 17:35, Russell King (Oracle) wrote:
>>> phy_stop() calls phy_process_state_change() while holding the phydev
>>> lock, so also arrange for phy_state_machine() to do the same, so that
>>> this function is called with consistent locking.
>>>
>>> Tested-by: Jijie Shao <shaojijie@huawei.com>
>>> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
>> This change, merged to linux-next as commit 8da77df649c4 ("net: phy:
>> always call phy_process_state_change() under lock") introduces the
>> following deadlock with ASIX AX8817X USB driver:
> Yay, latent bug found...
>
> I guess this is asix_ax88772a_link_change_notify() which is causing
> the problem, and yes, that phy_start_aneg() needs to be the unlocked
> version (which we'll have to export.)
>
> This should fix it.

Thanks!

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>


> diff --git a/drivers/net/phy/ax88796b.c b/drivers/net/phy/ax88796b.c
> index 0f1e617a26c9..eb74a8cf8df1 100644
> --- a/drivers/net/phy/ax88796b.c
> +++ b/drivers/net/phy/ax88796b.c
> @@ -90,7 +90,7 @@ static void asix_ax88772a_link_change_notify(struct phy_device *phydev)
>   	 */
>   	if (phydev->state == PHY_NOLINK) {
>   		phy_init_hw(phydev);
> -		phy_start_aneg(phydev);
> +		_phy_start_aneg(phydev);
>   	}
>   }
>   
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 93a8676dd8d8..a5fa077650e8 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -981,7 +981,7 @@ static int phy_check_link_status(struct phy_device *phydev)
>    *   If the PHYCONTROL Layer is operating, we change the state to
>    *   reflect the beginning of Auto-negotiation or forcing.
>    */
> -static int _phy_start_aneg(struct phy_device *phydev)
> +int _phy_start_aneg(struct phy_device *phydev)
>   {
>   	int err;
>   
> @@ -1002,6 +1002,7 @@ static int _phy_start_aneg(struct phy_device *phydev)
>   
>   	return err;
>   }
> +EXPORT_SYMBOL(_phy_start_aneg);
>   
>   /**
>    * phy_start_aneg - start auto-negotiation for this PHY device
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 1351b802ffcf..3cc52826f18e 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -1736,6 +1736,7 @@ void phy_detach(struct phy_device *phydev);
>   void phy_start(struct phy_device *phydev);
>   void phy_stop(struct phy_device *phydev);
>   int phy_config_aneg(struct phy_device *phydev);
> +int _phy_start_aneg(struct phy_device *phydev);
>   int phy_start_aneg(struct phy_device *phydev);
>   int phy_aneg_done(struct phy_device *phydev);
>   int phy_speed_down(struct phy_device *phydev, bool sync);

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-09-18 16:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14 15:34 [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY Russell King (Oracle)
2023-09-14 15:35 ` [PATCH net-next 1/7] net: phy: always call phy_process_state_change() under lock Russell King (Oracle)
2023-09-14 18:21   ` Florian Fainelli
     [not found]   ` <CGME20230918123304eucas1p2b628f00ed8df536372f1f2b445706021@eucas1p2.samsung.com>
2023-09-18 12:33     ` Marek Szyprowski
2023-09-18 12:55       ` Andrew Lunn
2023-09-18 13:05         ` Marek Szyprowski
2023-09-18 13:07         ` Russell King (Oracle)
2023-09-18 13:06       ` Russell King (Oracle)
2023-09-18 13:15         ` Marek Szyprowski
2023-09-14 15:35 ` [PATCH net-next 2/7] net: phy: call phy_error_precise() while holding the lock Russell King (Oracle)
2023-09-14 18:21   ` Florian Fainelli
2023-09-14 15:35 ` [PATCH net-next 3/7] net: phy: move call to start aneg Russell King (Oracle)
2023-09-14 18:22   ` Florian Fainelli
2023-09-14 15:35 ` [PATCH net-next 4/7] net: phy: move phy_suspend() to end of phy_state_machine() Russell King (Oracle)
2023-09-14 18:22   ` Florian Fainelli
2023-09-14 15:35 ` [PATCH net-next 5/7] net: phy: move phy_state_machine() Russell King (Oracle)
2023-09-14 18:22   ` Florian Fainelli
2023-09-14 15:35 ` [PATCH net-next 6/7] net: phy: split locked and unlocked section of phy_state_machine() Russell King (Oracle)
2023-09-14 18:39   ` Florian Fainelli
2023-09-14 15:36 ` [PATCH net-next 7/7] net: phy: convert phy_stop() to use split state machine Russell King (Oracle)
2023-09-14 18:39   ` Florian Fainelli
2023-09-17 13:40 ` [PATCH net-next 0/7] net: phy: avoid race when erroring stopping PHY patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.