linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Remove private locks to avoid possible deadlock
@ 2016-12-07 20:05 Lino Sanfilippo
  2016-12-07 20:05 ` [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock Lino Sanfilippo
  2016-12-07 20:05 ` [PATCH 2/2] net: ethernet: stmmac: " Lino Sanfilippo
  0 siblings, 2 replies; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-07 20:05 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: pavel, davem, linux-kernel, netdev

Hi,

these patches fix possible deadlock situations in the sxgbe and stmmac
driver. Please note that the patches are only compile tested so it would
be great if someone could do tests with the concerning HW.

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-07 20:05 Remove private locks to avoid possible deadlock Lino Sanfilippo
@ 2016-12-07 20:05 ` Lino Sanfilippo
  2016-12-07 23:15   ` Francois Romieu
  2016-12-07 20:05 ` [PATCH 2/2] net: ethernet: stmmac: " Lino Sanfilippo
  1 sibling, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-07 20:05 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: pavel, davem, linux-kernel, netdev, Lino Sanfilippo

The driver uses a private lock for synchronization between the xmit
function and the xmit completion handler, but since the NETIF_F_LLTX flag
is not set, the xmit function is also called with the xmit_lock held.

On the other hand the xmit completion handler first takes the private lock
and (in case that the tx queue has been stopped) the xmit_lock, leading
to a reverse locking order and the potential danger of a deadlock.

Fix this by removing the private lock completely and synchronizing the xmit
function and completion handler solely by means of the xmit_lock. By doing
this also remove the now unnecessary double check for a stopped tx queue.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
 drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h |  1 -
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c   | 27 +++++------------------
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h b/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
index 5cb51b6..c61f260 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_common.h
@@ -384,7 +384,6 @@ struct sxgbe_tx_queue {
 	dma_addr_t *tx_skbuff_dma;
 	struct sk_buff **tx_skbuff;
 	struct timer_list txtimer;
-	spinlock_t tx_lock;	/* lock for tx queues */
 	unsigned int cur_tx;
 	unsigned int dirty_tx;
 	u32 tx_count_frames;
diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
index ea44a24..22d3b0b 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
@@ -426,9 +426,6 @@ static int init_tx_ring(struct device *dev, u8 queue_no,
 	tx_ring->dirty_tx = 0;
 	tx_ring->cur_tx = 0;
 
-	/* initialise TX queue lock */
-	spin_lock_init(&tx_ring->tx_lock);
-
 	return 0;
 
 dmamem_err:
@@ -743,7 +740,7 @@ static void sxgbe_tx_queue_clean(struct sxgbe_tx_queue *tqueue)
 
 	dev_txq = netdev_get_tx_queue(priv->dev, queue_no);
 
-	spin_lock(&tqueue->tx_lock);
+	__netif_tx_lock(dev_txq, smp_processor_id());
 
 	priv->xstats.tx_clean++;
 	while (tqueue->dirty_tx != tqueue->cur_tx) {
@@ -781,18 +778,13 @@ static void sxgbe_tx_queue_clean(struct sxgbe_tx_queue *tqueue)
 
 	/* wake up queue */
 	if (unlikely(netif_tx_queue_stopped(dev_txq) &&
-		     sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv))) {
-		netif_tx_lock(priv->dev);
-		if (netif_tx_queue_stopped(dev_txq) &&
-		    sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv)) {
-			if (netif_msg_tx_done(priv))
-				pr_debug("%s: restart transmit\n", __func__);
-			netif_tx_wake_queue(dev_txq);
-		}
-		netif_tx_unlock(priv->dev);
+	    sxgbe_tx_avail(tqueue, tx_rsize) > SXGBE_TX_THRESH(priv))) {
+		if (netif_msg_tx_done(priv))
+			pr_debug("%s: restart transmit\n", __func__);
+		netif_tx_wake_queue(dev_txq);
 	}
 
-	spin_unlock(&tqueue->tx_lock);
+	__netif_tx_unlock(dev_txq);
 }
 
 /**
@@ -1304,9 +1296,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 		      tqueue->hwts_tx_en)))
 		ctxt_desc_req = 1;
 
-	/* get the spinlock */
-	spin_lock(&tqueue->tx_lock);
-
 	if (priv->tx_path_in_lpi_mode)
 		sxgbe_disable_eee_mode(priv);
 
@@ -1316,8 +1305,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 			netdev_err(dev, "%s: Tx Ring is full when %d queue is awake\n",
 				   __func__, txq_index);
 		}
-		/* release the spin lock in case of BUSY */
-		spin_unlock(&tqueue->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -1436,8 +1423,6 @@ static netdev_tx_t sxgbe_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	priv->hw->dma->enable_dma_transmission(priv->ioaddr, txq_index);
 
-	spin_unlock(&tqueue->tx_lock);
-
 	return NETDEV_TX_OK;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 20:05 Remove private locks to avoid possible deadlock Lino Sanfilippo
  2016-12-07 20:05 ` [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock Lino Sanfilippo
@ 2016-12-07 20:05 ` Lino Sanfilippo
  2016-12-07 20:55   ` Pavel Machek
                     ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-07 20:05 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue
  Cc: pavel, davem, linux-kernel, netdev, Lino Sanfilippo

The driver uses a private lock for synchronization between the xmit
function and the xmit completion handler, but since the NETIF_F_LLTX flag
is not set, the xmit function is also called with the xmit_lock held.

On the other hand the xmit completion handler first takes the private lock
and (in case that the tx queue has been stopped) the xmit_lock, leading to
a reverse locking order and the potential danger of a deadlock.

Fix this by removing the private lock completely and synchronizing the xmit
function and completion handler solely by means of the xmit_lock. By doing
this remove also the now unnecessary double check for a stopped tx queue.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  1 -
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 28 +++++------------------
 2 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4d2a759..7e69b11 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -64,7 +64,6 @@ struct stmmac_priv {
 	dma_addr_t dma_tx_phy;
 	int tx_coalesce;
 	int hwts_tx_en;
-	spinlock_t tx_lock;
 	bool tx_path_in_lpi_mode;
 	struct timer_list txtimer;
 	bool tso;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index caf069a..db46ec4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1307,7 +1307,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	unsigned int bytes_compl = 0, pkts_compl = 0;
 	unsigned int entry = priv->dirty_tx;
 
-	spin_lock(&priv->tx_lock);
+	netif_tx_lock(priv->dev);
 
 	priv->xstats.tx_clean++;
 
@@ -1378,22 +1378,17 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	netdev_completed_queue(priv->dev, pkts_compl, bytes_compl);
 
 	if (unlikely(netif_queue_stopped(priv->dev) &&
-		     stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
-		netif_tx_lock(priv->dev);
-		if (netif_queue_stopped(priv->dev) &&
-		    stmmac_tx_avail(priv) > STMMAC_TX_THRESH) {
-			if (netif_msg_tx_done(priv))
-				pr_debug("%s: restart transmit\n", __func__);
-			netif_wake_queue(priv->dev);
-		}
-		netif_tx_unlock(priv->dev);
+	    stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
+		if (netif_msg_tx_done(priv))
+			pr_debug("%s: restart transmit\n", __func__);
+		netif_wake_queue(priv->dev);
 	}
 
 	if ((priv->eee_enabled) && (!priv->tx_path_in_lpi_mode)) {
 		stmmac_enable_eee_mode(priv);
 		mod_timer(&priv->eee_ctrl_timer, STMMAC_LPI_T(eee_timer));
 	}
-	spin_unlock(&priv->tx_lock);
+	netif_tx_unlock(priv->dev);
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -1998,8 +1993,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	u8 proto_hdr_len;
 	int i;
 
-	spin_lock(&priv->tx_lock);
-
 	/* Compute header lengths */
 	proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
 
@@ -2011,7 +2004,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 			/* This is a hard error, log it. */
 			pr_err("%s: Tx Ring full when queue awake\n", __func__);
 		}
-		spin_unlock(&priv->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2146,11 +2138,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 				       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2182,10 +2172,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			return stmmac_tso_xmit(skb, dev);
 	}
 
-	spin_lock(&priv->tx_lock);
-
 	if (unlikely(stmmac_tx_avail(priv) < nfrags + 1)) {
-		spin_unlock(&priv->tx_lock);
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
 			/* This is a hard error, log it. */
@@ -2357,11 +2344,9 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 					       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -3347,7 +3332,6 @@ int stmmac_dvr_probe(struct device *device,
 	netif_napi_add(ndev, &priv->napi, stmmac_poll, 64);
 
 	spin_lock_init(&priv->lock);
-	spin_lock_init(&priv->tx_lock);
 
 	ret = register_netdev(ndev);
 	if (ret) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 20:05 ` [PATCH 2/2] net: ethernet: stmmac: " Lino Sanfilippo
@ 2016-12-07 20:55   ` Pavel Machek
  2016-12-07 20:59   ` Pavel Machek
  2016-12-07 21:37   ` Pavel Machek
  2 siblings, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-07 20:55 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

Hi!

> The driver uses a private lock for synchronization between the xmit
> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> is not set, the xmit function is also called with the xmit_lock held.
> 
> On the other hand the xmit completion handler first takes the private lock
> and (in case that the tx queue has been stopped) the xmit_lock, leading to
> a reverse locking order and the potential danger of a deadlock.
> 
> Fix this by removing the private lock completely and synchronizing the xmit
> function and completion handler solely by means of the xmit_lock. By doing
> this remove also the now unnecessary double check for a stopped tx queue.
> 
> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>

Does not seem to apply to net-next based on
adc176c5472214971d77c1a61c83db9b01e9cdc7. Aha, that's the printk()
changes, probably would apply to mainline.

> index caf069a..db46ec4 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1307,7 +1307,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
>  	unsigned int bytes_compl = 0, pkts_compl = 0;
>  	unsigned int entry = priv->dirty_tx;
>  
> -	spin_lock(&priv->tx_lock);
> +	netif_tx_lock(priv->dev);
>  
>  	priv->xstats.tx_clean++;
>

Should it use "netif_tx_lock_bh"?

I could not reproduce the deadlock without this patch, nor can I
detect anything wrong with this patch, so I guess that is:

Tested-by: Pavel Machek <pavel@denx.de>

Thanks,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 20:05 ` [PATCH 2/2] net: ethernet: stmmac: " Lino Sanfilippo
  2016-12-07 20:55   ` Pavel Machek
@ 2016-12-07 20:59   ` Pavel Machek
  2016-12-07 21:37   ` Pavel Machek
  2 siblings, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-07 20:59 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1881 bytes --]

Hi!

> The driver uses a private lock for synchronization between the xmit
> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> is not set, the xmit function is also called with the xmit_lock held.
> 
> On the other hand the xmit completion handler first takes the private lock
> and (in case that the tx queue has been stopped) the xmit_lock, leading to
> a reverse locking order and the potential danger of a deadlock.
> 
> Fix this by removing the private lock completely and synchronizing the xmit
> function and completion handler solely by means of the xmit_lock. By doing
> this remove also the now unnecessary double check for a stopped tx queue.
> 
> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>

Oops, sorry no, that broke the driver after a while:

(So please ignore my tested-by:)

root@wagabuibui:/data/tmp/udpt# ./udp-test raw 10.0.0.6 1234 1000 100
30
Sending 100 packets (1000b each) at an interval of 30ms, expected data
rate:3333333b/s (3373333b/s incl udp overhead)
[   30.948626] socfpga-dwmac ff702000.ethernet eth0: Link is Up -
100Mbps/Full - flow control rx/tx
[   31.076064] Link is Up - 100/Full
[   32.979526] random: crng init done
[  262.244030] ------------[ cut here ]------------
[  262.248669] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
dev_watchdog+0x254/0x26c
[  262.256916] NETDEV WATCHDOG: eth0 (socfpga-dwmac): transmit queue 0
timed out
[  262.264028] Modules linked in:
[  262.267102] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.9.0-rc7-118095-g2d70d9b-dirty #339
[  262.275328] Hardware name: Altera SOCFPGA
[  262.279352] [<8010f758>] (unwind_backtrace) from [<8010affc>]
(show_stack+0x10/0x14)

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 20:05 ` [PATCH 2/2] net: ethernet: stmmac: " Lino Sanfilippo
  2016-12-07 20:55   ` Pavel Machek
  2016-12-07 20:59   ` Pavel Machek
@ 2016-12-07 21:37   ` Pavel Machek
  2016-12-07 21:43     ` Lino Sanfilippo
  2016-12-07 23:41     ` David Miller
  2 siblings, 2 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-07 21:37 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 5197 bytes --]

On Wed 2016-12-07 21:05:38, Lino Sanfilippo wrote:
> The driver uses a private lock for synchronization between the xmit
> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> is not set, the xmit function is also called with the xmit_lock held.
> 
> On the other hand the xmit completion handler first takes the private lock
> and (in case that the tx queue has been stopped) the xmit_lock, leading to
> a reverse locking order and the potential danger of a deadlock.
> 
> Fix this by removing the private lock completely and synchronizing the xmit
> function and completion handler solely by means of the xmit_lock. By doing
> this remove also the now unnecessary double check for a stopped tx queue.
> 

FYI, here's modified version. I believe _bh versions are needed, and
I'm testing that version now. (Oh and I also ported it to net-next).

It survived 30 minutes of testing so far...

Best regards,
									Pavel

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Pavel Machek <pavel@denx.de>

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index dbacb80..eab04ae 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -64,7 +64,6 @@ struct stmmac_priv {
 	dma_addr_t dma_tx_phy;
 	int tx_coalesce;
 	int hwts_tx_en;
-	spinlock_t tx_lock;
 	bool tx_path_in_lpi_mode;
 	struct timer_list txtimer;
 	bool tso;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 982c952..7415bc2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1308,7 +1308,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 	unsigned int bytes_compl = 0, pkts_compl = 0;
 	unsigned int entry = priv->dirty_tx;
 
-	spin_lock(&priv->tx_lock);
+	netif_tx_lock_bh(priv->dev);
 
 	priv->xstats.tx_clean++;
 
@@ -1378,23 +1378,18 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
 
 	netdev_completed_queue(priv->dev, pkts_compl, bytes_compl);
 
-	if (unlikely(netif_queue_stopped(priv->dev) &&
-		     stmmac_tx_avail(priv) > STMMAC_TX_THRESH)) {
-		netif_tx_lock(priv->dev);
-		if (netif_queue_stopped(priv->dev) &&
-		    stmmac_tx_avail(priv) > STMMAC_TX_THRESH) {
-			netif_dbg(priv, tx_done, priv->dev,
-				  "%s: restart transmit\n", __func__);
-			netif_wake_queue(priv->dev);
-		}
-		netif_tx_unlock(priv->dev);
+	if (netif_queue_stopped(priv->dev) &&
+	    stmmac_tx_avail(priv) > STMMAC_TX_THRESH) {
+		netif_dbg(priv, tx_done, priv->dev,
+			  "%s: restart transmit\n", __func__);
+		netif_wake_queue(priv->dev);
 	}
 
 	if ((priv->eee_enabled) && (!priv->tx_path_in_lpi_mode)) {
 		stmmac_enable_eee_mode(priv);
 		mod_timer(&priv->eee_ctrl_timer, STMMAC_LPI_T(eee_timer));
 	}
-	spin_unlock(&priv->tx_lock);
+	netif_tx_unlock_bh(priv->dev);
 }
 
 static inline void stmmac_enable_dma_irq(struct stmmac_priv *priv)
@@ -2006,8 +2001,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	u8 proto_hdr_len;
 	int i;
 
-	spin_lock(&priv->tx_lock);
-
 	/* Compute header lengths */
 	proto_hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
 
@@ -2021,7 +2014,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 				   "%s: Tx Ring full when queue awake\n",
 				   __func__);
 		}
-		spin_unlock(&priv->tx_lock);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -2156,11 +2148,9 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 				       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	dev_err(priv->device, "Tx dma map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -2192,10 +2182,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 			return stmmac_tso_xmit(skb, dev);
 	}
 
-	spin_lock(&priv->tx_lock);
-
 	if (unlikely(stmmac_tx_avail(priv) < nfrags + 1)) {
-		spin_unlock(&priv->tx_lock);
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
 			/* This is a hard error, log it. */
@@ -2366,11 +2353,9 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		priv->hw->dma->set_tx_tail_ptr(priv->ioaddr, priv->tx_tail_addr,
 					       STMMAC_CHAN0);
 
-	spin_unlock(&priv->tx_lock);
 	return NETDEV_TX_OK;
 
 dma_map_err:
-	spin_unlock(&priv->tx_lock);
 	netdev_err(priv->dev, "Tx DMA map failed\n");
 	dev_kfree_skb(skb);
 	priv->dev->stats.tx_dropped++;
@@ -3357,7 +3342,6 @@ int stmmac_dvr_probe(struct device *device,
 	netif_napi_add(ndev, &priv->napi, stmmac_poll, 64);
 
 	spin_lock_init(&priv->lock);
-	spin_lock_init(&priv->tx_lock);
 
 	ret = register_netdev(ndev);
 	if (ret) {




-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 21:37   ` Pavel Machek
@ 2016-12-07 21:43     ` Lino Sanfilippo
  2016-12-07 22:34       ` Lino Sanfilippo
  2016-12-07 23:41     ` David Miller
  1 sibling, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-07 21:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Hi Pavel,

On 07.12.2016 22:37, Pavel Machek wrote:
> On Wed 2016-12-07 21:05:38, Lino Sanfilippo wrote:
>> The driver uses a private lock for synchronization between the xmit
>> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>> is not set, the xmit function is also called with the xmit_lock held.
>> 
>> On the other hand the xmit completion handler first takes the private lock
>> and (in case that the tx queue has been stopped) the xmit_lock, leading to
>> a reverse locking order and the potential danger of a deadlock.
>> 
>> Fix this by removing the private lock completely and synchronizing the xmit
>> function and completion handler solely by means of the xmit_lock. By doing
>> this remove also the now unnecessary double check for a stopped tx queue.
>> 
> 
> FYI, here's modified version. I believe _bh versions are needed, and
> I'm testing that version now. (Oh and I also ported it to net-next).
> 
> It survived 30 minutes of testing so far...
> 

First off, thanks for testing.
Hmm. I dont understand why _bh would be needed. We call that function from
BH context only (napi poll and timer).
Any idea?

Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 21:43     ` Lino Sanfilippo
@ 2016-12-07 22:34       ` Lino Sanfilippo
  2016-12-07 23:21         ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-07 22:34 UTC (permalink / raw)
  To: Pavel Machek
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

On 07.12.2016 22:43, Lino Sanfilippo wrote:
> Hi Pavel,
> 
> On 07.12.2016 22:37, Pavel Machek wrote:
>> On Wed 2016-12-07 21:05:38, Lino Sanfilippo wrote:
>>> The driver uses a private lock for synchronization between the xmit
>>> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>>> is not set, the xmit function is also called with the xmit_lock held.
>>> 
>>> On the other hand the xmit completion handler first takes the private lock
>>> and (in case that the tx queue has been stopped) the xmit_lock, leading to
>>> a reverse locking order and the potential danger of a deadlock.
>>> 
>>> Fix this by removing the private lock completely and synchronizing the xmit
>>> function and completion handler solely by means of the xmit_lock. By doing
>>> this remove also the now unnecessary double check for a stopped tx queue.
>>> 
>> 
>> FYI, here's modified version. I believe _bh versions are needed, and
>> I'm testing that version now. (Oh and I also ported it to net-next).
>> 
>> It survived 30 minutes of testing so far...
>> 
> 
> First off, thanks for testing.
> Hmm. I dont understand why _bh would be needed. We call that function from
> BH context only (napi poll and timer).
> Any idea?
> 

Could this once again be caused by irq coalescing? When the tx queue has been stopped
the cleanup handler has to wakeup the queue within a certain time span, otherwise the
watchdog will complain (as it happened in your test). Could you retest this with
irq coalescing disabled?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-07 20:05 ` [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock Lino Sanfilippo
@ 2016-12-07 23:15   ` Francois Romieu
  2016-12-08 20:32     ` Lino Sanfilippo
  0 siblings, 1 reply; 36+ messages in thread
From: Francois Romieu @ 2016-12-07 23:15 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, pavel, davem, linux-kernel, netdev

Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> The driver uses a private lock for synchronization between the xmit
> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> is not set, the xmit function is also called with the xmit_lock held.
> 
> On the other hand the xmit completion handler first takes the private lock
> and (in case that the tx queue has been stopped) the xmit_lock, leading
> to a reverse locking order and the potential danger of a deadlock.

netif_tx_stop_queue is used by:
1. xmit function before releasing lock and returning.
2. sxgbe_restart_tx_queue()
   <- sxgbe_tx_interrupt
   <- sxgbe_reset_all_tx_queues()
      <- sxgbe_tx_timeout()

Given xmit won't be called again until tx queue is enabled, it's not clear
how a deadlock could happen due to #1.

Regardless of deadlocks anywhere else, #2 has some serious problem due to
the lack of exclusion between the tx queue restart handler and the xmit
handler.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 22:34       ` Lino Sanfilippo
@ 2016-12-07 23:21         ` Pavel Machek
  0 siblings, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-07 23:21 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

On Wed 2016-12-07 23:34:19, Lino Sanfilippo wrote:
> On 07.12.2016 22:43, Lino Sanfilippo wrote:
> > Hi Pavel,
> > 
> > On 07.12.2016 22:37, Pavel Machek wrote:
> >> On Wed 2016-12-07 21:05:38, Lino Sanfilippo wrote:
> >>> The driver uses a private lock for synchronization between the xmit
> >>> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> >>> is not set, the xmit function is also called with the xmit_lock held.
> >>> 
> >>> On the other hand the xmit completion handler first takes the private lock
> >>> and (in case that the tx queue has been stopped) the xmit_lock, leading to
> >>> a reverse locking order and the potential danger of a deadlock.
> >>> 
> >>> Fix this by removing the private lock completely and synchronizing the xmit
> >>> function and completion handler solely by means of the xmit_lock. By doing
> >>> this remove also the now unnecessary double check for a stopped tx queue.
> >>> 
> >> 
> >> FYI, here's modified version. I believe _bh versions are needed, and
> >> I'm testing that version now. (Oh and I also ported it to net-next).
> >> 
> >> It survived 30 minutes of testing so far...
> >> 
> > 
> > First off, thanks for testing.
> > Hmm. I dont understand why _bh would be needed. We call that function from
> > BH context only (napi poll and timer).
> > Any idea?
> > 
> 
> Could this once again be caused by irq coalescing? When the tx queue has been stopped
> the cleanup handler has to wakeup the queue within a certain time span, otherwise the
> watchdog will complain (as it happened in your test). Could you retest this with
> irq coalescing disabled?

I actually had TX coalescing disabled, with

-#define STMMAC_TX_FRAMES       64
+#define STMMAC_TX_FRAMES       0


										Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 21:37   ` Pavel Machek
  2016-12-07 21:43     ` Lino Sanfilippo
@ 2016-12-07 23:41     ` David Miller
  2016-12-08 14:08       ` Pavel Machek
  1 sibling, 1 reply; 36+ messages in thread
From: David Miller @ 2016-12-07 23:41 UTC (permalink / raw)
  To: pavel
  Cc: LinoSanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, linux-kernel, netdev

From: Pavel Machek <pavel@ucw.cz>
Date: Wed, 7 Dec 2016 22:37:57 +0100

> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 982c952..7415bc2 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1308,7 +1308,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
>  	unsigned int bytes_compl = 0, pkts_compl = 0;
>  	unsigned int entry = priv->dirty_tx;
>  
> -	spin_lock(&priv->tx_lock);
> +	netif_tx_lock_bh(priv->dev);
>  
>  	priv->xstats.tx_clean++;
>  

stmmac_tx_clean() runs from either the timer or the NAPI poll handler,
both execute from software interrupts, therefore _bh() should be
unnecessary.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-07 23:41     ` David Miller
@ 2016-12-08 14:08       ` Pavel Machek
  2016-12-08 15:26         ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-08 14:08 UTC (permalink / raw)
  To: David Miller
  Cc: LinoSanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1160 bytes --]

On Wed 2016-12-07 18:41:11, David Miller wrote:
> From: Pavel Machek <pavel@ucw.cz>
> Date: Wed, 7 Dec 2016 22:37:57 +0100
> 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 982c952..7415bc2 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -1308,7 +1308,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
> >  	unsigned int bytes_compl = 0, pkts_compl = 0;
> >  	unsigned int entry = priv->dirty_tx;
> >  
> > -	spin_lock(&priv->tx_lock);
> > +	netif_tx_lock_bh(priv->dev);
> >  
> >  	priv->xstats.tx_clean++;
> >  
> 
> stmmac_tx_clean() runs from either the timer or the NAPI poll handler,
> both execute from software interrupts, therefore _bh() should be
> unnecessary.

I've tried the test again with netif_tx_lock() (not _bh()) and it
survived for more then four hours. Strange...

Best regards,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-08 14:08       ` Pavel Machek
@ 2016-12-08 15:26         ` David Miller
  2016-12-08 15:46           ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2016-12-08 15:26 UTC (permalink / raw)
  To: pavel
  Cc: LinoSanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, linux-kernel, netdev

From: Pavel Machek <pavel@ucw.cz>
Date: Thu, 8 Dec 2016 15:08:46 +0100

> On Wed 2016-12-07 18:41:11, David Miller wrote:
>> From: Pavel Machek <pavel@ucw.cz>
>> Date: Wed, 7 Dec 2016 22:37:57 +0100
>> 
>> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> > index 982c952..7415bc2 100644
>> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> > @@ -1308,7 +1308,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
>> >  	unsigned int bytes_compl = 0, pkts_compl = 0;
>> >  	unsigned int entry = priv->dirty_tx;
>> >  
>> > -	spin_lock(&priv->tx_lock);
>> > +	netif_tx_lock_bh(priv->dev);
>> >  
>> >  	priv->xstats.tx_clean++;
>> >  
>> 
>> stmmac_tx_clean() runs from either the timer or the NAPI poll handler,
>> both execute from software interrupts, therefore _bh() should be
>> unnecessary.
> 
> I've tried the test again with netif_tx_lock() (not _bh()) and it
> survived for more then four hours. Strange...

It's not strange, it's completely expected.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
  2016-12-08 15:26         ` David Miller
@ 2016-12-08 15:46           ` Pavel Machek
  0 siblings, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-08 15:46 UTC (permalink / raw)
  To: David Miller
  Cc: LinoSanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1518 bytes --]

On Thu 2016-12-08 10:26:41, David Miller wrote:
> From: Pavel Machek <pavel@ucw.cz>
> Date: Thu, 8 Dec 2016 15:08:46 +0100
> 
> > On Wed 2016-12-07 18:41:11, David Miller wrote:
> >> From: Pavel Machek <pavel@ucw.cz>
> >> Date: Wed, 7 Dec 2016 22:37:57 +0100
> >> 
> >> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> >> > index 982c952..7415bc2 100644
> >> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> >> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> >> > @@ -1308,7 +1308,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
> >> >  	unsigned int bytes_compl = 0, pkts_compl = 0;
> >> >  	unsigned int entry = priv->dirty_tx;
> >> >  
> >> > -	spin_lock(&priv->tx_lock);
> >> > +	netif_tx_lock_bh(priv->dev);
> >> >  
> >> >  	priv->xstats.tx_clean++;
> >> >  
> >> 
> >> stmmac_tx_clean() runs from either the timer or the NAPI poll handler,
> >> both execute from software interrupts, therefore _bh() should be
> >> unnecessary.
> > 
> > I've tried the test again with netif_tx_lock() (not _bh()) and it
> > survived for more then four hours. Strange...
> 
> It's not strange, it's completely expected.

Well, I tried that exact test before, and it survived for something
like 10 minutes. So yes... this surprised me.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-07 23:15   ` Francois Romieu
@ 2016-12-08 20:32     ` Lino Sanfilippo
  2016-12-08 21:54       ` Pavel Machek
  2016-12-08 23:19       ` Francois Romieu
  0 siblings, 2 replies; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-08 20:32 UTC (permalink / raw)
  To: Francois Romieu
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, pavel, davem, linux-kernel, netdev

Hi,

On 08.12.2016 00:15, Francois Romieu wrote:
> Lino Sanfilippo <LinoSanfilippo@gmx.de> :
>> The driver uses a private lock for synchronization between the xmit
>> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>> is not set, the xmit function is also called with the xmit_lock held.
>> 
>> On the other hand the xmit completion handler first takes the private lock
>> and (in case that the tx queue has been stopped) the xmit_lock, leading
>> to a reverse locking order and the potential danger of a deadlock.
> 
> netif_tx_stop_queue is used by:
> 1. xmit function before releasing lock and returning.
> 2. sxgbe_restart_tx_queue()
>    <- sxgbe_tx_interrupt
>    <- sxgbe_reset_all_tx_queues()
>       <- sxgbe_tx_timeout()
> 
> Given xmit won't be called again until tx queue is enabled, it's not clear
> how a deadlock could happen due to #1.
> 


After spending more thoughts on this I tend to agree with you. Yes, we have the
different locking order for the xmit_lock and the private lock in two concurrent
threads. And one of the first things one learns about locking is that this is a
good way to create a deadlock sooner or later. But in our case the deadlock 
can only occur if the xmit function and the tx completion handler perceive different
 states for the tx queue, or to be more specific: 
the completion handler sees the tx queue in state "stopped" while the xmit handler 
sees it in state "running" at the same time. Only then both functions would try to
take both locks, which could lead to a deadlock.

OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
by that locking scheme (in a way I have not figured out yet) or if it is a different issue.

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-08 20:32     ` Lino Sanfilippo
@ 2016-12-08 21:54       ` Pavel Machek
  2016-12-08 22:12         ` Lino Sanfilippo
  2016-12-08 23:19       ` Francois Romieu
  1 sibling, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-08 21:54 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 2207 bytes --]

On Thu 2016-12-08 21:32:12, Lino Sanfilippo wrote:
> Hi,
> 
> On 08.12.2016 00:15, Francois Romieu wrote:
> > Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> >> The driver uses a private lock for synchronization between the xmit
> >> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> >> is not set, the xmit function is also called with the xmit_lock held.
> >> 
> >> On the other hand the xmit completion handler first takes the private lock
> >> and (in case that the tx queue has been stopped) the xmit_lock, leading
> >> to a reverse locking order and the potential danger of a deadlock.
> > 
> > netif_tx_stop_queue is used by:
> > 1. xmit function before releasing lock and returning.
> > 2. sxgbe_restart_tx_queue()
> >    <- sxgbe_tx_interrupt
> >    <- sxgbe_reset_all_tx_queues()
> >       <- sxgbe_tx_timeout()
> > 
> > Given xmit won't be called again until tx queue is enabled, it's not clear
> > how a deadlock could happen due to #1.
> > 
> 
> 
> After spending more thoughts on this I tend to agree with you. Yes, we have the
> different locking order for the xmit_lock and the private lock in two concurrent
> threads. And one of the first things one learns about locking is that this is a
> good way to create a deadlock sooner or later. But in our case the deadlock 
> can only occur if the xmit function and the tx completion handler perceive different
>  states for the tx queue, or to be more specific: 
> the completion handler sees the tx queue in state "stopped" while the xmit handler 
> sees it in state "running" at the same time. Only then both functions would try to
> take both locks, which could lead to a deadlock.
> 
> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
> by that locking scheme (in a way I have not figured out yet) or if it is a different issue.

Pavel has some problems, but that's on different hardware.. and it is
possible that it is deadlock (or something else) somewhere else.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-08 21:54       ` Pavel Machek
@ 2016-12-08 22:12         ` Lino Sanfilippo
  2016-12-08 22:18           ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-08 22:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Hi,

On 08.12.2016 22:54, Pavel Machek wrote:
> On Thu 2016-12-08 21:32:12, Lino Sanfilippo wrote:
>> Hi,
>> 
>> On 08.12.2016 00:15, Francois Romieu wrote:
>> > Lino Sanfilippo <LinoSanfilippo@gmx.de> :
>> >> The driver uses a private lock for synchronization between the xmit
>> >> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>> >> is not set, the xmit function is also called with the xmit_lock held.
>> >> 
>> >> On the other hand the xmit completion handler first takes the private lock
>> >> and (in case that the tx queue has been stopped) the xmit_lock, leading
>> >> to a reverse locking order and the potential danger of a deadlock.
>> > 
>> > netif_tx_stop_queue is used by:
>> > 1. xmit function before releasing lock and returning.
>> > 2. sxgbe_restart_tx_queue()
>> >    <- sxgbe_tx_interrupt
>> >    <- sxgbe_reset_all_tx_queues()
>> >       <- sxgbe_tx_timeout()
>> > 
>> > Given xmit won't be called again until tx queue is enabled, it's not clear
>> > how a deadlock could happen due to #1.
>> > 
>> 
>> 
>> After spending more thoughts on this I tend to agree with you. Yes, we have the
>> different locking order for the xmit_lock and the private lock in two concurrent
>> threads. And one of the first things one learns about locking is that this is a
>> good way to create a deadlock sooner or later. But in our case the deadlock 
>> can only occur if the xmit function and the tx completion handler perceive different
>>  states for the tx queue, or to be more specific: 
>> the completion handler sees the tx queue in state "stopped" while the xmit handler 
>> sees it in state "running" at the same time. Only then both functions would try to
>> take both locks, which could lead to a deadlock.
>> 
>> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
>> by that locking scheme (in a way I have not figured out yet) or if it is a different issue.
> 
> Pavel has some problems, but that's on different hardware.. and it is
> possible that it is deadlock (or something else) somewhere else.
> 

Right, it is different hardware. But the locking situation in xmit function and tx completion handler
is very similar in both drivers. So if a deadlock is not possible in sxgbe it should 
also not be possible in stmmac (at least not due to the different locking order). 
So maybe there is no real issue that we could fix with removing the private lock and we should
keep it as it is.

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-08 22:12         ` Lino Sanfilippo
@ 2016-12-08 22:18           ` Pavel Machek
  2016-12-08 22:45             ` Lino Sanfilippo
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-08 22:18 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 2997 bytes --]

On Thu 2016-12-08 23:12:10, Lino Sanfilippo wrote:
> Hi,
> 
> On 08.12.2016 22:54, Pavel Machek wrote:
> > On Thu 2016-12-08 21:32:12, Lino Sanfilippo wrote:
> >> Hi,
> >> 
> >> On 08.12.2016 00:15, Francois Romieu wrote:
> >> > Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> >> >> The driver uses a private lock for synchronization between the xmit
> >> >> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> >> >> is not set, the xmit function is also called with the xmit_lock held.
> >> >> 
> >> >> On the other hand the xmit completion handler first takes the private lock
> >> >> and (in case that the tx queue has been stopped) the xmit_lock, leading
> >> >> to a reverse locking order and the potential danger of a deadlock.
> >> > 
> >> > netif_tx_stop_queue is used by:
> >> > 1. xmit function before releasing lock and returning.
> >> > 2. sxgbe_restart_tx_queue()
> >> >    <- sxgbe_tx_interrupt
> >> >    <- sxgbe_reset_all_tx_queues()
> >> >       <- sxgbe_tx_timeout()
> >> > 
> >> > Given xmit won't be called again until tx queue is enabled, it's not clear
> >> > how a deadlock could happen due to #1.
> >> > 
> >> 
> >> 
> >> After spending more thoughts on this I tend to agree with you. Yes, we have the
> >> different locking order for the xmit_lock and the private lock in two concurrent
> >> threads. And one of the first things one learns about locking is that this is a
> >> good way to create a deadlock sooner or later. But in our case the deadlock 
> >> can only occur if the xmit function and the tx completion handler perceive different
> >>  states for the tx queue, or to be more specific: 
> >> the completion handler sees the tx queue in state "stopped" while the xmit handler 
> >> sees it in state "running" at the same time. Only then both functions would try to
> >> take both locks, which could lead to a deadlock.
> >> 
> >> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
> >> by that locking scheme (in a way I have not figured out yet) or if it is a different issue.
> > 
> > Pavel has some problems, but that's on different hardware.. and it is
> > possible that it is deadlock (or something else) somewhere else.
> > 
> 
> Right, it is different hardware. But the locking situation in xmit function and tx completion handler
> is very similar in both drivers. So if a deadlock is not possible in sxgbe it should 
> also not be possible in stmmac (at least not due to the different locking order). 
> So maybe there is no real issue that we could fix with removing the private lock and we should
> keep it as it is.

Well.. the locking is pretty confused there. Having private lock that
mirrors lock from network layer is confusing and ugly... that should
be reason to fix it.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-08 22:18           ` Pavel Machek
@ 2016-12-08 22:45             ` Lino Sanfilippo
  0 siblings, 0 replies; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-08 22:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

On 08.12.2016 23:18, Pavel Machek wrote:
> On Thu 2016-12-08 23:12:10, Lino Sanfilippo wrote:
>> Hi,
>> 
>> On 08.12.2016 22:54, Pavel Machek wrote:
>> > On Thu 2016-12-08 21:32:12, Lino Sanfilippo wrote:
>> >> Hi,
>> >> 
>> >> On 08.12.2016 00:15, Francois Romieu wrote:
>> >> > Lino Sanfilippo <LinoSanfilippo@gmx.de> :
>> >> >> The driver uses a private lock for synchronization between the xmit
>> >> >> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>> >> >> is not set, the xmit function is also called with the xmit_lock held.
>> >> >> 
>> >> >> On the other hand the xmit completion handler first takes the private lock
>> >> >> and (in case that the tx queue has been stopped) the xmit_lock, leading
>> >> >> to a reverse locking order and the potential danger of a deadlock.
>> >> > 
>> >> > netif_tx_stop_queue is used by:
>> >> > 1. xmit function before releasing lock and returning.
>> >> > 2. sxgbe_restart_tx_queue()
>> >> >    <- sxgbe_tx_interrupt
>> >> >    <- sxgbe_reset_all_tx_queues()
>> >> >       <- sxgbe_tx_timeout()
>> >> > 
>> >> > Given xmit won't be called again until tx queue is enabled, it's not clear
>> >> > how a deadlock could happen due to #1.
>> >> > 
>> >> 
>> >> 
>> >> After spending more thoughts on this I tend to agree with you. Yes, we have the
>> >> different locking order for the xmit_lock and the private lock in two concurrent
>> >> threads. And one of the first things one learns about locking is that this is a
>> >> good way to create a deadlock sooner or later. But in our case the deadlock 
>> >> can only occur if the xmit function and the tx completion handler perceive different
>> >>  states for the tx queue, or to be more specific: 
>> >> the completion handler sees the tx queue in state "stopped" while the xmit handler 
>> >> sees it in state "running" at the same time. Only then both functions would try to
>> >> take both locks, which could lead to a deadlock.
>> >> 
>> >> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
>> >> by that locking scheme (in a way I have not figured out yet) or if it is a different issue.
>> > 
>> > Pavel has some problems, but that's on different hardware.. and it is
>> > possible that it is deadlock (or something else) somewhere else.
>> > 
>> 
>> Right, it is different hardware. But the locking situation in xmit function and tx completion handler
>> is very similar in both drivers. So if a deadlock is not possible in sxgbe it should 
>> also not be possible in stmmac (at least not due to the different locking order). 
>> So maybe there is no real issue that we could fix with removing the private lock and we should
>> keep it as it is.
> 
> Well.. the locking is pretty confused there. Having private lock that
> mirrors lock from network layer is confusing and ugly... that should
> be reason to fix it.
> 									Pavel
> 

Ok. Then I will resend the patches for both drivers with a different (less dramatic) commit message
in which the change is not longer described as a fix for deadlock but rather as a code 
cleanup/improvement, ok?

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-08 20:32     ` Lino Sanfilippo
  2016-12-08 21:54       ` Pavel Machek
@ 2016-12-08 23:19       ` Francois Romieu
  2016-12-09 11:21         ` Pavel Machek
  1 sibling, 1 reply; 36+ messages in thread
From: Francois Romieu @ 2016-12-08 23:19 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, pavel, davem, linux-kernel, netdev

Lino Sanfilippo <LinoSanfilippo@gmx.de> :
[...]
> OTOH Pavel said that he actually could produce a deadlock. Now I wonder if
> this is caused by that locking scheme (in a way I have not figured out yet)
> or if it is a different issue.

stmmac_tx_err races with stmmac_xmit.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-08 23:19       ` Francois Romieu
@ 2016-12-09 11:21         ` Pavel Machek
  2016-12-10  2:25           ` Lino Sanfilippo
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-09 11:21 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Lino Sanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

On Fri 2016-12-09 00:19:43, Francois Romieu wrote:
> Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> [...]
> > OTOH Pavel said that he actually could produce a deadlock. Now I wonder if
> > this is caused by that locking scheme (in a way I have not figured out yet)
> > or if it is a different issue.
> 
> stmmac_tx_err races with stmmac_xmit.

Umm, yes, that looks real.

And that means that removing tx_lock will not be completely trivial
:-(. Lino, any ideas there?

netif_tx_lock_irqsave() would help, but afaict that one does not
exist.

Plus, does someone know how to trigger the status == tx_hard_error? I
tried powering down the switch, but that did not do it.

Thanks,								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-09 11:21         ` Pavel Machek
@ 2016-12-10  2:25           ` Lino Sanfilippo
  2016-12-11 20:11             ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-10  2:25 UTC (permalink / raw)
  To: Pavel Machek, Francois Romieu
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Hi,

On 09.12.2016 12:21, Pavel Machek wrote:
> On Fri 2016-12-09 00:19:43, Francois Romieu wrote:
>> Lino Sanfilippo <LinoSanfilippo@gmx.de> :
>> [...]
>> > OTOH Pavel said that he actually could produce a deadlock. Now I wonder if
>> > this is caused by that locking scheme (in a way I have not figured out yet)
>> > or if it is a different issue.
>> 
>> stmmac_tx_err races with stmmac_xmit.
> 
> Umm, yes, that looks real.
> 
> And that means that removing tx_lock will not be completely trivial
> :-(. Lino, any ideas there?
> 

Ok, the race is there but it looks like a problem that is not related to 
the use or removal of the private lock.
By a glimpse into other drivers (e.g sky2 or e1000), a possible way to handle a 
tx error is to start a separate task and restart the tx path in that task instead
the irq handler (or timer in case of the watchdog).

In that task we could do:
1. deactivate napi
2. deactivate irqs
3. wait for running napi/irqs do complete (_sync)
4. call stmmac_tx_err()
5. reenable napi
6. reenable irqs

We have to ensure that no xmit() is executing while stmmac_tx_err() does the cleanup,
so stmmac_tx_err() should IMO rather call netif_tx_disable() instead of netif_stop_queue()
(the former grabs the xmit lock before it sets __QUEUE_STATE_DRV_XOFF to disable
the queue).

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-10  2:25           ` Lino Sanfilippo
@ 2016-12-11 20:11             ` Pavel Machek
  2016-12-15 19:27               ` Lino Sanfilippo
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-11 20:11 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 2356 bytes --]

Hi!

> On 09.12.2016 12:21, Pavel Machek wrote:
> > On Fri 2016-12-09 00:19:43, Francois Romieu wrote:
> >> Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> >> [...]
> >> > OTOH Pavel said that he actually could produce a deadlock. Now I wonder if
> >> > this is caused by that locking scheme (in a way I have not figured out yet)
> >> > or if it is a different issue.
> >> 
> >> stmmac_tx_err races with stmmac_xmit.
> > 
> > Umm, yes, that looks real.
> > 
> > And that means that removing tx_lock will not be completely trivial
> > :-(. Lino, any ideas there?
> > 
> 
> Ok, the race is there but it looks like a problem that is not related to 
> the use or removal of the private lock.

Well, removal of the private lock will make it trickier to fix :-(.

> By a glimpse into other drivers (e.g sky2 or e1000), a possible way to handle a 
> tx error is to start a separate task and restart the tx path in that task instead
> the irq handler (or timer in case of the watchdog).
> 
> In that task we could do:
> 1. deactivate napi
> 2. deactivate irqs
> 3. wait for running napi/irqs do complete (_sync)
> 4. call stmmac_tx_err()
> 5. reenable napi
> 6. reenable irqs
> 
> We have to ensure that no xmit() is executing while stmmac_tx_err() does the cleanup,
> so stmmac_tx_err() should IMO rather call netif_tx_disable() instead of netif_stop_queue()
> (the former grabs the xmit lock before it sets __QUEUE_STATE_DRV_XOFF to disable
> the queue).

Do you understand what stmmac_tx_err(priv); is supposed to do? In
particular, if it is called while the driver is working ok -- should
the driver survive that?

Because it does not currently, and I don't know how to test that
code. Unplugging the cable does not provoke that.

I tried

        } else if (unlikely(status == tx_hard_error))
                stmmac_tx_err(priv);
+
+       {
+               static int i;
+               i++;
+               if (i==1000) {
+                       i = 0;
+                       printk("Simulated error\n");
+                       stmmac_tx_err(priv);
+               }
+       }
 }

but the driver does not survive that :-(.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-11 20:11             ` Pavel Machek
@ 2016-12-15 19:27               ` Lino Sanfilippo
  2016-12-15 21:03                 ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-15 19:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Hi Pavel,

sorry for the late reply.

On 11.12.2016 21:11, Pavel Machek wrote:
> 
> Do you understand what stmmac_tx_err(priv); is supposed to do? In
> particular, if it is called while the driver is working ok -- should
> the driver survive that?

As far as I understood it is supposed to fixup an errorneous tx path, e.g. a
missing tx completion for transmitted frames.

Some drivers do this by restarting only the HW parts responsible for tx, some
others by restarting the complete hardware. 
But IMO it should also be ok to be called if the HW is still working fine.

> Because it does not currently, and I don't know how to test that
> code. Unplugging the cable does not provoke that.
> 
> I tried
> 
>         } else if (unlikely(status == tx_hard_error))
>                 stmmac_tx_err(priv);
> +
> +       {
> +               static int i;
> +               i++;
> +               if (i==1000) {
> +                       i = 0;
> +                       printk("Simulated error\n");
> +                       stmmac_tx_err(priv);
> +               }
> +       }
>  }
> 

Ok, there is this race that Francois mentioned so it is not surprising that
the driver does not survive the call of stmmac_tx_err() as it is called now.
Thats why I suggested to do a proper shutdown and restart of the tx path to
avoid the race.

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-15 19:27               ` Lino Sanfilippo
@ 2016-12-15 21:03                 ` Pavel Machek
  2016-12-15 21:32                   ` Lino Sanfilippo
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-15 21:03 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1784 bytes --]

Hi!

> sorry for the late reply.

No problem. Thanks for the help.

> On 11.12.2016 21:11, Pavel Machek wrote:
> > 
> > Do you understand what stmmac_tx_err(priv); is supposed to do? In
> > particular, if it is called while the driver is working ok -- should
> > the driver survive that?
> 
> As far as I understood it is supposed to fixup an errorneous tx path, e.g. a
> missing tx completion for transmitted frames.
> 
> Some drivers do this by restarting only the HW parts responsible for tx, some
> others by restarting the complete hardware. 
> But IMO it should also be ok to be called if the HW is still working fine.
> 
> > Because it does not currently, and I don't know how to test that
> > code. Unplugging the cable does not provoke that.
> > 
> > I tried
> > 
> >         } else if (unlikely(status == tx_hard_error))
> >                 stmmac_tx_err(priv);
> > +
> > +       {
> > +               static int i;
> > +               i++;
> > +               if (i==1000) {
> > +                       i = 0;
> > +                       printk("Simulated error\n");
> > +                       stmmac_tx_err(priv);
> > +               }
> > +       }
> >  }
> > 
> 
> Ok, there is this race that Francois mentioned so it is not surprising that
> the driver does not survive the call of stmmac_tx_err() as it is called now.
> Thats why I suggested to do a proper shutdown and restart of the tx path to
> avoid the race.

I actually did experiment with adding locking there, too, and no, no
luck. It seems stmmac_tx_err() is more broken than just locking.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-15 21:03                 ` Pavel Machek
@ 2016-12-15 21:32                   ` Lino Sanfilippo
  2016-12-15 22:33                     ` Lino Sanfilippo
  0 siblings, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-15 21:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

On 15.12.2016 22:03, Pavel Machek wrote:

> 
> I actually did experiment with adding locking there, too, and no, no
> luck. It seems stmmac_tx_err() is more broken than just locking.
> 

Ah ok. Then maybe priv->hw->dma->stop_tx() does not do the job correctly (stop the
tx path properly) and the HW is still active on the tx path while the tx buffers are
freed. OTOH stmmac_release() also stops the phy before the tx (and rx) paths are stopped.
Did you try to stop the phy fist in stmmac_tx_err_work(), too?

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-15 21:32                   ` Lino Sanfilippo
@ 2016-12-15 22:33                     ` Lino Sanfilippo
  2016-12-17 17:31                       ` Pavel Machek
  0 siblings, 1 reply; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-15 22:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 556 bytes --]

On 15.12.2016 22:32, Lino Sanfilippo wrote:

> Ah ok. Then maybe priv->hw->dma->stop_tx() does not do the job correctly (stop the
> tx path properly) and the HW is still active on the tx path while the tx buffers are
> freed. OTOH stmmac_release() also stops the phy before the tx (and rx) paths are stopped.
> Did you try to stop the phy fist in stmmac_tx_err_work(), too?
> 
> Regards,
> Lino
> 

And this is the "sledgehammer" approach: Do a complete shutdown and restart
of the hardware in case of tx error (against net-next and only compile tested).


[-- Attachment #2: 0001-Sledgehammer.patch --]
[-- Type: text/x-patch, Size: 4140 bytes --]

>From 0eda87ce6cbc2fb6d25653f30121f30f89332199 Mon Sep 17 00:00:00 2001
From: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Date: Thu, 15 Dec 2016 23:18:15 +0100
Subject: [PATCH] Sledgehammer

---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  1 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 70 ++++++++++-------------
 2 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index eab04ae..9c240d7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -131,6 +131,7 @@ struct stmmac_priv {
 	u32 rx_tail_addr;
 	u32 tx_tail_addr;
 	u32 mss;
+	struct work_struct tx_err_work;
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *dbgfs_dir;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3e40578..7546574 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1403,37 +1403,6 @@ static inline void stmmac_disable_dma_irq(struct stmmac_priv *priv)
 }
 
 /**
- * stmmac_tx_err - to manage the tx error
- * @priv: driver private structure
- * Description: it cleans the descriptors and restarts the transmission
- * in case of transmission errors.
- */
-static void stmmac_tx_err(struct stmmac_priv *priv)
-{
-	int i;
-	netif_stop_queue(priv->dev);
-
-	priv->hw->dma->stop_tx(priv->ioaddr);
-	dma_free_tx_skbufs(priv);
-	for (i = 0; i < DMA_TX_SIZE; i++)
-		if (priv->extend_desc)
-			priv->hw->desc->init_tx_desc(&priv->dma_etx[i].basic,
-						     priv->mode,
-						     (i == DMA_TX_SIZE - 1));
-		else
-			priv->hw->desc->init_tx_desc(&priv->dma_tx[i],
-						     priv->mode,
-						     (i == DMA_TX_SIZE - 1));
-	priv->dirty_tx = 0;
-	priv->cur_tx = 0;
-	netdev_reset_queue(priv->dev);
-	priv->hw->dma->start_tx(priv->ioaddr);
-
-	priv->dev->stats.tx_errors++;
-	netif_wake_queue(priv->dev);
-}
-
-/**
  * stmmac_dma_interrupt - DMA ISR
  * @priv: driver private structure
  * Description: this is the DMA ISR. It is called by the main ISR.
@@ -1466,7 +1435,7 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 			priv->xstats.threshold = tc;
 		}
 	} else if (unlikely(status == tx_hard_error))
-		stmmac_tx_err(priv);
+		schedule_work(&priv->tx_err_work);
 }
 
 /**
@@ -1870,13 +1839,7 @@ static int stmmac_open(struct net_device *dev)
 	return ret;
 }
 
-/**
- *  stmmac_release - close entry point of the driver
- *  @dev : device pointer.
- *  Description:
- *  This is the stop entry point of the driver.
- */
-static int stmmac_release(struct net_device *dev)
+static int stmmac_do_release(struct net_device *dev)
 {
 	struct stmmac_priv *priv = netdev_priv(dev);
 
@@ -1919,10 +1882,36 @@ static int stmmac_release(struct net_device *dev)
 #endif
 
 	stmmac_release_ptp(priv);
+}
+
+/**
+ *  stmmac_release - close entry point of the driver
+ *  @dev : device pointer.
+ *  Description:
+ *  This is the stop entry point of the driver.
+ */
+static int stmmac_release(struct net_device *dev)
+{
+	struct stmmac_priv *priv = netdev_priv(dev);
+
+	cancel_work_sync(&priv->tx_err_work);
+
+	stmmac_do_release(dev);
 
 	return 0;
 }
 
+static void stmmac_tx_err_work(struct work_struct *work)
+{
+	struct stmmac_priv *priv = container_of(work, struct stmmac_priv,
+						tx_err_work);
+	/* restart netdev */
+	rtnl_lock();
+	stmmac_release(priv->dev);
+	stmmac_open(priv->dev);
+	rtnl_unlock();
+}
+
 /**
  *  stmmac_tso_allocator - close entry point of the driver
  *  @priv: driver private structure
@@ -2688,7 +2677,7 @@ static void stmmac_tx_timeout(struct net_device *dev)
 	struct stmmac_priv *priv = netdev_priv(dev);
 
 	/* Clear Tx resources and restart transmitting again */
-	stmmac_tx_err(priv);
+	schedule_work(&priv->tx_err_work);
 }
 
 /**
@@ -3338,6 +3327,7 @@ int stmmac_dvr_probe(struct device *device,
 	netif_napi_add(ndev, &priv->napi, stmmac_poll, 64);
 
 	spin_lock_init(&priv->lock);
+	INIT_WORK(&priv->tx_err_work, stmmac_tx_err_work);
 
 	ret = register_netdev(ndev);
 	if (ret) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-15 22:33                     ` Lino Sanfilippo
@ 2016-12-17 17:31                       ` Pavel Machek
  2016-12-18  0:15                         ` Francois Romieu
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-17 17:31 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1355 bytes --]

On Thu 2016-12-15 23:33:22, Lino Sanfilippo wrote:
> On 15.12.2016 22:32, Lino Sanfilippo wrote:
> 
> > Ah ok. Then maybe priv->hw->dma->stop_tx() does not do the job correctly (stop the
> > tx path properly) and the HW is still active on the tx path while the tx buffers are
> > freed. OTOH stmmac_release() also stops the phy before the tx (and rx) paths are stopped.
> > Did you try to stop the phy fist in stmmac_tx_err_work(), too?
> > 
> > Regards,
> > Lino
> > 
> 
> And this is the "sledgehammer" approach: Do a complete shutdown and restart
> of the hardware in case of tx error (against net-next and only
>compile tested).

Wow, thanks a lot. I'll try to get the driver back to the non-working
state, and try it.

I believe I have some idea what is wrong there. (Missing memory barriers).

> +static void stmmac_tx_err_work(struct work_struct *work)
> +{
> +	struct stmmac_priv *priv = container_of(work, struct stmmac_priv,
> +						tx_err_work);
> +	/* restart netdev */
> +	rtnl_lock();
> +	stmmac_release(priv->dev);
> +	stmmac_open(priv->dev);
> +	rtnl_unlock();
> +}

Won't this up/down the interface, in a way userspace can observe?

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-17 17:31                       ` Pavel Machek
@ 2016-12-18  0:15                         ` Francois Romieu
  2016-12-18 16:15                           ` Lino Sanfilippo
  2016-12-19 10:02                           ` Pavel Machek
  0 siblings, 2 replies; 36+ messages in thread
From: Francois Romieu @ 2016-12-18  0:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Lino Sanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Pavel Machek <pavel@ucw.cz> :
[...]
> Won't this up/down the interface, in a way userspace can observe?

It won't up/down the interface as it doesn't exactly mimic what the
network code does (there's more than rtnl_lock).

For the same reason it's broken if it races with the transmit path: it
can release driver resources while the transmit path uses these.

Btw the points below may not matter/hurt much for a proof a concept
but they would need to be addressed as well:
1) unchecked (and avoidable) extra error paths due to stmmac_release()
2) racy cancel_work_sync. Low probability as it is, an irq + error could
   take place right after cancel_work_sync

Lino, have you considered via-rhine.c since its "move work from irq to
workqueue context" changes that started in
7ab87ff4c770eed71e3777936299292739fcd0fe [*] ?

It's a shameless plug - originated in r8169.c - but it should be rather
close to what the sxgbe and friends require. Thought / opinion ?

[*] Including fixes/changes in:
- 3a5a883a8a663b930908cae4abe5ec913b9b2fd2
- e1efa87241272104d6a12c8b9fcdc4f62634d447
- 810f19bcb862f8889b27e0c9d9eceac9593925dd
- e45af497950a89459a0c4b13ffd91e1729fffef4
- a926592f5e4e900f3fa903298c4619a131e60963
- 559bcac35facfed49ab4f408e162971612dcfdf3

-- 
Ueimor

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-18  0:15                         ` Francois Romieu
@ 2016-12-18 16:15                           ` Lino Sanfilippo
  2016-12-18 17:23                             ` Pavel Machek
                                               ` (2 more replies)
  2016-12-19 10:02                           ` Pavel Machek
  1 sibling, 3 replies; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-18 16:15 UTC (permalink / raw)
  To: Francois Romieu, Pavel Machek
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Hi,

On 18.12.2016 01:15, Francois Romieu wrote:
> Pavel Machek <pavel@ucw.cz> :
> [...]
>> Won't this up/down the interface, in a way userspace can observe?
> 
> It won't up/down the interface as it doesn't exactly mimic what the
> network code does (there's more than rtnl_lock).
> 

Right. Userspace wont see link down/up, but it will see carrier off/on.
But this is AFAIK acceptable for a rare situation like a tx error.

> For the same reason it's broken if it races with the transmit path: it
> can release driver resources while the transmit path uses these.
> 
> Btw the points below may not matter/hurt much for a proof a concept
> but they would need to be addressed as well:
> 1) unchecked (and avoidable) extra error paths due to stmmac_release()
> 2) racy cancel_work_sync. Low probability as it is, an irq + error could
>    take place right after cancel_work_sync

It was indeed only meant as a proof of concept. Nevertheless the race is not 
good, since one can run into it when faking the tx error for testings purposes.
So below is a slightly improved version of the restart handling.
Its not meant as a final version either. But maybe we can use it as a starting
point.

> Lino, have you considered via-rhine.c since its "move work from irq to
> workqueue context" changes that started in
> 7ab87ff4c770eed71e3777936299292739fcd0fe [*] ?
> It's a shameless plug - originated in r8169.c - but it should be rather
> close to what the sxgbe and friends require. Thought / opinion ?
> 

Not really. There are a few drivers that I use to look into if I want to know
how certain things are done correctly (e.g the sky2 or the intel drivers) because
I think they are well implemented.
But you seem to have put some thoughts into various race condition problems
in the via-rhine driver and I can image that sxgbe and stmmac still have some
of these issues, too.

> [*] Including fixes/changes in:
> - 3a5a883a8a663b930908cae4abe5ec913b9b2fd2

Ok, the issues you fixed here are concerning the tx_curr and tx_dirty
pointers. For now this is not needed in stmmac and sxgbe since the
tx completion handlers in both drivers are not lock-free like in 
the via-rhine.c but are synchronized with xmit by means of the xmit_lock.

> - e1efa87241272104d6a12c8b9fcdc4f62634d447

Yep, a sync of the dma descriptors before the hardware gets ownership of the tx tail
idx is missing in the stmmac, too. 

> - 810f19bcb862f8889b27e0c9d9eceac9593925dd
> - e45af497950a89459a0c4b13ffd91e1729fffef4
> - a926592f5e4e900f3fa903298c4619a131e60963

I think we should use netif_tx_disable() instead of netif_stop_queue(), too, in 
case of restart to avoid a possible schedule of the xmit function while we restart. 
So this is also part of the new patch.

Again the patch is only compile tested.

Regards,
Lino

---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |  1 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 95 +++++++++++++++--------
 2 files changed, 63 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index eab04ae..9c240d7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -131,6 +131,7 @@ struct stmmac_priv {
 	u32 rx_tail_addr;
 	u32 tx_tail_addr;
 	u32 mss;
+	struct work_struct tx_err_work;
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *dbgfs_dir;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3e40578..5762750 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1403,37 +1403,6 @@ static inline void stmmac_disable_dma_irq(struct stmmac_priv *priv)
 }
 
 /**
- * stmmac_tx_err - to manage the tx error
- * @priv: driver private structure
- * Description: it cleans the descriptors and restarts the transmission
- * in case of transmission errors.
- */
-static void stmmac_tx_err(struct stmmac_priv *priv)
-{
-	int i;
-	netif_stop_queue(priv->dev);
-
-	priv->hw->dma->stop_tx(priv->ioaddr);
-	dma_free_tx_skbufs(priv);
-	for (i = 0; i < DMA_TX_SIZE; i++)
-		if (priv->extend_desc)
-			priv->hw->desc->init_tx_desc(&priv->dma_etx[i].basic,
-						     priv->mode,
-						     (i == DMA_TX_SIZE - 1));
-		else
-			priv->hw->desc->init_tx_desc(&priv->dma_tx[i],
-						     priv->mode,
-						     (i == DMA_TX_SIZE - 1));
-	priv->dirty_tx = 0;
-	priv->cur_tx = 0;
-	netdev_reset_queue(priv->dev);
-	priv->hw->dma->start_tx(priv->ioaddr);
-
-	priv->dev->stats.tx_errors++;
-	netif_wake_queue(priv->dev);
-}
-
-/**
  * stmmac_dma_interrupt - DMA ISR
  * @priv: driver private structure
  * Description: this is the DMA ISR. It is called by the main ISR.
@@ -1466,7 +1435,7 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 			priv->xstats.threshold = tc;
 		}
 	} else if (unlikely(status == tx_hard_error))
-		stmmac_tx_err(priv);
+		schedule_work(&priv->tx_err_work);
 }
 
 /**
@@ -1902,6 +1871,7 @@ static int stmmac_release(struct net_device *dev)
 	if (priv->lpi_irq > 0)
 		free_irq(priv->lpi_irq, dev);
 
+	cancel_work_sync(&priv->tx_err_work);
 	/* Stop TX/RX DMA and clear the descriptors */
 	priv->hw->dma->stop_tx(priv->ioaddr);
 	priv->hw->dma->stop_rx(priv->ioaddr);
@@ -1920,9 +1890,67 @@ static int stmmac_release(struct net_device *dev)
 
 	stmmac_release_ptp(priv);
 
+
 	return 0;
 }
 
+static void stmmac_shutdown(struct net_device *dev)
+{
+	struct stmmac_priv *priv = netdev_priv(dev);
+
+	/* make sure xmit is not scheduled any more */
+	netif_tx_disable(dev);
+
+	if (priv->eee_enabled)
+		del_timer_sync(&priv->eee_ctrl_timer);
+
+	/* Stop and disconnect the PHY */
+	if (dev->phydev) {
+		phy_stop(dev->phydev);
+		phy_disconnect(dev->phydev);
+	}
+
+	napi_disable(&priv->napi);
+
+	del_timer_sync(&priv->txtimer);
+
+	/* Free the IRQ lines */
+	free_irq(dev->irq, dev);
+	if (priv->wol_irq != dev->irq)
+		free_irq(priv->wol_irq, dev);
+	if (priv->lpi_irq > 0)
+		free_irq(priv->lpi_irq, dev);
+
+	/* Stop TX/RX DMA and clear the descriptors */
+	priv->hw->dma->stop_tx(priv->ioaddr);
+	priv->hw->dma->stop_rx(priv->ioaddr);
+
+	/* Release and free the Rx/Tx resources */
+	free_dma_desc_resources(priv);
+
+	/* Disable the MAC Rx/Tx */
+	stmmac_set_mac(priv->ioaddr, false);
+
+	netif_carrier_off(dev);
+
+#ifdef CONFIG_DEBUG_FS
+	stmmac_exit_fs(dev);
+#endif
+
+	stmmac_release_ptp(priv);
+}
+
+static void stmmac_tx_err_work(struct work_struct *work)
+{
+	struct stmmac_priv *priv = container_of(work, struct stmmac_priv,
+						tx_err_work);
+	/* restart netdev */
+	rtnl_lock();
+	stmmac_shutdown(priv->dev);
+	stmmac_open(priv->dev);
+	rtnl_unlock();
+}
+
 /**
  *  stmmac_tso_allocator - close entry point of the driver
  *  @priv: driver private structure
@@ -2688,7 +2716,7 @@ static void stmmac_tx_timeout(struct net_device *dev)
 	struct stmmac_priv *priv = netdev_priv(dev);
 
 	/* Clear Tx resources and restart transmitting again */
-	stmmac_tx_err(priv);
+	schedule_work(&priv->tx_err_work);
 }
 
 /**
@@ -3338,6 +3366,7 @@ int stmmac_dvr_probe(struct device *device,
 	netif_napi_add(ndev, &priv->napi, stmmac_poll, 64);
 
 	spin_lock_init(&priv->lock);
+	INIT_WORK(&priv->tx_err_work, stmmac_tx_err_work);
 
 	ret = register_netdev(ndev);
 	if (ret) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-18 16:15                           ` Lino Sanfilippo
@ 2016-12-18 17:23                             ` Pavel Machek
  2016-12-18 18:30                             ` Pavel Machek
  2016-12-18 20:16                             ` Pavel Machek
  2 siblings, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-18 17:23 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 590 bytes --]

Hi!

> > - e1efa87241272104d6a12c8b9fcdc4f62634d447
> 
> Yep, a sync of the dma descriptors before the hardware gets ownership of the tx tail
> idx is missing in the stmmac, too. 

Thanks for the hint. Actually, the driver uses smp_wmb() which is
completely crazy, and probably misses rmb() in clean_tx, too. Anyway,
I did not notice there are dma_ variants, too... we clearly need them.

Thanks and best regards,
								Pavel
								
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-18 16:15                           ` Lino Sanfilippo
  2016-12-18 17:23                             ` Pavel Machek
@ 2016-12-18 18:30                             ` Pavel Machek
  2016-12-19 22:49                               ` Lino Sanfilippo
  2016-12-18 20:16                             ` Pavel Machek
  2 siblings, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-18 18:30 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 811 bytes --]

Hi!

> > - e1efa87241272104d6a12c8b9fcdc4f62634d447
> 
> Yep, a sync of the dma descriptors before the hardware gets ownership of the tx tail
> idx is missing in the stmmac, too.

I can reproduce failure with 4.4 fairly easily. I tried with dma_
variant of barriers, and it failed, too

[ 1018.410012] stmmac: early irq
[ 1023.939702] fpga_cmd_read:wait_event timed out!
[ 1033.128692] ------------[ cut here ]------------
[ 1033.133329] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303
dev_watchdog+0x264/0x284()
[ 1033.141748] NETDEV WATCHDOG: eth0 (socfpga-dwmac): transmit queue 0
timed out
[ 1033.148861] Modules linked in:

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-18 16:15                           ` Lino Sanfilippo
  2016-12-18 17:23                             ` Pavel Machek
  2016-12-18 18:30                             ` Pavel Machek
@ 2016-12-18 20:16                             ` Pavel Machek
  2 siblings, 0 replies; 36+ messages in thread
From: Pavel Machek @ 2016-12-18 20:16 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1266 bytes --]

Hi!

> > For the same reason it's broken if it races with the transmit path: it
> > can release driver resources while the transmit path uses these.
> > 
> > Btw the points below may not matter/hurt much for a proof a concept
> > but they would need to be addressed as well:
> > 1) unchecked (and avoidable) extra error paths due to stmmac_release()
> > 2) racy cancel_work_sync. Low probability as it is, an irq + error could
> >    take place right after cancel_work_sync
> 
> It was indeed only meant as a proof of concept. Nevertheless the race is not 
> good, since one can run into it when faking the tx error for testings purposes.
> So below is a slightly improved version of the restart handling.
> Its not meant as a final version either. But maybe we can use it as a starting
> point.

Certainly works better than version we currently have in tree. I'm
running it in a loop, and it survived 10 minutes of testing so
far. (Previous version killed the hardware at first iteration.)

> Again the patch is only compile tested.

Tested-by: Pavel Machek <pavel@denx.de>

Thanks!
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-18  0:15                         ` Francois Romieu
  2016-12-18 16:15                           ` Lino Sanfilippo
@ 2016-12-19 10:02                           ` Pavel Machek
  2016-12-20  0:05                             ` Francois Romieu
  1 sibling, 1 reply; 36+ messages in thread
From: Pavel Machek @ 2016-12-19 10:02 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Lino Sanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 2119 bytes --]

Hi!

> Lino, have you considered via-rhine.c since its "move work from irq to
> workqueue context" changes that started in
> 7ab87ff4c770eed71e3777936299292739fcd0fe [*] ?
> 
> It's a shameless plug - originated in r8169.c - but it should be rather
> close to what the sxgbe and friends require. Thought / opinion ?
> 
> [*] Including fixes/changes in:
> - 3a5a883a8a663b930908cae4abe5ec913b9b2fd2
> - e1efa87241272104d6a12c8b9fcdc4f62634d447
> - 810f19bcb862f8889b27e0c9d9eceac9593925dd
> - e45af497950a89459a0c4b13ffd91e1729fffef4
> - a926592f5e4e900f3fa903298c4619a131e60963
> - 559bcac35facfed49ab4f408e162971612dcfdf3

Considering the memory barriers... is something like this neccessary
in the via-rhine?

AFAICT... we need a barrier after making sure that descriptor is no
longer owned by DMA (to make sure we don't get stale data in rest of
descriptor)... and we need a barrier before giving the descriptor to
the dma, to make sure DMA engine sees the complete update....?

Thanks,
								Pavel

diff --git a/drivers/net/ethernet/via/via-rhine.c b/drivers/net/ethernet/via/via-rhine.c
index ba5c542..3806e72 100644
--- a/drivers/net/ethernet/via/via-rhine.c
+++ b/drivers/net/ethernet/via/via-rhine.c
@@ -1952,6 +1952,7 @@ static void rhine_tx(struct net_device *dev)
 			  entry, txstatus);
 		if (txstatus & DescOwn)
 			break;
+		dma_rmb();
 		skb = rp->tx_skbuff[entry];
 		if (txstatus & 0x8000) {
 			netif_dbg(rp, tx_done, dev,
@@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit)
 
 		if (desc_status & DescOwn)
 			break;
+		dma_rmb();
 
 		netif_dbg(rp, rx_status, dev, "%s() status %08x\n", __func__,
 			  desc_status);
@@ -2146,6 +2148,7 @@ static int rhine_rx(struct net_device *dev, int limit)
 			u64_stats_update_end(&rp->rx_stats.syncp);
 		}
 give_descriptor_to_nic:
+		dma_wmb();
 		desc->rx_status = cpu_to_le32(DescOwn);
 		entry = (++rp->cur_rx) % RX_RING_SIZE;
 	}

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-18 18:30                             ` Pavel Machek
@ 2016-12-19 22:49                               ` Lino Sanfilippo
  0 siblings, 0 replies; 36+ messages in thread
From: Lino Sanfilippo @ 2016-12-19 22:49 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Francois Romieu, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Hi,

On 18.12.2016 19:30, Pavel Machek wrote:
> Hi!
> 
>> > - e1efa87241272104d6a12c8b9fcdc4f62634d447
>> 
>> Yep, a sync of the dma descriptors before the hardware gets ownership of the tx tail
>> idx is missing in the stmmac, too.
> 
> I can reproduce failure with 4.4 fairly easily. I tried with dma_
> variant of barriers, and it failed, too
> 
> [ 1018.410012] stmmac: early irq
> [ 1023.939702] fpga_cmd_read:wait_event timed out!
> [ 1033.128692] ------------[ cut here ]------------
> [ 1033.133329] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303
> dev_watchdog+0x264/0x284()
> [ 1033.141748] NETDEV WATCHDOG: eth0 (socfpga-dwmac): transmit queue 0
> timed out
> [ 1033.148861] Modules linked in:

This watchdog warning clearly says that for some reason the tx queue was stopped
but never woken up in a certain timespan (5 sec in our case, which is a lot).

Does this occur after the queue has been stopped and woken up again a few times or
is it already the first time the queue is stopped (and never woken up)? 

Regards,
Lino

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
  2016-12-19 10:02                           ` Pavel Machek
@ 2016-12-20  0:05                             ` Francois Romieu
  0 siblings, 0 replies; 36+ messages in thread
From: Francois Romieu @ 2016-12-20  0:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Lino Sanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, davem, linux-kernel, netdev

Pavel Machek <pavel@ucw.cz> :
[...]
> Considering the memory barriers... is something like this neccessary
> in the via-rhine ?

Yes.

> AFAICT... we need a barrier after making sure that descriptor is no
> longer owned by DMA (to make sure we don't get stale data in rest of
> descriptor)... and we need a barrier before giving the descriptor to
> the dma, to make sure DMA engine sees the complete update....?

I would not expect stale data while processing a single transmit
descriptor as the transmit completion does not use the rest of the
descriptor at all in the via-rhine driver. However I agree that transmit
descriptors should be read by the cpu with adequate ordering so the
dma_rmb() should stay.

Same kind of narrative for dma_wmb rhine_rx (s/read/written/ and
s/cpu/device/).

> diff --git a/drivers/net/ethernet/via/via-rhine.c b/drivers/net/ethernet/via/via-rhine.c
> index ba5c542..3806e72 100644
> --- a/drivers/net/ethernet/via/via-rhine.c
> +++ b/drivers/net/ethernet/via/via-rhine.c
[...]
> @@ -2061,6 +2062,7 @@ static int rhine_rx(struct net_device *dev, int limit)
>  
>  		if (desc_status & DescOwn)
>  			break;
> +		dma_rmb();
>  

I agree with your explanation for this one (late vlan processing in a 
different word from the same descriptor).

-- 
Ueimor

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-12-20  0:05 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-07 20:05 Remove private locks to avoid possible deadlock Lino Sanfilippo
2016-12-07 20:05 ` [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock Lino Sanfilippo
2016-12-07 23:15   ` Francois Romieu
2016-12-08 20:32     ` Lino Sanfilippo
2016-12-08 21:54       ` Pavel Machek
2016-12-08 22:12         ` Lino Sanfilippo
2016-12-08 22:18           ` Pavel Machek
2016-12-08 22:45             ` Lino Sanfilippo
2016-12-08 23:19       ` Francois Romieu
2016-12-09 11:21         ` Pavel Machek
2016-12-10  2:25           ` Lino Sanfilippo
2016-12-11 20:11             ` Pavel Machek
2016-12-15 19:27               ` Lino Sanfilippo
2016-12-15 21:03                 ` Pavel Machek
2016-12-15 21:32                   ` Lino Sanfilippo
2016-12-15 22:33                     ` Lino Sanfilippo
2016-12-17 17:31                       ` Pavel Machek
2016-12-18  0:15                         ` Francois Romieu
2016-12-18 16:15                           ` Lino Sanfilippo
2016-12-18 17:23                             ` Pavel Machek
2016-12-18 18:30                             ` Pavel Machek
2016-12-19 22:49                               ` Lino Sanfilippo
2016-12-18 20:16                             ` Pavel Machek
2016-12-19 10:02                           ` Pavel Machek
2016-12-20  0:05                             ` Francois Romieu
2016-12-07 20:05 ` [PATCH 2/2] net: ethernet: stmmac: " Lino Sanfilippo
2016-12-07 20:55   ` Pavel Machek
2016-12-07 20:59   ` Pavel Machek
2016-12-07 21:37   ` Pavel Machek
2016-12-07 21:43     ` Lino Sanfilippo
2016-12-07 22:34       ` Lino Sanfilippo
2016-12-07 23:21         ` Pavel Machek
2016-12-07 23:41     ` David Miller
2016-12-08 14:08       ` Pavel Machek
2016-12-08 15:26         ` David Miller
2016-12-08 15:46           ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).