All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
@ 2017-02-25 14:22 Eric Dumazet
  2017-02-26 16:32 ` Saeed Mahameed
  2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-25 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Saeed Mahameed

From: Eric Dumazet <edumazet@google.com>

While playing with hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

I could track this down to the cx-3 (mlx4 driver) card that was
apparently sending an IRQ before we would arm it.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

This patch adds a new rx_irq_miss field that is incremented every time
the hard IRQ handler could not grab NAPI_STATE_SCHED.

Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Note that this work around would probably not work if the IRQ is spread
over many cpus, since it assume the hard irq and softirq are handled by
the same cpu. This kind of setup is buggy anyway because of reordering
issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +++++++++++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1 
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
 	struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
 	struct mlx4_en_priv *priv = netdev_priv(cq->dev);
 
-	if (likely(priv->port_up))
-		napi_schedule_irqoff(&cq->napi);
-	else
+	if (likely(priv->port_up)) {
+		if (napi_schedule_prep(&cq->napi))
+			__napi_schedule_irqoff(&cq->napi);
+		else
+			cq->rx_irq_missed++;
+	} else {
 		mlx4_en_arm_cq(priv, cq);
+	}
 }
 
 /* Rx CQ polling - called by NAPI */
@@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 	struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
 	struct net_device *dev = cq->dev;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int done;
+	int done = 0;
+	u32 rx_irq_missed;
 
-	done = mlx4_en_process_rx_cq(dev, cq, budget);
+again:
+	rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
+	done += mlx4_en_process_rx_cq(dev, cq, budget - done);
 
 	/* If we used up all the quota - we're probably not done yet... */
 	if (done == budget) {
@@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 		 */
 		if (done)
 			done--;
+		if (napi_complete_done(napi, done))
+			mlx4_en_arm_cq(priv, cq);
+		return done;
 	}
-	/* Done for now */
-	if (napi_complete_done(napi, done))
+	if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
+		goto again;
+
+	if (napi_complete_done(napi, done)) {
 		mlx4_en_arm_cq(priv, cq);
+
+		/* We might have received an interrupt too soon */
+		if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
+		    napi_reschedule(napi))
+			goto again;
+	}
 	return done;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -369,6 +369,7 @@ struct mlx4_en_cq {
 	int                     ring;
 	struct net_device      *dev;
 	struct napi_struct	napi;
+	u32			rx_irq_missed;
 	int size;
 	int buf_size;
 	int vector;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-03-02 10:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-25 14:22 [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker Eric Dumazet
2017-02-26 16:32 ` Saeed Mahameed
2017-02-26 17:34   ` Eric Dumazet
2017-02-26 17:40     ` Eric Dumazet
2017-02-26 18:03       ` Eric Dumazet
2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
2017-02-27 14:06   ` Eric Dumazet
2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
2017-02-27 16:19     ` David Miller
2017-02-27 16:44       ` Eric Dumazet
2017-02-27 17:10         ` Eric Dumazet
2017-02-28  2:08         ` David Miller
2017-03-01  0:22           ` Francois Romieu
2017-03-01  1:04             ` Stephen Hemminger
2017-02-27 21:00       ` Alexander Duyck
2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
2017-02-27 22:14       ` Stephen Hemminger
2017-02-27 22:35         ` Eric Dumazet
2017-02-27 22:44           ` Stephen Hemminger
2017-02-27 22:48             ` David Miller
2017-02-27 23:23               ` Stephen Hemminger
2017-02-28 10:14           ` David Laight
2017-02-28 13:04             ` Eric Dumazet
2017-02-28 13:20             ` Eric Dumazet
2017-02-28 16:17       ` Stephen Hemminger
2017-02-28 16:57         ` Eric Dumazet
2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
2017-03-01 17:53         ` David Miller
2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
2017-02-28 17:47       ` Eric Dumazet
2017-03-01 10:41       ` David Laight
2017-03-01 16:14         ` Alexander Duyck
2017-03-01 17:32           ` Eric Dumazet
2017-03-02 10:24             ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.