All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
@ 2017-02-25 14:22 Eric Dumazet
  2017-02-26 16:32 ` Saeed Mahameed
  2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-25 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Saeed Mahameed

From: Eric Dumazet <edumazet@google.com>

While playing with hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

I could track this down to the cx-3 (mlx4 driver) card that was
apparently sending an IRQ before we would arm it.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

This patch adds a new rx_irq_miss field that is incremented every time
the hard IRQ handler could not grab NAPI_STATE_SCHED.

Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Note that this work around would probably not work if the IRQ is spread
over many cpus, since it assume the hard irq and softirq are handled by
the same cpu. This kind of setup is buggy anyway because of reordering
issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +++++++++++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1 
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
 	struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
 	struct mlx4_en_priv *priv = netdev_priv(cq->dev);
 
-	if (likely(priv->port_up))
-		napi_schedule_irqoff(&cq->napi);
-	else
+	if (likely(priv->port_up)) {
+		if (napi_schedule_prep(&cq->napi))
+			__napi_schedule_irqoff(&cq->napi);
+		else
+			cq->rx_irq_missed++;
+	} else {
 		mlx4_en_arm_cq(priv, cq);
+	}
 }
 
 /* Rx CQ polling - called by NAPI */
@@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 	struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
 	struct net_device *dev = cq->dev;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int done;
+	int done = 0;
+	u32 rx_irq_missed;
 
-	done = mlx4_en_process_rx_cq(dev, cq, budget);
+again:
+	rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
+	done += mlx4_en_process_rx_cq(dev, cq, budget - done);
 
 	/* If we used up all the quota - we're probably not done yet... */
 	if (done == budget) {
@@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 		 */
 		if (done)
 			done--;
+		if (napi_complete_done(napi, done))
+			mlx4_en_arm_cq(priv, cq);
+		return done;
 	}
-	/* Done for now */
-	if (napi_complete_done(napi, done))
+	if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
+		goto again;
+
+	if (napi_complete_done(napi, done)) {
 		mlx4_en_arm_cq(priv, cq);
+
+		/* We might have received an interrupt too soon */
+		if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
+		    napi_reschedule(napi))
+			goto again;
+	}
 	return done;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -369,6 +369,7 @@ struct mlx4_en_cq {
 	int                     ring;
 	struct net_device      *dev;
 	struct napi_struct	napi;
+	u32			rx_irq_missed;
 	int size;
 	int buf_size;
 	int vector;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
  2017-02-25 14:22 [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker Eric Dumazet
@ 2017-02-26 16:32 ` Saeed Mahameed
  2017-02-26 17:34   ` Eric Dumazet
  2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
  1 sibling, 1 reply; 34+ messages in thread
From: Saeed Mahameed @ 2017-02-26 16:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While playing with hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> I could track this down to the cx-3 (mlx4 driver) card that was
> apparently sending an IRQ before we would arm it.
>

Hi Eric,

This is highly unlikely, the hardware should not do that, and if this
is really the case
we need to hunt down the root cause and not work around it.

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> This patch adds a new rx_irq_miss field that is incremented every time
> the hard IRQ handler could not grab NAPI_STATE_SCHED.
>
> Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Are you sure this is not some kind of race condition with the busy
polling mechanism
Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
Maybe the busy polling thread when it detects that it wants to yield,
it arms the CQ too early (when napi is not ready)?

Anyway Tariq and I would like to further investigate the fired IRQ
while CQ is not armed.  It smells
like  a bug in the driver/napi level, it is not a HW expected behavior.

Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
counter advance on a linerate RX load ?

>
> Note that this work around would probably not work if the IRQ is spread
> over many cpus, since it assume the hard irq and softirq are handled by
> the same cpu. This kind of setup is buggy anyway because of reordering
> issues.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +++++++++++++----
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
>         struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
>         struct mlx4_en_priv *priv = netdev_priv(cq->dev);
>
> -       if (likely(priv->port_up))
> -               napi_schedule_irqoff(&cq->napi);
> -       else
> +       if (likely(priv->port_up)) {
> +               if (napi_schedule_prep(&cq->napi))
> +                       __napi_schedule_irqoff(&cq->napi);
> +               else
> +                       cq->rx_irq_missed++;
> +       } else {
>                 mlx4_en_arm_cq(priv, cq);
> +       }
>  }
>
>  /* Rx CQ polling - called by NAPI */
> @@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>         struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
>         struct net_device *dev = cq->dev;
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> -       int done;
> +       int done = 0;
> +       u32 rx_irq_missed;
>
> -       done = mlx4_en_process_rx_cq(dev, cq, budget);
> +again:
> +       rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
> +       done += mlx4_en_process_rx_cq(dev, cq, budget - done);
>
>         /* If we used up all the quota - we're probably not done yet... */
>         if (done == budget) {
> @@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>                  */
>                 if (done)
>                         done--;
> +               if (napi_complete_done(napi, done))
> +                       mlx4_en_arm_cq(priv, cq);
> +               return done;
>         }
> -       /* Done for now */
> -       if (napi_complete_done(napi, done))
> +       if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
> +               goto again;
> +
> +       if (napi_complete_done(napi, done)) {
>                 mlx4_en_arm_cq(priv, cq);
> +
> +               /* We might have received an interrupt too soon */
> +               if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
> +                   napi_reschedule(napi))
> +                       goto again;
> +       }
>         return done;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -369,6 +369,7 @@ struct mlx4_en_cq {
>         int                     ring;
>         struct net_device      *dev;
>         struct napi_struct      napi;
> +       u32                     rx_irq_missed;
>         int size;
>         int buf_size;
>         int vector;
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
  2017-02-26 16:32 ` Saeed Mahameed
@ 2017-02-26 17:34   ` Eric Dumazet
  2017-02-26 17:40     ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2017-02-26 17:34 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Sun, 2017-02-26 at 18:32 +0200, Saeed Mahameed wrote:
> On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > While playing with hardware timestamping of RX packets, I found
> > that some packets were received by TCP stack with a ~200 ms delay...
> >
> > Since the timestamp was provided by the NIC, and my probe was added
> > in tcp_v4_rcv() while in BH handler, I was confident it was not
> > a sender issue, or a drop in the network.
> >
> > This would happen with a very low probability, but hurting RPC
> > workloads.
> >
> > I could track this down to the cx-3 (mlx4 driver) card that was
> > apparently sending an IRQ before we would arm it.
> >
> 
> Hi Eric,
> 
> This is highly unlikely, the hardware should not do that, and if this
> is really the case
> we need to hunt down the root cause and not work around it.

Well, I definitely see the interrupt coming while the napi bit is not
available.


> 
> > A NAPI driver normally arms the IRQ after the napi_complete_done(),
> > after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> > it.
> >
> > This patch adds a new rx_irq_miss field that is incremented every time
> > the hard IRQ handler could not grab NAPI_STATE_SCHED.
> >
> > Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> > and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED
> 
> Are you sure this is not some kind of race condition with the busy
> polling mechanism
> Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
> Maybe the busy polling thread when it detects that it wants to yield,
> it arms the CQ too early (when napi is not ready)?
> 

Good question.

No busy polling in my tests.

I have triggers by using on a 2x10 Gbit host

(bonding of two 10Gbit mlx4 ports)


ethtool -C eth1 adaptive-rx on rx-usecs-low 0
ethtool -C eth2 adaptive-rx on rx-usecs-low 0
./super_netperf 9 -H lpaa6 -t TCP_RR -l 10000 -- -r 1000000,1000000 &


4 rx and tx queues per NIC, fq packet scheduler on them.

TCP timestamps are on. (this might be important to get the last packet
of a given size. In my case 912 bytes, with a PSH flag)


> Anyway Tariq and I would like to further investigate the fired IRQ
> while CQ is not armed.  It smells
> like  a bug in the driver/napi level, it is not a HW expected behavior.
> 
> Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
> counter advance on a linerate RX load ?


About 1000 times per second on my hosts, receiving about 1.2 Mpps.
But most of these misses are not an issue because next packet is
arriving maybe less than 10 usec later.

Note that the bug is hard to notice, because TCP would fast retransmit,
or the next packet was coming soon enough.

You have to be unlucky enough that the RX queue that missed the NAPI
schedule receive no more packets before the 200 ms RTO timer, and the
packet stuck in the RX ring is the last packet of the RPC.


I believe part of the problem is that NAPI_STATE_SCHED can be grabbed by
a process while hard irqs were not disabled.

I do not believe this bug is mlx4 specific.

Anything doing the following while hard irq were not masked :

local_bh_disable();
napi_reschedule(&priv->rx_cq[ring]->napi);
local_bh_enable();

Like in mlx4_en_recover_from_oom()

Can trigger the issue really.

Unfortunately I do not see how core layer can handle this.
Only the driver hard irq could possibly know that it could not grab
NAPI_STATE_SCHED

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
  2017-02-26 17:34   ` Eric Dumazet
@ 2017-02-26 17:40     ` Eric Dumazet
  2017-02-26 18:03       ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2017-02-26 17:40 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Sun, 2017-02-26 at 09:34 -0800, Eric Dumazet wrote:

> I do not believe this bug is mlx4 specific.
> 
> Anything doing the following while hard irq were not masked :
> 
> local_bh_disable();
> napi_reschedule(&priv->rx_cq[ring]->napi);
> local_bh_enable();
> 
> Like in mlx4_en_recover_from_oom()
> 
> Can trigger the issue really.
> 
> Unfortunately I do not see how core layer can handle this.
> Only the driver hard irq could possibly know that it could not grab
> NAPI_STATE_SCHED

Actually we could use an additional bit for that, that the driver would
set even if NAPI_STATE_SCHED could not be grabbed.

Let me try something.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
  2017-02-26 17:40     ` Eric Dumazet
@ 2017-02-26 18:03       ` Eric Dumazet
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-26 18:03 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Sun, 2017-02-26 at 09:40 -0800, Eric Dumazet wrote:
> NAPI_STATE_SCHED
> 
> Actually we could use an additional bit for that, that the driver would
> set even if NAPI_STATE_SCHED could not be grabbed.

Just to be clear :

Drivers would require no change, this would be done in
existing helpers.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH net] net: solve a NAPI race
  2017-02-25 14:22 [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker Eric Dumazet
  2017-02-26 16:32 ` Saeed Mahameed
@ 2017-02-27  3:31 ` Eric Dumazet
  2017-02-27 14:06   ` Eric Dumazet
  2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
  1 sibling, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27  3:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Saeed Mahameed

From: Eric Dumazet <edumazet@google.com>

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   29 +++++++----------------
 net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++--
 2 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
 	NAPI_STATE_SCHED,	/* Poll is scheduled */
+	NAPI_STATE_MISSED,	/* reschedule a napi poll */
 	NAPI_STATE_DISABLE,	/* Disable pending */
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-	NAPIF_STATE_SCHED	 = (1UL << NAPI_STATE_SCHED),
-	NAPIF_STATE_DISABLE	 = (1UL << NAPI_STATE_DISABLE),
-	NAPIF_STATE_NPSVC	 = (1UL << NAPI_STATE_NPSVC),
-	NAPIF_STATE_HASHED	 = (1UL << NAPI_STATE_HASHED),
-	NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-	NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_SCHED	 = BIT(NAPI_STATE_SCHED),
+	NAPIF_STATE_MISSED	 = BIT(NAPI_STATE_MISSED),
+	NAPIF_STATE_DISABLE	 = BIT(NAPI_STATE_DISABLE),
+	NAPIF_STATE_NPSVC	 = BIT(NAPI_STATE_NPSVC),
+	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
+	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
 	return test_bit(NAPI_STATE_DISABLE, &n->state);
 }
 
-/**
- *	napi_schedule_prep - check if NAPI can be scheduled
- *	@n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-	return !napi_disable_pending(n) &&
-		!test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  *	napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..82d868049ba78260a5f28376842657b72bd31994 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ *	napi_schedule_prep - check if napi can be scheduled
+ *	@n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running.  This is used as a condition variable
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			return false;
+		new = val | NAPIF_STATE_SCHED;
+		if (unlikely(val & NAPIF_STATE_SCHED))
+			new |= NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
  * __napi_schedule_irqoff - schedule for receive
  * @n: entry to schedule
  *
@@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
 
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
-	unsigned long flags;
+	unsigned long flags, val, new;
 
 	/*
 	 * 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		list_del_init(&n->poll_list);
 		local_irq_restore(flags);
 	}
-	WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+	do {
+		val = READ_ONCE(n->state);
+
+		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+		if (unlikely(val & NAPIF_STATE_MISSED))
+			new |= NAPIF_STATE_SCHED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	if (unlikely(val & NAPIF_STATE_MISSED))
+		__napi_schedule(n);
+
 	return true;
 }
 EXPORT_SYMBOL(napi_complete_done);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net] net: solve a NAPI race
  2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
@ 2017-02-27 14:06   ` Eric Dumazet
  2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 14:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Saeed Mahameed

On Sun, 2017-02-26 at 19:31 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
> 
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
> 
> This would happen with a very low probability, but hurting RPC
> workloads.
> 
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
> 
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED
> 
> Then napi_complete_done() properly reschedules the napi to make sure
> we do not miss something.
> 
> Since we manipulate multiple bits at once, use cmpxchg() like in
> sk_busy_loop() to provide proper transactions.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/netdevice.h |   29 +++++++----------------
>  net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 51 insertions(+), 22 deletions(-)

I will send a v2 of this patch.

Points trying to grab NAPI_STATE_SCHED not from the device driver IRQ
handler should not set NAPI_STATE_MISSED if they fail, otherwise this
adds extra work for no purpose.

One of this point is napi_watchdog() which can fire pretty often.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 net] net: solve a NAPI race
  2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
  2017-02-27 14:06   ` Eric Dumazet
@ 2017-02-27 14:21   ` Eric Dumazet
  2017-02-27 16:19     ` David Miller
                       ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 14:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tariq Toukan, Saeed Mahameed

From: Eric Dumazet <edumazet@google.com>

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   29 ++++++-------------
 net/core/dev.c            |   53 +++++++++++++++++++++++++++++++++---
 2 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
 	NAPI_STATE_SCHED,	/* Poll is scheduled */
+	NAPI_STATE_MISSED,	/* reschedule a napi */
 	NAPI_STATE_DISABLE,	/* Disable pending */
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-	NAPIF_STATE_SCHED	 = (1UL << NAPI_STATE_SCHED),
-	NAPIF_STATE_DISABLE	 = (1UL << NAPI_STATE_DISABLE),
-	NAPIF_STATE_NPSVC	 = (1UL << NAPI_STATE_NPSVC),
-	NAPIF_STATE_HASHED	 = (1UL << NAPI_STATE_HASHED),
-	NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-	NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_SCHED	 = BIT(NAPI_STATE_SCHED),
+	NAPIF_STATE_MISSED	 = BIT(NAPI_STATE_MISSED),
+	NAPIF_STATE_DISABLE	 = BIT(NAPI_STATE_DISABLE),
+	NAPIF_STATE_NPSVC	 = BIT(NAPI_STATE_NPSVC),
+	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
+	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
 	return test_bit(NAPI_STATE_DISABLE, &n->state);
 }
 
-/**
- *	napi_schedule_prep - check if NAPI can be scheduled
- *	@n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-	return !napi_disable_pending(n) &&
-		!test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  *	napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..edeb916487015f279036ecf7ff5d9096dff365d3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ *	napi_schedule_prep - check if napi can be scheduled
+ *	@n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running.  This is used as a condition variable
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			return false;
+		new = val | NAPIF_STATE_SCHED;
+		if (unlikely(val & NAPIF_STATE_SCHED))
+			new |= NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
  * __napi_schedule_irqoff - schedule for receive
  * @n: entry to schedule
  *
@@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
 
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
-	unsigned long flags;
+	unsigned long flags, val, new;
 
 	/*
 	 * 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		list_del_init(&n->poll_list);
 		local_irq_restore(flags);
 	}
-	WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+	do {
+		val = READ_ONCE(n->state);
+
+		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+		if (unlikely(val & NAPIF_STATE_MISSED))
+			new |= NAPIF_STATE_SCHED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	if (unlikely(val & NAPIF_STATE_MISSED))
+		__napi_schedule(n);
+
 	return true;
 }
 EXPORT_SYMBOL(napi_complete_done);
@@ -5088,8 +5128,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	struct napi_struct *napi;
 
 	napi = container_of(timer, struct napi_struct, timer);
-	if (napi->gro_list)
-		napi_schedule_irqoff(napi);
+
+	/* Note : we use a relaxed variant of napi_schedule_prep() not setting
+	 * NAPI_STATE_MISSED, since we do not react to a device IRQ.
+	 */
+	if (napi->gro_list && !napi_disable_pending(napi) &&
+	    !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
+		__napi_schedule_irqoff(napi);
 
 	return HRTIMER_NORESTART;
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
@ 2017-02-27 16:19     ` David Miller
  2017-02-27 16:44       ` Eric Dumazet
  2017-02-27 21:00       ` Alexander Duyck
  2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
  2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
  2 siblings, 2 replies; 34+ messages in thread
From: David Miller @ 2017-02-27 16:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tariqt, saeedm

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2017 06:21:38 -0800

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
> 
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED

Various rules were meant to protect these sequences, and make sure
nothing like this race could happen.

Can you show the specific sequence that fails?

One of the basic protections is that the device IRQ is not re-enabled
until napi_complete_done() is finished, most drivers do something like
this:

	napi_complete_done();
		- sets NAPI_STATE_SCHED
	enable device IRQ

So I don't understand how it is possible that "later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it".

While napi_complete_done() is running, the device's IRQ is still
disabled, so there cannot be an IRQ firing before napi_complete_done()
is finished.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-27 16:19     ` David Miller
@ 2017-02-27 16:44       ` Eric Dumazet
  2017-02-27 17:10         ` Eric Dumazet
  2017-02-28  2:08         ` David Miller
  2017-02-27 21:00       ` Alexander Duyck
  1 sibling, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 16:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tariqt, saeedm

On Mon, 2017-02-27 at 11:19 -0500, David Miller wrote:

> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
> 
> Can you show the specific sequence that fails?
> 
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
> 
> 	napi_complete_done();
> 		- sets NAPI_STATE_SCHED
> 	enable device IRQ
> 
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
> 
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.


Any point doing a napi_schedule() not from device hard irq handler
is subject to the race for NIC using some kind of edge trigger interrupts.

Since we do not provide a ndo to disable device interrupts,
the following can happen.

thread 1                                 thread 2 (could be on same cpu)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is available.
                                          device hard irq

                                          // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();
                                          
napi_complete_done(napi, 2);
rearm_irq();


Note that rearm_irq() will not force the device to send an additional IRQ
for the packet it already signaled (3rd packet in my example)

At least for mlx4, only 4th packet will trigger the IRQ again.

In the old days, the race would not happen since napi->poll() was called
in direct response to a prior device IRQ :
Edge triggered hard irqs from the device for this queue were already disabled.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-27 16:44       ` Eric Dumazet
@ 2017-02-27 17:10         ` Eric Dumazet
  2017-02-28  2:08         ` David Miller
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 17:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tariqt, saeedm

On Mon, 2017-02-27 at 08:44 -0800, Eric Dumazet wrote:

> // busy polling or napi_watchdog()

BTW, we also can add to the beginning of busy_poll_stop() :

clear_bit(NAPI_STATE_MISSED, &napi->state);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 net] net: solve a NAPI race
  2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
  2017-02-27 16:19     ` David Miller
@ 2017-02-27 20:18     ` Eric Dumazet
  2017-02-27 22:14       ` Stephen Hemminger
                         ` (2 more replies)
  2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
  2 siblings, 3 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 20:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is available.
                                          device hard irq

                                          // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();
                                          
napi_complete_done(napi, 2);
rearm_irq();


Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)



This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   29 +++++-----------
 net/core/dev.c            |   63 +++++++++++++++++++++++++++++++++---
 2 files changed, 68 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
 	NAPI_STATE_SCHED,	/* Poll is scheduled */
+	NAPI_STATE_MISSED,	/* reschedule a napi */
 	NAPI_STATE_DISABLE,	/* Disable pending */
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-	NAPIF_STATE_SCHED	 = (1UL << NAPI_STATE_SCHED),
-	NAPIF_STATE_DISABLE	 = (1UL << NAPI_STATE_DISABLE),
-	NAPIF_STATE_NPSVC	 = (1UL << NAPI_STATE_NPSVC),
-	NAPIF_STATE_HASHED	 = (1UL << NAPI_STATE_HASHED),
-	NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-	NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_SCHED	 = BIT(NAPI_STATE_SCHED),
+	NAPIF_STATE_MISSED	 = BIT(NAPI_STATE_MISSED),
+	NAPIF_STATE_DISABLE	 = BIT(NAPI_STATE_DISABLE),
+	NAPIF_STATE_NPSVC	 = BIT(NAPI_STATE_NPSVC),
+	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
+	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
 	return test_bit(NAPI_STATE_DISABLE, &n->state);
 }
 
-/**
- *	napi_schedule_prep - check if NAPI can be scheduled
- *	@n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-	return !napi_disable_pending(n) &&
-		!test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  *	napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..afcab3670aac18a9c193bf5a09e36e3dc9d0d63c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ *	napi_schedule_prep - check if napi can be scheduled
+ *	@n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running.  This is used as a condition variable
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			return false;
+		new = val | NAPIF_STATE_SCHED;
+		if (unlikely(val & NAPIF_STATE_SCHED))
+			new |= NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
  * __napi_schedule_irqoff - schedule for receive
  * @n: entry to schedule
  *
@@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
 
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
-	unsigned long flags;
+	unsigned long flags, val, new;
 
 	/*
 	 * 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		list_del_init(&n->poll_list);
 		local_irq_restore(flags);
 	}
-	WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+	do {
+		val = READ_ONCE(n->state);
+
+		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+		if (unlikely(val & NAPIF_STATE_MISSED))
+			new |= NAPIF_STATE_SCHED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	if (unlikely(val & NAPIF_STATE_MISSED))
+		__napi_schedule(n);
+
 	return true;
 }
 EXPORT_SYMBOL(napi_complete_done);
@@ -4953,6 +4993,16 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
 {
 	int rc;
 
+	/* Busy polling means there is a high chance device driver hard irq
+	 * could not grab NAPI_STATE_SCHED, and that NAPI_STATE_MISSED was
+	 * set in napi_schedule_prep().
+	 * Since we are about to call napi->poll() once more, we can safely
+	 * clear NAPI_STATE_MISSED.
+	 *
+	 * Note: x86 could use a single "lock and ..." instruction
+	 * to perform these two clear_bit()
+	 */
+	clear_bit(NAPI_STATE_MISSED, &napi->state);
 	clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
 
 	local_bh_disable();
@@ -5088,8 +5138,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	struct napi_struct *napi;
 
 	napi = container_of(timer, struct napi_struct, timer);
-	if (napi->gro_list)
-		napi_schedule_irqoff(napi);
+
+	/* Note : we use a relaxed variant of napi_schedule_prep() not setting
+	 * NAPI_STATE_MISSED, since we do not react to a device IRQ.
+	 */
+	if (napi->gro_list && !napi_disable_pending(napi) &&
+	    !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
+		__napi_schedule_irqoff(napi);
 
 	return HRTIMER_NORESTART;
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-27 16:19     ` David Miller
  2017-02-27 16:44       ` Eric Dumazet
@ 2017-02-27 21:00       ` Alexander Duyck
  1 sibling, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2017-02-27 21:00 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Netdev, Tariq Toukan, Saeed Mahameed

On Mon, Feb 27, 2017 at 8:19 AM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Feb 2017 06:21:38 -0800
>
>> A NAPI driver normally arms the IRQ after the napi_complete_done(),
>> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
>> it.
>>
>> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
>> while IRQ are not disabled, we might have later an IRQ firing and
>> finding this bit set, right before napi_complete_done() clears it.
>>
>> This can happen with busy polling users, or if gro_flush_timeout is
>> used. But some other uses of napi_schedule() in drivers can cause this
>> as well.
>>
>> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
>> can set if it could not grab NAPI_STATE_SCHED
>
> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
>
> Can you show the specific sequence that fails?
>
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
>
>         napi_complete_done();
>                 - sets NAPI_STATE_SCHED
>         enable device IRQ
>
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
>
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.

So there are some drivers that will need to have the interrupts
enabled when busy polling and I assume that can cause this kind of
issue. Specifically in the case of i40e the part will not flush
completed descriptors until either 4 completed descriptors are ready
to be written back, or an interrupt fires.

Our other drivers have code in them that will force the interrupt to
unmask and fire once every 2 seconds in the unlikely event that an
interrupt was lost which can occur on some platforms.

- Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
@ 2017-02-27 22:14       ` Stephen Hemminger
  2017-02-27 22:35         ` Eric Dumazet
  2017-02-28 16:17       ` Stephen Hemminger
  2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
  2 siblings, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-27 22:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, 27 Feb 2017 12:18:31 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> thread 1                                 thread 2 (could be on same cpu)
> 
> // busy polling or napi_watchdog()
> napi_schedule();
> ...
> napi->poll()
> 
> device polling:
> read 2 packets from ring buffer
>                                           Additional 3rd packet is available.
>                                           device hard irq
> 
>                                           // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
>                                           napi_schedule();
>                                           
> napi_complete_done(napi, 2);
> rearm_irq();

The original design (as Davem mentioned) was that IRQ's must be disabled
during device polling. If that was true, then the race above
would be impossible. Also NAPI assumes interrupts are level triggered.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 22:14       ` Stephen Hemminger
@ 2017-02-27 22:35         ` Eric Dumazet
  2017-02-27 22:44           ` Stephen Hemminger
  2017-02-28 10:14           ` David Laight
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-27 22:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev

On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:

> The original design (as Davem mentioned) was that IRQ's must be disabled
> during device polling. If that was true, then the race above
> would be impossible.

I would love to see an alternative patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 22:35         ` Eric Dumazet
@ 2017-02-27 22:44           ` Stephen Hemminger
  2017-02-27 22:48             ` David Miller
  2017-02-28 10:14           ` David Laight
  1 sibling, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-27 22:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, 27 Feb 2017 14:35:17 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> 
> > The original design (as Davem mentioned) was that IRQ's must be disabled
> > during device polling. If that was true, then the race above
> > would be impossible.  
> 
> I would love to see an alternative patch.

Turn off busy poll? 
The poll stuff runs risk of breaking more things.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 22:44           ` Stephen Hemminger
@ 2017-02-27 22:48             ` David Miller
  2017-02-27 23:23               ` Stephen Hemminger
  0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2017-02-27 22:48 UTC (permalink / raw)
  To: stephen; +Cc: eric.dumazet, netdev

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Mon, 27 Feb 2017 14:44:55 -0800

> On Mon, 27 Feb 2017 14:35:17 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
>> 
>> > The original design (as Davem mentioned) was that IRQ's must be disabled
>> > during device polling. If that was true, then the race above
>> > would be impossible.  
>> 
>> I would love to see an alternative patch.
> 
> Turn off busy poll? 
> The poll stuff runs risk of breaking more things.

Eric is exactly trying to make busy poll even more prominent in
the stack, not less prominent.

It's an important component of some performance improvements he is
working on.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 22:48             ` David Miller
@ 2017-02-27 23:23               ` Stephen Hemminger
  0 siblings, 0 replies; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-27 23:23 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Mon, 27 Feb 2017 17:48:54 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Mon, 27 Feb 2017 14:44:55 -0800
> 
> > On Mon, 27 Feb 2017 14:35:17 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> >> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> >>   
> >> > The original design (as Davem mentioned) was that IRQ's must be disabled
> >> > during device polling. If that was true, then the race above
> >> > would be impossible.    
> >> 
> >> I would love to see an alternative patch.  
> > 
> > Turn off busy poll? 
> > The poll stuff runs risk of breaking more things.  
> 
> Eric is exactly trying to make busy poll even more prominent in
> the stack, not less prominent.
> 
> It's an important component of some performance improvements he is
> working on.

Maybe making IRQ controlled as part of the network device model
(instead of a side effect left to device driver to handle) would
be less problematic.

Really just shooting in the dark because I don't have any of the problematic
hardware to play with.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-27 16:44       ` Eric Dumazet
  2017-02-27 17:10         ` Eric Dumazet
@ 2017-02-28  2:08         ` David Miller
  2017-03-01  0:22           ` Francois Romieu
  1 sibling, 1 reply; 34+ messages in thread
From: David Miller @ 2017-02-28  2:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tariqt, saeedm

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Feb 2017 08:44:14 -0800

> Any point doing a napi_schedule() not from device hard irq handler
> is subject to the race for NIC using some kind of edge trigger
> interrupts.
> 
> Since we do not provide a ndo to disable device interrupts, the
> following can happen.

Ok, now I understand.

I think even without considering the race you are trying to solve,
this situation is really dangerous.

I am sure that every ->poll() handler out there was written by an
author who completely assumed that if they are executing then the
device's interrupts for that NAPI instance are disabled.  And this is
with very few, if any, exceptions.

So if we saw a driver doing something like:

	reg->irq_enable ^= value;

after napi_complete_done(), it would be quite understandable.

We really made a mistake taking the napi_schedule() call out of
the domain of the driver so that it could manage the interrupt
state properly.

I'm not against your missed bit fix as a short-term cure for now, it's
just that somewhere down the road we need to manage the interrupt
properly.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 22:35         ` Eric Dumazet
  2017-02-27 22:44           ` Stephen Hemminger
@ 2017-02-28 10:14           ` David Laight
  2017-02-28 13:04             ` Eric Dumazet
  2017-02-28 13:20             ` Eric Dumazet
  1 sibling, 2 replies; 34+ messages in thread
From: David Laight @ 2017-02-28 10:14 UTC (permalink / raw)
  To: 'Eric Dumazet', Stephen Hemminger; +Cc: David Miller, netdev

From: Eric Dumazet
> Sent: 27 February 2017 22:35
> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> 
> > The original design (as Davem mentioned) was that IRQ's must be disabled
> > during device polling. If that was true, then the race above
> > would be impossible.
> 
> I would love to see an alternative patch.

Can you test for 'receive data available' after telling the NAPI
logic that you've finished?
You'd then need to force a reschedule.

I think your proposed patch will do a reschedule if any packet arrives
during the receive processing, not just when one arrives right at the end.
You might want to 'unset' the reschedule flag before each check of the
receive ring.

I also wonder about the cost of processing the MSI-X (I guess) interrupts
compared to the cost of posted PCIe writes to disable and/or mask the
interrupt generation.
Clearly you don't want to do PCIe reads.

	David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-28 10:14           ` David Laight
@ 2017-02-28 13:04             ` Eric Dumazet
  2017-02-28 13:20             ` Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 13:04 UTC (permalink / raw)
  To: David Laight; +Cc: Stephen Hemminger, David Miller, netdev

On Tue, 2017-02-28 at 10:14 +0000, David Laight wrote:
> From: Eric Dumazet
> > Sent: 27 February 2017 22:35
> > On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> > 
> > > The original design (as Davem mentioned) was that IRQ's must be disabled
> > > during device polling. If that was true, then the race above
> > > would be impossible.
> > 
> > I would love to see an alternative patch.
> 
> Can you test for 'receive data available' after telling the NAPI
> logic that you've finished?
> You'd then need to force a reschedule.
> 
> I think your proposed patch will do a reschedule if any packet arrives
> during the receive processing, not just when one arrives right at the end.
> You might want to 'unset' the reschedule flag before each check of the
> receive ring.
> 
> I also wonder about the cost of processing the MSI-X (I guess) interrupts
> compared to the cost of posted PCIe writes to disable and/or mask the
> interrupt generation.
> Clearly you don't want to do PCIe reads.

Have you seen the mlx4 patch I provided ?

Then, I did not want to review 100+ NAPI drivers and provide patches for
them.

This generic solution is basically free. Same number of atomic
operations.

Given it took more than 2 years to even spot the bug, I have no idea how
people on netdev expect me to review all drivers. This is crazy.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-28 10:14           ` David Laight
  2017-02-28 13:04             ` Eric Dumazet
@ 2017-02-28 13:20             ` Eric Dumazet
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 13:20 UTC (permalink / raw)
  To: David Laight; +Cc: Stephen Hemminger, David Miller, netdev

On Tue, 2017-02-28 at 10:14 +0000, David Laight wrote:
> From: Eric Dumazet
> > Sent: 27 February 2017 22:35
> > On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> > 
> > > The original design (as Davem mentioned) was that IRQ's must be disabled
> > > during device polling. If that was true, then the race above
> > > would be impossible.
> > 
> > I would love to see an alternative patch.
> 
> Can you test for 'receive data available' after telling the NAPI
> logic that you've finished?
> You'd then need to force a reschedule.
> 

You understand that a 'reschedule' is only firing another invocation of
napi->poll() right away ?

hot cache lines, basically 0 cost.

In my stress tests, this happens 0.01 % of the times. Bug is tiny.

(Otherwise we would have spotted it earlier)

> I think your proposed patch will do a reschedule if any packet arrives
> during the receive processing, not just when one arrives right at the end.
> You might want to 'unset' the reschedule flag before each check of the
> receive ring.

Well, no :

Interrupt has been masked before the napi poll was scheduled.

As David and Stephen says, this condition must not happen, unless your
are really unlucky while doing busy polling, which is opportunistic call
of napi->poll() while you have idle cycles on your cpu.

> 
> I also wonder about the cost of processing the MSI-X (I guess) interrupts
> compared to the cost of posted PCIe writes to disable and/or mask the
> interrupt generation.
> Clearly you don't want to do PCIe reads.

Seems irrelevant to the bug we are discussing.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
  2017-02-27 22:14       ` Stephen Hemminger
@ 2017-02-28 16:17       ` Stephen Hemminger
  2017-02-28 16:57         ` Eric Dumazet
  2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
  2 siblings, 1 reply; 34+ messages in thread
From: Stephen Hemminger @ 2017-02-28 16:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, 27 Feb 2017 12:18:31 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.

Where were IRQ's re-enabled?


> thread 1                                 thread 2 (could be on same cpu)
> 
> // busy polling or napi_watchdog()
> napi_schedule();
> ...
> napi->poll()
> 
> device polling:
> read 2 packets from ring buffer
>                                           Additional 3rd packet is available.
>                                           device hard irq
> 
>                                           // does nothing because NAPI_STATE_SCHED bit is owned by thread 1
>                                           napi_schedule();
>                                           
> napi_complete_done(napi, 2);
> rearm_irq();

Maybe just as simple as using irqsave/irqrestore in driver.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 net] net: solve a NAPI race
  2017-02-28 16:17       ` Stephen Hemminger
@ 2017-02-28 16:57         ` Eric Dumazet
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 16:57 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev

On Tue, 2017-02-28 at 08:17 -0800, Stephen Hemminger wrote:

> Maybe just as simple as using irqsave/irqrestore in driver.


CPU can be differents.

irqsave will not help.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
  2017-02-27 16:19     ` David Miller
  2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
@ 2017-02-28 17:20     ` Alexander Duyck
  2017-02-28 17:47       ` Eric Dumazet
  2017-03-01 10:41       ` David Laight
  2 siblings, 2 replies; 34+ messages in thread
From: Alexander Duyck @ 2017-02-28 17:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Mon, Feb 27, 2017 at 6:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
>
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
>
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED
>
> Then napi_complete_done() properly reschedules the napi to make sure
> we do not miss something.
>
> Since we manipulate multiple bits at once, use cmpxchg() like in
> sk_busy_loop() to provide proper transactions.
>
> In v2, I changed napi_watchdog() to use a relaxed variant of
> napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/netdevice.h |   29 ++++++-------------
>  net/core/dev.c            |   53 +++++++++++++++++++++++++++++++++---
>  2 files changed, 58 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -330,6 +330,7 @@ struct napi_struct {
>
>  enum {
>         NAPI_STATE_SCHED,       /* Poll is scheduled */
> +       NAPI_STATE_MISSED,      /* reschedule a napi */
>         NAPI_STATE_DISABLE,     /* Disable pending */
>         NAPI_STATE_NPSVC,       /* Netpoll - don't dequeue from poll_list */
>         NAPI_STATE_HASHED,      /* In NAPI hash (busy polling possible) */
> @@ -338,12 +339,13 @@ enum {
>  };
>
>  enum {
> -       NAPIF_STATE_SCHED        = (1UL << NAPI_STATE_SCHED),
> -       NAPIF_STATE_DISABLE      = (1UL << NAPI_STATE_DISABLE),
> -       NAPIF_STATE_NPSVC        = (1UL << NAPI_STATE_NPSVC),
> -       NAPIF_STATE_HASHED       = (1UL << NAPI_STATE_HASHED),
> -       NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
> -       NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
> +       NAPIF_STATE_SCHED        = BIT(NAPI_STATE_SCHED),
> +       NAPIF_STATE_MISSED       = BIT(NAPI_STATE_MISSED),
> +       NAPIF_STATE_DISABLE      = BIT(NAPI_STATE_DISABLE),
> +       NAPIF_STATE_NPSVC        = BIT(NAPI_STATE_NPSVC),
> +       NAPIF_STATE_HASHED       = BIT(NAPI_STATE_HASHED),
> +       NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
> +       NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
>  };
>
>  enum gro_result {
> @@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
>         return test_bit(NAPI_STATE_DISABLE, &n->state);
>  }
>
> -/**
> - *     napi_schedule_prep - check if NAPI can be scheduled
> - *     @n: NAPI context
> - *
> - * Test if NAPI routine is already running, and if not mark
> - * it as running.  This is used as a condition variable to
> - * insure only one NAPI poll instance runs.  We also make
> - * sure there is no pending NAPI disable.
> - */
> -static inline bool napi_schedule_prep(struct napi_struct *n)
> -{
> -       return !napi_disable_pending(n) &&
> -               !test_and_set_bit(NAPI_STATE_SCHED, &n->state);
> -}
> +bool napi_schedule_prep(struct napi_struct *n);
>
>  /**
>   *     napi_schedule - schedule NAPI poll
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 304f2deae5f9897e60a79ed8b69d6ef208295ded..edeb916487015f279036ecf7ff5d9096dff365d3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
>  EXPORT_SYMBOL(__napi_schedule);
>
>  /**
> + *     napi_schedule_prep - check if napi can be scheduled
> + *     @n: napi context
> + *
> + * Test if NAPI routine is already running, and if not mark
> + * it as running.  This is used as a condition variable
> + * insure only one NAPI poll instance runs.  We also make
> + * sure there is no pending NAPI disable.
> + */
> +bool napi_schedule_prep(struct napi_struct *n)
> +{
> +       unsigned long val, new;
> +
> +       do {
> +               val = READ_ONCE(n->state);
> +               if (unlikely(val & NAPIF_STATE_DISABLE))
> +                       return false;
> +               new = val | NAPIF_STATE_SCHED;
> +               if (unlikely(val & NAPIF_STATE_SCHED))
> +                       new |= NAPIF_STATE_MISSED;

You might want to consider just using a combination AND, divide,
multiply, and OR to avoid having to have any conditional branches
being added due to this code path.  Basically the logic would look
like:
    new = val | NAPIF_STATE_SCHED;
    new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;

In assembler that all ends up getting translated out to AND, SHL, OR.
You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
up with otherwise.

> +       } while (cmpxchg(&n->state, val, new) != val);
> +
> +       return !(val & NAPIF_STATE_SCHED);
> +}
> +EXPORT_SYMBOL(napi_schedule_prep);
> +
> +/**
>   * __napi_schedule_irqoff - schedule for receive
>   * @n: entry to schedule
>   *
> @@ -4897,7 +4923,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
>
>  bool napi_complete_done(struct napi_struct *n, int work_done)
>  {
> -       unsigned long flags;
> +       unsigned long flags, val, new;
>
>         /*
>          * 1) Don't let napi dequeue from the cpu poll list
> @@ -4927,7 +4953,21 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
>                 list_del_init(&n->poll_list);
>                 local_irq_restore(flags);
>         }
> -       WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
> +
> +       do {
> +               val = READ_ONCE(n->state);
> +
> +               WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
> +
> +               new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
> +
> +               if (unlikely(val & NAPIF_STATE_MISSED))
> +                       new |= NAPIF_STATE_SCHED;

Same kind of thing here.  You could probably just get away with something like:
    new = val & ~NAPIF_STATE_MISSED;
    new &= (val & NAPIF_STATE_MISSED) / NAPIF_STATE_MISSED * NETIF_STATE_SCHED;

> +       } while (cmpxchg(&n->state, val, new) != val);
> +
> +       if (unlikely(val & NAPIF_STATE_MISSED))
> +               __napi_schedule(n);
> +
>         return true;
>  }

If you rescheduled napi should you really be returning true?  Seems
like you should be returning "!(val & NAPIF_STATE_MISSED)" to try to
avoid letting this occur again.

>  EXPORT_SYMBOL(napi_complete_done);
> @@ -5088,8 +5128,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
>         struct napi_struct *napi;
>
>         napi = container_of(timer, struct napi_struct, timer);
> -       if (napi->gro_list)
> -               napi_schedule_irqoff(napi);
> +
> +       /* Note : we use a relaxed variant of napi_schedule_prep() not setting
> +        * NAPI_STATE_MISSED, since we do not react to a device IRQ.
> +        */
> +       if (napi->gro_list && !napi_disable_pending(napi) &&
> +           !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
> +               __napi_schedule_irqoff(napi);
>
>         return HRTIMER_NORESTART;
>  }
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
@ 2017-02-28 17:47       ` Eric Dumazet
  2017-03-01 10:41       ` David Laight
  1 sibling, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 17:47 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Tue, 2017-02-28 at 09:20 -0800, Alexander Duyck wrote:
> On Mon, Feb 27, 2017 at 6:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > +bool napi_schedule_prep(struct napi_struct *n)
> > +{
> > +       unsigned long val, new;
> > +
> > +       do {
> > +               val = READ_ONCE(n->state);
> > +               if (unlikely(val & NAPIF_STATE_DISABLE))
> > +                       return false;
> > +               new = val | NAPIF_STATE_SCHED;
> > +               if (unlikely(val & NAPIF_STATE_SCHED))
> > +                       new |= NAPIF_STATE_MISSED;
> 
> You might want to consider just using a combination AND, divide,
> multiply, and OR to avoid having to have any conditional branches
> being added due to this code path.  Basically the logic would look
> like:


>     new = val | NAPIF_STATE_SCHED;
>     new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
> 
> In assembler that all ends up getting translated out to AND, SHL, OR.
> You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
> up with otherwise.

Sure, I can try to optimize this a bit ;)


> > +       } while (cmpxchg(&n->state, val, new) != val);
> > +
> > +       if (unlikely(val & NAPIF_STATE_MISSED))
> > +               __napi_schedule(n);
> > +
> >         return true;
> >  }
> 
> If you rescheduled napi should you really be returning true?  Seems
> like you should be returning "!(val & NAPIF_STATE_MISSED)" to try to
> avoid letting this occur again.

Good idea.

Hmm... you mean that many drivers test napi_complete_done() return
value ?

;)

Thanks !

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 net] net: solve a NAPI race
  2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
  2017-02-27 22:14       ` Stephen Hemminger
  2017-02-28 16:17       ` Stephen Hemminger
@ 2017-02-28 18:34       ` Eric Dumazet
  2017-03-01 17:53         ` David Miller
  2 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2017-02-28 18:34 UTC (permalink / raw)
  To: David Miller, Alexander Duyck; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();
                                          
napi_complete_done(napi, 2);
rearm_irq();


Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
---
 include/linux/netdevice.h |   29 ++++---------
 net/core/dev.c            |   76 ++++++++++++++++++++++++++++++++++--
 2 files changed, 81 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
 	NAPI_STATE_SCHED,	/* Poll is scheduled */
+	NAPI_STATE_MISSED,	/* reschedule a napi */
 	NAPI_STATE_DISABLE,	/* Disable pending */
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-	NAPIF_STATE_SCHED	 = (1UL << NAPI_STATE_SCHED),
-	NAPIF_STATE_DISABLE	 = (1UL << NAPI_STATE_DISABLE),
-	NAPIF_STATE_NPSVC	 = (1UL << NAPI_STATE_NPSVC),
-	NAPIF_STATE_HASHED	 = (1UL << NAPI_STATE_HASHED),
-	NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-	NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+	NAPIF_STATE_SCHED	 = BIT(NAPI_STATE_SCHED),
+	NAPIF_STATE_MISSED	 = BIT(NAPI_STATE_MISSED),
+	NAPIF_STATE_DISABLE	 = BIT(NAPI_STATE_DISABLE),
+	NAPIF_STATE_NPSVC	 = BIT(NAPI_STATE_NPSVC),
+	NAPIF_STATE_HASHED	 = BIT(NAPI_STATE_HASHED),
+	NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+	NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct *n)
 	return test_bit(NAPI_STATE_DISABLE, &n->state);
 }
 
-/**
- *	napi_schedule_prep - check if NAPI can be scheduled
- *	@n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-	return !napi_disable_pending(n) &&
-		!test_and_set_bit(NAPI_STATE_SCHED, &n->state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  *	napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9897e60a79ed8b69d6ef208295ded..ab4fe5304834e43790fa023d99d7fd1844f8f728 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,39 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ *	napi_schedule_prep - check if napi can be scheduled
+ *	@n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running.  This is used as a condition variable
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (unlikely(val & NAPIF_STATE_DISABLE))
+			return false;
+		new = val | NAPIF_STATE_SCHED;
+
+		/* Sets STATE_MISSED bit if STATE_SCHED was already set
+		 * This was suggested by Alexander Duyck, as compiler
+		 * emits better code than :
+		 * if (val & NAPIF_STATE_SCHED)
+		 *     new |= NAPIF_STATE_MISSED;
+		 */
+		new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED *
+						   NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
  * __napi_schedule_irqoff - schedule for receive
  * @n: entry to schedule
  *
@@ -4897,7 +4930,7 @@ EXPORT_SYMBOL(__napi_schedule_irqoff);
 
 bool napi_complete_done(struct napi_struct *n, int work_done)
 {
-	unsigned long flags;
+	unsigned long flags, val, new;
 
 	/*
 	 * 1) Don't let napi dequeue from the cpu poll list
@@ -4927,7 +4960,27 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 		list_del_init(&n->poll_list);
 		local_irq_restore(flags);
 	}
-	WARN_ON_ONCE(!test_and_clear_bit(NAPI_STATE_SCHED, &n->state));
+
+	do {
+		val = READ_ONCE(n->state);
+
+		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
+
+		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
+
+		/* If STATE_MISSED was set, leave STATE_SCHED set,
+		 * because we will call napi->poll() one more time.
+		 * This C code was suggested by Alexander Duyck to help gcc.
+		 */
+		new |= (val & NAPIF_STATE_MISSED) / NAPIF_STATE_MISSED *
+						    NAPIF_STATE_SCHED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	if (unlikely(val & NAPIF_STATE_MISSED)) {
+		__napi_schedule(n);
+		return false;
+	}
+
 	return true;
 }
 EXPORT_SYMBOL(napi_complete_done);
@@ -4953,6 +5006,16 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
 {
 	int rc;
 
+	/* Busy polling means there is a high chance device driver hard irq
+	 * could not grab NAPI_STATE_SCHED, and that NAPI_STATE_MISSED was
+	 * set in napi_schedule_prep().
+	 * Since we are about to call napi->poll() once more, we can safely
+	 * clear NAPI_STATE_MISSED.
+	 *
+	 * Note: x86 could use a single "lock and ..." instruction
+	 * to perform these two clear_bit()
+	 */
+	clear_bit(NAPI_STATE_MISSED, &napi->state);
 	clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
 
 	local_bh_disable();
@@ -5088,8 +5151,13 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
 	struct napi_struct *napi;
 
 	napi = container_of(timer, struct napi_struct, timer);
-	if (napi->gro_list)
-		napi_schedule_irqoff(napi);
+
+	/* Note : we use a relaxed variant of napi_schedule_prep() not setting
+	 * NAPI_STATE_MISSED, since we do not react to a device IRQ.
+	 */
+	if (napi->gro_list && !napi_disable_pending(napi) &&
+	    !test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
+		__napi_schedule_irqoff(napi);
 
 	return HRTIMER_NORESTART;
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-02-28  2:08         ` David Miller
@ 2017-03-01  0:22           ` Francois Romieu
  2017-03-01  1:04             ` Stephen Hemminger
  0 siblings, 1 reply; 34+ messages in thread
From: Francois Romieu @ 2017-03-01  0:22 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, tariqt, saeedm

David Miller <davem@davemloft.net> :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Feb 2017 08:44:14 -0800
> 
> > Any point doing a napi_schedule() not from device hard irq handler
> > is subject to the race for NIC using some kind of edge trigger
> > interrupts.
> > 
> > Since we do not provide a ndo to disable device interrupts, the
> > following can happen.
> 
> Ok, now I understand.
> 
> I think even without considering the race you are trying to solve,
> this situation is really dangerous.
> 
> I am sure that every ->poll() handler out there was written by an
> author who completely assumed that if they are executing then the
> device's interrupts for that NAPI instance are disabled.  And this is
> with very few, if any, exceptions.

Shareable pci irq used to remind author that such an assumption was
not always right. Otoh it was still manageable as long as level only
triggered irq were involved.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-03-01  0:22           ` Francois Romieu
@ 2017-03-01  1:04             ` Stephen Hemminger
  0 siblings, 0 replies; 34+ messages in thread
From: Stephen Hemminger @ 2017-03-01  1:04 UTC (permalink / raw)
  To: Francois Romieu; +Cc: David Miller, eric.dumazet, netdev, tariqt, saeedm

On Wed, 1 Mar 2017 01:22:40 +0100
Francois Romieu <romieu@fr.zoreil.com> wrote:

> David Miller <davem@davemloft.net> :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 27 Feb 2017 08:44:14 -0800
> >   
> > > Any point doing a napi_schedule() not from device hard irq handler
> > > is subject to the race for NIC using some kind of edge trigger
> > > interrupts.
> > > 
> > > Since we do not provide a ndo to disable device interrupts, the
> > > following can happen.  
> > 
> > Ok, now I understand.
> > 
> > I think even without considering the race you are trying to solve,
> > this situation is really dangerous.
> > 
> > I am sure that every ->poll() handler out there was written by an
> > author who completely assumed that if they are executing then the
> > device's interrupts for that NAPI instance are disabled.  And this is
> > with very few, if any, exceptions.  
> 
> Shareable pci irq used to remind author that such an assumption was
> not always right. Otoh it was still manageable as long as level only
> triggered irq were involved.
> 

When I had to deal with that in sky2, the best way was to have a single
NAPI poll handler shared between both ports. Works well and avoids races
in interrupt handling and enabling.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v2 net] net: solve a NAPI race
  2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
  2017-02-28 17:47       ` Eric Dumazet
@ 2017-03-01 10:41       ` David Laight
  2017-03-01 16:14         ` Alexander Duyck
  1 sibling, 1 reply; 34+ messages in thread
From: David Laight @ 2017-03-01 10:41 UTC (permalink / raw)
  To: 'Alexander Duyck', Eric Dumazet
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

From: Alexander Duyck
> Sent: 28 February 2017 17:20
...
> You might want to consider just using a combination AND, divide,
> multiply, and OR to avoid having to have any conditional branches
> being added due to this code path.  Basically the logic would look
> like:
>     new = val | NAPIF_STATE_SCHED;
>     new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
> 
> In assembler that all ends up getting translated out to AND, SHL, OR.
> You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
> up with otherwise.

It is a shame gcc doesn't contain that optimisation.
It also doesn't even make a good job of (a & b)/b * c since it
always does a shr and a sal (gcc 4.7.3 and 5.4).

Worthy of a #define or static inline.
Something like:
#define BIT_IF(v, a, b) ((b & (b-1) ? (v & a)/a * b : a > b ? (v & a) / (a/b) : (v & a) * (b/a))

	David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-03-01 10:41       ` David Laight
@ 2017-03-01 16:14         ` Alexander Duyck
  2017-03-01 17:32           ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2017-03-01 16:14 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Wed, Mar 1, 2017 at 2:41 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexander Duyck
>> Sent: 28 February 2017 17:20
> ...
>> You might want to consider just using a combination AND, divide,
>> multiply, and OR to avoid having to have any conditional branches
>> being added due to this code path.  Basically the logic would look
>> like:
>>     new = val | NAPIF_STATE_SCHED;
>>     new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED;
>>
>> In assembler that all ends up getting translated out to AND, SHL, OR.
>> You avoid the branching, or MOV/OR/TEST/CMOV type code you would end
>> up with otherwise.
>
> It is a shame gcc doesn't contain that optimisation.
> It also doesn't even make a good job of (a & b)/b * c since it
> always does a shr and a sal (gcc 4.7.3 and 5.4).

What build flags are you using?  With -Os or -O2 I have seen it
convert the /b * c into a single shift.

> Worthy of a #define or static inline.
> Something like:
> #define BIT_IF(v, a, b) ((b & (b-1) ? (v & a)/a * b : a > b ? (v & a) / (a/b) : (v & a) * (b/a))
>
>         David

Feel free to put together a patch.  I use this kind of thing in the
Intel drivers in multiple spots to shift stuff from TX_FLAGS into
descriptor flags.

- Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net] net: solve a NAPI race
  2017-03-01 16:14         ` Alexander Duyck
@ 2017-03-01 17:32           ` Eric Dumazet
  2017-03-02 10:24             ` David Laight
  0 siblings, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2017-03-01 17:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, David Miller, netdev, Tariq Toukan, Saeed Mahameed

On Wed, 2017-03-01 at 08:14 -0800, Alexander Duyck wrote:

> What build flags are you using?  With -Os or -O2 I have seen it
> convert the /b * c into a single shift.
> 


Because b & c are unsigned in our case.

I presume David tried signed integers, this is why gcc does that.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 net] net: solve a NAPI race
  2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
@ 2017-03-01 17:53         ` David Miller
  0 siblings, 0 replies; 34+ messages in thread
From: David Miller @ 2017-03-01 17:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.duyck, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Feb 2017 10:34:50 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
> 
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
> 
> This would happen with a very low probability, but hurting RPC
> workloads.
> 
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v2 net] net: solve a NAPI race
  2017-03-01 17:32           ` Eric Dumazet
@ 2017-03-02 10:24             ` David Laight
  0 siblings, 0 replies; 34+ messages in thread
From: David Laight @ 2017-03-02 10:24 UTC (permalink / raw)
  To: 'Eric Dumazet', Alexander Duyck
  Cc: David Miller, netdev, Tariq Toukan, Saeed Mahameed

From: Eric Dumazet
> Sent: 01 March 2017 17:33
> On Wed, 2017-03-01 at 08:14 -0800, Alexander Duyck wrote:
> 
> > What build flags are you using?  With -Os or -O2 I have seen it
> > convert the /b * c into a single shift.
> >
> 
> 
> Because b & c are unsigned in our case.
> 
> I presume David tried signed integers, this is why gcc does that.

I was using integer constants but an unsigned variable.
Changing the constants to unsigned makes no difference (they would get
promoted anyway).

After some experiments I can get gcc to generate a single shift if the
variable being tested is 32bits, the constants are 64bits and a 64bit
result is required.
In every other case it does either and-shr-shl or shr-and-shl (all the
shr are logical).

	David

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-03-02 10:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-25 14:22 [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker Eric Dumazet
2017-02-26 16:32 ` Saeed Mahameed
2017-02-26 17:34   ` Eric Dumazet
2017-02-26 17:40     ` Eric Dumazet
2017-02-26 18:03       ` Eric Dumazet
2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
2017-02-27 14:06   ` Eric Dumazet
2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
2017-02-27 16:19     ` David Miller
2017-02-27 16:44       ` Eric Dumazet
2017-02-27 17:10         ` Eric Dumazet
2017-02-28  2:08         ` David Miller
2017-03-01  0:22           ` Francois Romieu
2017-03-01  1:04             ` Stephen Hemminger
2017-02-27 21:00       ` Alexander Duyck
2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
2017-02-27 22:14       ` Stephen Hemminger
2017-02-27 22:35         ` Eric Dumazet
2017-02-27 22:44           ` Stephen Hemminger
2017-02-27 22:48             ` David Miller
2017-02-27 23:23               ` Stephen Hemminger
2017-02-28 10:14           ` David Laight
2017-02-28 13:04             ` Eric Dumazet
2017-02-28 13:20             ` Eric Dumazet
2017-02-28 16:17       ` Stephen Hemminger
2017-02-28 16:57         ` Eric Dumazet
2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
2017-03-01 17:53         ` David Miller
2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
2017-02-28 17:47       ` Eric Dumazet
2017-03-01 10:41       ` David Laight
2017-03-01 16:14         ` Alexander Duyck
2017-03-01 17:32           ` Eric Dumazet
2017-03-02 10:24             ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.