All of lore.kernel.org
 help / color / mirror / Atom feed
From: Saeed Mahameed <saeedm@dev.mellanox.co.il>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Tariq Toukan <tariqt@mellanox.com>,
	Saeed Mahameed <saeedm@mellanox.com>
Subject: Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
Date: Sun, 26 Feb 2017 18:32:16 +0200	[thread overview]
Message-ID: <CALzJLG-OcTAKkRQQHXV=8N-m+yATaX76UpGE4eWgK_8N52YePg@mail.gmail.com> (raw)
In-Reply-To: <1488032577.9415.131.camel@edumazet-glaptop3.roam.corp.google.com>

On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While playing with hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> I could track this down to the cx-3 (mlx4 driver) card that was
> apparently sending an IRQ before we would arm it.
>

Hi Eric,

This is highly unlikely, the hardware should not do that, and if this
is really the case
we need to hunt down the root cause and not work around it.

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> This patch adds a new rx_irq_miss field that is incremented every time
> the hard IRQ handler could not grab NAPI_STATE_SCHED.
>
> Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Are you sure this is not some kind of race condition with the busy
polling mechanism
Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
Maybe the busy polling thread when it detects that it wants to yield,
it arms the CQ too early (when napi is not ready)?

Anyway Tariq and I would like to further investigate the fired IRQ
while CQ is not armed.  It smells
like  a bug in the driver/napi level, it is not a HW expected behavior.

Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
counter advance on a linerate RX load ?

>
> Note that this work around would probably not work if the IRQ is spread
> over many cpus, since it assume the hard irq and softirq are handled by
> the same cpu. This kind of setup is buggy anyway because of reordering
> issues.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +++++++++++++----
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
>         struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
>         struct mlx4_en_priv *priv = netdev_priv(cq->dev);
>
> -       if (likely(priv->port_up))
> -               napi_schedule_irqoff(&cq->napi);
> -       else
> +       if (likely(priv->port_up)) {
> +               if (napi_schedule_prep(&cq->napi))
> +                       __napi_schedule_irqoff(&cq->napi);
> +               else
> +                       cq->rx_irq_missed++;
> +       } else {
>                 mlx4_en_arm_cq(priv, cq);
> +       }
>  }
>
>  /* Rx CQ polling - called by NAPI */
> @@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>         struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
>         struct net_device *dev = cq->dev;
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> -       int done;
> +       int done = 0;
> +       u32 rx_irq_missed;
>
> -       done = mlx4_en_process_rx_cq(dev, cq, budget);
> +again:
> +       rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
> +       done += mlx4_en_process_rx_cq(dev, cq, budget - done);
>
>         /* If we used up all the quota - we're probably not done yet... */
>         if (done == budget) {
> @@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>                  */
>                 if (done)
>                         done--;
> +               if (napi_complete_done(napi, done))
> +                       mlx4_en_arm_cq(priv, cq);
> +               return done;
>         }
> -       /* Done for now */
> -       if (napi_complete_done(napi, done))
> +       if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
> +               goto again;
> +
> +       if (napi_complete_done(napi, done)) {
>                 mlx4_en_arm_cq(priv, cq);
> +
> +               /* We might have received an interrupt too soon */
> +               if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
> +                   napi_reschedule(napi))
> +                       goto again;
> +       }
>         return done;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -369,6 +369,7 @@ struct mlx4_en_cq {
>         int                     ring;
>         struct net_device      *dev;
>         struct napi_struct      napi;
> +       u32                     rx_irq_missed;
>         int size;
>         int buf_size;
>         int vector;
>
>

  reply	other threads:[~2017-02-26 16:32 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-25 14:22 [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker Eric Dumazet
2017-02-26 16:32 ` Saeed Mahameed [this message]
2017-02-26 17:34   ` Eric Dumazet
2017-02-26 17:40     ` Eric Dumazet
2017-02-26 18:03       ` Eric Dumazet
2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
2017-02-27 14:06   ` Eric Dumazet
2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
2017-02-27 16:19     ` David Miller
2017-02-27 16:44       ` Eric Dumazet
2017-02-27 17:10         ` Eric Dumazet
2017-02-28  2:08         ` David Miller
2017-03-01  0:22           ` Francois Romieu
2017-03-01  1:04             ` Stephen Hemminger
2017-02-27 21:00       ` Alexander Duyck
2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
2017-02-27 22:14       ` Stephen Hemminger
2017-02-27 22:35         ` Eric Dumazet
2017-02-27 22:44           ` Stephen Hemminger
2017-02-27 22:48             ` David Miller
2017-02-27 23:23               ` Stephen Hemminger
2017-02-28 10:14           ` David Laight
2017-02-28 13:04             ` Eric Dumazet
2017-02-28 13:20             ` Eric Dumazet
2017-02-28 16:17       ` Stephen Hemminger
2017-02-28 16:57         ` Eric Dumazet
2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
2017-03-01 17:53         ` David Miller
2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
2017-02-28 17:47       ` Eric Dumazet
2017-03-01 10:41       ` David Laight
2017-03-01 16:14         ` Alexander Duyck
2017-03-01 17:32           ` Eric Dumazet
2017-03-02 10:24             ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALzJLG-OcTAKkRQQHXV=8N-m+yATaX76UpGE4eWgK_8N52YePg@mail.gmail.com' \
    --to=saeedm@dev.mellanox.co.il \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.