All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Saeed Mahameed <saeedm@dev.mellanox.co.il>
Cc: David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Tariq Toukan <tariqt@mellanox.com>,
	Saeed Mahameed <saeedm@mellanox.com>
Subject: Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
Date: Sun, 26 Feb 2017 09:34:17 -0800	[thread overview]
Message-ID: <1488130457.9415.153.camel@edumazet-glaptop3.roam.corp.google.com> (raw)
In-Reply-To: <CALzJLG-OcTAKkRQQHXV=8N-m+yATaX76UpGE4eWgK_8N52YePg@mail.gmail.com>

On Sun, 2017-02-26 at 18:32 +0200, Saeed Mahameed wrote:
> On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > While playing with hardware timestamping of RX packets, I found
> > that some packets were received by TCP stack with a ~200 ms delay...
> >
> > Since the timestamp was provided by the NIC, and my probe was added
> > in tcp_v4_rcv() while in BH handler, I was confident it was not
> > a sender issue, or a drop in the network.
> >
> > This would happen with a very low probability, but hurting RPC
> > workloads.
> >
> > I could track this down to the cx-3 (mlx4 driver) card that was
> > apparently sending an IRQ before we would arm it.
> >
> 
> Hi Eric,
> 
> This is highly unlikely, the hardware should not do that, and if this
> is really the case
> we need to hunt down the root cause and not work around it.

Well, I definitely see the interrupt coming while the napi bit is not
available.


> 
> > A NAPI driver normally arms the IRQ after the napi_complete_done(),
> > after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> > it.
> >
> > This patch adds a new rx_irq_miss field that is incremented every time
> > the hard IRQ handler could not grab NAPI_STATE_SCHED.
> >
> > Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> > and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED
> 
> Are you sure this is not some kind of race condition with the busy
> polling mechanism
> Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
> Maybe the busy polling thread when it detects that it wants to yield,
> it arms the CQ too early (when napi is not ready)?
> 

Good question.

No busy polling in my tests.

I have triggers by using on a 2x10 Gbit host

(bonding of two 10Gbit mlx4 ports)


ethtool -C eth1 adaptive-rx on rx-usecs-low 0
ethtool -C eth2 adaptive-rx on rx-usecs-low 0
./super_netperf 9 -H lpaa6 -t TCP_RR -l 10000 -- -r 1000000,1000000 &


4 rx and tx queues per NIC, fq packet scheduler on them.

TCP timestamps are on. (this might be important to get the last packet
of a given size. In my case 912 bytes, with a PSH flag)


> Anyway Tariq and I would like to further investigate the fired IRQ
> while CQ is not armed.  It smells
> like  a bug in the driver/napi level, it is not a HW expected behavior.
> 
> Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
> counter advance on a linerate RX load ?


About 1000 times per second on my hosts, receiving about 1.2 Mpps.
But most of these misses are not an issue because next packet is
arriving maybe less than 10 usec later.

Note that the bug is hard to notice, because TCP would fast retransmit,
or the next packet was coming soon enough.

You have to be unlucky enough that the RX queue that missed the NAPI
schedule receive no more packets before the 200 ms RTO timer, and the
packet stuck in the RX ring is the last packet of the RPC.


I believe part of the problem is that NAPI_STATE_SCHED can be grabbed by
a process while hard irqs were not disabled.

I do not believe this bug is mlx4 specific.

Anything doing the following while hard irq were not masked :

local_bh_disable();
napi_reschedule(&priv->rx_cq[ring]->napi);
local_bh_enable();

Like in mlx4_en_recover_from_oom()

Can trigger the issue really.

Unfortunately I do not see how core layer can handle this.
Only the driver hard irq could possibly know that it could not grab
NAPI_STATE_SCHED

  reply	other threads:[~2017-02-26 17:34 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-25 14:22 [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker Eric Dumazet
2017-02-26 16:32 ` Saeed Mahameed
2017-02-26 17:34   ` Eric Dumazet [this message]
2017-02-26 17:40     ` Eric Dumazet
2017-02-26 18:03       ` Eric Dumazet
2017-02-27  3:31 ` [PATCH net] net: solve a NAPI race Eric Dumazet
2017-02-27 14:06   ` Eric Dumazet
2017-02-27 14:21   ` [PATCH v2 " Eric Dumazet
2017-02-27 16:19     ` David Miller
2017-02-27 16:44       ` Eric Dumazet
2017-02-27 17:10         ` Eric Dumazet
2017-02-28  2:08         ` David Miller
2017-03-01  0:22           ` Francois Romieu
2017-03-01  1:04             ` Stephen Hemminger
2017-02-27 21:00       ` Alexander Duyck
2017-02-27 20:18     ` [PATCH v3 " Eric Dumazet
2017-02-27 22:14       ` Stephen Hemminger
2017-02-27 22:35         ` Eric Dumazet
2017-02-27 22:44           ` Stephen Hemminger
2017-02-27 22:48             ` David Miller
2017-02-27 23:23               ` Stephen Hemminger
2017-02-28 10:14           ` David Laight
2017-02-28 13:04             ` Eric Dumazet
2017-02-28 13:20             ` Eric Dumazet
2017-02-28 16:17       ` Stephen Hemminger
2017-02-28 16:57         ` Eric Dumazet
2017-02-28 18:34       ` [PATCH v4 " Eric Dumazet
2017-03-01 17:53         ` David Miller
2017-02-28 17:20     ` [PATCH v2 " Alexander Duyck
2017-02-28 17:47       ` Eric Dumazet
2017-03-01 10:41       ` David Laight
2017-03-01 16:14         ` Alexander Duyck
2017-03-01 17:32           ` Eric Dumazet
2017-03-02 10:24             ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1488130457.9415.153.camel@edumazet-glaptop3.roam.corp.google.com \
    --to=eric.dumazet@gmail.com \
    --cc=davem@davemloft.net \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@dev.mellanox.co.il \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.