netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ian Kumlien <ian.kumlien@gmail.com>
To: Linux Kernel Network Developers <netdev@vger.kernel.org>
Cc: Saeed Mahameed <saeedm@mellanox.com>,
	Leon Romanovsky <leonro@mellanox.com>,
	kliteyn@mellanox.com
Subject: [VXLAN] [MLX5] Lost traffic and issues
Date: Fri, 28 Feb 2020 16:02:19 +0100	[thread overview]
Message-ID: <CAA85sZsO9EaS8fZJqx6=QJA+7epe88UE2zScqw-KHZYDRMjk5A@mail.gmail.com> (raw)

Hi,

Including netdev - to see if someone else has a clue.

We have a few machines in a cloud and when upgrading from 4.16.7 ->
5.4.15 we ran in to
unexpected and intermittent problems.
(I have tested 5.5.6 and the problems persists)

What we saw, using several monitoring points, was that traffic
disappeared after what we can see when tcpdumping on "bond0"

We had tcpdump running on:
1, DHCP nodes (local tap interfaces)
2, Router instances on L3 node
3, Local node (where the VM runs) (tap, bridge and eventually tap
interface dumping VXLAN traffic)
4, Using port mirroring on the 100gbit switch to see what ended up on
the physical wire.

What we can see is that from the four step handshake for DHCP only two
steps works, the forth step will be dropped "on the nic".

We can see it go out bond0, in tagged VLAN and within a VXLAN packet -
however the switch never sees it.

There has been a few mlx5 changes wrt VXLAN which can be culprits but
it's really hard to judge.

dmesg |grep mlx
[    2.231399] mlx5_core 0000:0b:00.0: firmware version: 16.26.1040
[    2.912595] mlx5_core 0000:0b:00.0: Rate limit: 127 rates are
supported, range: 0Mbps to 97656Mbps
[    2.935012] mlx5_core 0000:0b:00.0: Port module event: module 0,
Cable plugged
[    2.949528] mlx5_core 0000:0b:00.1: firmware version: 16.26.1040
[    3.638647] mlx5_core 0000:0b:00.1: Rate limit: 127 rates are
supported, range: 0Mbps to 97656Mbps
[    3.661206] mlx5_core 0000:0b:00.1: Port module event: module 1,
Cable plugged
[    3.675562] mlx5_core 0000:0b:00.0: MLX5E: StrdRq(1) RqSz(8)
StrdSz(64) RxCqeCmprss(0)
[    3.846149] mlx5_core 0000:0b:00.1: MLX5E: StrdRq(1) RqSz(8)
StrdSz(64) RxCqeCmprss(0)
[    4.021738] mlx5_core 0000:0b:00.0 enp11s0f0: renamed from eth0
[    4.021962] mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0

I have tried turning all offloads off, but the problem persists as
well - it's really weird that it seems to be only some packets.

To be clear, the bond0 interface is 2*100gbit, using 802.1ad (LACP)
with layer2+3 hashing.
This seems to be offloaded in to the nic (can it be turned off?) and
messages about modifying the "lag map" was
quite frequent until we did a firmware upgrade - even with upgraded
firmware, it continued but to a lesser extent.

With 5.5.7 approaching, we would want a path forward to handle this...

             reply	other threads:[~2020-02-28 15:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-28 15:02 Ian Kumlien [this message]
2020-03-02 19:10 ` [VXLAN] [MLX5] Lost traffic and issues Saeed Mahameed
2020-03-02 22:45   ` Ian Kumlien
2020-03-03 10:23     ` Ian Kumlien
2020-03-04  9:47       ` Ian Kumlien

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAA85sZsO9EaS8fZJqx6=QJA+7epe88UE2zScqw-KHZYDRMjk5A@mail.gmail.com' \
    --to=ian.kumlien@gmail.com \
    --cc=kliteyn@mellanox.com \
    --cc=leonro@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).