linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IPoIB child interfaces not working with mlx5
@ 2021-03-19  7:44 Jinpu Wang
  2021-03-20  9:30 ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-03-19  7:44 UTC (permalink / raw)
  To: linux-rdma, Jason Gunthorpe, Leon Romanovsky, Doug Ledford

Hi Jason and Leon,

We recently switch to use upstream OFED from MLNX-OFED, and we notice
IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
behaves the same.

The symptoms are ipoib child interfaces are UP and ready, but ping
doens't work at all, simple ifdown/ifup the child interface doens't
change anything.
Workaround is bring up the parent interface "ip link set ib0 up"

basic config from "ip a"
jwang@ps401a-914.nst:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
group default qlen 1000
    link/ether 0c:c4:7a:ff:07:d0 brd ff:ff:ff:ff:ff:ff
    inet 10.41.3.146/22 brd 10.41.3.255 scope global eth0
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
default qlen 1000
    link/ether 0c:c4:7a:ff:07:d1 brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group
default qlen 1024
    link/infiniband
00:00:11:07:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:52 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
5: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group
default qlen 1024
    link/infiniband
00:00:19:07:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:53 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
6: ib0.beef@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092
qdisc mq state UP group default qlen 1024
    link/infiniband
00:00:11:4b:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:52 brd
00:ff:ff:ff:ff:12:40:1b:be:ef:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.42.3.146/20 brd 10.42.15.255 scope global ib0.beef
       valid_lft forever preferred_lft forever
    inet6 fe80::9a03:9b03:66:de52/64 scope link
       valid_lft forever preferred_lft forever
7: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092
qdisc mq state UP group default qlen 1024
    link/infiniband
00:00:12:87:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:52 brd
00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 2a02:247f:401:1:2:0:a:392/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::9a03:9b03:66:de52/64 scope link
       valid_lft forever preferred_lft forever
8: ib1.beef@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092
qdisc mq state UP group default qlen 1024
    link/infiniband
00:00:19:4b:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:53 brd
00:ff:ff:ff:ff:12:40:1b:be:ef:00:00:00:00:00:00:ff:ff:ff:ff
    inet 10.43.3.146/20 brd 10.43.15.255 scope global ib1.beef
       valid_lft forever preferred_lft forever
    inet6 fe80::9a03:9b03:66:de53/64 scope link
       valid_lft forever preferred_lft forever
9: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092
qdisc mq state UP group default qlen 1024
    link/infiniband
00:00:1a:87:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:53 brd
00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
    inet6 2a02:247f:402:1:2:0:a:392/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::9a03:9b03:66:de53/64 scope link
       valid_lft forever preferred_lft forever

jwang@ps401a-914.nst:~$ dmesg | egrep 'mlx|ib'
[    0.000000] Command line:
BOOT_IMAGE=(http)/live-images/liveboot-2021.76/vmlinuz
BOOTIF=0c:c4:7a:ff:07:d0 boot=live
fetch=http://mgmt/live-images/liveboot-2021.76/root.squashfs
consoleblank=0 PHASE=Testing crashkernel=512M quiet
salt-master=salt-master.stg.profitbricks.net saltenv=base
pillarenv=base ib_ipoib.debug_level=1 liveboot.sdn2
[    0.889525] Kernel command line:
BOOT_IMAGE=(http)/live-images/liveboot-2021.76/vmlinuz
BOOTIF=0c:c4:7a:ff:07:d0 boot=live
fetch=http://mgmt/live-images/liveboot-2021.76/root.squashfs
consoleblank=0 PHASE=Testing crashkernel=512M quiet
salt-master=salt-master.stg.profitbricks.net saltenv=base
pillarenv=base ib_ipoib.debug_level=1 liveboot.sdn2
[    1.997444] Calibrating delay loop (skipped), value calculated
using timer frequency.. 4200.00 BogoMIPS (lpj=21000000)
[    2.422119] MDS CPU bug present and SMT on, data leak possible. See
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html
for more details.
[    2.422119] TAA CPU bug present and SMT on, data leak possible. See
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html
for more details.
[    2.992059] pci_bus 0000:03: extended config space not accessible
[    3.024991] pci 0000:03:00.0: vgaarb: bridge control possible
[    5.287548] tsc: Refined TSC clocksource calibration: 2099.999 MHz
[   16.839146] systemd[1]: File
/lib/systemd/system/systemd-journald.service:12 configures an IP
firewall (IPAddressDeny=any), but the local system does not support
BPF/cgroup based firewalling.
[   16.874155] systemd[1]:
/lib/systemd/system/tap-offloads-trk.service:10: PIDFile= references
path below legacy directory /var/run/, updating
/var/run/tap-offloads-trk.pid → /run/tap-offloads-trk.pid; please
update the unit file accordingly.
[   16.893383] systemd[1]: Listening on initctl Compatibility Named Pipe.
[   23.244067] mlx5_core 0000:af:00.0: firmware version: 16.27.2008
[   23.244103] mlx5_core 0000:af:00.0: 126.016 Gb/s available PCIe
bandwidth (8.0 GT/s PCIe x16 link)
[   23.274277] libata version 3.00 loaded.
[   23.555901] mlx5_core 0000:af:00.0: Port module event: module 0,
Cable plugged
[   23.556314] mlx5_core 0000:af:00.0: mlx5_pcie_event:296:(pid 7):
PCIe slot advertised sufficient power (75W).
[   23.573895] mlx5_core 0000:af:00.1: firmware version: 16.27.2008
[   23.573950] mlx5_core 0000:af:00.1: 126.016 Gb/s available PCIe
bandwidth (8.0 GT/s PCIe x16 link)
[   23.885989] mlx5_core 0000:af:00.1: Port module event: module 1,
Cable plugged
[   23.886133] mlx5_core 0000:af:00.1: mlx5_pcie_event:296:(pid 3256):
PCIe slot advertised sufficient power (75W).
[   27.924069] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   27.924076] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   27.999211] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   28.000387] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   28.000393] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   28.086111] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.415045] ib0: Event 12 on device mlx5_0 port 1
[   29.415147] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.415661] ib0: Event 12 on device mlx5_0 port 1
[   29.415742] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.416497] ib0: Event 12 on device mlx5_0 port 1
[   29.416591] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.419656] ib0: Event 17 on device mlx5_0 port 1
[   29.419669] ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
[   29.420226] ib0: Event 11 on device mlx5_0 port 1
[   29.420240] ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
[   29.420257] ib1: Event 12 on device mlx5_1 port 1
[   29.420317] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.420840] ib1: Event 12 on device mlx5_1 port 1
[   29.420898] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.421190] ib1: Event 12 on device mlx5_1 port 1
[   29.421247] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set.
[   29.421632] ib1: Event 11 on device mlx5_1 port 1
[   29.421640] ib1: Not flushing - IPOIB_FLAG_INITIALIZED not set.
[   29.422261] ib1: Event 17 on device mlx5_1 port 1
[   29.422276] ib1: Not flushing - IPOIB_FLAG_INITIALIZED not set.
[   29.749430] ib0: Event 9 on device mlx5_0 port 1
[   29.749441] ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set.
[   29.751349] ib1: Event 9 on device mlx5_1 port 1
[   29.751365] ib1: Not flushing - IPOIB_FLAG_INITIALIZED not set.
[   46.707421] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   46.707434] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   46.725944] ib0.beef: bringing up interface
[   46.968005] ib0.beef: Created ah 00000000cb29051b
[   47.000529] IPv6: ADDRCONF(NETDEV_CHANGE): ib0.beef: link becomes ready
[   47.004101] ib0.beef: Created ah 000000001338d4ae
[   47.007399] ib0.beef: Created ah 000000002947be1d
[   47.010668] ib0.beef: Created ah 00000000a8586948
[   47.013871] ib0.beef: Created ah 00000000e584ea42
[   47.033747] ib0.beef: Created ah 0000000086cb1ff9
[   47.189454] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   47.189465] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   47.215051] ib0.dddd: bringing up interface
[   47.457634] ib0.dddd: Created ah 000000009bb41171
[   47.490564] IPv6: ADDRCONF(NETDEV_CHANGE): ib0.dddd: link becomes ready
[   47.494065] ib0.dddd: Created ah 00000000531ff3b3
[   47.497206] ib0.dddd: Created ah 0000000006238049
[   47.500281] ib0.dddd: Created ah 00000000a2776703
[   47.503453] ib0.dddd: Created ah 000000006f839ea0
[   47.506697] ib0.dddd: Created ah 00000000d3218392
[   47.523579] ib0.dddd: Created ah 000000004e8a14c7
[   48.894389] ib0.dddd: Created ah 00000000c664dbd4
[   48.897657] ib0.beef: Created ah 00000000c446a0e6
[   49.593055] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   49.593064] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   49.610051] ib1.beef: bringing up interface
[   49.857979] ib1.beef: Created ah 000000003571492a
[   49.890521] IPv6: ADDRCONF(NETDEV_CHANGE): ib1.beef: link becomes ready
[   49.893951] ib1.beef: Created ah 00000000aea98452
[   49.897011] ib1.beef: Created ah 000000004e23c357
[   49.899995] ib1.beef: Created ah 00000000ed62df50
[   49.903036] ib1.beef: Created ah 0000000041605d6d
[   49.915754] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   49.915765] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024)
StrdSz(256) RxCqeCmprss(0)
[   49.923955] ib1.beef: Created ah 00000000f5d6b457
[   49.943153] ib1.dddd: bringing up interface
[   50.187608] ib1.dddd: Created ah 00000000cebeba47
[   50.220523] IPv6: ADDRCONF(NETDEV_CHANGE): ib1.dddd: link becomes ready
[   50.224347] ib1.dddd: Created ah 00000000c6f96f11
[   50.227539] ib1.dddd: Created ah 000000004fe70418
[   50.230691] ib1.dddd: Created ah 00000000ae96df99
[   50.233810] ib1.dddd: Created ah 000000004af47f93
[   50.236892] ib1.dddd: Created ah 0000000064aca082
[   50.264221] ib1.dddd: Created ah 00000000f330012e
[   51.774399] ib1.beef: Created ah 000000007f1ef527
[   52.094689] ib1.dddd: Created ah 00000000210b80b4
[   57.215935] ib0.dddd: Created ah 00000000f07b9547
[   57.216368] ib1.beef: Created ah 00000000f3a87dc7
[   57.219420] ib1.beef: Created ah 00000000b7d4d592
[   57.225647] ib0.beef: Created ah 00000000e65557a4
[   57.228334] ib1.dddd: Created ah 000000001914b301
[   57.228819] ib0.beef: Created ah 0000000070b21f1c
[   57.264003] ib1.beef: Created ah 0000000070b3a6e8
[   57.264079] ib0.beef: Created ah 00000000be1feac1,
[  137.514460] ib0.beef: neigh free for ffffff
ff12:601b:beef:0000:0000:0001:ff66:de52
[  137.514461] ib0.dddd: neigh free for ffffff
ff12:601b:dddd:0000:0000:0001:ff0a:0392
[  137.514471] ib0.dddd: neigh free for ffffff
ff12:601b:dddd:0000:0000:0001:ff66:de52
[  137.514473] ib0.beef: neigh free for ffffff
ff12:401b:beef:0000:0000:0000:0000:0016
[  137.514477] ib0.dddd: neigh free for ffffff
ff12:601b:dddd:0000:0000:0000:0000:0016
[  137.514478] ib0.beef: neigh free for ffffff
ff12:601b:beef:0000:0000:0000:0000:0016
[  140.074531] ib1.beef: neigh free for ffffff
ff12:401b:beef:0000:0000:0000:0000:0016
[  140.074541] ib1.beef: neigh free for ffffff
ff12:601b:beef:0000:0000:0000:0000:0016
[  140.074545] ib1.beef: neigh free for ffffff
ff12:601b:beef:0000:0000:0001:ff66:de53
[  140.714539] ib1.dddd: neigh free for ffffff
ff12:601b:dddd:0000:0000:0001:ff0a:0392
[  140.714549] ib1.dddd: neigh free for ffffff
ff12:601b:dddd:0000:0000:0000:0000:0016
[  140.714553] ib1.dddd: neigh free for ffffff
ff12:601b:dddd:0000:0000:0001:ff66:de53
[  144.470916] ib0.dddd: Created ah 000000009d40e279
[  177.320655] ib0.dddd: Created ah 0000000023a374d0
[  177.321583] ib1.beef: Created ah 00000000b54aadfc
[  177.324385] ib0.beef: Created ah 00000000f4507818
[  177.325263] ib1.beef: Created ah 00000000132b48ff
[  177.328056] ib0.beef: Created ah 000000004e093b7c
[  177.328715] ib1.dddd: Created ah 00000000b274652f
[  177.358792] ib0.beef: Created ah 0000000076e40813
[  177.358863] ib1.dddd: Created ah 00000000146f0ae3
[  177.361796] ib1.beef: Created ah 00000000d7c8cff5
[  177.362033] ib0.beef: Created ah 0000000086031b72
[  177.365082] ib0.dddd: Created ah 0000000083e723db
[  177.365086] ib1.beef: Created ah 0000000029b2b4cb
[  200.215825] ib1.beef: neigh free for ffffff
ff12:401b:beef:0000:0000:0000:ffff:ffff

I suspect it might be related to change in this patchset:
https://lore.kernel.org/linux-rdma/20180729083500.5352-1-leon@kernel.org/

Is this expected behavor? how can we fix it?

Thanks!
-- 
Jinpu Wang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-03-19  7:44 IPoIB child interfaces not working with mlx5 Jinpu Wang
@ 2021-03-20  9:30 ` Leon Romanovsky
       [not found]   ` <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2021-03-20  9:30 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-rdma, Jason Gunthorpe, Doug Ledford

On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> Hi Jason and Leon,
> 
> We recently switch to use upstream OFED from MLNX-OFED, and we notice
> IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> behaves the same.

Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
       [not found]   ` <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com>
@ 2021-03-21 13:07     ` Leon Romanovsky
  2021-03-22  6:08       ` Jinpu Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2021-03-21 13:07 UTC (permalink / raw)
  To: Jack Wang; +Cc: Doug Ledford, Jason Gunthorpe, Jinpu Wang, linux-rdma

On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> 
> > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > Hi Jason and Leon,
> > >
> > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > behaves the same.
> >
> > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> >
> > Thanks
> 
>  Yes.

> Is this expected behavor?

Yes, we wanted to make IPoIB behave like any other netdev interfaces and
if parent interface isn't enabled, no traffic should pass. More on that,
in our internal implementation of enhanced IPoIB, we are reusing same
resources for both parent and child, this requires us to wait for "UP"
event before allowing traffic.

Thanks

> 
> >
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-03-21 13:07     ` Leon Romanovsky
@ 2021-03-22  6:08       ` Jinpu Wang
  2021-03-22  6:56         ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-03-22  6:08 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma

On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> >
> > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > Hi Jason and Leon,
> > > >
> > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > behaves the same.
> > >
> > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > >
> > > Thanks
> >
> >  Yes.
>
> > Is this expected behavor?
>
> Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> if parent interface isn't enabled, no traffic should pass. More on that,
> in our internal implementation of enhanced IPoIB, we are reusing same
> resources for both parent and child, this requires us to wait for "UP"
> event before allowing traffic.
>
> Thanks
Hi Leon,

Thanks for the clarification, is this behavior documented somewhere?
is it specific to "enhanced IPoIB" for CX-5?
Will it work differently if without MLX5_CORE_IPOIB enabled?

I think it would be helpful to add a message if possible to remind
admin to enable parent if only child if configured.

Thanks!

>
> >
> > >
> > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-03-22  6:08       ` Jinpu Wang
@ 2021-03-22  6:56         ` Leon Romanovsky
  2021-04-20  9:14           ` Jinpu Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2021-03-22  6:56 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma

On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote:
> On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> > >
> > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > > Hi Jason and Leon,
> > > > >
> > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > > behaves the same.
> > > >
> > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > > >
> > > > Thanks
> > >
> > >  Yes.
> >
> > > Is this expected behavor?
> >
> > Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> > if parent interface isn't enabled, no traffic should pass. More on that,
> > in our internal implementation of enhanced IPoIB, we are reusing same
> > resources for both parent and child, this requires us to wait for "UP"
> > event before allowing traffic.
> >
> > Thanks
> Hi Leon,
> 
> Thanks for the clarification, is this behavior documented somewhere?
> is it specific to "enhanced IPoIB" for CX-5?

It is specific to "enhanced IPoIB" and not to device. I don't know where
we can document it.

> Will it work differently if without MLX5_CORE_IPOIB enabled?

Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB",
exactly as cx-3. The best thing will be to change IPoIB ULP to behave
like netdev, but we were not comfortable to do it back then due to
user visible nature of such change.

> 
> I think it would be helpful to add a message if possible to remind
> admin to enable parent if only child if configured.

Care to send patch?

Thanks

> 
> Thanks!
> 
> >
> > >
> > > >
> > > >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-03-22  6:56         ` Leon Romanovsky
@ 2021-04-20  9:14           ` Jinpu Wang
  2021-04-20 11:29             ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-04-20  9:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma

On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote:
> > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> > > >
> > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > > > Hi Jason and Leon,
> > > > > >
> > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > > > behaves the same.
> > > > >
> > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > > > >
> > > > > Thanks
> > > >
> > > >  Yes.
> > >
> > > > Is this expected behavor?
> > >
> > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> > > if parent interface isn't enabled, no traffic should pass. More on that,
> > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > resources for both parent and child, this requires us to wait for "UP"
> > > event before allowing traffic.
> > >
> > > Thanks
> > Hi Leon,
> >
> > Thanks for the clarification, is this behavior documented somewhere?
> > is it specific to "enhanced IPoIB" for CX-5?
>
> It is specific to "enhanced IPoIB" and not to device. I don't know where
> we can document it.
>
> > Will it work differently if without MLX5_CORE_IPOIB enabled?
>
> Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB",
> exactly as cx-3. The best thing will be to change IPoIB ULP to behave
> like netdev, but we were not comfortable to do it back then due to
> user visible nature of such change.
>
Hi Leon,

More testing reveals new problems with MLX5_CORE_IPOIB.
w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data.
I'm running on A: "iperf3 -s"
and on B: "sudo iperf3 -t 30000 -c ip6_of_A"
example output

[  5] local 2a02:247f:401:1:2:0:a:391 port 41288 connected to
2a02:247f:401:1:2:0:a:392 port 5201

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd

[  5]   0.00-1.00   sec   165 KBytes  1.35 Mbits/sec    2   3.93 KBytes

[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   3.93 KBytes


[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   3.93 KBytes

While when I disable MLX5_CORE_IPOIB, run the same test above, iperf
run without problem.

[  5] local 2a02:247f:401:1:2:0:a:391 port 51866 connected to
2a02:247f:401:1:2:0:a:392 port 5201

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd

[  5]   0.00-1.00   sec   293 MBytes  2.46 Gbits/sec    0   1.50 MBytes

[  5]   1.00-2.00   sec   290 MBytes  2.43 Gbits/sec    0   1.50 MBytes

[  5]   2.00-3.00   sec   289 MBytes  2.42 Gbits/sec    0   1.50 MBytes

[  5]   3.00-4.00   sec   290 MBytes  2.43 Gbits/sec    0   1.50 MBytes

On both side we have:
jwang@ps401a-913.nst:/mnt/jwang$ ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.27.2008
Hardware version: 0
Node GUID: 0x98039b03006c7912
System image GUID: 0x98039b03006c7912
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 14
LMC: 0
SM lid: 19
Capability mask: 0x2651e848
Port GUID: 0x98039b03006c7912
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.27.2008
Hardware version: 0
Node GUID: 0x98039b03006c7913
System image GUID: 0x98039b03006c7912
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 15
LMC: 0
SM lid: 45
Capability mask: 0x2651e848
Port GUID: 0x98039b03006c7913
Link layer: InfiniBand

The initial tests were done on 5.4.102.
And I did a brief test with ~linux-5.12-rc4 with MLX5_CORE_IPOIB,
iperf3 also doesn't work as same as 5.4.102.

cat /etc/network/interfaces.d/infiniband
auto ib0.beef
iface ib0.beef inet static
    address 10.42.3.145
    netmask 20
    up sysctl -w net.ipv4.conf.ib0/beef.forwarding=1
    up ethtool -K $IFACE gro off
    pre-up ip link set ib0 up
    dad-attempts 600

auto ib0.dddd
iface ib0.dddd inet6 static
    address 2a02:247f:401:1:2:0:a:391
    netmask 64
    pre-up ip link set ib0 up
    up sysctl -w net.ipv6.conf.ib0/dddd.forwarding=1
net.ipv6.conf.ib0/dddd.proxy_ndp=1
    up ip -6 route add fd57:1:0:4::/64 dev $IFACE
    up ethtool -K $IFACE gro off
    dad-attempts 600

auto ib1.beef
iface ib1.beef inet static
    address 10.43.3.145
    netmask 20
    up sysctl -w net.ipv4.conf.ib1/beef.forwarding=1
    up ethtool -K $IFACE gro off
    pre-up ip link set ib1 up
    dad-attempts 600

auto ib1.dddd
iface ib1.dddd inet6 static
    address 2a02:247f:402:1:2:0:a:391
    netmask 64
    pre-up ip link set ib1 up
    up sysctl -w net.ipv6.conf.ib1/dddd.forwarding=1
net.ipv6.conf.ib1/dddd.proxy_ndp=1
    up ip -6 route add fd57:2:0:4::/64 dev $IFACE
    up ethtool -K $IFACE gro off
    dad-attempts 600

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-04-20  9:14           ` Jinpu Wang
@ 2021-04-20 11:29             ` Leon Romanovsky
  2021-05-07  6:53               ` Jinpu Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2021-04-20 11:29 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma

On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote:
> On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote:
> > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> > > > >
> > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > > > > Hi Jason and Leon,
> > > > > > >
> > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > > > > behaves the same.
> > > > > >
> > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > > > > >
> > > > > > Thanks
> > > > >
> > > > >  Yes.
> > > >
> > > > > Is this expected behavor?
> > > >
> > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> > > > if parent interface isn't enabled, no traffic should pass. More on that,
> > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > resources for both parent and child, this requires us to wait for "UP"
> > > > event before allowing traffic.
> > > >
> > > > Thanks
> > > Hi Leon,
> > >
> > > Thanks for the clarification, is this behavior documented somewhere?
> > > is it specific to "enhanced IPoIB" for CX-5?
> >
> > It is specific to "enhanced IPoIB" and not to device. I don't know where
> > we can document it.
> >
> > > Will it work differently if without MLX5_CORE_IPOIB enabled?
> >
> > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB",
> > exactly as cx-3. The best thing will be to change IPoIB ULP to behave
> > like netdev, but we were not comfortable to do it back then due to
> > user visible nature of such change.
> >
> Hi Leon,
> 
> More testing reveals new problems with MLX5_CORE_IPOIB.
> w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data.

In our regression, iperf3 works.

Let's take it offline.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-04-20 11:29             ` Leon Romanovsky
@ 2021-05-07  6:53               ` Jinpu Wang
  2021-05-07  8:03                 ` Zhu Yanjun
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-05-07  6:53 UTC (permalink / raw)
  To: Leon Romanovsky, Itay Aveksis
  Cc: Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma

On Tue, Apr 20, 2021 at 1:29 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote:
> > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote:
> > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> > > > > >
> > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > > > > > Hi Jason and Leon,
> > > > > > > >
> > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > > > > > behaves the same.
> > > > > > >
> > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > >  Yes.
> > > > >
> > > > > > Is this expected behavor?
> > > > >
> > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> > > > > if parent interface isn't enabled, no traffic should pass. More on that,
> > > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > > resources for both parent and child, this requires us to wait for "UP"
> > > > > event before allowing traffic.
> > > > >
> > > > > Thanks
> > > > Hi Leon,
> > > >
> > > > Thanks for the clarification, is this behavior documented somewhere?
> > > > is it specific to "enhanced IPoIB" for CX-5?
> > >
> > > It is specific to "enhanced IPoIB" and not to device. I don't know where
> > > we can document it.
> > >
> > > > Will it work differently if without MLX5_CORE_IPOIB enabled?
> > >
> > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB",
> > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave
> > > like netdev, but we were not comfortable to do it back then due to
> > > user visible nature of such change.
> > >
> > Hi Leon,
> >
> > More testing reveals new problems with MLX5_CORE_IPOIB.
> > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data.

 Just want to give an update, we finally find out the key which leads
to the failure on our side.

we need to set the child interface to same MTU as the parent.
jwang@ps401a-913.nst:/mnt/jwang$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
mode DEFAULT group default qlen 1000
    link/ether 0c:c4:7a:ff:07:ce brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether 0c:c4:7a:ff:07:cf brd ff:ff:ff:ff:ff:ff
6: ha_transport: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether f6:ff:16:93:08:8a brd ff:ff:ff:ff:ff:ff
11: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP
mode DEFAULT group default qlen 1024
    link/infiniband
00:00:00:83:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
12: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP
mode DEFAULT group default qlen 1024
    link/infiniband
00:00:01:58:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
13: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
state UP mode DEFAULT group default qlen 1024
    link/infiniband
00:00:10:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd
00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
14: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc mq
state UP mode DEFAULT group default qlen 1024
    link/infiniband
00:00:11:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd
00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff

Initially, ib0 mtu is 2044, and ib0.dddd is 4092.
After I reduced ib0.dddd mtu to 2044 on both sides, then iperf3 works fine.

Could you explain why mtu must be set to exactly the same in case of
enhanced IPoIB mode? is there anything else we must treat it special?
I guess it related to

> > > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > > resources for both parent and child, this requires us to wait for "UP"
> > > > > event before allowing traffic.

Thanks!
Jinpu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-05-07  6:53               ` Jinpu Wang
@ 2021-05-07  8:03                 ` Zhu Yanjun
  2021-05-07  8:11                   ` Jinpu Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Zhu Yanjun @ 2021-05-07  8:03 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Leon Romanovsky, Itay Aveksis, Jinpu Wang, Jack Wang,
	Doug Ledford, Jason Gunthorpe, RDMA mailing list

On Fri, May 7, 2021 at 3:53 PM Jinpu Wang <jinpu.wang@ionos.com> wrote:
>
> On Tue, Apr 20, 2021 at 1:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote:
> > > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote:
> > > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> > > > > > >
> > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > > > > > > Hi Jason and Leon,
> > > > > > > > >
> > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > > > > > > behaves the same.
> > > > > > > >
> > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > >
> > > > > > >  Yes.
> > > > > >
> > > > > > > Is this expected behavor?
> > > > > >
> > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> > > > > > if parent interface isn't enabled, no traffic should pass. More on that,
> > > > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > > > resources for both parent and child, this requires us to wait for "UP"
> > > > > > event before allowing traffic.
> > > > > >
> > > > > > Thanks
> > > > > Hi Leon,
> > > > >
> > > > > Thanks for the clarification, is this behavior documented somewhere?
> > > > > is it specific to "enhanced IPoIB" for CX-5?
> > > >
> > > > It is specific to "enhanced IPoIB" and not to device. I don't know where
> > > > we can document it.
> > > >
> > > > > Will it work differently if without MLX5_CORE_IPOIB enabled?
> > > >
> > > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB",
> > > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave
> > > > like netdev, but we were not comfortable to do it back then due to
> > > > user visible nature of such change.
> > > >
> > > Hi Leon,
> > >
> > > More testing reveals new problems with MLX5_CORE_IPOIB.
> > > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data.
>
>  Just want to give an update, we finally find out the key which leads
> to the failure on our side.
>
> we need to set the child interface to same MTU as the parent.
> jwang@ps401a-913.nst:/mnt/jwang$ ip link list
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> mode DEFAULT group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
> mode DEFAULT group default qlen 1000
>     link/ether 0c:c4:7a:ff:07:ce brd ff:ff:ff:ff:ff:ff
> 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> DEFAULT group default qlen 1000
>     link/ether 0c:c4:7a:ff:07:cf brd ff:ff:ff:ff:ff:ff
> 6: ha_transport: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether f6:ff:16:93:08:8a brd ff:ff:ff:ff:ff:ff
> 11: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP
> mode DEFAULT group default qlen 1024
>     link/infiniband
> 00:00:00:83:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> 12: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP
> mode DEFAULT group default qlen 1024
>     link/infiniband
> 00:00:01:58:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> 13: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
> state UP mode DEFAULT group default qlen 1024
>     link/infiniband
> 00:00:10:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd
> 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
> 14: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc mq
> state UP mode DEFAULT group default qlen 1024
>     link/infiniband
> 00:00:11:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd
> 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
>
> Initially, ib0 mtu is 2044, and ib0.dddd is 4092.
> After I reduced ib0.dddd mtu to 2044 on both sides, then iperf3 works fine.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ipoib

"
When using datagram mode, the unreliable, disconnected queue pair type
does not allow any packets larger than the InfiniBand link-layer’s
MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the IP
packet being transmitted. As a result, the IPoIB MTU must be 4 bytes
less than the InfiniBand link-layer MTU. As 2048 is a common
InfiniBand link-layer MTU, the common IPoIB device MTU in datagram
mode is 2044.
"

>
> Could you explain why mtu must be set to exactly the same in case of
> enhanced IPoIB mode? is there anything else we must treat it special?
> I guess it related to
>
> > > > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > > > resources for both parent and child, this requires us to wait for "UP"
> > > > > > event before allowing traffic.
>
> Thanks!
> Jinpu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: IPoIB child interfaces not working with mlx5
  2021-05-07  8:03                 ` Zhu Yanjun
@ 2021-05-07  8:11                   ` Jinpu Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Jinpu Wang @ 2021-05-07  8:11 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Leon Romanovsky, Itay Aveksis, Jinpu Wang, Jack Wang,
	Doug Ledford, Jason Gunthorpe, RDMA mailing list

On Fri, May 7, 2021 at 10:03 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>
> On Fri, May 7, 2021 at 3:53 PM Jinpu Wang <jinpu.wang@ionos.com> wrote:
> >
> > On Tue, Apr 20, 2021 at 1:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote:
> > > > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote:
> > > > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote:
> > > > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道:
> > > > > > > >
> > > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote:
> > > > > > > > > > Hi Jason and Leon,
> > > > > > > > > >
> > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice
> > > > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5
> > > > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it
> > > > > > > > > > behaves the same.
> > > > > > > > >
> > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > >
> > > > > > > >  Yes.
> > > > > > >
> > > > > > > > Is this expected behavor?
> > > > > > >
> > > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and
> > > > > > > if parent interface isn't enabled, no traffic should pass. More on that,
> > > > > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > > > > resources for both parent and child, this requires us to wait for "UP"
> > > > > > > event before allowing traffic.
> > > > > > >
> > > > > > > Thanks
> > > > > > Hi Leon,
> > > > > >
> > > > > > Thanks for the clarification, is this behavior documented somewhere?
> > > > > > is it specific to "enhanced IPoIB" for CX-5?
> > > > >
> > > > > It is specific to "enhanced IPoIB" and not to device. I don't know where
> > > > > we can document it.
> > > > >
> > > > > > Will it work differently if without MLX5_CORE_IPOIB enabled?
> > > > >
> > > > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB",
> > > > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave
> > > > > like netdev, but we were not comfortable to do it back then due to
> > > > > user visible nature of such change.
> > > > >
> > > > Hi Leon,
> > > >
> > > > More testing reveals new problems with MLX5_CORE_IPOIB.
> > > > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data.
> >
> >  Just want to give an update, we finally find out the key which leads
> > to the failure on our side.
> >
> > we need to set the child interface to same MTU as the parent.
> > jwang@ps401a-913.nst:/mnt/jwang$ ip link list
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> > mode DEFAULT group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
> > mode DEFAULT group default qlen 1000
> >     link/ether 0c:c4:7a:ff:07:ce brd ff:ff:ff:ff:ff:ff
> > 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> > DEFAULT group default qlen 1000
> >     link/ether 0c:c4:7a:ff:07:cf brd ff:ff:ff:ff:ff:ff
> > 6: ha_transport: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue
> > state UNKNOWN mode DEFAULT group default qlen 1000
> >     link/ether f6:ff:16:93:08:8a brd ff:ff:ff:ff:ff:ff
> > 11: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP
> > mode DEFAULT group default qlen 1024
> >     link/infiniband
> > 00:00:00:83:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd
> > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> > 12: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP
> > mode DEFAULT group default qlen 1024
> >     link/infiniband
> > 00:00:01:58:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd
> > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> > 13: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq
> > state UP mode DEFAULT group default qlen 1024
> >     link/infiniband
> > 00:00:10:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd
> > 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
> > 14: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc mq
> > state UP mode DEFAULT group default qlen 1024
> >     link/infiniband
> > 00:00:11:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd
> > 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff
> >
> > Initially, ib0 mtu is 2044, and ib0.dddd is 4092.
> > After I reduced ib0.dddd mtu to 2044 on both sides, then iperf3 works fine.
>
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ipoib
>
> "
> When using datagram mode, the unreliable, disconnected queue pair type
> does not allow any packets larger than the InfiniBand link-layer’s
> MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the IP
> packet being transmitted. As a result, the IPoIB MTU must be 4 bytes
> less than the InfiniBand link-layer MTU. As 2048 is a common
> InfiniBand link-layer MTU, the common IPoIB device MTU in datagram
> mode is 2044.
> "
Thanks for the hint, Yanjun.
>
> >
> > Could you explain why mtu must be set to exactly the same in case of
> > enhanced IPoIB mode? is there anything else we must treat it special?
> > I guess it related to
> >
> > > > > > > in our internal implementation of enhanced IPoIB, we are reusing same
> > > > > > > resources for both parent and child, this requires us to wait for "UP"
> > > > > > > event before allowing traffic.
> >
> > Thanks!
> > Jinpu

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-07  8:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-19  7:44 IPoIB child interfaces not working with mlx5 Jinpu Wang
2021-03-20  9:30 ` Leon Romanovsky
     [not found]   ` <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com>
2021-03-21 13:07     ` Leon Romanovsky
2021-03-22  6:08       ` Jinpu Wang
2021-03-22  6:56         ` Leon Romanovsky
2021-04-20  9:14           ` Jinpu Wang
2021-04-20 11:29             ` Leon Romanovsky
2021-05-07  6:53               ` Jinpu Wang
2021-05-07  8:03                 ` Zhu Yanjun
2021-05-07  8:11                   ` Jinpu Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).