* IPoIB child interfaces not working with mlx5 @ 2021-03-19 7:44 Jinpu Wang 2021-03-20 9:30 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-03-19 7:44 UTC (permalink / raw) To: linux-rdma, Jason Gunthorpe, Leon Romanovsky, Doug Ledford Hi Jason and Leon, We recently switch to use upstream OFED from MLNX-OFED, and we notice IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it behaves the same. The symptoms are ipoib child interfaces are UP and ready, but ping doens't work at all, simple ifdown/ifup the child interface doens't change anything. Workaround is bring up the parent interface "ip link set ib0 up" basic config from "ip a" jwang@ps401a-914.nst:~$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 0c:c4:7a:ff:07:d0 brd ff:ff:ff:ff:ff:ff inet 10.41.3.146/22 brd 10.41.3.255 scope global eth0 valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 0c:c4:7a:ff:07:d1 brd ff:ff:ff:ff:ff:ff 4: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1024 link/infiniband 00:00:11:07:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:52 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 5: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 1024 link/infiniband 00:00:19:07:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:53 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 6: ib0.beef@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092 qdisc mq state UP group default qlen 1024 link/infiniband 00:00:11:4b:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:52 brd 00:ff:ff:ff:ff:12:40:1b:be:ef:00:00:00:00:00:00:ff:ff:ff:ff inet 10.42.3.146/20 brd 10.42.15.255 scope global ib0.beef valid_lft forever preferred_lft forever inet6 fe80::9a03:9b03:66:de52/64 scope link valid_lft forever preferred_lft forever 7: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092 qdisc mq state UP group default qlen 1024 link/infiniband 00:00:12:87:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:52 brd 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff inet6 2a02:247f:401:1:2:0:a:392/64 scope global valid_lft forever preferred_lft forever inet6 fe80::9a03:9b03:66:de52/64 scope link valid_lft forever preferred_lft forever 8: ib1.beef@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092 qdisc mq state UP group default qlen 1024 link/infiniband 00:00:19:4b:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:53 brd 00:ff:ff:ff:ff:12:40:1b:be:ef:00:00:00:00:00:00:ff:ff:ff:ff inet 10.43.3.146/20 brd 10.43.15.255 scope global ib1.beef valid_lft forever preferred_lft forever inet6 fe80::9a03:9b03:66:de53/64 scope link valid_lft forever preferred_lft forever 9: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 4092 qdisc mq state UP group default qlen 1024 link/infiniband 00:00:1a:87:fe:80:00:00:00:00:00:00:98:03:9b:03:00:66:de:53 brd 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff inet6 2a02:247f:402:1:2:0:a:392/64 scope global valid_lft forever preferred_lft forever inet6 fe80::9a03:9b03:66:de53/64 scope link valid_lft forever preferred_lft forever jwang@ps401a-914.nst:~$ dmesg | egrep 'mlx|ib' [ 0.000000] Command line: BOOT_IMAGE=(http)/live-images/liveboot-2021.76/vmlinuz BOOTIF=0c:c4:7a:ff:07:d0 boot=live fetch=http://mgmt/live-images/liveboot-2021.76/root.squashfs consoleblank=0 PHASE=Testing crashkernel=512M quiet salt-master=salt-master.stg.profitbricks.net saltenv=base pillarenv=base ib_ipoib.debug_level=1 liveboot.sdn2 [ 0.889525] Kernel command line: BOOT_IMAGE=(http)/live-images/liveboot-2021.76/vmlinuz BOOTIF=0c:c4:7a:ff:07:d0 boot=live fetch=http://mgmt/live-images/liveboot-2021.76/root.squashfs consoleblank=0 PHASE=Testing crashkernel=512M quiet salt-master=salt-master.stg.profitbricks.net saltenv=base pillarenv=base ib_ipoib.debug_level=1 liveboot.sdn2 [ 1.997444] Calibrating delay loop (skipped), value calculated using timer frequency.. 4200.00 BogoMIPS (lpj=21000000) [ 2.422119] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 2.422119] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details. [ 2.992059] pci_bus 0000:03: extended config space not accessible [ 3.024991] pci 0000:03:00.0: vgaarb: bridge control possible [ 5.287548] tsc: Refined TSC clocksource calibration: 2099.999 MHz [ 16.839146] systemd[1]: File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling. [ 16.874155] systemd[1]: /lib/systemd/system/tap-offloads-trk.service:10: PIDFile= references path below legacy directory /var/run/, updating /var/run/tap-offloads-trk.pid → /run/tap-offloads-trk.pid; please update the unit file accordingly. [ 16.893383] systemd[1]: Listening on initctl Compatibility Named Pipe. [ 23.244067] mlx5_core 0000:af:00.0: firmware version: 16.27.2008 [ 23.244103] mlx5_core 0000:af:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 23.274277] libata version 3.00 loaded. [ 23.555901] mlx5_core 0000:af:00.0: Port module event: module 0, Cable plugged [ 23.556314] mlx5_core 0000:af:00.0: mlx5_pcie_event:296:(pid 7): PCIe slot advertised sufficient power (75W). [ 23.573895] mlx5_core 0000:af:00.1: firmware version: 16.27.2008 [ 23.573950] mlx5_core 0000:af:00.1: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 23.885989] mlx5_core 0000:af:00.1: Port module event: module 1, Cable plugged [ 23.886133] mlx5_core 0000:af:00.1: mlx5_pcie_event:296:(pid 3256): PCIe slot advertised sufficient power (75W). [ 27.924069] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 27.924076] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 27.999211] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 28.000387] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 28.000393] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 28.086111] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.415045] ib0: Event 12 on device mlx5_0 port 1 [ 29.415147] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.415661] ib0: Event 12 on device mlx5_0 port 1 [ 29.415742] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.416497] ib0: Event 12 on device mlx5_0 port 1 [ 29.416591] ib0: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.419656] ib0: Event 17 on device mlx5_0 port 1 [ 29.419669] ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set. [ 29.420226] ib0: Event 11 on device mlx5_0 port 1 [ 29.420240] ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set. [ 29.420257] ib1: Event 12 on device mlx5_1 port 1 [ 29.420317] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.420840] ib1: Event 12 on device mlx5_1 port 1 [ 29.420898] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.421190] ib1: Event 12 on device mlx5_1 port 1 [ 29.421247] ib1: Not flushing - IPOIB_FLAG_ADMIN_UP not set. [ 29.421632] ib1: Event 11 on device mlx5_1 port 1 [ 29.421640] ib1: Not flushing - IPOIB_FLAG_INITIALIZED not set. [ 29.422261] ib1: Event 17 on device mlx5_1 port 1 [ 29.422276] ib1: Not flushing - IPOIB_FLAG_INITIALIZED not set. [ 29.749430] ib0: Event 9 on device mlx5_0 port 1 [ 29.749441] ib0: Not flushing - IPOIB_FLAG_INITIALIZED not set. [ 29.751349] ib1: Event 9 on device mlx5_1 port 1 [ 29.751365] ib1: Not flushing - IPOIB_FLAG_INITIALIZED not set. [ 46.707421] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 46.707434] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 46.725944] ib0.beef: bringing up interface [ 46.968005] ib0.beef: Created ah 00000000cb29051b [ 47.000529] IPv6: ADDRCONF(NETDEV_CHANGE): ib0.beef: link becomes ready [ 47.004101] ib0.beef: Created ah 000000001338d4ae [ 47.007399] ib0.beef: Created ah 000000002947be1d [ 47.010668] ib0.beef: Created ah 00000000a8586948 [ 47.013871] ib0.beef: Created ah 00000000e584ea42 [ 47.033747] ib0.beef: Created ah 0000000086cb1ff9 [ 47.189454] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 47.189465] mlx5_core 0000:af:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 47.215051] ib0.dddd: bringing up interface [ 47.457634] ib0.dddd: Created ah 000000009bb41171 [ 47.490564] IPv6: ADDRCONF(NETDEV_CHANGE): ib0.dddd: link becomes ready [ 47.494065] ib0.dddd: Created ah 00000000531ff3b3 [ 47.497206] ib0.dddd: Created ah 0000000006238049 [ 47.500281] ib0.dddd: Created ah 00000000a2776703 [ 47.503453] ib0.dddd: Created ah 000000006f839ea0 [ 47.506697] ib0.dddd: Created ah 00000000d3218392 [ 47.523579] ib0.dddd: Created ah 000000004e8a14c7 [ 48.894389] ib0.dddd: Created ah 00000000c664dbd4 [ 48.897657] ib0.beef: Created ah 00000000c446a0e6 [ 49.593055] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 49.593064] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 49.610051] ib1.beef: bringing up interface [ 49.857979] ib1.beef: Created ah 000000003571492a [ 49.890521] IPv6: ADDRCONF(NETDEV_CHANGE): ib1.beef: link becomes ready [ 49.893951] ib1.beef: Created ah 00000000aea98452 [ 49.897011] ib1.beef: Created ah 000000004e23c357 [ 49.899995] ib1.beef: Created ah 00000000ed62df50 [ 49.903036] ib1.beef: Created ah 0000000041605d6d [ 49.915754] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 49.915765] mlx5_core 0000:af:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0) [ 49.923955] ib1.beef: Created ah 00000000f5d6b457 [ 49.943153] ib1.dddd: bringing up interface [ 50.187608] ib1.dddd: Created ah 00000000cebeba47 [ 50.220523] IPv6: ADDRCONF(NETDEV_CHANGE): ib1.dddd: link becomes ready [ 50.224347] ib1.dddd: Created ah 00000000c6f96f11 [ 50.227539] ib1.dddd: Created ah 000000004fe70418 [ 50.230691] ib1.dddd: Created ah 00000000ae96df99 [ 50.233810] ib1.dddd: Created ah 000000004af47f93 [ 50.236892] ib1.dddd: Created ah 0000000064aca082 [ 50.264221] ib1.dddd: Created ah 00000000f330012e [ 51.774399] ib1.beef: Created ah 000000007f1ef527 [ 52.094689] ib1.dddd: Created ah 00000000210b80b4 [ 57.215935] ib0.dddd: Created ah 00000000f07b9547 [ 57.216368] ib1.beef: Created ah 00000000f3a87dc7 [ 57.219420] ib1.beef: Created ah 00000000b7d4d592 [ 57.225647] ib0.beef: Created ah 00000000e65557a4 [ 57.228334] ib1.dddd: Created ah 000000001914b301 [ 57.228819] ib0.beef: Created ah 0000000070b21f1c [ 57.264003] ib1.beef: Created ah 0000000070b3a6e8 [ 57.264079] ib0.beef: Created ah 00000000be1feac1, [ 137.514460] ib0.beef: neigh free for ffffff ff12:601b:beef:0000:0000:0001:ff66:de52 [ 137.514461] ib0.dddd: neigh free for ffffff ff12:601b:dddd:0000:0000:0001:ff0a:0392 [ 137.514471] ib0.dddd: neigh free for ffffff ff12:601b:dddd:0000:0000:0001:ff66:de52 [ 137.514473] ib0.beef: neigh free for ffffff ff12:401b:beef:0000:0000:0000:0000:0016 [ 137.514477] ib0.dddd: neigh free for ffffff ff12:601b:dddd:0000:0000:0000:0000:0016 [ 137.514478] ib0.beef: neigh free for ffffff ff12:601b:beef:0000:0000:0000:0000:0016 [ 140.074531] ib1.beef: neigh free for ffffff ff12:401b:beef:0000:0000:0000:0000:0016 [ 140.074541] ib1.beef: neigh free for ffffff ff12:601b:beef:0000:0000:0000:0000:0016 [ 140.074545] ib1.beef: neigh free for ffffff ff12:601b:beef:0000:0000:0001:ff66:de53 [ 140.714539] ib1.dddd: neigh free for ffffff ff12:601b:dddd:0000:0000:0001:ff0a:0392 [ 140.714549] ib1.dddd: neigh free for ffffff ff12:601b:dddd:0000:0000:0000:0000:0016 [ 140.714553] ib1.dddd: neigh free for ffffff ff12:601b:dddd:0000:0000:0001:ff66:de53 [ 144.470916] ib0.dddd: Created ah 000000009d40e279 [ 177.320655] ib0.dddd: Created ah 0000000023a374d0 [ 177.321583] ib1.beef: Created ah 00000000b54aadfc [ 177.324385] ib0.beef: Created ah 00000000f4507818 [ 177.325263] ib1.beef: Created ah 00000000132b48ff [ 177.328056] ib0.beef: Created ah 000000004e093b7c [ 177.328715] ib1.dddd: Created ah 00000000b274652f [ 177.358792] ib0.beef: Created ah 0000000076e40813 [ 177.358863] ib1.dddd: Created ah 00000000146f0ae3 [ 177.361796] ib1.beef: Created ah 00000000d7c8cff5 [ 177.362033] ib0.beef: Created ah 0000000086031b72 [ 177.365082] ib0.dddd: Created ah 0000000083e723db [ 177.365086] ib1.beef: Created ah 0000000029b2b4cb [ 200.215825] ib1.beef: neigh free for ffffff ff12:401b:beef:0000:0000:0000:ffff:ffff I suspect it might be related to change in this patchset: https://lore.kernel.org/linux-rdma/20180729083500.5352-1-leon@kernel.org/ Is this expected behavor? how can we fix it? Thanks! -- Jinpu Wang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-03-19 7:44 IPoIB child interfaces not working with mlx5 Jinpu Wang @ 2021-03-20 9:30 ` Leon Romanovsky [not found] ` <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com> 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2021-03-20 9:30 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-rdma, Jason Gunthorpe, Doug Ledford On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > Hi Jason and Leon, > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > behaves the same. Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com>]
* Re: IPoIB child interfaces not working with mlx5 [not found] ` <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com> @ 2021-03-21 13:07 ` Leon Romanovsky 2021-03-22 6:08 ` Jinpu Wang 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2021-03-21 13:07 UTC (permalink / raw) To: Jack Wang; +Cc: Doug Ledford, Jason Gunthorpe, Jinpu Wang, linux-rdma On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > Hi Jason and Leon, > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > behaves the same. > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > Thanks > > Yes. > Is this expected behavor? Yes, we wanted to make IPoIB behave like any other netdev interfaces and if parent interface isn't enabled, no traffic should pass. More on that, in our internal implementation of enhanced IPoIB, we are reusing same resources for both parent and child, this requires us to wait for "UP" event before allowing traffic. Thanks > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-03-21 13:07 ` Leon Romanovsky @ 2021-03-22 6:08 ` Jinpu Wang 2021-03-22 6:56 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-03-22 6:08 UTC (permalink / raw) To: Leon Romanovsky; +Cc: Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > Hi Jason and Leon, > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > behaves the same. > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > Thanks > > > > Yes. > > > Is this expected behavor? > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > if parent interface isn't enabled, no traffic should pass. More on that, > in our internal implementation of enhanced IPoIB, we are reusing same > resources for both parent and child, this requires us to wait for "UP" > event before allowing traffic. > > Thanks Hi Leon, Thanks for the clarification, is this behavior documented somewhere? is it specific to "enhanced IPoIB" for CX-5? Will it work differently if without MLX5_CORE_IPOIB enabled? I think it would be helpful to add a message if possible to remind admin to enable parent if only child if configured. Thanks! > > > > > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-03-22 6:08 ` Jinpu Wang @ 2021-03-22 6:56 ` Leon Romanovsky 2021-04-20 9:14 ` Jinpu Wang 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2021-03-22 6:56 UTC (permalink / raw) To: Jinpu Wang; +Cc: Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote: > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > > Hi Jason and Leon, > > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > > behaves the same. > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > > > Thanks > > > > > > Yes. > > > > > Is this expected behavor? > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > > if parent interface isn't enabled, no traffic should pass. More on that, > > in our internal implementation of enhanced IPoIB, we are reusing same > > resources for both parent and child, this requires us to wait for "UP" > > event before allowing traffic. > > > > Thanks > Hi Leon, > > Thanks for the clarification, is this behavior documented somewhere? > is it specific to "enhanced IPoIB" for CX-5? It is specific to "enhanced IPoIB" and not to device. I don't know where we can document it. > Will it work differently if without MLX5_CORE_IPOIB enabled? Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB", exactly as cx-3. The best thing will be to change IPoIB ULP to behave like netdev, but we were not comfortable to do it back then due to user visible nature of such change. > > I think it would be helpful to add a message if possible to remind > admin to enable parent if only child if configured. Care to send patch? Thanks > > Thanks! > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-03-22 6:56 ` Leon Romanovsky @ 2021-04-20 9:14 ` Jinpu Wang 2021-04-20 11:29 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-04-20 9:14 UTC (permalink / raw) To: Leon Romanovsky Cc: Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote: > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote: > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > > > Hi Jason and Leon, > > > > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > > > behaves the same. > > > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > > > > > Thanks > > > > > > > > Yes. > > > > > > > Is this expected behavor? > > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > > > if parent interface isn't enabled, no traffic should pass. More on that, > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > resources for both parent and child, this requires us to wait for "UP" > > > event before allowing traffic. > > > > > > Thanks > > Hi Leon, > > > > Thanks for the clarification, is this behavior documented somewhere? > > is it specific to "enhanced IPoIB" for CX-5? > > It is specific to "enhanced IPoIB" and not to device. I don't know where > we can document it. > > > Will it work differently if without MLX5_CORE_IPOIB enabled? > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB", > exactly as cx-3. The best thing will be to change IPoIB ULP to behave > like netdev, but we were not comfortable to do it back then due to > user visible nature of such change. > Hi Leon, More testing reveals new problems with MLX5_CORE_IPOIB. w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data. I'm running on A: "iperf3 -s" and on B: "sudo iperf3 -t 30000 -c ip6_of_A" example output [ 5] local 2a02:247f:401:1:2:0:a:391 port 41288 connected to 2a02:247f:401:1:2:0:a:392 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 165 KBytes 1.35 Mbits/sec 2 3.93 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 3.93 KBytes [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 1 3.93 KBytes While when I disable MLX5_CORE_IPOIB, run the same test above, iperf run without problem. [ 5] local 2a02:247f:401:1:2:0:a:391 port 51866 connected to 2a02:247f:401:1:2:0:a:392 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 293 MBytes 2.46 Gbits/sec 0 1.50 MBytes [ 5] 1.00-2.00 sec 290 MBytes 2.43 Gbits/sec 0 1.50 MBytes [ 5] 2.00-3.00 sec 289 MBytes 2.42 Gbits/sec 0 1.50 MBytes [ 5] 3.00-4.00 sec 290 MBytes 2.43 Gbits/sec 0 1.50 MBytes On both side we have: jwang@ps401a-913.nst:/mnt/jwang$ ibstat CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.27.2008 Hardware version: 0 Node GUID: 0x98039b03006c7912 System image GUID: 0x98039b03006c7912 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 14 LMC: 0 SM lid: 19 Capability mask: 0x2651e848 Port GUID: 0x98039b03006c7912 Link layer: InfiniBand CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.27.2008 Hardware version: 0 Node GUID: 0x98039b03006c7913 System image GUID: 0x98039b03006c7912 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 15 LMC: 0 SM lid: 45 Capability mask: 0x2651e848 Port GUID: 0x98039b03006c7913 Link layer: InfiniBand The initial tests were done on 5.4.102. And I did a brief test with ~linux-5.12-rc4 with MLX5_CORE_IPOIB, iperf3 also doesn't work as same as 5.4.102. cat /etc/network/interfaces.d/infiniband auto ib0.beef iface ib0.beef inet static address 10.42.3.145 netmask 20 up sysctl -w net.ipv4.conf.ib0/beef.forwarding=1 up ethtool -K $IFACE gro off pre-up ip link set ib0 up dad-attempts 600 auto ib0.dddd iface ib0.dddd inet6 static address 2a02:247f:401:1:2:0:a:391 netmask 64 pre-up ip link set ib0 up up sysctl -w net.ipv6.conf.ib0/dddd.forwarding=1 net.ipv6.conf.ib0/dddd.proxy_ndp=1 up ip -6 route add fd57:1:0:4::/64 dev $IFACE up ethtool -K $IFACE gro off dad-attempts 600 auto ib1.beef iface ib1.beef inet static address 10.43.3.145 netmask 20 up sysctl -w net.ipv4.conf.ib1/beef.forwarding=1 up ethtool -K $IFACE gro off pre-up ip link set ib1 up dad-attempts 600 auto ib1.dddd iface ib1.dddd inet6 static address 2a02:247f:402:1:2:0:a:391 netmask 64 pre-up ip link set ib1 up up sysctl -w net.ipv6.conf.ib1/dddd.forwarding=1 net.ipv6.conf.ib1/dddd.proxy_ndp=1 up ip -6 route add fd57:2:0:4::/64 dev $IFACE up ethtool -K $IFACE gro off dad-attempts 600 Thanks! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-04-20 9:14 ` Jinpu Wang @ 2021-04-20 11:29 ` Leon Romanovsky 2021-05-07 6:53 ` Jinpu Wang 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2021-04-20 11:29 UTC (permalink / raw) To: Jinpu Wang Cc: Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote: > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote: > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > > > > Hi Jason and Leon, > > > > > > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > > > > behaves the same. > > > > > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > > > > > > > Thanks > > > > > > > > > > Yes. > > > > > > > > > Is this expected behavor? > > > > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > > > > if parent interface isn't enabled, no traffic should pass. More on that, > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > resources for both parent and child, this requires us to wait for "UP" > > > > event before allowing traffic. > > > > > > > > Thanks > > > Hi Leon, > > > > > > Thanks for the clarification, is this behavior documented somewhere? > > > is it specific to "enhanced IPoIB" for CX-5? > > > > It is specific to "enhanced IPoIB" and not to device. I don't know where > > we can document it. > > > > > Will it work differently if without MLX5_CORE_IPOIB enabled? > > > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB", > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave > > like netdev, but we were not comfortable to do it back then due to > > user visible nature of such change. > > > Hi Leon, > > More testing reveals new problems with MLX5_CORE_IPOIB. > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data. In our regression, iperf3 works. Let's take it offline. Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-04-20 11:29 ` Leon Romanovsky @ 2021-05-07 6:53 ` Jinpu Wang 2021-05-07 8:03 ` Zhu Yanjun 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-05-07 6:53 UTC (permalink / raw) To: Leon Romanovsky, Itay Aveksis Cc: Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, linux-rdma On Tue, Apr 20, 2021 at 1:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote: > > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote: > > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > > > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > > > > > Hi Jason and Leon, > > > > > > > > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > > > > > behaves the same. > > > > > > > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > Yes. > > > > > > > > > > > Is this expected behavor? > > > > > > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > > > > > if parent interface isn't enabled, no traffic should pass. More on that, > > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > > resources for both parent and child, this requires us to wait for "UP" > > > > > event before allowing traffic. > > > > > > > > > > Thanks > > > > Hi Leon, > > > > > > > > Thanks for the clarification, is this behavior documented somewhere? > > > > is it specific to "enhanced IPoIB" for CX-5? > > > > > > It is specific to "enhanced IPoIB" and not to device. I don't know where > > > we can document it. > > > > > > > Will it work differently if without MLX5_CORE_IPOIB enabled? > > > > > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB", > > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave > > > like netdev, but we were not comfortable to do it back then due to > > > user visible nature of such change. > > > > > Hi Leon, > > > > More testing reveals new problems with MLX5_CORE_IPOIB. > > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data. Just want to give an update, we finally find out the key which leads to the failure on our side. we need to set the child interface to same MTU as the parent. jwang@ps401a-913.nst:/mnt/jwang$ ip link list 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 0c:c4:7a:ff:07:ce brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 0c:c4:7a:ff:07:cf brd ff:ff:ff:ff:ff:ff 6: ha_transport: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether f6:ff:16:93:08:8a brd ff:ff:ff:ff:ff:ff 11: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1024 link/infiniband 00:00:00:83:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 12: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1024 link/infiniband 00:00:01:58:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 13: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1024 link/infiniband 00:00:10:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff 14: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc mq state UP mode DEFAULT group default qlen 1024 link/infiniband 00:00:11:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff Initially, ib0 mtu is 2044, and ib0.dddd is 4092. After I reduced ib0.dddd mtu to 2044 on both sides, then iperf3 works fine. Could you explain why mtu must be set to exactly the same in case of enhanced IPoIB mode? is there anything else we must treat it special? I guess it related to > > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > > resources for both parent and child, this requires us to wait for "UP" > > > > > event before allowing traffic. Thanks! Jinpu ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-05-07 6:53 ` Jinpu Wang @ 2021-05-07 8:03 ` Zhu Yanjun 2021-05-07 8:11 ` Jinpu Wang 0 siblings, 1 reply; 10+ messages in thread From: Zhu Yanjun @ 2021-05-07 8:03 UTC (permalink / raw) To: Jinpu Wang Cc: Leon Romanovsky, Itay Aveksis, Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, RDMA mailing list On Fri, May 7, 2021 at 3:53 PM Jinpu Wang <jinpu.wang@ionos.com> wrote: > > On Tue, Apr 20, 2021 at 1:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote: > > > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote: > > > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > > > > > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > > > > > > Hi Jason and Leon, > > > > > > > > > > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > > > > > > behaves the same. > > > > > > > > > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > Is this expected behavor? > > > > > > > > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > > > > > > if parent interface isn't enabled, no traffic should pass. More on that, > > > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > > > resources for both parent and child, this requires us to wait for "UP" > > > > > > event before allowing traffic. > > > > > > > > > > > > Thanks > > > > > Hi Leon, > > > > > > > > > > Thanks for the clarification, is this behavior documented somewhere? > > > > > is it specific to "enhanced IPoIB" for CX-5? > > > > > > > > It is specific to "enhanced IPoIB" and not to device. I don't know where > > > > we can document it. > > > > > > > > > Will it work differently if without MLX5_CORE_IPOIB enabled? > > > > > > > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB", > > > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave > > > > like netdev, but we were not comfortable to do it back then due to > > > > user visible nature of such change. > > > > > > > Hi Leon, > > > > > > More testing reveals new problems with MLX5_CORE_IPOIB. > > > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data. > > Just want to give an update, we finally find out the key which leads > to the failure on our side. > > we need to set the child interface to same MTU as the parent. > jwang@ps401a-913.nst:/mnt/jwang$ ip link list > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN > mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP > mode DEFAULT group default qlen 1000 > link/ether 0c:c4:7a:ff:07:ce brd ff:ff:ff:ff:ff:ff > 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode > DEFAULT group default qlen 1000 > link/ether 0c:c4:7a:ff:07:cf brd ff:ff:ff:ff:ff:ff > 6: ha_transport: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue > state UNKNOWN mode DEFAULT group default qlen 1000 > link/ether f6:ff:16:93:08:8a brd ff:ff:ff:ff:ff:ff > 11: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP > mode DEFAULT group default qlen 1024 > link/infiniband > 00:00:00:83:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > 12: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP > mode DEFAULT group default qlen 1024 > link/infiniband > 00:00:01:58:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > 13: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq > state UP mode DEFAULT group default qlen 1024 > link/infiniband > 00:00:10:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd > 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff > 14: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc mq > state UP mode DEFAULT group default qlen 1024 > link/infiniband > 00:00:11:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd > 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff > > Initially, ib0 mtu is 2044, and ib0.dddd is 4092. > After I reduced ib0.dddd mtu to 2044 on both sides, then iperf3 works fine. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ipoib " When using datagram mode, the unreliable, disconnected queue pair type does not allow any packets larger than the InfiniBand link-layer’s MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the IP packet being transmitted. As a result, the IPoIB MTU must be 4 bytes less than the InfiniBand link-layer MTU. As 2048 is a common InfiniBand link-layer MTU, the common IPoIB device MTU in datagram mode is 2044. " > > Could you explain why mtu must be set to exactly the same in case of > enhanced IPoIB mode? is there anything else we must treat it special? > I guess it related to > > > > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > > > resources for both parent and child, this requires us to wait for "UP" > > > > > > event before allowing traffic. > > Thanks! > Jinpu ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: IPoIB child interfaces not working with mlx5 2021-05-07 8:03 ` Zhu Yanjun @ 2021-05-07 8:11 ` Jinpu Wang 0 siblings, 0 replies; 10+ messages in thread From: Jinpu Wang @ 2021-05-07 8:11 UTC (permalink / raw) To: Zhu Yanjun Cc: Leon Romanovsky, Itay Aveksis, Jinpu Wang, Jack Wang, Doug Ledford, Jason Gunthorpe, RDMA mailing list On Fri, May 7, 2021 at 10:03 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > On Fri, May 7, 2021 at 3:53 PM Jinpu Wang <jinpu.wang@ionos.com> wrote: > > > > On Tue, Apr 20, 2021 at 1:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Tue, Apr 20, 2021 at 11:14:41AM +0200, Jinpu Wang wrote: > > > > On Mon, Mar 22, 2021 at 7:56 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > On Mon, Mar 22, 2021 at 07:08:01AM +0100, Jinpu Wang wrote: > > > > > > On Sun, Mar 21, 2021 at 2:07 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > > > On Sat, Mar 20, 2021 at 02:09:50PM +0100, Jack Wang wrote: > > > > > > > > Leon Romanovsky <leon@kernel.org>于2021年3月20日 周六12:17写道: > > > > > > > > > > > > > > > > > On Fri, Mar 19, 2021 at 08:44:29AM +0100, Jinpu Wang wrote: > > > > > > > > > > Hi Jason and Leon, > > > > > > > > > > > > > > > > > > > > We recently switch to use upstream OFED from MLNX-OFED, and we notice > > > > > > > > > > IPoIB stop working with upstream kernel 5.4.102 with mellanox CX-5 > > > > > > > > > > HCA, it's working fine on CX-2/CX-3. I tested also on 5.11 kernel it > > > > > > > > > > behaves the same. > > > > > > > > > > > > > > > > > > Are you using "enhanced IPoIB" for CX-5 devices? MLX5_CORE_IPOIB? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > Is this expected behavor? > > > > > > > > > > > > > > Yes, we wanted to make IPoIB behave like any other netdev interfaces and > > > > > > > if parent interface isn't enabled, no traffic should pass. More on that, > > > > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > > > > resources for both parent and child, this requires us to wait for "UP" > > > > > > > event before allowing traffic. > > > > > > > > > > > > > > Thanks > > > > > > Hi Leon, > > > > > > > > > > > > Thanks for the clarification, is this behavior documented somewhere? > > > > > > is it specific to "enhanced IPoIB" for CX-5? > > > > > > > > > > It is specific to "enhanced IPoIB" and not to device. I don't know where > > > > > we can document it. > > > > > > > > > > > Will it work differently if without MLX5_CORE_IPOIB enabled? > > > > > > > > > > Yes, without MLX5_CORE_IPOIB, the devices will work in "legacy IPoIB", > > > > > exactly as cx-3. The best thing will be to change IPoIB ULP to behave > > > > > like netdev, but we were not comfortable to do it back then due to > > > > > user visible nature of such change. > > > > > > > > > Hi Leon, > > > > > > > > More testing reveals new problems with MLX5_CORE_IPOIB. > > > > w MLX5_CORE_IPOIB, ping wors on both hosts, but iperf3 doens't send any data. > > > > Just want to give an update, we finally find out the key which leads > > to the failure on our side. > > > > we need to set the child interface to same MTU as the parent. > > jwang@ps401a-913.nst:/mnt/jwang$ ip link list > > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN > > mode DEFAULT group default qlen 1000 > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP > > mode DEFAULT group default qlen 1000 > > link/ether 0c:c4:7a:ff:07:ce brd ff:ff:ff:ff:ff:ff > > 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode > > DEFAULT group default qlen 1000 > > link/ether 0c:c4:7a:ff:07:cf brd ff:ff:ff:ff:ff:ff > > 6: ha_transport: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue > > state UNKNOWN mode DEFAULT group default qlen 1000 > > link/ether f6:ff:16:93:08:8a brd ff:ff:ff:ff:ff:ff > > 11: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP > > mode DEFAULT group default qlen 1024 > > link/infiniband > > 00:00:00:83:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd > > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > > 12: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP > > mode DEFAULT group default qlen 1024 > > link/infiniband > > 00:00:01:58:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd > > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > > 13: ib0.dddd@ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq > > state UP mode DEFAULT group default qlen 1024 > > link/infiniband > > 00:00:10:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:12 brd > > 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff > > 14: ib1.dddd@ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc mq > > state UP mode DEFAULT group default qlen 1024 > > link/infiniband > > 00:00:11:8c:fe:80:00:00:00:00:00:00:98:03:9b:03:00:6c:79:13 brd > > 00:ff:ff:ff:ff:12:40:1b:dd:dd:00:00:00:00:00:00:ff:ff:ff:ff > > > > Initially, ib0 mtu is 2044, and ib0.dddd is 4092. > > After I reduced ib0.dddd mtu to 2044 on both sides, then iperf3 works fine. > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ipoib > > " > When using datagram mode, the unreliable, disconnected queue pair type > does not allow any packets larger than the InfiniBand link-layer’s > MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the IP > packet being transmitted. As a result, the IPoIB MTU must be 4 bytes > less than the InfiniBand link-layer MTU. As 2048 is a common > InfiniBand link-layer MTU, the common IPoIB device MTU in datagram > mode is 2044. > " Thanks for the hint, Yanjun. > > > > > Could you explain why mtu must be set to exactly the same in case of > > enhanced IPoIB mode? is there anything else we must treat it special? > > I guess it related to > > > > > > > > > in our internal implementation of enhanced IPoIB, we are reusing same > > > > > > > resources for both parent and child, this requires us to wait for "UP" > > > > > > > event before allowing traffic. > > > > Thanks! > > Jinpu ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-05-07 8:11 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-19 7:44 IPoIB child interfaces not working with mlx5 Jinpu Wang 2021-03-20 9:30 ` Leon Romanovsky [not found] ` <CAD+HZHUHbuBeoB4cCLc78gsmZAEyEr+fiWtpuTrxyzRBzMBf_g@mail.gmail.com> 2021-03-21 13:07 ` Leon Romanovsky 2021-03-22 6:08 ` Jinpu Wang 2021-03-22 6:56 ` Leon Romanovsky 2021-04-20 9:14 ` Jinpu Wang 2021-04-20 11:29 ` Leon Romanovsky 2021-05-07 6:53 ` Jinpu Wang 2021-05-07 8:03 ` Zhu Yanjun 2021-05-07 8:11 ` Jinpu Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).