* Missing infiniband network interfaces after update to 5.14/5.15 @ 2021-11-11 7:48 Jinpu Wang 2021-11-11 11:29 ` Leon Romanovsky 2021-11-11 12:58 ` Jason Gunthorpe 0 siblings, 2 replies; 10+ messages in thread From: Jinpu Wang @ 2021-11-11 7:48 UTC (permalink / raw) To: RDMA mailing list, Jason Gunthorpe, Leon Romanovsky, Haris Iqbal Hi Jason, hi Leon, We are seeing exactly the same error reported here: https://bugzilla.redhat.com/show_bug.cgi?id=2014094 I suspect it's related to https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ Do you have any idea, what goes wrong? Thanks! Jinpu Wang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-11 7:48 Missing infiniband network interfaces after update to 5.14/5.15 Jinpu Wang @ 2021-11-11 11:29 ` Leon Romanovsky 2021-11-12 8:23 ` Jinpu Wang 2021-11-11 12:58 ` Jason Gunthorpe 1 sibling, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2021-11-11 11:29 UTC (permalink / raw) To: Jinpu Wang; +Cc: RDMA mailing list, Jason Gunthorpe, Haris Iqbal On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > Hi Jason, hi Leon, > > We are seeing exactly the same error reported here: > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > I suspect it's related to > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > Do you have any idea, what goes wrong? I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug report. [leonro@c-235-8-1-005 ~]$ uname -a Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux [leonro@c-235-8-1-005 ~]$ rdma dev 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 [leonro@c-235-8-1-005 ~]$ uname -a Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux [leonro@c-235-8-1-005 ~]$ rdma dev 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 [leonro@c-235-8-1-005 ~]$ lspci |grep nox 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] Thanks > > Thanks! > Jinpu Wang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-11 11:29 ` Leon Romanovsky @ 2021-11-12 8:23 ` Jinpu Wang 2021-11-12 14:23 ` Jason Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-11-12 8:23 UTC (permalink / raw) To: Leon Romanovsky; +Cc: RDMA mailing list, Jason Gunthorpe, Haris Iqbal On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > Hi Jason, hi Leon, > > > > We are seeing exactly the same error reported here: > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > I suspect it's related to > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > Do you have any idea, what goes wrong? > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > report. > > [leonro@c-235-8-1-005 ~]$ uname -a > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > [leonro@c-235-8-1-005 ~]$ rdma dev > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > [leonro@c-235-8-1-005 ~]$ uname -a > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > [leonro@c-235-8-1-005 ~]$ rdma dev > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > Thanks > Hi, I tried different host with CX-3/CX-5, they all work fine. and I can only reproduce on hosts with a bit old HCA: 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) The bug report link https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA ConnectX too. 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0) with the instrument, I only narrow it down to 1438 port = setup_port(coredev, port_num, &attr); 1439 if (IS_ERR(port)) { 1440 ret = PTR_ERR(port); 1441 pr_info("setup ports failed %d\n", ret); 1442 goto err_put; 1443 } [ 43.795268] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0 [ 43.830809] setup ports failed -12 [ 43.830814] infiniband mlx4_0: Couldn't register device with driver model My guess is the ConnectX HCA may be missing some features, which leads to ENOMEM, I will continue the instrument if no other hint. Thanks ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-12 8:23 ` Jinpu Wang @ 2021-11-12 14:23 ` Jason Gunthorpe 2021-11-14 7:05 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2021-11-12 14:23 UTC (permalink / raw) To: Jinpu Wang; +Cc: Leon Romanovsky, RDMA mailing list, Haris Iqbal On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > Hi Jason, hi Leon, > > > > > > We are seeing exactly the same error reported here: > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > I suspect it's related to > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > > > Do you have any idea, what goes wrong? > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > report. > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > [leonro@c-235-8-1-005 ~]$ rdma dev > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > [leonro@c-235-8-1-005 ~]$ rdma dev > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > Thanks > > > Hi, > > I tried different host with CX-3/CX-5, they all work fine. and I can > only reproduce on hosts with a bit old HCA: > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > The bug report link > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > ConnectX too. > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > 2.0 x8 5.0GT/s In... (rev b0) > with the instrument, I only narrow it down to > 1438 port = setup_port(coredev, port_num, &attr); > 1439 if (IS_ERR(port)) { > 1440 ret = PTR_ERR(port); > 1441 pr_info("setup ports failed %d\n", ret); > 1442 goto err_put; > 1443 } Keep going with the tracing, there are lots of allocations in there. > My guess is the ConnectX HCA may be missing some features, which leads > to ENOMEM, I will continue the instrument if no other hint. Since there is no memory allocation failure splat I'm guessing some memory allocation hit an overflow and silently failed - ie mlx4 is possibily setting some value to something bogus Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-12 14:23 ` Jason Gunthorpe @ 2021-11-14 7:05 ` Leon Romanovsky 2021-11-15 8:18 ` Jinpu Wang 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2021-11-14 7:05 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: Jinpu Wang, RDMA mailing list, Haris Iqbal On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote: > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > > Hi Jason, hi Leon, > > > > > > > > We are seeing exactly the same error reported here: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > > > I suspect it's related to > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > > > > > Do you have any idea, what goes wrong? > > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > > report. > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > Thanks > > > > > Hi, > > > > I tried different host with CX-3/CX-5, they all work fine. and I can > > only reproduce on hosts with a bit old HCA: > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > The bug report link > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > > ConnectX too. > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > > 2.0 x8 5.0GT/s In... (rev b0) > > with the instrument, I only narrow it down to > > 1438 port = setup_port(coredev, port_num, &attr); > > 1439 if (IS_ERR(port)) { > > 1440 ret = PTR_ERR(port); > > 1441 pr_info("setup ports failed %d\n", ret); > > 1442 goto err_put; > > 1443 } > > Keep going with the tracing, there are lots of allocations in there. > > > My guess is the ConnectX HCA may be missing some features, which leads > > to ENOMEM, I will continue the instrument if no other hint. > > Since there is no memory allocation failure splat I'm guessing some > memory allocation hit an overflow and silently failed - ie mlx4 is > possibily setting some value to something bogus Yes, look for the values returned from FW. > > Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-14 7:05 ` Leon Romanovsky @ 2021-11-15 8:18 ` Jinpu Wang 2021-11-15 9:20 ` Jinpu Wang 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-11-15 8:18 UTC (permalink / raw) To: Leon Romanovsky; +Cc: Jason Gunthorpe, RDMA mailing list, Haris Iqbal On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote: > > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > > > Hi Jason, hi Leon, > > > > > > > > > > We are seeing exactly the same error reported here: > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > > > > > I suspect it's related to > > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > > > > > > > Do you have any idea, what goes wrong? > > > > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > > > report. > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > > > Thanks > > > > > > > Hi, > > > > > > I tried different host with CX-3/CX-5, they all work fine. and I can > > > only reproduce on hosts with a bit old HCA: > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > > > The bug report link > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > > > ConnectX too. > > > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > > > 2.0 x8 5.0GT/s In... (rev b0) > > > with the instrument, I only narrow it down to > > > 1438 port = setup_port(coredev, port_num, &attr); > > > 1439 if (IS_ERR(port)) { > > > 1440 ret = PTR_ERR(port); > > > 1441 pr_info("setup ports failed %d\n", ret); > > > 1442 goto err_put; > > > 1443 } > > > > Keep going with the tracing, there are lots of allocations in there. > > > > > My guess is the ConnectX HCA may be missing some features, which leads > > > to ENOMEM, I will continue the instrument if no other hint. > > > > Since there is no memory allocation failure splat I'm guessing some > > memory allocation hit an overflow and silently failed - ie mlx4 is > > possibily setting some value to something bogus > > Yes, look for the values returned from FW. Hi Leon, hi Jason I've found the problem, the device doesn't support per port diag counters, and the driver then fails the register which is too harsh. I'm not sure how to fix it properly, your thought? Thanks [ 3426.452062] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0 [ 3426.452067] <mlx4_ib> mlx4_ib_alloc_diag_counters: #### i =1, per_port 0 // device MLX4_DEV_CAP_FLAG2_DIAG_PER_PORT not set. which lead to the allocation failure. [ 3426.494000] <mlx4_ib> mlx4_ib_alloc_hw_port_stats: mlx4_ib_alloc_hw_port_stats name null [ 3426.494170] <mlx4_ib> mlx4_ib_alloc_hw_port_stats: mlx4_ib_alloc_hw_port_stats name null [ 3426.494174] ibdev ops alloc_hw_stats_port failed [ 3426.494175] alloc_hw_stats_port failed [ 3426.494177] setup_hw_port_stats failed, -12 [ 3426.494181] setup ports failed -12 [ 3426.494190] infiniband mlx4_0: Couldn't register device with driver model > > > > > Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-15 8:18 ` Jinpu Wang @ 2021-11-15 9:20 ` Jinpu Wang 2021-11-17 12:15 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Jinpu Wang @ 2021-11-15 9:20 UTC (permalink / raw) To: Leon Romanovsky; +Cc: Jason Gunthorpe, RDMA mailing list, Haris Iqbal On Mon, Nov 15, 2021 at 9:18 AM Jinpu Wang <jinpu.wang@ionos.com> wrote: > > On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote: > > > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > > > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > > > > Hi Jason, hi Leon, > > > > > > > > > > > > We are seeing exactly the same error reported here: > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > > > > > > > I suspect it's related to > > > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > > > > > > > > > Do you have any idea, what goes wrong? > > > > > > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > > > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > > > > report. > > > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > > > > > Thanks > > > > > > > > > Hi, > > > > > > > > I tried different host with CX-3/CX-5, they all work fine. and I can > > > > only reproduce on hosts with a bit old HCA: > > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > > > > > The bug report link > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > > > > ConnectX too. > > > > > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > > > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > > > > 2.0 x8 5.0GT/s In... (rev b0) > > > > with the instrument, I only narrow it down to > > > > 1438 port = setup_port(coredev, port_num, &attr); > > > > 1439 if (IS_ERR(port)) { > > > > 1440 ret = PTR_ERR(port); > > > > 1441 pr_info("setup ports failed %d\n", ret); > > > > 1442 goto err_put; > > > > 1443 } > > > > > > Keep going with the tracing, there are lots of allocations in there. > > > > > > > My guess is the ConnectX HCA may be missing some features, which leads > > > > to ENOMEM, I will continue the instrument if no other hint. > > > > > > Since there is no memory allocation failure splat I'm guessing some > > > memory allocation hit an overflow and silently failed - ie mlx4 is > > > possibily setting some value to something bogus > > > > Yes, look for the values returned from FW. > Hi Leon, hi Jason > > I've found the problem, the device doesn't support per port diag > counters, and the driver then fails the register which is > too harsh. > > I'm not sure how to fix it properly, your thought? > > Thanks > with this change, the device can be detected properly. if you think it's the right direction, I can submit a patch. Thanks! + +static const struct ib_device_ops mlx4_ib_hw_stats_ops1 = { + .alloc_hw_device_stats = mlx4_ib_alloc_hw_device_stats, + .get_hw_stats = mlx4_ib_get_hw_stats, +}; + static int mlx4_ib_alloc_diag_counters(struct mlx4_ib_dev *ibdev) { struct mlx4_ib_diag_counters *diag = ibdev->diag_counters; @@ -2230,8 +2238,11 @@ static int mlx4_ib_alloc_diag_counters(struct mlx4_ib_dev *ibdev) for (i = 0; i < MLX4_DIAG_COUNTERS_TYPES; i++) { /* i == 1 means we are building port counters */ - if (i && !per_port) - continue; + if (i && !per_port) { + ib_set_device_ops(&ibdev->ib_dev, &mlx4_ib_hw_stats_ops1); + return 0; + } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-15 9:20 ` Jinpu Wang @ 2021-11-17 12:15 ` Leon Romanovsky 0 siblings, 0 replies; 10+ messages in thread From: Leon Romanovsky @ 2021-11-17 12:15 UTC (permalink / raw) To: Jinpu Wang; +Cc: Jason Gunthorpe, RDMA mailing list, Haris Iqbal On Mon, Nov 15, 2021 at 10:20:50AM +0100, Jinpu Wang wrote: > On Mon, Nov 15, 2021 at 9:18 AM Jinpu Wang <jinpu.wang@ionos.com> wrote: > > > > On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote: > > > > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > > > > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > > > > > Hi Jason, hi Leon, > > > > > > > > > > > > > > We are seeing exactly the same error reported here: > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > > > > > > > > > I suspect it's related to > > > > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > > > > > > > > > > > Do you have any idea, what goes wrong? > > > > > > > > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > > > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > > > > > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > > > > > report. > > > > > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > > > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > > > > > > > Thanks > > > > > > > > > > > Hi, > > > > > > > > > > I tried different host with CX-3/CX-5, they all work fine. and I can > > > > > only reproduce on hosts with a bit old HCA: > > > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > > > > > > > The bug report link > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > > > > > ConnectX too. > > > > > > > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > > > > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > > > > > 2.0 x8 5.0GT/s In... (rev b0) > > > > > with the instrument, I only narrow it down to > > > > > 1438 port = setup_port(coredev, port_num, &attr); > > > > > 1439 if (IS_ERR(port)) { > > > > > 1440 ret = PTR_ERR(port); > > > > > 1441 pr_info("setup ports failed %d\n", ret); > > > > > 1442 goto err_put; > > > > > 1443 } > > > > > > > > Keep going with the tracing, there are lots of allocations in there. > > > > > > > > > My guess is the ConnectX HCA may be missing some features, which leads > > > > > to ENOMEM, I will continue the instrument if no other hint. > > > > > > > > Since there is no memory allocation failure splat I'm guessing some > > > > memory allocation hit an overflow and silently failed - ie mlx4 is > > > > possibily setting some value to something bogus > > > > > > Yes, look for the values returned from FW. > > Hi Leon, hi Jason > > > > I've found the problem, the device doesn't support per port diag > > counters, and the driver then fails the register which is > > too harsh. > > > > I'm not sure how to fix it properly, your thought? > > > > Thanks > > > with this change, the device can be detected properly. if you think > it's the right direction, I can submit a patch. Thanks, it looks like a right fix. > > Thanks! > + > +static const struct ib_device_ops mlx4_ib_hw_stats_ops1 = { > + .alloc_hw_device_stats = mlx4_ib_alloc_hw_device_stats, > + .get_hw_stats = mlx4_ib_get_hw_stats, > +}; > + > static int mlx4_ib_alloc_diag_counters(struct mlx4_ib_dev *ibdev) > { > struct mlx4_ib_diag_counters *diag = ibdev->diag_counters; > @@ -2230,8 +2238,11 @@ static int mlx4_ib_alloc_diag_counters(struct > mlx4_ib_dev *ibdev) > > for (i = 0; i < MLX4_DIAG_COUNTERS_TYPES; i++) { > /* i == 1 means we are building port counters */ > - if (i && !per_port) > - continue; > + if (i && !per_port) { > + ib_set_device_ops(&ibdev->ib_dev, > &mlx4_ib_hw_stats_ops1); > + return 0; > + } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-11 7:48 Missing infiniband network interfaces after update to 5.14/5.15 Jinpu Wang 2021-11-11 11:29 ` Leon Romanovsky @ 2021-11-11 12:58 ` Jason Gunthorpe 2021-11-11 13:48 ` Jinpu Wang 1 sibling, 1 reply; 10+ messages in thread From: Jason Gunthorpe @ 2021-11-11 12:58 UTC (permalink / raw) To: Jinpu Wang; +Cc: RDMA mailing list, Leon Romanovsky, Haris Iqbal On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > Hi Jason, hi Leon, > > We are seeing exactly the same error reported here: > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > I suspect it's related to > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > Do you have any idea, what goes wrong? instrument ib_setup_port_attrs() until you find why it failed Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Missing infiniband network interfaces after update to 5.14/5.15 2021-11-11 12:58 ` Jason Gunthorpe @ 2021-11-11 13:48 ` Jinpu Wang 0 siblings, 0 replies; 10+ messages in thread From: Jinpu Wang @ 2021-11-11 13:48 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: RDMA mailing list, Leon Romanovsky, Haris Iqbal On Thu, Nov 11, 2021 at 1:58 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > Hi Jason, hi Leon, > > > > We are seeing exactly the same error reported here: > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > I suspect it's related to > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/ > > > > Do you have any idea, what goes wrong? > > instrument ib_setup_port_attrs() until you find why it failed > > Jason Thanks Jason and Leon, I will add some debug messages and find out the reason. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-11-17 12:15 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-11 7:48 Missing infiniband network interfaces after update to 5.14/5.15 Jinpu Wang 2021-11-11 11:29 ` Leon Romanovsky 2021-11-12 8:23 ` Jinpu Wang 2021-11-12 14:23 ` Jason Gunthorpe 2021-11-14 7:05 ` Leon Romanovsky 2021-11-15 8:18 ` Jinpu Wang 2021-11-15 9:20 ` Jinpu Wang 2021-11-17 12:15 ` Leon Romanovsky 2021-11-11 12:58 ` Jason Gunthorpe 2021-11-11 13:48 ` Jinpu Wang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.