All of lore.kernel.org
 help / color / mirror / Atom feed
* Missing infiniband network interfaces after update to 5.14/5.15
@ 2021-11-11  7:48 Jinpu Wang
  2021-11-11 11:29 ` Leon Romanovsky
  2021-11-11 12:58 ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: Jinpu Wang @ 2021-11-11  7:48 UTC (permalink / raw)
  To: RDMA mailing list, Jason Gunthorpe, Leon Romanovsky, Haris Iqbal

Hi Jason, hi Leon,

We are seeing exactly the same error reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=2014094

I suspect it's related to
https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/

Do you have any idea, what goes wrong?

Thanks!
Jinpu Wang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-11  7:48 Missing infiniband network interfaces after update to 5.14/5.15 Jinpu Wang
@ 2021-11-11 11:29 ` Leon Romanovsky
  2021-11-12  8:23   ` Jinpu Wang
  2021-11-11 12:58 ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2021-11-11 11:29 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: RDMA mailing list, Jason Gunthorpe, Haris Iqbal

On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> Hi Jason, hi Leon,
> 
> We are seeing exactly the same error reported here:
> https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> 
> I suspect it's related to
> https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> 
> Do you have any idea, what goes wrong?

I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842

and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
report.

[leonro@c-235-8-1-005 ~]$ uname -a
Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[leonro@c-235-8-1-005 ~]$ rdma dev
0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63

[leonro@c-235-8-1-005 ~]$ uname -a
Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[leonro@c-235-8-1-005 ~]$ rdma dev
0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
[leonro@c-235-8-1-005 ~]$ lspci |grep nox
08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

Thanks

> 
> Thanks!
> Jinpu Wang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-11  7:48 Missing infiniband network interfaces after update to 5.14/5.15 Jinpu Wang
  2021-11-11 11:29 ` Leon Romanovsky
@ 2021-11-11 12:58 ` Jason Gunthorpe
  2021-11-11 13:48   ` Jinpu Wang
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2021-11-11 12:58 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: RDMA mailing list, Leon Romanovsky, Haris Iqbal

On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> Hi Jason, hi Leon,
> 
> We are seeing exactly the same error reported here:
> https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> 
> I suspect it's related to
> https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> 
> Do you have any idea, what goes wrong?

instrument ib_setup_port_attrs() until you find why it failed

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-11 12:58 ` Jason Gunthorpe
@ 2021-11-11 13:48   ` Jinpu Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Jinpu Wang @ 2021-11-11 13:48 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: RDMA mailing list, Leon Romanovsky, Haris Iqbal

On Thu, Nov 11, 2021 at 1:58 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > Hi Jason, hi Leon,
> >
> > We are seeing exactly the same error reported here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> >
> > I suspect it's related to
> > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> >
> > Do you have any idea, what goes wrong?
>
> instrument ib_setup_port_attrs() until you find why it failed
>
> Jason
Thanks Jason and Leon, I will add some debug messages and find out the reason.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-11 11:29 ` Leon Romanovsky
@ 2021-11-12  8:23   ` Jinpu Wang
  2021-11-12 14:23     ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-11-12  8:23 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: RDMA mailing list, Jason Gunthorpe, Haris Iqbal

On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > Hi Jason, hi Leon,
> >
> > We are seeing exactly the same error reported here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> >
> > I suspect it's related to
> > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> >
> > Do you have any idea, what goes wrong?
>
> I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
>
> and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> report.
>
> [leonro@c-235-8-1-005 ~]$ uname -a
> Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> [leonro@c-235-8-1-005 ~]$ rdma dev
> 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
>
> [leonro@c-235-8-1-005 ~]$ uname -a
> Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> [leonro@c-235-8-1-005 ~]$ rdma dev
> 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
>
> Thanks
>
Hi,

I tried different host with CX-3/CX-5, they all work fine. and I can
only reproduce on hosts with a bit old HCA:
03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
2.0 5GT/s - IB QDR / 10GigE] (rev b0)

The bug report link
https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
ConnectX too.

01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
2.0 x8 5.0GT/s In... (rev b0)
with the instrument, I only narrow it down to
1438                 port = setup_port(coredev, port_num, &attr);
1439                 if (IS_ERR(port)) {
1440                         ret = PTR_ERR(port);
1441                         pr_info("setup ports failed %d\n", ret);
1442                         goto err_put;
1443                 }

[   43.795268] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0
[   43.830809] setup ports failed -12
[   43.830814] infiniband mlx4_0: Couldn't register device with driver model

My guess is the ConnectX HCA may be missing some features, which leads
to ENOMEM, I will continue the instrument if no other hint.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-12  8:23   ` Jinpu Wang
@ 2021-11-12 14:23     ` Jason Gunthorpe
  2021-11-14  7:05       ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2021-11-12 14:23 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: Leon Romanovsky, RDMA mailing list, Haris Iqbal

On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote:
> On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > > Hi Jason, hi Leon,
> > >
> > > We are seeing exactly the same error reported here:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> > >
> > > I suspect it's related to
> > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> > >
> > > Do you have any idea, what goes wrong?
> >
> > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
> >
> > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> > report.
> >
> > [leonro@c-235-8-1-005 ~]$ uname -a
> > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > [leonro@c-235-8-1-005 ~]$ rdma dev
> > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> >
> > [leonro@c-235-8-1-005 ~]$ uname -a
> > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > [leonro@c-235-8-1-005 ~]$ rdma dev
> > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> >
> > Thanks
> >
> Hi,
> 
> I tried different host with CX-3/CX-5, they all work fine. and I can
> only reproduce on hosts with a bit old HCA:
> 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> 
> The bug report link
> https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
> ConnectX too.
> 
> 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
> ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
> 2.0 x8 5.0GT/s In... (rev b0)
> with the instrument, I only narrow it down to
> 1438                 port = setup_port(coredev, port_num, &attr);
> 1439                 if (IS_ERR(port)) {
> 1440                         ret = PTR_ERR(port);
> 1441                         pr_info("setup ports failed %d\n", ret);
> 1442                         goto err_put;
> 1443                 }

Keep going with the tracing, there are lots of allocations in there.

> My guess is the ConnectX HCA may be missing some features, which leads
> to ENOMEM, I will continue the instrument if no other hint.

Since there is no memory allocation failure splat I'm guessing some
memory allocation hit an overflow and silently failed - ie mlx4 is
possibily setting some value to something bogus

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-12 14:23     ` Jason Gunthorpe
@ 2021-11-14  7:05       ` Leon Romanovsky
  2021-11-15  8:18         ` Jinpu Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2021-11-14  7:05 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Jinpu Wang, RDMA mailing list, Haris Iqbal

On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote:
> > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > > > Hi Jason, hi Leon,
> > > >
> > > > We are seeing exactly the same error reported here:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> > > >
> > > > I suspect it's related to
> > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> > > >
> > > > Do you have any idea, what goes wrong?
> > >
> > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
> > >
> > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> > > report.
> > >
> > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > >
> > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > >
> > > Thanks
> > >
> > Hi,
> > 
> > I tried different host with CX-3/CX-5, they all work fine. and I can
> > only reproduce on hosts with a bit old HCA:
> > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > 
> > The bug report link
> > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
> > ConnectX too.
> > 
> > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
> > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
> > 2.0 x8 5.0GT/s In... (rev b0)
> > with the instrument, I only narrow it down to
> > 1438                 port = setup_port(coredev, port_num, &attr);
> > 1439                 if (IS_ERR(port)) {
> > 1440                         ret = PTR_ERR(port);
> > 1441                         pr_info("setup ports failed %d\n", ret);
> > 1442                         goto err_put;
> > 1443                 }
> 
> Keep going with the tracing, there are lots of allocations in there.
> 
> > My guess is the ConnectX HCA may be missing some features, which leads
> > to ENOMEM, I will continue the instrument if no other hint.
> 
> Since there is no memory allocation failure splat I'm guessing some
> memory allocation hit an overflow and silently failed - ie mlx4 is
> possibily setting some value to something bogus

Yes, look for the values returned from FW.

> 
> Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-14  7:05       ` Leon Romanovsky
@ 2021-11-15  8:18         ` Jinpu Wang
  2021-11-15  9:20           ` Jinpu Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-11-15  8:18 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, RDMA mailing list, Haris Iqbal

On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote:
> > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > > > > Hi Jason, hi Leon,
> > > > >
> > > > > We are seeing exactly the same error reported here:
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> > > > >
> > > > > I suspect it's related to
> > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> > > > >
> > > > > Do you have any idea, what goes wrong?
> > > >
> > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
> > > >
> > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> > > > report.
> > > >
> > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > >
> > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > >
> > > > Thanks
> > > >
> > > Hi,
> > >
> > > I tried different host with CX-3/CX-5, they all work fine. and I can
> > > only reproduce on hosts with a bit old HCA:
> > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > >
> > > The bug report link
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
> > > ConnectX too.
> > >
> > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
> > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
> > > 2.0 x8 5.0GT/s In... (rev b0)
> > > with the instrument, I only narrow it down to
> > > 1438                 port = setup_port(coredev, port_num, &attr);
> > > 1439                 if (IS_ERR(port)) {
> > > 1440                         ret = PTR_ERR(port);
> > > 1441                         pr_info("setup ports failed %d\n", ret);
> > > 1442                         goto err_put;
> > > 1443                 }
> >
> > Keep going with the tracing, there are lots of allocations in there.
> >
> > > My guess is the ConnectX HCA may be missing some features, which leads
> > > to ENOMEM, I will continue the instrument if no other hint.
> >
> > Since there is no memory allocation failure splat I'm guessing some
> > memory allocation hit an overflow and silently failed - ie mlx4 is
> > possibily setting some value to something bogus
>
> Yes, look for the values returned from FW.
Hi Leon, hi Jason

I've found the problem, the device doesn't support per port diag
counters, and the driver then fails the register which is
too harsh.

I'm not sure how to fix it properly, your thought?

Thanks


[ 3426.452062] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0
[ 3426.452067] <mlx4_ib> mlx4_ib_alloc_diag_counters: #### i =1,
per_port 0  // device MLX4_DEV_CAP_FLAG2_DIAG_PER_PORT not set. which
lead to the allocation failure.
[ 3426.494000] <mlx4_ib> mlx4_ib_alloc_hw_port_stats:
mlx4_ib_alloc_hw_port_stats name null
[ 3426.494170] <mlx4_ib> mlx4_ib_alloc_hw_port_stats:
mlx4_ib_alloc_hw_port_stats name null
[ 3426.494174] ibdev ops alloc_hw_stats_port failed
[ 3426.494175] alloc_hw_stats_port failed
[ 3426.494177] setup_hw_port_stats failed, -12
[ 3426.494181] setup ports failed -12
[ 3426.494190] infiniband mlx4_0: Couldn't register device with driver model


>
> >
> > Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-15  8:18         ` Jinpu Wang
@ 2021-11-15  9:20           ` Jinpu Wang
  2021-11-17 12:15             ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Jinpu Wang @ 2021-11-15  9:20 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, RDMA mailing list, Haris Iqbal

On Mon, Nov 15, 2021 at 9:18 AM Jinpu Wang <jinpu.wang@ionos.com> wrote:
>
> On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote:
> > > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote:
> > > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > > > > > Hi Jason, hi Leon,
> > > > > >
> > > > > > We are seeing exactly the same error reported here:
> > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> > > > > >
> > > > > > I suspect it's related to
> > > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> > > > > >
> > > > > > Do you have any idea, what goes wrong?
> > > > >
> > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> > > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
> > > > >
> > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> > > > > report.
> > > > >
> > > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > > >
> > > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> > > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > > >
> > > > > Thanks
> > > > >
> > > > Hi,
> > > >
> > > > I tried different host with CX-3/CX-5, they all work fine. and I can
> > > > only reproduce on hosts with a bit old HCA:
> > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > > >
> > > > The bug report link
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
> > > > ConnectX too.
> > > >
> > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
> > > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
> > > > 2.0 x8 5.0GT/s In... (rev b0)
> > > > with the instrument, I only narrow it down to
> > > > 1438                 port = setup_port(coredev, port_num, &attr);
> > > > 1439                 if (IS_ERR(port)) {
> > > > 1440                         ret = PTR_ERR(port);
> > > > 1441                         pr_info("setup ports failed %d\n", ret);
> > > > 1442                         goto err_put;
> > > > 1443                 }
> > >
> > > Keep going with the tracing, there are lots of allocations in there.
> > >
> > > > My guess is the ConnectX HCA may be missing some features, which leads
> > > > to ENOMEM, I will continue the instrument if no other hint.
> > >
> > > Since there is no memory allocation failure splat I'm guessing some
> > > memory allocation hit an overflow and silently failed - ie mlx4 is
> > > possibily setting some value to something bogus
> >
> > Yes, look for the values returned from FW.
> Hi Leon, hi Jason
>
> I've found the problem, the device doesn't support per port diag
> counters, and the driver then fails the register which is
> too harsh.
>
> I'm not sure how to fix it properly, your thought?
>
> Thanks
>
with this change,  the device can be detected properly. if you think
it's the right direction, I can submit a patch.

Thanks!
+
+static const struct ib_device_ops mlx4_ib_hw_stats_ops1 = {
+       .alloc_hw_device_stats = mlx4_ib_alloc_hw_device_stats,
+       .get_hw_stats = mlx4_ib_get_hw_stats,
+};
+
 static int mlx4_ib_alloc_diag_counters(struct mlx4_ib_dev *ibdev)
 {
        struct mlx4_ib_diag_counters *diag = ibdev->diag_counters;
@@ -2230,8 +2238,11 @@ static int mlx4_ib_alloc_diag_counters(struct
mlx4_ib_dev *ibdev)

        for (i = 0; i < MLX4_DIAG_COUNTERS_TYPES; i++) {
                /* i == 1 means we are building port counters */
-               if (i && !per_port)
-                       continue;
+               if (i && !per_port) {
+                       ib_set_device_ops(&ibdev->ib_dev,
&mlx4_ib_hw_stats_ops1);
+                       return 0;
+               }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Missing infiniband network interfaces after update to 5.14/5.15
  2021-11-15  9:20           ` Jinpu Wang
@ 2021-11-17 12:15             ` Leon Romanovsky
  0 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2021-11-17 12:15 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: Jason Gunthorpe, RDMA mailing list, Haris Iqbal

On Mon, Nov 15, 2021 at 10:20:50AM +0100, Jinpu Wang wrote:
> On Mon, Nov 15, 2021 at 9:18 AM Jinpu Wang <jinpu.wang@ionos.com> wrote:
> >
> > On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote:
> > > > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > > > > > > Hi Jason, hi Leon,
> > > > > > >
> > > > > > > We are seeing exactly the same error reported here:
> > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> > > > > > >
> > > > > > > I suspect it's related to
> > > > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@nvidia.com/
> > > > > > >
> > > > > > > Do you have any idea, what goes wrong?
> > > > > >
> > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> > > > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
> > > > > >
> > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> > > > > > report.
> > > > > >
> > > > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > > > >
> > > > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> > > > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > Hi,
> > > > >
> > > > > I tried different host with CX-3/CX-5, they all work fine. and I can
> > > > > only reproduce on hosts with a bit old HCA:
> > > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > > > >
> > > > > The bug report link
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
> > > > > ConnectX too.
> > > > >
> > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
> > > > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
> > > > > 2.0 x8 5.0GT/s In... (rev b0)
> > > > > with the instrument, I only narrow it down to
> > > > > 1438                 port = setup_port(coredev, port_num, &attr);
> > > > > 1439                 if (IS_ERR(port)) {
> > > > > 1440                         ret = PTR_ERR(port);
> > > > > 1441                         pr_info("setup ports failed %d\n", ret);
> > > > > 1442                         goto err_put;
> > > > > 1443                 }
> > > >
> > > > Keep going with the tracing, there are lots of allocations in there.
> > > >
> > > > > My guess is the ConnectX HCA may be missing some features, which leads
> > > > > to ENOMEM, I will continue the instrument if no other hint.
> > > >
> > > > Since there is no memory allocation failure splat I'm guessing some
> > > > memory allocation hit an overflow and silently failed - ie mlx4 is
> > > > possibily setting some value to something bogus
> > >
> > > Yes, look for the values returned from FW.
> > Hi Leon, hi Jason
> >
> > I've found the problem, the device doesn't support per port diag
> > counters, and the driver then fails the register which is
> > too harsh.
> >
> > I'm not sure how to fix it properly, your thought?
> >
> > Thanks
> >
> with this change,  the device can be detected properly. if you think
> it's the right direction, I can submit a patch.

Thanks, it looks like a right fix.

> 
> Thanks!
> +
> +static const struct ib_device_ops mlx4_ib_hw_stats_ops1 = {
> +       .alloc_hw_device_stats = mlx4_ib_alloc_hw_device_stats,
> +       .get_hw_stats = mlx4_ib_get_hw_stats,
> +};
> +
>  static int mlx4_ib_alloc_diag_counters(struct mlx4_ib_dev *ibdev)
>  {
>         struct mlx4_ib_diag_counters *diag = ibdev->diag_counters;
> @@ -2230,8 +2238,11 @@ static int mlx4_ib_alloc_diag_counters(struct
> mlx4_ib_dev *ibdev)
> 
>         for (i = 0; i < MLX4_DIAG_COUNTERS_TYPES; i++) {
>                 /* i == 1 means we are building port counters */
> -               if (i && !per_port)
> -                       continue;
> +               if (i && !per_port) {
> +                       ib_set_device_ops(&ibdev->ib_dev,
> &mlx4_ib_hw_stats_ops1);
> +                       return 0;
> +               }

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-11-17 12:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11  7:48 Missing infiniband network interfaces after update to 5.14/5.15 Jinpu Wang
2021-11-11 11:29 ` Leon Romanovsky
2021-11-12  8:23   ` Jinpu Wang
2021-11-12 14:23     ` Jason Gunthorpe
2021-11-14  7:05       ` Leon Romanovsky
2021-11-15  8:18         ` Jinpu Wang
2021-11-15  9:20           ` Jinpu Wang
2021-11-17 12:15             ` Leon Romanovsky
2021-11-11 12:58 ` Jason Gunthorpe
2021-11-11 13:48   ` Jinpu Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.