linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] mellanox IB driver fails to load on large config
@ 2015-07-10 19:15 andrew banman
  2015-07-11 20:20 ` Or Gerlitz
  0 siblings, 1 reply; 11+ messages in thread
From: andrew banman @ 2015-07-10 19:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz,
	David S. Miller, Roland Dreier, Matan Barak, Moni Shoua,
	Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny,
	linux-rdma

I'm seeing a large number of allocation errors originating from the Mellanox IB
driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:

8<---
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 64; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 65; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 66; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 67; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 68; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 69; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 70; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 71; reverting to legacy
......
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 123; reverting to legacy
--->8

Where the failing function is in drivers/infiniband/hw/mlx4/main.c:

8<---
2042 static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
...
2075                         /* Set IRQ for specific name (per ring) */
2076                         if (mlx4_assign_eq(dev, name, NULL,
2077                                            &ibdev->eq_table[eq])) {
2078                                 /* Use legacy (same as mlx4_en driver) */
2079                                 pr_warn("Can't allocate EQ %d; reverting to legacy\n", eq);
2080                                 ibdev->eq_table[eq] =
2081                                         (eq % dev->caps.num_comp_vectors);
2082                         }
--->8

The problem doesn't appear to be fatal. At this point I am unsure if this is
actually expected behavior, so I'm looking for some insight into the issue.

At first we believed the problem to be with request_irq, but after writing in
some debug code that mlx4_assign_eq returned -28, indicating that vec was
never assigned:

8<---
@@ -1401,6 +1402,7 @@ int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
        if (vec) {
                *vector = vec;
        } else {
+               pr_crit("!!! debug: mlx4_assign_eq - last err %d\n", err);
                *vector = 0;
                err = (i == dev->caps.comp_pool) ? -ENOSPC : err;
        }
--->8

8<---
 [ 1565.416273] !!! debug: mlx4_assign_eq - last err 0
 [ 1565.416275] <mlx4_ib> mlx4_ib_alloc_eqs: !!! debug: mlx4_assign_eq returned -28
 [ 1565.416277] <mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 64; reverting to legacy
--->8


Any help would be greatly appreciated!

Andrew Banman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-10 19:15 [BUG] mellanox IB driver fails to load on large config andrew banman
@ 2015-07-11 20:20 ` Or Gerlitz
  2015-07-14 18:22   ` andrew banman
  0 siblings, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2015-07-11 20:20 UTC (permalink / raw)
  To: andrew banman
  Cc: Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock,
	Or Gerlitz, David S. Miller, Roland Dreier, Matan Barak,
	Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha,
	Ira Weiny, linux-rdma

On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote:
> I'm seeing a large number of allocation errors originating from the Mellanox IB
> driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:

Just to make sure, mlx4 works fine on this small (...) system with 4.1
and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that
config?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-11 20:20 ` Or Gerlitz
@ 2015-07-14 18:22   ` andrew banman
  2015-07-14 18:48     ` Alex Thorlton
  0 siblings, 1 reply; 11+ messages in thread
From: andrew banman @ 2015-07-14 18:22 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: andrew banman, Linux Kernel, Doug Ledford, Sean Hefty,
	Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier,
	Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas,
	Eran Ben Elisha, Ira Weiny, linux-rdma, athorlton

On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote:
> On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote:
> > I'm seeing a large number of allocation errors originating from the Mellanox IB
> > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:
> 
> Just to make sure, mlx4 works fine on this small (...) system with 4.1
> and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that
> config?

I'll let Alex comment on that, he did some testing on that.

-Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-14 18:22   ` andrew banman
@ 2015-07-14 18:48     ` Alex Thorlton
  2015-07-14 20:06       ` Or Gerlitz
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Thorlton @ 2015-07-14 18:48 UTC (permalink / raw)
  To: andrew banman
  Cc: Or Gerlitz, Linux Kernel, Doug Ledford, Sean Hefty,
	Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier,
	Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas,
	Eran Ben Elisha, Ira Weiny, linux-rdma, athorlton

On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote:
> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote:
> > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote:
> > > I'm seeing a large number of allocation errors originating from the Mellanox IB
> > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:
> > 
> > Just to make sure, mlx4 works fine on this small (...) system with 4.1
> > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that
> > config?
> 
> I'll let Alex comment on that, he did some testing on that.

I started seeing this on a 4.1-rc8 kernel, so it's been around for a
little while.  It may have been around before 4.1-rc8, but I haven't run
any kernels older than that on the big machine for some time.

- Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-14 18:48     ` Alex Thorlton
@ 2015-07-14 20:06       ` Or Gerlitz
  2015-07-14 20:28         ` Alex Thorlton
  0 siblings, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2015-07-14 20:06 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: andrew banman, Linux Kernel, Doug Ledford, Sean Hefty,
	Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier,
	Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas,
	Eran Ben Elisha, Ira Weiny, linux-rdma

On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton <athorlton@sgi.com> wrote:
> On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote:
>> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote:
>> > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote:
>> > > I'm seeing a large number of allocation errors originating from the Mellanox IB
>> > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:
>> >
>> > Just to make sure, mlx4 works fine on this small (...) system with 4.1
>> > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that
>> > config?
>>
>> I'll let Alex comment on that, he did some testing on that.
>
> I started seeing this on a 4.1-rc8 kernel, so it's been around for a
> little while.  It may have been around before 4.1-rc8, but I haven't run
> any kernels older than that on the big machine for some time.

To make sure I am correctly following, on 4.1-rc8  you also see
something, right? are these the same messages or different ones? if
the latter send to us.

Or.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-14 20:06       ` Or Gerlitz
@ 2015-07-14 20:28         ` Alex Thorlton
  2015-07-15 11:33           ` Matan Barak
  2015-07-16  6:25           ` Or Gerlitz
  0 siblings, 2 replies; 11+ messages in thread
From: Alex Thorlton @ 2015-07-14 20:28 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alex Thorlton, andrew banman, Linux Kernel, Doug Ledford,
	Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller,
	Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein,
	Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma

On Tue, Jul 14, 2015 at 11:06:26PM +0300, Or Gerlitz wrote:
> On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton <athorlton@sgi.com> wrote:
> > On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote:
> >> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote:
> >> > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote:
> >> > > I'm seeing a large number of allocation errors originating from the Mellanox IB
> >> > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:
> >> >
> >> > Just to make sure, mlx4 works fine on this small (...) system with 4.1
> >> > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that
> >> > config?
> >>
> >> I'll let Alex comment on that, he did some testing on that.
> >
> > I started seeing this on a 4.1-rc8 kernel, so it's been around for a
> > little while.  It may have been around before 4.1-rc8, but I haven't run
> > any kernels older than that on the big machine for some time.
> 
> To make sure I am correctly following, on 4.1-rc8  you also see
> something, right?

Yes, that's correct.

> are these the same messages or different ones? if the latter send to us.

We see the same exact messages on 4.1-rc8.

Thanks for looking into this!

- Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-14 20:28         ` Alex Thorlton
@ 2015-07-15 11:33           ` Matan Barak
  2015-07-16  6:25           ` Or Gerlitz
  1 sibling, 0 replies; 11+ messages in thread
From: Matan Barak @ 2015-07-15 11:33 UTC (permalink / raw)
  To: Alex Thorlton, Or Gerlitz
  Cc: andrew banman, Linux Kernel, Doug Ledford, Sean Hefty,
	Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier,
	Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha,
	Ira Weiny, linux-rdma



On 7/14/2015 11:28 PM, Alex Thorlton wrote:
> On Tue, Jul 14, 2015 at 11:06:26PM +0300, Or Gerlitz wrote:
>> On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton <athorlton@sgi.com> wrote:
>>> On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote:
>>>> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote:
>>>>> On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote:
>>>>>> I'm seeing a large number of allocation errors originating from the Mellanox IB
>>>>>> driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:
>>>>>
>>>>> Just to make sure, mlx4 works fine on this small (...) system with 4.1
>>>>> and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that
>>>>> config?
>>>>
>>>> I'll let Alex comment on that, he did some testing on that.
>>>
>>> I started seeing this on a 4.1-rc8 kernel, so it's been around for a
>>> little while.  It may have been around before 4.1-rc8, but I haven't run
>>> any kernels older than that on the big machine for some time.
>>
>> To make sure I am correctly following, on 4.1-rc8  you also see
>> something, right?
>
> Yes, that's correct.
>
>> are these the same messages or different ones? if the latter send to us.
>
> We see the same exact messages on 4.1-rc8.

Hi,

We don't recall getting those error with 32cpu machines, but we'll try 
to reproduce this issue.

Matan

>
> Thanks for looking into this!
>
> - Alex
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-14 20:28         ` Alex Thorlton
  2015-07-15 11:33           ` Matan Barak
@ 2015-07-16  6:25           ` Or Gerlitz
  2015-07-20 16:28             ` Alex Thorlton
  1 sibling, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2015-07-16  6:25 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Or Gerlitz, andrew banman, Linux Kernel, Doug Ledford,
	Sean Hefty, Hal Rosenstock, David S. Miller, Roland Dreier,
	Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas,
	Eran Ben Elisha, Ira Weiny, linux-rdma

On 7/14/2015 11:28 PM, Alex Thorlton wrote:
>
> We see the same exact messages on 4.1-rc8.
>
>

does this solves the problem?


diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index ad31e47..c8ae3b9 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -45,7 +45,7 @@
  #include <linux/timecounter.h>

  #define MAX_MSIX_P_PORT                17
-#define MAX_MSIX               64
+#define MAX_MSIX               1024
  #define MIN_MSIX_P_PORT                5
  #define MLX4_IS_LEGACY_EQ_MODE(dev_cap) ((dev_cap).num_comp_vectors < \
(dev_cap).num_ports * MIN_MSIX_P_PORT)
--


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-16  6:25           ` Or Gerlitz
@ 2015-07-20 16:28             ` Alex Thorlton
  2015-07-21  2:56               ` Alex Thorlton
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Thorlton @ 2015-07-20 16:28 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alex Thorlton, Or Gerlitz, andrew banman, Linux Kernel,
	Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller,
	Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein,
	Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma

On Thu, Jul 16, 2015 at 09:25:37AM +0300, Or Gerlitz wrote:
> On 7/14/2015 11:28 PM, Alex Thorlton wrote:
>>
>> We see the same exact messages on 4.1-rc8.
>>
>>
>
> does this solves the problem?
>
>
> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> index ad31e47..c8ae3b9 100644
> --- a/include/linux/mlx4/device.h
> +++ b/include/linux/mlx4/device.h
> @@ -45,7 +45,7 @@
>  #include <linux/timecounter.h>
>
>  #define MAX_MSIX_P_PORT                17
> -#define MAX_MSIX               64
> +#define MAX_MSIX               1024
>  #define MIN_MSIX_P_PORT                5
>  #define MLX4_IS_LEGACY_EQ_MODE(dev_cap) ((dev_cap).num_comp_vectors < \
> (dev_cap).num_ports * MIN_MSIX_P_PORT)
> --
>

I've got some time on the large machine later today.  I'll give this a
try then.

Thanks, Or!

- Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-20 16:28             ` Alex Thorlton
@ 2015-07-21  2:56               ` Alex Thorlton
  2015-07-21 14:21                 ` Matan Barak
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Thorlton @ 2015-07-21  2:56 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Or Gerlitz, Or Gerlitz, andrew banman, Linux Kernel,
	Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller,
	Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein,
	Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma

On Mon, Jul 20, 2015 at 11:28:03AM -0500, Alex Thorlton wrote:
> I've got some time on the large machine later today.  I'll give this a
> try then.

I ran a boot with this patch applied:

diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 83e80ab..c84aea0 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -45,7 +45,7 @@
 #include <linux/timecounter.h>

 #define MAX_MSIX_P_PORT                17
-#define MAX_MSIX               64
+#define MAX_MSIX               8192
 #define MSIX_LEGACY_SZ         4
 #define MIN_MSIX_P_PORT                5

I went for a max of 8192, since I was actually booting the machine with
6144 cores (not 4096) for this run.  It doesn't look like this fixed the
problem.  I still saw the same errors during boot.

FWIW, the module does appear to still successfully load:

8<---
# lsmod | grep mlx
mlx4_ib               151552  0
ib_sa                  32768  1 mlx4_ib
ib_mad                 49152  2 ib_sa,mlx4_ib
ib_core               102400  3 ib_sa,mlx4_ib,ib_mad
mlx4_core             278528  1 mlx4_ib
--->8

If the module loading is good enough, and we should just ignore the
errors, then I'm fine with that.  Just wanting to make sure that
everything is behaving correctly.

- Alex

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [BUG] mellanox IB driver fails to load on large config
  2015-07-21  2:56               ` Alex Thorlton
@ 2015-07-21 14:21                 ` Matan Barak
  0 siblings, 0 replies; 11+ messages in thread
From: Matan Barak @ 2015-07-21 14:21 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Or Gerlitz, Or Gerlitz, andrew banman, Linux Kernel,
	Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller,
	Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein,
	Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma

On Tue, Jul 21, 2015 at 5:56 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> On Mon, Jul 20, 2015 at 11:28:03AM -0500, Alex Thorlton wrote:
>> I've got some time on the large machine later today.  I'll give this a
>> try then.
>
> I ran a boot with this patch applied:
>
> diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
> index 83e80ab..c84aea0 100644
> --- a/include/linux/mlx4/device.h
> +++ b/include/linux/mlx4/device.h
> @@ -45,7 +45,7 @@
>  #include <linux/timecounter.h>
>
>  #define MAX_MSIX_P_PORT                17
> -#define MAX_MSIX               64
> +#define MAX_MSIX               8192
>  #define MSIX_LEGACY_SZ         4
>  #define MIN_MSIX_P_PORT                5
>
> I went for a max of 8192, since I was actually booting the machine with
> 6144 cores (not 4096) for this run.  It doesn't look like this fixed the
> problem.  I still saw the same errors during boot.
>
> FWIW, the module does appear to still successfully load:
>
> 8<---
> # lsmod | grep mlx
> mlx4_ib               151552  0
> ib_sa                  32768  1 mlx4_ib
> ib_mad                 49152  2 ib_sa,mlx4_ib
> ib_core               102400  3 ib_sa,mlx4_ib,ib_mad
> mlx4_core             278528  1 mlx4_ib
> --->8
>
> If the module loading is good enough, and we should just ignore the
> errors, then I'm fine with that.  Just wanting to make sure that
> everything is behaving correctly.

It shouldn't be a problem,  as all unused/erroneous EQs get "-1".
We'll try to reproduce the problem here, it might take awhile though.

Thanks for checking this,
Matan

>
> - Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-07-21 14:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-10 19:15 [BUG] mellanox IB driver fails to load on large config andrew banman
2015-07-11 20:20 ` Or Gerlitz
2015-07-14 18:22   ` andrew banman
2015-07-14 18:48     ` Alex Thorlton
2015-07-14 20:06       ` Or Gerlitz
2015-07-14 20:28         ` Alex Thorlton
2015-07-15 11:33           ` Matan Barak
2015-07-16  6:25           ` Or Gerlitz
2015-07-20 16:28             ` Alex Thorlton
2015-07-21  2:56               ` Alex Thorlton
2015-07-21 14:21                 ` Matan Barak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).