Re: system hang on start-up (mlx5?)

From: Chuck Lever III <chuck.lever@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eli Cohen <elic@nvidia.com>, Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>
Subject: Re: system hang on start-up (mlx5?)
Date: Tue, 30 May 2023 21:48:32 +0000	[thread overview]
Message-ID: <C34181E7-A515-4BD1-8C38-CB8BCF2D987D@oracle.com> (raw)
In-Reply-To: <87o7m1iov9.ffs@tglx>

> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Chuck!
> 
> On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:
>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> But if you look at the address: 0xffffffffb9ef3f80
>>> 
>>> That one is bogus:
>>> 
>>>    managed_map=ffff9a36efcf0f80
>>>    managed_map=ffff9a36efd30f80
>>>    managed_map=ffff9a3aefc30f80
>>>    managed_map=ffff9a3aefc70f80
>>>    managed_map=ffff9a3aefd30f80
>>>    managed_map=ffff9a3aefd70f80
>>>    managed_map=ffffffffb9ef3f80
>>> 
>>> Can you spot the fail?
>>> 
>>> The first six are in the direct map and the last one is in module map,
>>> which makes no sense at all.
>> 
>> Indeed. The reason for that is that the affinity mask has bits
>> set for CPU IDs that are not present on my system.
> 
> Which I don't buy, but even if so then this should not make
> for_each_cpu() go south. See below.
> 
>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>> that mask is set up like this:
>> 
>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>> {
>>        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>> -       cpumask_var_t req_mask;
>> +       struct irq_affinity_desc af_desc;
> 
> That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...
> 
>>        struct mlx5_irq *irq;
>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>> -               return ERR_PTR(-ENOMEM);
>> -       cpumask_copy(req_mask, cpu_online_mask);
>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>> +       af_desc.is_managed = false;
>> 
>> Which normally works as you would expect. But for some historical
>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>> cpumask_copy() misbehaves.
>> 
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away. But mlx5_core crashes during a later
>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>> exactly the same thing (for_each_cpu() on an affinity mask created
>> by copying), and crashes in a very similar fashion.
>> 
>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>> 
>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>> when NR_CPUS is a small value (the default is 8192).
> 
> I don't buy any of this. Here is why:
> 
> cpumask_copy(d, s)
>   bitmap_copy(d, s, nbits = 32)
>     len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
> 
> So it copies as many longs as required to cover nbits, i.e. it copies
> any clobbered bits beyond nbits too. While that looks odd at the first
> glance, that's just an optimization which is harmless.
> 
> for_each_cpu() finds the next set bit in a mask and breaks the loop once
> bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.
> 
> I just booted a kernel with NR_CPUS=32:

My system has only 12 CPUs. So every bit in your mask represents
a present CPU, but on my system, only 0x00000fff are ever present.

Therefore, on my system, any bit higher than bit 11 in a CPU mask
will reference a CPU that is not present.

> [    0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
> [    0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
> ...
> [    0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1
> 
> and added a function which does:
> 
>    struct cpumask ma, mb;
>    int cpu;
> 
>    memset(&ma, 0xaa, sizeof(ma);
>    cpumask_copy(&mb, &ma);
>    pr_info("MASKBITS: %016lx\n", mb.bits[0]);
>    pr_info("CPUs:");
>    for_each_cpu(cpu, &mb)
>         pr_cont(" %d", cpu);
>    pr_cont("\n");
> 
> [    2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
> [    2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
> 
> and the same exercise with a 0x55 filled source bitmap.
> 
> [    2.167595] smp: MASKBITS: 0x5555555555555555
> [    2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
> 
> So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
> the right thing simple because of this:
> 
> nr_cpu_ids is 32, right?
> 
> for_each_cpu(cpu, mask)
>   for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
> for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
> 
> So if find_next_bit() returns a bit after bit 31 the condition (bit) <
> (size) will terminate the loop, right?

Again, you are assuming more CPUs than there are bits in the mask.

> Also in the case of that driver the copy is _NOT_ copying set bits
> beyond bit 31 simply because the source mask is cpu_online_mask which
> definitely does not have a bit set which is greater than 31. As the copy
> copies longs the resulting mask in af_desc.mask cannot have any bit set
> past bit 31 either.
> 
> If the above is not correct, then there is a bigger problem than that
> MLX5 driver crashing.
> 
> So this:
> 
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away.
> 
> does not make any sense to me.
> 
> Can you please add after the cpumask_copy() in that mlx5 code:
> 
>    pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>    pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);

Both are 0000 0000 0000 0fff, as expected on a system
where 12 CPUs are present.

> Please print also in irq_matrix_reserve_managed():
> 
>  - @mask->bits[0]
>  - nr_cpu_ids
>  - the CPU numbers inside the for_each_cpu() loop

Here's where it gets interesting:

+       pr_info("%s: MASKBITS:   %016lx\n", __func__, msk->bits[0]);
+       pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids);

[   70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000
[   70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[   71.260555][    C9] port_module: 1 callbacks suppressed
[   71.260561][    C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS:   ffffb1a74686bcd8
[   71.274096][   T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
[   71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12
[   71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3
[   71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4
[   71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6
[   71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7
[   71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10
[   71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11
[   71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12

Notice that there are in fact higher bits set than bit 11.

The lowest 16 bits of MASKBITS are bcd8, or in binary:

... 1011 1100 1101 1000

Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
12, matching the CPU IDs from the loop. At bit 12, we fault,
since there is no CPU 12 on the system.

--
Chuck Lever