Re: system hang on start-up (mlx5?)

From: Chuck Lever III <chuck.lever@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eli Cohen <elic@nvidia.com>, Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>
Subject: Re: system hang on start-up (mlx5?)
Date: Tue, 30 May 2023 13:09:13 +0000	[thread overview]
Message-ID: <0C0389AD-5DB9-42A8-993C-2C9DEDC958AC@oracle.com> (raw)
In-Reply-To: <875y8altrq.ffs@tglx>

> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>> I can boot the system with mlx5_core deny-listed. I log in, remove
>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>> reproduce the issue while the system is running.
>> 
>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: firmware version: 16.35.2000
>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>> 
>> ###
>> 
>> The fault address is the cm->managed_map for one of the CPUs.
> 
> That does not make any sense at all. The irq matrix is initialized via:
> 
> irq_alloc_matrix()
>  m = kzalloc(sizeof(matric);
>  m->maps = alloc_percpu(*m->maps);
> 
> So how is any per CPU map which got allocated there supposed to be
> invalid (not mapped):
> 
>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read access in kernel mode
>> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000) - not-present page
>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
> 
> But if you look at the address: 0xffffffffb9ef3f80
> 
> That one is bogus:
> 
>     managed_map=ffff9a36efcf0f80
>     managed_map=ffff9a36efd30f80
>     managed_map=ffff9a3aefc30f80
>     managed_map=ffff9a3aefc70f80
>     managed_map=ffff9a3aefd30f80
>     managed_map=ffff9a3aefd70f80
>     managed_map=ffffffffb9ef3f80
> 
> Can you spot the fail?
> 
> The first six are in the direct map and the last one is in module map,
> which makes no sense at all.

Indeed. The reason for that is that the affinity mask has bits
set for CPU IDs that are not present on my system.

After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
that mask is set up like this:

 struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
 {
        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
-       cpumask_var_t req_mask;
+       struct irq_affinity_desc af_desc;
        struct mlx5_irq *irq;
-       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
-               return ERR_PTR(-ENOMEM);
-       cpumask_copy(req_mask, cpu_online_mask);
+       cpumask_copy(&af_desc.mask, cpu_online_mask);
+       af_desc.is_managed = false;

Which normally works as you would expect. But for some historical
reason, I have CONFIG_NR_CPUS=32 on my system, and the
cpumask_copy() misbehaves.

If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
copy, this crash goes away. But mlx5_core crashes during a later
part of its init, in cpu_rmap_update(). cpu_rmap_update() does
exactly the same thing (for_each_cpu() on an affinity mask created
by copying), and crashes in a very similar fashion.

If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
vanishes entirely, and "modprobe mlx5_core" works as expected.

Thus I think the problem is with cpumask_copy() or for_each_cpu()
when NR_CPUS is a small value (the default is 8192).

> Can you please apply the debug patch below and provide the output?
> 
> Thanks,
> 
>        tglx
> ---
> --- a/kernel/irq/matrix.c
> +++ b/kernel/irq/matrix.c
> @@ -51,6 +51,7 @@ struct irq_matrix {
>   unsigned int alloc_end)
> {
> struct irq_matrix *m;
> + unsigned int cpu;
> 
> if (matrix_bits > IRQ_MATRIX_BITS)
> return NULL;
> @@ -68,6 +69,8 @@ struct irq_matrix {
> kfree(m);
> return NULL;
> }
> + for_each_possible_cpu(cpu)
> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned long)per_cpu_ptr(m->maps, cpu));
> return m;
> }
> 
> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
> unsigned int bit;
> 
> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned long)cm);
> +
> bit = matrix_alloc_area(m, cm, 1, true);
> if (bit >= m->alloc_end)
> goto cleanup;

--
Chuck Lever