Re: system hang on start-up (mlx5?)

From: Chuck Lever III <chuck.lever@oracle.com>
To: Shay Drory <shayd@nvidia.com>, Eli Cohen <elic@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: system hang on start-up (mlx5?)
Date: Wed, 31 May 2023 14:15:28 +0000	[thread overview]
Message-ID: <B9761A06-C76C-4088-A748-77867C9FF3CD@oracle.com> (raw)
In-Reply-To: <9d793d9f-0fca-2b0d-2a2e-abd527ffa8d4@nvidia.com>

> On May 30, 2023, at 11:08 AM, Shay Drory <shayd@nvidia.com> wrote:
> 
> 
> On 30/05/2023 16:54, Eli Cohen wrote:
>>> -----Original Message-----
>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>> Sent: Tuesday, 30 May 2023 16:51
>>> To: Eli Cohen <elic@nvidia.com>
>>> Cc: Shay Drory <shayd@nvidia.com>; Leon Romanovsky <leon@kernel.org>;
>>> Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
>>> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
>>> <netdev@vger.kernel.org>; Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: system hang on start-up (mlx5?)
>>> 
>>> 
>>> 
>>>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>> 
>>>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>>>> Sent: Tuesday, 30 May 2023 16:28
>>>>> To: Eli Cohen <elic@nvidia.com>
>>>>> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
>>>>> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
>>>>> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
>>>>> <tglx@linutronix.de>
>>>>> Subject: Re: system hang on start-up (mlx5?)
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
>>>>> wrote:
>>>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
>>>>> wrote:
>>>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>>>>>>> reproduce the issue while the system is running.
>>>>>>>> 
>>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> firmware version: 16.35.2000
>>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> Port module event: module 0, Cable plugged
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>>> page fault for address: ffffffffb9ef3f80
>>>>>>>> ###
>>>>>>>> 
>>>>>>>> The fault address is the cm->managed_map for one of the CPUs.
>>>>>>> That does not make any sense at all. The irq matrix is initialized via:
>>>>>>> 
>>>>>>> irq_alloc_matrix()
>>>>>>> m = kzalloc(sizeof(matric);
>>>>>>> m->maps = alloc_percpu(*m->maps);
>>>>>>> 
>>>>>>> So how is any per CPU map which got allocated there supposed to be
>>>>>>> invalid (not mapped):
>>>>>>> 
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>>> page fault for address: ffffffffb9ef3f80
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
>>>>> access in kernel mode
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
>>> error_code(0x0000)
>>>>> - not-present page
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
>>>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>>>>>>> But if you look at the address: 0xffffffffb9ef3f80
>>>>>>> 
>>>>>>> That one is bogus:
>>>>>>> 
>>>>>>>   managed_map=ffff9a36efcf0f80
>>>>>>>   managed_map=ffff9a36efd30f80
>>>>>>>   managed_map=ffff9a3aefc30f80
>>>>>>>   managed_map=ffff9a3aefc70f80
>>>>>>>   managed_map=ffff9a3aefd30f80
>>>>>>>   managed_map=ffff9a3aefd70f80
>>>>>>>   managed_map=ffffffffb9ef3f80
>>>>>>> 
>>>>>>> Can you spot the fail?
>>>>>>> 
>>>>>>> The first six are in the direct map and the last one is in module map,
>>>>>>> which makes no sense at all.
>>>>>> Indeed. The reason for that is that the affinity mask has bits
>>>>>> set for CPU IDs that are not present on my system.
>>>>>> 
>>>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>>>>>> that mask is set up like this:
>>>>>> 
>>>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>>>>>> {
>>>>>>       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>>>>>> -       cpumask_var_t req_mask;
>>>>>> +       struct irq_affinity_desc af_desc;
>>>>>>       struct mlx5_irq *irq;
>>>>>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>>>>>> -               return ERR_PTR(-ENOMEM);
>>>>>> -       cpumask_copy(req_mask, cpu_online_mask);
>>>>>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>>>>>> +       af_desc.is_managed = false;
>>>>> By the way, why is "is_managed" set to false?
>>>>> 
>>>>> This particular system is a NUMA system, and I'd like to be
>>>>> able to set IRQ affinity for the card. Since is_managed is
>>>>> set to false, writing to the /proc/irq files fails with EIO.
>>>>> 
>>>> This is a control irq and is used for issuing configuration commands.
>>>> 
>>>> This commit:
>>>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f
>>>> Author: Dou Liyang <douliyangs@gmail.com>
>>>> Date:   Tue Dec 4 23:51:21 2018 +0800
>>>> 
>>>>    genirq/affinity: Add is_managed to struct irq_affinity_desc
>>>> 
>>>> explains why it should not be managed.
>>> Understood, but what about the other IRQs? I can't set any
>>> of them. All writes to the proc files result in EIO.
>>> 
>> I think @Shay Drory has a fix for that should go upstream.
>> Shay was it sent?
> 
> The fix was send and merged.
> 
> https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/

Fwiw, I'm now on v6.4-rc4, and setting IRQ affinity works as expected.
Sorry for the noise and thanks for the fix.

>>>>>> Which normally works as you would expect. But for some historical
>>>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>>>>>> cpumask_copy() misbehaves.
>>>>>> 
>>>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>>>>>> copy, this crash goes away. But mlx5_core crashes during a later
>>>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>>>>>> exactly the same thing (for_each_cpu() on an affinity mask created
>>>>>> by copying), and crashes in a very similar fashion.
>>>>>> 
>>>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>>>>>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>>>>> 
>>>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>>>>>> when NR_CPUS is a small value (the default is 8192).
>>>>>> 
>>>>>> 
>>>>>>> Can you please apply the debug patch below and provide the output?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>>      tglx
>>>>>>> ---
>>>>>>> --- a/kernel/irq/matrix.c
>>>>>>> +++ b/kernel/irq/matrix.c
>>>>>>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>>>>>> unsigned int alloc_end)
>>>>>>> {
>>>>>>> struct irq_matrix *m;
>>>>>>> + unsigned int cpu;
>>>>>>> 
>>>>>>> if (matrix_bits > IRQ_MATRIX_BITS)
>>>>>>> return NULL;
>>>>>>> @@ -68,6 +69,8 @@ struct irq_matrix {
>>>>>>> kfree(m);
>>>>>>> return NULL;
>>>>>>> }
>>>>>>> + for_each_possible_cpu(cpu)
>>>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
>>>>> long)per_cpu_ptr(m->maps, cpu));
>>>>>>> return m;
>>>>>>> }
>>>>>>> 
>>>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>>>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>>>>>>> unsigned int bit;
>>>>>>> 
>>>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
>>>>> long)cm);
>>>>>>> +
>>>>>>> bit = matrix_alloc_area(m, cm, 1, true);
>>>>>>> if (bit >= m->alloc_end)
>>>>>>> goto cleanup;
>>>>>> --
>>>>>> Chuck Lever
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>> 
>>> --
>>> Chuck Lever

--
Chuck Lever