Re: system hang on start-up (mlx5?)

From: Chuck Lever III <chuck.lever@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eli Cohen <elic@nvidia.com>, Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: system hang on start-up (mlx5?)
Date: Wed, 31 May 2023 18:52:39 +0000	[thread overview]
Message-ID: <7AECA9A6-1F7D-4F82-A4C1-83616D4D6566@oracle.com> (raw)
In-Reply-To: <87leh4fmsg.ffs@tglx>



> On May 31, 2023, at 1:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, May 31 2023 at 15:06, Chuck Lever III wrote:
>>> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> mlx5_irq_alloc(af_desc)
>>> pci_msix_alloc_irq_at(af_desc)
>>>   msi_domain_alloc_irq_at(af_desc)
>>>     __msi_domain_alloc_irqs(af_desc)
>>> 1)      msidesc->affinity = kmemdup(af_desc);
>>>       __irq_domain_alloc_irqs()
>>>         __irq_domain_alloc_irqs(aff=msidesc->affinity)
>>>           irq_domain_alloc_irqs_locked(aff)
>>>             irq_domain_alloc_irqs_locked(aff)
>>>               irq_domain_alloc_descs(aff)
>>>                 alloc_desc(mask=&aff->mask)
>>>                   desc_smp_init(mask)
>>> 2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
>>>               irq_domain_alloc_irqs_hierarchy()
>>>                 msi_domain_alloc()
>>>                   intel_irq_remapping_alloc()
>>>                     x86_vector_alloc_irqs()
>> 
>> It is x86_vector_alloc_irqs() where the struct irq_data is
>> fabricated that ends up in irq_matrix_reserve_managed().
> 
> Kinda fabricated :)
> 
>     irqd = irq_domain_get_irq_data(domain, virq + i);
> 
> Thats finding the irqdata which is associated to the vector domain. That
> has been allocated earlier. The affinity mask is retrieved via:
> 
>    const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);
> 
> which does:
> 
>      return irqd->common->affinity;
> 
> irqd->common points to desc->irq_common_data. The affinity there was
> copied in #2 above.
> 
>>> This also ends up in the wrong place. That mlx code does:
>>> 
>>>  af_desc.is_managed = false;
>>> 
>>> but the allocation ends up allocating a managed vector.
>> 
>> That line was changed in 6.4-rc4 to address another bug,
>> and it avoids the crash by not calling into the misbehaving
>> code. It doesn't address the mlx5_core initialization issue
>> though, because as I said before, execution continues and
>> crashes in a similar scenario later on.
> 
> Ok.
> 
>> On my system, I've reverted that fix:
>> 
>> -       af_desc.is_managed = false;
>> +       af_desc.is_managed = 1;
>> 
>> so that we can continue debugging this crash.
> 
> Ah.
> 
>>> Can you please instrument this along the call chain so we can see where
>>> or at least when this gets corrupted? Please print the relevant pointer
>>> addresses too so we can see whether that's consistent or not.
>> 
>> I will continue working on this today.
>>> But that's just the symptom, not the root cause. That code is perfectly
>>> fine when all callers use the proper cpumask functions.
>> 
>> Agreed: we're crashing here because of the extra bits
>> in the affinity mask, but those bits should not be set
>> in the first place.
> 
> Correct.
> 
>> I wasn't sure if for_each_cpu() was supposed to iterate
>> into non-present CPUs -- and I guess the answer
>> is yes, it will iterate the full length of the mask.
>> The caller is responsible for ensuring the mask is valid.
> 
> Yes, that's the assumption of this constant optimization for the small
> number of CPUs case. All other cases use nr_cpu_ids as limit and won't
> go into non-possible CPUs. I didn't spot it yesterday night either.

This addresses the problem for me with both is_managed = 1
and is_managed = false:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index db5687d9fec9..bcf5df316c8f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -570,11 +570,11 @@ int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
        af_desc.is_managed = false;
        for (i = 0; i < nirqs; i++) {
+               cpumask_clear(&af_desc.mask);
                cpumask_set_cpu(cpus[i], &af_desc.mask);
                irq = mlx5_irq_request(dev, i + 1, &af_desc, rmap);
                if (IS_ERR(irq))
                        break;
-               cpumask_clear(&af_desc.mask);
                irqs[i] = irq;
        }

If you agree this looks reasonable, I can package it with a
proper patch description and send it to Eli and Saeed.

--
Chuck Lever