Re: system hang on start-up (mlx5?)

From: Chuck Lever III <chuck.lever@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eli Cohen <elic@nvidia.com>, Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: system hang on start-up (mlx5?)
Date: Wed, 31 May 2023 15:06:15 +0000	[thread overview]
Message-ID: <48B0BC74-5F5C-4212-BC5A-552356E9FFB1@oracle.com> (raw)
In-Reply-To: <87ttvsftoc.ffs@tglx>

> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
>>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Can you please add after the cpumask_copy() in that mlx5 code:
>>> 
>>>   pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>>>   pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);
>> 
>> Both are 0000 0000 0000 0fff, as expected on a system
>> where 12 CPUs are present.
> 
> So the non-initialized mask on stack has the online bits correctly
> copied and bits 12-63 are cleared, which is what cpumask_copy()
> achieves by copying longs and not bits.
> 
>> [   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8
> 
> How can that end up with a completely different content here?
> 
> As I said before that's clearly a direct map address.
> 
> So the call chain is:
> 
> mlx5_irq_alloc(af_desc)
>  pci_msix_alloc_irq_at(af_desc)
>    msi_domain_alloc_irq_at(af_desc)
>      __msi_domain_alloc_irqs(af_desc)
> 1)      msidesc->affinity = kmemdup(af_desc);
>        __irq_domain_alloc_irqs()
>          __irq_domain_alloc_irqs(aff=msidesc->affinity)
>            irq_domain_alloc_irqs_locked(aff)
>              irq_domain_alloc_irqs_locked(aff)
>                irq_domain_alloc_descs(aff)
>                  alloc_desc(mask=&aff->mask)
>                    desc_smp_init(mask)
> 2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
>                irq_domain_alloc_irqs_hierarchy()
>                  msi_domain_alloc()
>                    intel_irq_remapping_alloc()
>                      x86_vector_alloc_irqs()

It is x86_vector_alloc_irqs() where the struct irq_data is
fabricated that ends up in irq_matrix_reserve_managed().

>                        reserve_managed_vector()
>                          mask = desc->irq_common_data.affinity;
>                          irq_matrix_reserve_managed(mask)
> 
> So af_desc is kmemdup'ed at #1 and then the result is copied in #2.
> 
> Anything else just hands pointers around. So where gets either af_desc
> or msidesc->affinity or desc->irq_common_data.affinity overwritten? Or
> one of the pointers mangled. I doubt that it's the latter as this is 99%
> generic code which would end up in random fails all over the place.
> 
> This also ends up in the wrong place. That mlx code does:
> 
>   af_desc.is_managed = false;
> 
> but the allocation ends up allocating a managed vector.

That line was changed in 6.4-rc4 to address another bug,
and it avoids the crash by not calling into the misbehaving
code. It doesn't address the mlx5_core initialization issue
though, because as I said before, execution continues and
crashes in a similar scenario later on.

On my system, I've reverted that fix:

-       af_desc.is_managed = false;
+       af_desc.is_managed = 1;

so that we can continue debugging this crash.

> Can you please instrument this along the call chain so we can see where
> or at least when this gets corrupted? Please print the relevant pointer
> addresses too so we can see whether that's consistent or not.

I will continue working on this today.

>> The lowest 16 bits of MASKBITS are bcd8, or in binary:
>> 
>> ... 1011 1100 1101 1000
>> 
>> Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
>> 12, matching the CPU IDs from the loop. At bit 12, we fault,
>> since there is no CPU 12 on the system.
> 
> Thats due to a dubious optimization from Linus:
> 
> #if NR_CPUS <= BITS_PER_LONG
>  #define small_cpumask_bits ((unsigned int)NR_CPUS)
>  #define large_cpumask_bits ((unsigned int)NR_CPUS)
> #elif NR_CPUS <= 4*BITS_PER_LONG
>  #define small_cpumask_bits nr_cpu_ids
> 
> small_cpumask_bits is not nr_cpu_ids(12), it's NR_CPUS(32) which is why
> the loop does not terminate. Bah!
> 
> But that's just the symptom, not the root cause. That code is perfectly
> fine when all callers use the proper cpumask functions.

Agreed: we're crashing here because of the extra bits
in the affinity mask, but those bits should not be set
in the first place.

I wasn't sure if for_each_cpu() was supposed to iterate
into non-present CPUs -- and I guess the answer
is yes, it will iterate the full length of the mask.
The caller is responsible for ensuring the mask is valid.

--
Chuck Lever