All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eli Cohen <elic@nvidia.com>, Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: system hang on start-up (mlx5?)
Date: Wed, 31 May 2023 15:06:15 +0000	[thread overview]
Message-ID: <48B0BC74-5F5C-4212-BC5A-552356E9FFB1@oracle.com> (raw)
In-Reply-To: <87ttvsftoc.ffs@tglx>



> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
>>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Can you please add after the cpumask_copy() in that mlx5 code:
>>> 
>>>   pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>>>   pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);
>> 
>> Both are 0000 0000 0000 0fff, as expected on a system
>> where 12 CPUs are present.
> 
> So the non-initialized mask on stack has the online bits correctly
> copied and bits 12-63 are cleared, which is what cpumask_copy()
> achieves by copying longs and not bits.
> 
>> [   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8
> 
> How can that end up with a completely different content here?
> 
> As I said before that's clearly a direct map address.
> 
> So the call chain is:
> 
> mlx5_irq_alloc(af_desc)
>  pci_msix_alloc_irq_at(af_desc)
>    msi_domain_alloc_irq_at(af_desc)
>      __msi_domain_alloc_irqs(af_desc)
> 1)      msidesc->affinity = kmemdup(af_desc);
>        __irq_domain_alloc_irqs()
>          __irq_domain_alloc_irqs(aff=msidesc->affinity)
>            irq_domain_alloc_irqs_locked(aff)
>              irq_domain_alloc_irqs_locked(aff)
>                irq_domain_alloc_descs(aff)
>                  alloc_desc(mask=&aff->mask)
>                    desc_smp_init(mask)
> 2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
>                irq_domain_alloc_irqs_hierarchy()
>                  msi_domain_alloc()
>                    intel_irq_remapping_alloc()
>                      x86_vector_alloc_irqs()

It is x86_vector_alloc_irqs() where the struct irq_data is
fabricated that ends up in irq_matrix_reserve_managed().


>                        reserve_managed_vector()
>                          mask = desc->irq_common_data.affinity;
>                          irq_matrix_reserve_managed(mask)
> 
> So af_desc is kmemdup'ed at #1 and then the result is copied in #2.
> 
> Anything else just hands pointers around. So where gets either af_desc
> or msidesc->affinity or desc->irq_common_data.affinity overwritten? Or
> one of the pointers mangled. I doubt that it's the latter as this is 99%
> generic code which would end up in random fails all over the place.
> 
> This also ends up in the wrong place. That mlx code does:
> 
>   af_desc.is_managed = false;
> 
> but the allocation ends up allocating a managed vector.

That line was changed in 6.4-rc4 to address another bug,
and it avoids the crash by not calling into the misbehaving
code. It doesn't address the mlx5_core initialization issue
though, because as I said before, execution continues and
crashes in a similar scenario later on.

On my system, I've reverted that fix:

-       af_desc.is_managed = false;
+       af_desc.is_managed = 1;

so that we can continue debugging this crash.


> Can you please instrument this along the call chain so we can see where
> or at least when this gets corrupted? Please print the relevant pointer
> addresses too so we can see whether that's consistent or not.

I will continue working on this today.


>> The lowest 16 bits of MASKBITS are bcd8, or in binary:
>> 
>> ... 1011 1100 1101 1000
>> 
>> Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
>> 12, matching the CPU IDs from the loop. At bit 12, we fault,
>> since there is no CPU 12 on the system.
> 
> Thats due to a dubious optimization from Linus:
> 
> #if NR_CPUS <= BITS_PER_LONG
>  #define small_cpumask_bits ((unsigned int)NR_CPUS)
>  #define large_cpumask_bits ((unsigned int)NR_CPUS)
> #elif NR_CPUS <= 4*BITS_PER_LONG
>  #define small_cpumask_bits nr_cpu_ids
> 
> small_cpumask_bits is not nr_cpu_ids(12), it's NR_CPUS(32) which is why
> the loop does not terminate. Bah!
> 
> But that's just the symptom, not the root cause. That code is perfectly
> fine when all callers use the proper cpumask functions.

Agreed: we're crashing here because of the extra bits
in the affinity mask, but those bits should not be set
in the first place.

I wasn't sure if for_each_cpu() was supposed to iterate
into non-present CPUs -- and I guess the answer
is yes, it will iterate the full length of the mask.
The caller is responsible for ensuring the mask is valid.


--
Chuck Lever



  reply	other threads:[~2023-05-31 15:07 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03  1:03 system hang on start-up (mlx5?) Chuck Lever III
2023-05-03  6:34 ` Eli Cohen
2023-05-03 14:02   ` Chuck Lever III
2023-05-04  7:29     ` Leon Romanovsky
2023-05-04 19:02       ` Chuck Lever III
2023-05-04 23:38         ` Jason Gunthorpe
2023-05-07  5:23           ` Eli Cohen
2023-05-07  5:31         ` Eli Cohen
2023-05-27 20:16           ` Chuck Lever III
2023-05-29 21:20             ` Thomas Gleixner
2023-05-30 13:09               ` Chuck Lever III
2023-05-30 13:28                 ` Chuck Lever III
2023-05-30 13:48                   ` Eli Cohen
2023-05-30 13:51                     ` Chuck Lever III
2023-05-30 13:54                       ` Eli Cohen
2023-05-30 15:08                         ` Shay Drory
2023-05-31 14:15                           ` Chuck Lever III
2023-05-30 19:46                 ` Thomas Gleixner
2023-05-30 21:48                   ` Chuck Lever III
2023-05-30 22:17                     ` Thomas Gleixner
2023-05-31 14:43                     ` Thomas Gleixner
2023-05-31 15:06                       ` Chuck Lever III [this message]
2023-05-31 17:11                         ` Thomas Gleixner
2023-05-31 18:52                           ` Chuck Lever III
2023-05-31 19:19                             ` Thomas Gleixner
2023-05-16 19:23         ` Chuck Lever III
2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
2023-05-24 14:59             ` Chuck Lever III
2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-06-02 13:38     ` Chuck Lever III
2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
2023-06-02 14:03         ` Chuck Lever III
2023-06-02 14:29         ` Jason Gunthorpe
2023-06-02 15:58           ` Thorsten Leemhuis
2023-06-02 16:54           ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48B0BC74-5F5C-4212-BC5A-552356E9FFB1@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=elic@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=saeedm@nvidia.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.