All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eli Cohen <elic@nvidia.com>, Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>
Subject: Re: system hang on start-up (mlx5?)
Date: Tue, 30 May 2023 21:48:32 +0000	[thread overview]
Message-ID: <C34181E7-A515-4BD1-8C38-CB8BCF2D987D@oracle.com> (raw)
In-Reply-To: <87o7m1iov9.ffs@tglx>



> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Chuck!
> 
> On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:
>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> But if you look at the address: 0xffffffffb9ef3f80
>>> 
>>> That one is bogus:
>>> 
>>>    managed_map=ffff9a36efcf0f80
>>>    managed_map=ffff9a36efd30f80
>>>    managed_map=ffff9a3aefc30f80
>>>    managed_map=ffff9a3aefc70f80
>>>    managed_map=ffff9a3aefd30f80
>>>    managed_map=ffff9a3aefd70f80
>>>    managed_map=ffffffffb9ef3f80
>>> 
>>> Can you spot the fail?
>>> 
>>> The first six are in the direct map and the last one is in module map,
>>> which makes no sense at all.
>> 
>> Indeed. The reason for that is that the affinity mask has bits
>> set for CPU IDs that are not present on my system.
> 
> Which I don't buy, but even if so then this should not make
> for_each_cpu() go south. See below.
> 
>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>> that mask is set up like this:
>> 
>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>> {
>>        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>> -       cpumask_var_t req_mask;
>> +       struct irq_affinity_desc af_desc;
> 
> That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...
> 
>>        struct mlx5_irq *irq;
>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>> -               return ERR_PTR(-ENOMEM);
>> -       cpumask_copy(req_mask, cpu_online_mask);
>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>> +       af_desc.is_managed = false;
>> 
>> Which normally works as you would expect. But for some historical
>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>> cpumask_copy() misbehaves.
>> 
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away. But mlx5_core crashes during a later
>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>> exactly the same thing (for_each_cpu() on an affinity mask created
>> by copying), and crashes in a very similar fashion.
>> 
>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>> 
>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>> when NR_CPUS is a small value (the default is 8192).
> 
> I don't buy any of this. Here is why:
> 
> cpumask_copy(d, s)
>   bitmap_copy(d, s, nbits = 32)
>     len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
> 
> So it copies as many longs as required to cover nbits, i.e. it copies
> any clobbered bits beyond nbits too. While that looks odd at the first
> glance, that's just an optimization which is harmless.
> 
> for_each_cpu() finds the next set bit in a mask and breaks the loop once
> bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.
> 
> I just booted a kernel with NR_CPUS=32:

My system has only 12 CPUs. So every bit in your mask represents
a present CPU, but on my system, only 0x00000fff are ever present.

Therefore, on my system, any bit higher than bit 11 in a CPU mask
will reference a CPU that is not present.


> [    0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
> [    0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
> ...
> [    0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1
> 
> and added a function which does:
> 
>    struct cpumask ma, mb;
>    int cpu;
> 
>    memset(&ma, 0xaa, sizeof(ma);
>    cpumask_copy(&mb, &ma);
>    pr_info("MASKBITS: %016lx\n", mb.bits[0]);
>    pr_info("CPUs:");
>    for_each_cpu(cpu, &mb)
>         pr_cont(" %d", cpu);
>    pr_cont("\n");
> 
> [    2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
> [    2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
> 
> and the same exercise with a 0x55 filled source bitmap.
> 
> [    2.167595] smp: MASKBITS: 0x5555555555555555
> [    2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
> 
> So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
> the right thing simple because of this:
> 
> nr_cpu_ids is 32, right?
> 
> for_each_cpu(cpu, mask)
>   for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
> for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
> 
> So if find_next_bit() returns a bit after bit 31 the condition (bit) <
> (size) will terminate the loop, right?

Again, you are assuming more CPUs than there are bits in the mask.


> Also in the case of that driver the copy is _NOT_ copying set bits
> beyond bit 31 simply because the source mask is cpu_online_mask which
> definitely does not have a bit set which is greater than 31. As the copy
> copies longs the resulting mask in af_desc.mask cannot have any bit set
> past bit 31 either.
> 
> If the above is not correct, then there is a bigger problem than that
> MLX5 driver crashing.
> 
> So this:
> 
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away.
> 
> does not make any sense to me.
> 
> Can you please add after the cpumask_copy() in that mlx5 code:
> 
>    pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>    pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);

Both are 0000 0000 0000 0fff, as expected on a system
where 12 CPUs are present.


> Please print also in irq_matrix_reserve_managed():
> 
>  - @mask->bits[0]
>  - nr_cpu_ids
>  - the CPU numbers inside the for_each_cpu() loop

Here's where it gets interesting:

+       pr_info("%s: MASKBITS:   %016lx\n", __func__, msk->bits[0]);
+       pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids);

[   70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000
[   70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[   71.260555][    C9] port_module: 1 callbacks suppressed
[   71.260561][    C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS:   ffffb1a74686bcd8
[   71.274096][   T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
[   71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12
[   71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3
[   71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4
[   71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6
[   71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7
[   71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10
[   71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11
[   71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12

Notice that there are in fact higher bits set than bit 11.

The lowest 16 bits of MASKBITS are bcd8, or in binary:

... 1011 1100 1101 1000

Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
12, matching the CPU IDs from the loop. At bit 12, we fault,
since there is no CPU 12 on the system.


--
Chuck Lever



  reply	other threads:[~2023-05-30 21:49 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03  1:03 system hang on start-up (mlx5?) Chuck Lever III
2023-05-03  6:34 ` Eli Cohen
2023-05-03 14:02   ` Chuck Lever III
2023-05-04  7:29     ` Leon Romanovsky
2023-05-04 19:02       ` Chuck Lever III
2023-05-04 23:38         ` Jason Gunthorpe
2023-05-07  5:23           ` Eli Cohen
2023-05-07  5:31         ` Eli Cohen
2023-05-27 20:16           ` Chuck Lever III
2023-05-29 21:20             ` Thomas Gleixner
2023-05-30 13:09               ` Chuck Lever III
2023-05-30 13:28                 ` Chuck Lever III
2023-05-30 13:48                   ` Eli Cohen
2023-05-30 13:51                     ` Chuck Lever III
2023-05-30 13:54                       ` Eli Cohen
2023-05-30 15:08                         ` Shay Drory
2023-05-31 14:15                           ` Chuck Lever III
2023-05-30 19:46                 ` Thomas Gleixner
2023-05-30 21:48                   ` Chuck Lever III [this message]
2023-05-30 22:17                     ` Thomas Gleixner
2023-05-31 14:43                     ` Thomas Gleixner
2023-05-31 15:06                       ` Chuck Lever III
2023-05-31 17:11                         ` Thomas Gleixner
2023-05-31 18:52                           ` Chuck Lever III
2023-05-31 19:19                             ` Thomas Gleixner
2023-05-16 19:23         ` Chuck Lever III
2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
2023-05-24 14:59             ` Chuck Lever III
2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-06-02 13:38     ` Chuck Lever III
2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
2023-06-02 14:03         ` Chuck Lever III
2023-06-02 14:29         ` Jason Gunthorpe
2023-06-02 15:58           ` Thorsten Leemhuis
2023-06-02 16:54           ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C34181E7-A515-4BD1-8C38-CB8BCF2D987D@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=elic@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.