All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Shay Drory <shayd@nvidia.com>, Eli Cohen <elic@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: system hang on start-up (mlx5?)
Date: Wed, 31 May 2023 14:15:28 +0000	[thread overview]
Message-ID: <B9761A06-C76C-4088-A748-77867C9FF3CD@oracle.com> (raw)
In-Reply-To: <9d793d9f-0fca-2b0d-2a2e-abd527ffa8d4@nvidia.com>



> On May 30, 2023, at 11:08 AM, Shay Drory <shayd@nvidia.com> wrote:
> 
> 
> On 30/05/2023 16:54, Eli Cohen wrote:
>>> -----Original Message-----
>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>> Sent: Tuesday, 30 May 2023 16:51
>>> To: Eli Cohen <elic@nvidia.com>
>>> Cc: Shay Drory <shayd@nvidia.com>; Leon Romanovsky <leon@kernel.org>;
>>> Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
>>> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
>>> <netdev@vger.kernel.org>; Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: system hang on start-up (mlx5?)
>>> 
>>> 
>>> 
>>>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>> 
>>>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>>>> Sent: Tuesday, 30 May 2023 16:28
>>>>> To: Eli Cohen <elic@nvidia.com>
>>>>> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
>>>>> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
>>>>> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
>>>>> <tglx@linutronix.de>
>>>>> Subject: Re: system hang on start-up (mlx5?)
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
>>>>> wrote:
>>>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
>>>>> wrote:
>>>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>>>>>>> reproduce the issue while the system is running.
>>>>>>>> 
>>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> firmware version: 16.35.2000
>>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> Port module event: module 0, Cable plugged
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>>> page fault for address: ffffffffb9ef3f80
>>>>>>>> ###
>>>>>>>> 
>>>>>>>> The fault address is the cm->managed_map for one of the CPUs.
>>>>>>> That does not make any sense at all. The irq matrix is initialized via:
>>>>>>> 
>>>>>>> irq_alloc_matrix()
>>>>>>> m = kzalloc(sizeof(matric);
>>>>>>> m->maps = alloc_percpu(*m->maps);
>>>>>>> 
>>>>>>> So how is any per CPU map which got allocated there supposed to be
>>>>>>> invalid (not mapped):
>>>>>>> 
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>>> page fault for address: ffffffffb9ef3f80
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
>>>>> access in kernel mode
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
>>> error_code(0x0000)
>>>>> - not-present page
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
>>>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>>>>>>> But if you look at the address: 0xffffffffb9ef3f80
>>>>>>> 
>>>>>>> That one is bogus:
>>>>>>> 
>>>>>>>   managed_map=ffff9a36efcf0f80
>>>>>>>   managed_map=ffff9a36efd30f80
>>>>>>>   managed_map=ffff9a3aefc30f80
>>>>>>>   managed_map=ffff9a3aefc70f80
>>>>>>>   managed_map=ffff9a3aefd30f80
>>>>>>>   managed_map=ffff9a3aefd70f80
>>>>>>>   managed_map=ffffffffb9ef3f80
>>>>>>> 
>>>>>>> Can you spot the fail?
>>>>>>> 
>>>>>>> The first six are in the direct map and the last one is in module map,
>>>>>>> which makes no sense at all.
>>>>>> Indeed. The reason for that is that the affinity mask has bits
>>>>>> set for CPU IDs that are not present on my system.
>>>>>> 
>>>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>>>>>> that mask is set up like this:
>>>>>> 
>>>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>>>>>> {
>>>>>>       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>>>>>> -       cpumask_var_t req_mask;
>>>>>> +       struct irq_affinity_desc af_desc;
>>>>>>       struct mlx5_irq *irq;
>>>>>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>>>>>> -               return ERR_PTR(-ENOMEM);
>>>>>> -       cpumask_copy(req_mask, cpu_online_mask);
>>>>>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>>>>>> +       af_desc.is_managed = false;
>>>>> By the way, why is "is_managed" set to false?
>>>>> 
>>>>> This particular system is a NUMA system, and I'd like to be
>>>>> able to set IRQ affinity for the card. Since is_managed is
>>>>> set to false, writing to the /proc/irq files fails with EIO.
>>>>> 
>>>> This is a control irq and is used for issuing configuration commands.
>>>> 
>>>> This commit:
>>>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f
>>>> Author: Dou Liyang <douliyangs@gmail.com>
>>>> Date:   Tue Dec 4 23:51:21 2018 +0800
>>>> 
>>>>    genirq/affinity: Add is_managed to struct irq_affinity_desc
>>>> 
>>>> explains why it should not be managed.
>>> Understood, but what about the other IRQs? I can't set any
>>> of them. All writes to the proc files result in EIO.
>>> 
>> I think @Shay Drory has a fix for that should go upstream.
>> Shay was it sent?
> 
> The fix was send and merged.
> 
> https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/

Fwiw, I'm now on v6.4-rc4, and setting IRQ affinity works as expected.
Sorry for the noise and thanks for the fix.


>>>>>> Which normally works as you would expect. But for some historical
>>>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>>>>>> cpumask_copy() misbehaves.
>>>>>> 
>>>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>>>>>> copy, this crash goes away. But mlx5_core crashes during a later
>>>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>>>>>> exactly the same thing (for_each_cpu() on an affinity mask created
>>>>>> by copying), and crashes in a very similar fashion.
>>>>>> 
>>>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>>>>>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>>>>> 
>>>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>>>>>> when NR_CPUS is a small value (the default is 8192).
>>>>>> 
>>>>>> 
>>>>>>> Can you please apply the debug patch below and provide the output?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>>      tglx
>>>>>>> ---
>>>>>>> --- a/kernel/irq/matrix.c
>>>>>>> +++ b/kernel/irq/matrix.c
>>>>>>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>>>>>> unsigned int alloc_end)
>>>>>>> {
>>>>>>> struct irq_matrix *m;
>>>>>>> + unsigned int cpu;
>>>>>>> 
>>>>>>> if (matrix_bits > IRQ_MATRIX_BITS)
>>>>>>> return NULL;
>>>>>>> @@ -68,6 +69,8 @@ struct irq_matrix {
>>>>>>> kfree(m);
>>>>>>> return NULL;
>>>>>>> }
>>>>>>> + for_each_possible_cpu(cpu)
>>>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
>>>>> long)per_cpu_ptr(m->maps, cpu));
>>>>>>> return m;
>>>>>>> }
>>>>>>> 
>>>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>>>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>>>>>>> unsigned int bit;
>>>>>>> 
>>>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
>>>>> long)cm);
>>>>>>> +
>>>>>>> bit = matrix_alloc_area(m, cm, 1, true);
>>>>>>> if (bit >= m->alloc_end)
>>>>>>> goto cleanup;
>>>>>> --
>>>>>> Chuck Lever
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>> 
>>> --
>>> Chuck Lever


--
Chuck Lever



  reply	other threads:[~2023-05-31 14:18 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03  1:03 system hang on start-up (mlx5?) Chuck Lever III
2023-05-03  6:34 ` Eli Cohen
2023-05-03 14:02   ` Chuck Lever III
2023-05-04  7:29     ` Leon Romanovsky
2023-05-04 19:02       ` Chuck Lever III
2023-05-04 23:38         ` Jason Gunthorpe
2023-05-07  5:23           ` Eli Cohen
2023-05-07  5:31         ` Eli Cohen
2023-05-27 20:16           ` Chuck Lever III
2023-05-29 21:20             ` Thomas Gleixner
2023-05-30 13:09               ` Chuck Lever III
2023-05-30 13:28                 ` Chuck Lever III
2023-05-30 13:48                   ` Eli Cohen
2023-05-30 13:51                     ` Chuck Lever III
2023-05-30 13:54                       ` Eli Cohen
2023-05-30 15:08                         ` Shay Drory
2023-05-31 14:15                           ` Chuck Lever III [this message]
2023-05-30 19:46                 ` Thomas Gleixner
2023-05-30 21:48                   ` Chuck Lever III
2023-05-30 22:17                     ` Thomas Gleixner
2023-05-31 14:43                     ` Thomas Gleixner
2023-05-31 15:06                       ` Chuck Lever III
2023-05-31 17:11                         ` Thomas Gleixner
2023-05-31 18:52                           ` Chuck Lever III
2023-05-31 19:19                             ` Thomas Gleixner
2023-05-16 19:23         ` Chuck Lever III
2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
2023-05-24 14:59             ` Chuck Lever III
2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-06-02 13:38     ` Chuck Lever III
2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
2023-06-02 14:03         ` Chuck Lever III
2023-06-02 14:29         ` Jason Gunthorpe
2023-06-02 15:58           ` Thorsten Leemhuis
2023-06-02 16:54           ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B9761A06-C76C-4088-A748-77867C9FF3CD@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=elic@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=shayd@nvidia.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.