All of lore.kernel.org
 help / color / mirror / Atom feed
* system hang on start-up (mlx5?)
@ 2023-05-03  1:03 Chuck Lever III
  2023-05-03  6:34 ` Eli Cohen
  2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
  0 siblings, 2 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-05-03  1:03 UTC (permalink / raw)
  To: elic; +Cc: saeedm, Leon Romanovsky, linux-rdma, open list:NETWORKING [GENERAL]

Hi-

I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
MCX515A-CCAT

When booting a v6.3+ kernel, the boot process stops cold after a
few seconds. The last message on the console is the MLX5 driver
note about "PCIe slot advertised sufficient power (27W)".

bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
descriptor") is the first bad commit.

I've trolled lore a couple of times and haven't found any discussion
of this issue.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: system hang on start-up (mlx5?)
  2023-05-03  1:03 system hang on start-up (mlx5?) Chuck Lever III
@ 2023-05-03  6:34 ` Eli Cohen
  2023-05-03 14:02   ` Chuck Lever III
  2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 1 reply; 36+ messages in thread
From: Eli Cohen @ 2023-05-03  6:34 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Saeed Mahameed, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL]

Hi Chuck,

Just verifying, could you make sure your server and card firmware are up to date?

Will try to see if I can reproduce this here.

> -----Original Message-----
> From: Chuck Lever III <chuck.lever@oracle.com>
> Sent: Wednesday, 3 May 2023 4:03
> To: Eli Cohen <elic@nvidia.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>; Leon Romanovsky
> <leon@kernel.org>; linux-rdma <linux-rdma@vger.kernel.org>; open
> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>
> Subject: system hang on start-up (mlx5?)
> 
> Hi-
> 
> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
> MCX515A-CCAT
> 
> When booting a v6.3+ kernel, the boot process stops cold after a
> few seconds. The last message on the console is the MLX5 driver
> note about "PCIe slot advertised sufficient power (27W)".
> 
> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
> descriptor") is the first bad commit.
> 
> I've trolled lore a couple of times and haven't found any discussion
> of this issue.
> 
> 
> --
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-03  6:34 ` Eli Cohen
@ 2023-05-03 14:02   ` Chuck Lever III
  2023-05-04  7:29     ` Leon Romanovsky
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-03 14:02 UTC (permalink / raw)
  To: Eli Cohen
  Cc: Saeed Mahameed, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL]



> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
> 
> Hi Chuck,
> 
> Just verifying, could you make sure your server and card firmware are up to date?

Device firmware updated to 16.35.2000; no change.

System firmware is dated September 2016. I'll see if I can get
something more recent installed.


> Will try to see if I can reproduce this here.
> 
>> -----Original Message-----
>> From: Chuck Lever III <chuck.lever@oracle.com>
>> Sent: Wednesday, 3 May 2023 4:03
>> To: Eli Cohen <elic@nvidia.com>
>> Cc: Saeed Mahameed <saeedm@nvidia.com>; Leon Romanovsky
>> <leon@kernel.org>; linux-rdma <linux-rdma@vger.kernel.org>; open
>> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>
>> Subject: system hang on start-up (mlx5?)
>> 
>> Hi-
>> 
>> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
>> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
>> MCX515A-CCAT
>> 
>> When booting a v6.3+ kernel, the boot process stops cold after a
>> few seconds. The last message on the console is the MLX5 driver
>> note about "PCIe slot advertised sufficient power (27W)".
>> 
>> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
>> descriptor") is the first bad commit.
>> 
>> I've trolled lore a couple of times and haven't found any discussion
>> of this issue.
>> 
>> 
>> --
>> Chuck Lever
>> 
> 

--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-03 14:02   ` Chuck Lever III
@ 2023-05-04  7:29     ` Leon Romanovsky
  2023-05-04 19:02       ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Leon Romanovsky @ 2023-05-04  7:29 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Eli Cohen, Saeed Mahameed, linux-rdma, open list:NETWORKING [GENERAL]

On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
> 
> 
> > On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
> > 
> > Hi Chuck,
> > 
> > Just verifying, could you make sure your server and card firmware are up to date?
> 
> Device firmware updated to 16.35.2000; no change.
> 
> System firmware is dated September 2016. I'll see if I can get
> something more recent installed.

We are trying to reproduce this issue internally.

Thanks

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-04  7:29     ` Leon Romanovsky
@ 2023-05-04 19:02       ` Chuck Lever III
  2023-05-04 23:38         ` Jason Gunthorpe
                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-05-04 19:02 UTC (permalink / raw)
  To: Leon Romanovsky, Eli Cohen
  Cc: Saeed Mahameed, linux-rdma, open list:NETWORKING [GENERAL]



> On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
> 
> On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
>>> 
>>> Hi Chuck,
>>> 
>>> Just verifying, could you make sure your server and card firmware are up to date?
>> 
>> Device firmware updated to 16.35.2000; no change.
>> 
>> System firmware is dated September 2016. I'll see if I can get
>> something more recent installed.
> 
> We are trying to reproduce this issue internally.

More information. I captured the serial console during boot.
Here are the last messages:

[    9.837087] mlx5_core 0000:02:00.0: firmware version: 16.35.2000
[    9.843126] mlx5_core 0000:02:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[   10.311515] mlx5_core 0000:02:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[   10.321948] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
[   10.344324] mlx5_core 0000:02:00.0: mlx5_pcie_event:301:(pid 88): PCIe slot advertised sufficient power (27W).
[   10.354339] BUG: unable to handle page fault for address: ffffffff8ff0ade0
[   10.361206] #PF: supervisor read access in kernel mode
[   10.366335] #PF: error_code(0x0000) - not-present page
[   10.371467] PGD 81ec39067 P4D 81ec39067 PUD 81ec3a063 PMD 114b07063 PTE 800ffff7e10f5062
[   10.379544] Oops: 0000 [#1] PREEMPT SMP PTI
[   10.383721] CPU: 0 PID: 117 Comm: kworker/0:6 Not tainted 6.3.0-13028-g7222f123c983 #1
[   10.391625] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
[   10.398750] Workqueue: events work_for_cpu_fn
[   10.403108] RIP: 0010:__bitmap_or+0x10/0x26
[   10.407286] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
[   10.426024] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
[   10.431240] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX: 0000000000000004
[   10.438365] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI: ffff9156801967b0
[   10.445489] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09: 0000000000000000
[   10.452613] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ec
[   10.459737] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15: 0000000000000020
[   10.466862] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000) knlGS:0000000000000000
[   10.474936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.480674] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4: 00000000003706f0
[   10.487800] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.494922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.502046] Call Trace:
[   10.504493]  <TASK>
[   10.506589]  ? matrix_alloc_area.constprop.0+0x43/0x9a
[   10.511729]  ? prepare_namespace+0x84/0x174
[   10.515914]  irq_matrix_reserve_managed+0x56/0x10c
[   10.520699]  x86_vector_alloc_irqs+0x1d2/0x31e
[   10.525146]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
[   10.530284]  irq_domain_alloc_irqs_parent+0x1a/0x2a
[   10.535155]  intel_irq_remapping_alloc+0x59/0x5e9
[   10.539859]  ? kmem_cache_debug_flags+0x11/0x26
[   10.544383]  ? __radix_tree_lookup+0x39/0xb9
[   10.548649]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
[   10.553779]  irq_domain_alloc_irqs_parent+0x1a/0x2a
[   10.558650]  msi_domain_alloc+0x8c/0x120
[ rqs_hierarchy+0x39/0x3f
[   10.567697]  irq_domain_alloc_irqs_locked+0x11d/0x286
[   10.572741]  __irq_domain_alloc_irqs+0x72/0x93
[   10.577179]  __msi_domain_alloc_irqs+0x193/0x3f1
[   10.581789]  ? __xa_alloc+0xcf/0xe2
[   10.585273]  msi_domain_alloc_irq_at+0xa8/0xfe
[   10.589711]  pci_msix_alloc_irq_at+0x47/0x5c
[   10.593987]  mlx5_irq_alloc+0x99/0x319 [mlx5_core]
[   10.598881]  ? xa_load+0x5e/0x68
[   10.602112]  irq_pool_request_vector+0x60/0x7d [mlx5_core]
[   10.607668]  mlx5_irq_request+0x26/0x98 [mlx5_core]
[   10.612617]  mlx5_irqs_request_vectors+0x52/0x82 [mlx5_core]
[   10.618345]  mlx5_eq_table_create+0x613/0x8d3 [mlx5_core]
[   10.623806]  ? kmalloc_trace+0x46/0x57
[   10.627549]  mlx5_load+0xb1/0x33e [mlx5_core]
[   10.631971]  mlx5_init_one+0x497/0x514 [mlx5_core]
[   10.636824]  probe_one+0x2fa/0x3f6 [mlx5_core]
[   10.641330]  local_pci_probe+0x47/0x8b
[   10.645073]  work_for_cpu_fn+0x1a/0x25
[   10.648817]  process_one_work+0x1e0/0x2e0
[   10.652822]  process_scheduled_works+0x2c/0x37
[   10.657258]  worker_thread+0x1e2/0x25e
[   10.661003]  ? __pfx_worker_thread+0x10/0x10
[   10.665267]  kthread+0x10d/0x115
[   10.668501]  ? __pfx_kthread+0x10/0x10
[   10.672244]  ret_from_fork+0x2c/0x50
[   10.675824]  </TASK>
[   10.678007] Modules linked in: mlx5_core(+) ast drm_kms_helper crct10dif_pclmul crc32_pclmul drm_shmem_helper crc32c_intel drm ghash_clmulni_intel sha512_ssse3 igb dca i2c_algo_bit mlxfw pci_hyperv_intf pkcs8_key_parser
[   10.697447] CR2: ffffffff8ff0ade0
[   10.700758] ---[ end trace 0000000000000000 ]---
[   10.707706] pstore: backend (erst) writing error (-28)
[   10.712838] RIP: 0010:__bitmap_or+0x10/0x26
[   10.717014] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
[   10.735752] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
[   10.740969] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX: 0000000000000004
[   10.748093] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI: ffff9156801967b0
[   10.755218] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09: 0000000000000000
[   10.762341] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ec
[   10.769467] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15: 0000000000000020
[   10.776590] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000) knlGS:0000000000000000
[   10.784666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.790405] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4: 00000000003706f0
[   10.797529] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.804651] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.811775] note: kworker/0:6[117] exited with irqs disabled
[   10.817444] note: kworker/0:6[117] exited with preempt_count 1

HTH

--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-04 19:02       ` Chuck Lever III
@ 2023-05-04 23:38         ` Jason Gunthorpe
  2023-05-07  5:23           ` Eli Cohen
  2023-05-07  5:31         ` Eli Cohen
  2023-05-16 19:23         ` Chuck Lever III
  2 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2023-05-04 23:38 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Leon Romanovsky, Eli Cohen, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]

On Thu, May 04, 2023 at 07:02:48PM +0000, Chuck Lever III wrote:
> 
> 
> > On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > 
> > On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
> >> 
> >> 
> >>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
> >>> 
> >>> Hi Chuck,
> >>> 
> >>> Just verifying, could you make sure your server and card firmware are up to date?
> >> 
> >> Device firmware updated to 16.35.2000; no change.
> >> 
> >> System firmware is dated September 2016. I'll see if I can get
> >> something more recent installed.
> > 
> > We are trying to reproduce this issue internally.
> 
> More information. I captured the serial console during boot.
> Here are the last messages:

Oh I wonder if this is connected to Thomas's recent interrupt and MSI
rework? Might need to bisection search around those big series

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: system hang on start-up (mlx5?)
  2023-05-04 23:38         ` Jason Gunthorpe
@ 2023-05-07  5:23           ` Eli Cohen
  0 siblings, 0 replies; 36+ messages in thread
From: Eli Cohen @ 2023-05-07  5:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Chuck Lever III
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]

Maybe. And maybe the problem is triggered due to the usage we make of that infrastructure. If I am not mistaken, mlx5_core is the first and only driver to make use of this infrastructure.
 
> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, 5 May 2023 2:38
> To: Chuck Lever III <chuck.lever@oracle.com>
> Cc: Leon Romanovsky <leon@kernel.org>; Eli Cohen <elic@nvidia.com>; Saeed
> Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
> <netdev@vger.kernel.org>
> Subject: Re: system hang on start-up (mlx5?)
> 
> On Thu, May 04, 2023 at 07:02:48PM +0000, Chuck Lever III wrote:
> >
> >
> > > On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
> > >>
> > >>
> > >>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
> > >>>
> > >>> Hi Chuck,
> > >>>
> > >>> Just verifying, could you make sure your server and card firmware are
> up to date?
> > >>
> > >> Device firmware updated to 16.35.2000; no change.
> > >>
> > >> System firmware is dated September 2016. I'll see if I can get
> > >> something more recent installed.
> > >
> > > We are trying to reproduce this issue internally.
> >
> > More information. I captured the serial console during boot.
> > Here are the last messages:
> 
> Oh I wonder if this is connected to Thomas's recent interrupt and MSI
> rework? Might need to bisection search around those big series
> 
> Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: system hang on start-up (mlx5?)
  2023-05-04 19:02       ` Chuck Lever III
  2023-05-04 23:38         ` Jason Gunthorpe
@ 2023-05-07  5:31         ` Eli Cohen
  2023-05-27 20:16           ` Chuck Lever III
  2023-05-16 19:23         ` Chuck Lever III
  2 siblings, 1 reply; 36+ messages in thread
From: Eli Cohen @ 2023-05-07  5:31 UTC (permalink / raw)
  To: Chuck Lever III, Leon Romanovsky, tglx
  Cc: Saeed Mahameed, linux-rdma, open list:NETWORKING [GENERAL]

Hi Thomas,

Do you have insights what could cause this?

> -----Original Message-----
> From: Chuck Lever III <chuck.lever@oracle.com>
> Sent: Thursday, 4 May 2023 22:03
> To: Leon Romanovsky <leon@kernel.org>; Eli Cohen <elic@nvidia.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
> <netdev@vger.kernel.org>
> Subject: Re: system hang on start-up (mlx5?)
> 
> 
> 
> > On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
> >>
> >>
> >>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
> >>>
> >>> Hi Chuck,
> >>>
> >>> Just verifying, could you make sure your server and card firmware are up
> to date?
> >>
> >> Device firmware updated to 16.35.2000; no change.
> >>
> >> System firmware is dated September 2016. I'll see if I can get
> >> something more recent installed.
> >
> > We are trying to reproduce this issue internally.
> 
> More information. I captured the serial console during boot.
> Here are the last messages:
> 
> [    9.837087] mlx5_core 0000:02:00.0: firmware version: 16.35.2000
> [    9.843126] mlx5_core 0000:02:00.0: 126.016 Gb/s available PCIe
> bandwidth (8.0 GT/s PCIe x16 link)
> [   10.311515] mlx5_core 0000:02:00.0: Rate limit: 127 rates are supported,
> range: 0Mbps to 97656Mbps
> [   10.321948] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport:
> max uc(128) max mc(2048)
> [   10.344324] mlx5_core 0000:02:00.0: mlx5_pcie_event:301:(pid 88): PCIe
> slot advertised sufficient power (27W).
> [   10.354339] BUG: unable to handle page fault for address: ffffffff8ff0ade0
> [   10.361206] #PF: supervisor read access in kernel mode
> [   10.366335] #PF: error_code(0x0000) - not-present page
> [   10.371467] PGD 81ec39067 P4D 81ec39067 PUD 81ec3a063 PMD
> 114b07063 PTE 800ffff7e10f5062
> [   10.379544] Oops: 0000 [#1] PREEMPT SMP PTI
> [   10.383721] CPU: 0 PID: 117 Comm: kworker/0:6 Not tainted 6.3.0-13028-
> g7222f123c983 #1
> [   10.391625] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b
> 06/12/2017
> [   10.398750] Workqueue: events work_for_cpu_fn
> [   10.403108] RIP: 0010:__bitmap_or+0x10/0x26
> [   10.407286] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c>
> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
> [   10.426024] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
> [   10.431240] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX:
> 0000000000000004
> [   10.438365] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI:
> ffff9156801967b0
> [   10.445489] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09:
> 0000000000000000
> [   10.452613] R10: 0000000000000000 R11: 0000000000000000 R12:
> 00000000000000ec
> [   10.459737] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15:
> 0000000000000020
> [   10.466862] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000)
> knlGS:0000000000000000
> [   10.474936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.480674] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4:
> 00000000003706f0
> [   10.487800] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   10.494922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [   10.502046] Call Trace:
> [   10.504493]  <TASK>
> [   10.506589]  ? matrix_alloc_area.constprop.0+0x43/0x9a
> [   10.511729]  ? prepare_namespace+0x84/0x174
> [   10.515914]  irq_matrix_reserve_managed+0x56/0x10c
> [   10.520699]  x86_vector_alloc_irqs+0x1d2/0x31e
> [   10.525146]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
> [   10.530284]  irq_domain_alloc_irqs_parent+0x1a/0x2a
> [   10.535155]  intel_irq_remapping_alloc+0x59/0x5e9
> [   10.539859]  ? kmem_cache_debug_flags+0x11/0x26
> [   10.544383]  ? __radix_tree_lookup+0x39/0xb9
> [   10.548649]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
> [   10.553779]  irq_domain_alloc_irqs_parent+0x1a/0x2a
> [   10.558650]  msi_domain_alloc+0x8c/0x120
> [ rqs_hierarchy+0x39/0x3f
> [   10.567697]  irq_domain_alloc_irqs_locked+0x11d/0x286
> [   10.572741]  __irq_domain_alloc_irqs+0x72/0x93
> [   10.577179]  __msi_domain_alloc_irqs+0x193/0x3f1
> [   10.581789]  ? __xa_alloc+0xcf/0xe2
> [   10.585273]  msi_domain_alloc_irq_at+0xa8/0xfe
> [   10.589711]  pci_msix_alloc_irq_at+0x47/0x5c
> [   10.593987]  mlx5_irq_alloc+0x99/0x319 [mlx5_core]
> [   10.598881]  ? xa_load+0x5e/0x68
> [   10.602112]  irq_pool_request_vector+0x60/0x7d [mlx5_core]
> [   10.607668]  mlx5_irq_request+0x26/0x98 [mlx5_core]
> [   10.612617]  mlx5_irqs_request_vectors+0x52/0x82 [mlx5_core]
> [   10.618345]  mlx5_eq_table_create+0x613/0x8d3 [mlx5_core]
> [   10.623806]  ? kmalloc_trace+0x46/0x57
> [   10.627549]  mlx5_load+0xb1/0x33e [mlx5_core]
> [   10.631971]  mlx5_init_one+0x497/0x514 [mlx5_core]
> [   10.636824]  probe_one+0x2fa/0x3f6 [mlx5_core]
> [   10.641330]  local_pci_probe+0x47/0x8b
> [   10.645073]  work_for_cpu_fn+0x1a/0x25
> [   10.648817]  process_one_work+0x1e0/0x2e0
> [   10.652822]  process_scheduled_works+0x2c/0x37
> [   10.657258]  worker_thread+0x1e2/0x25e
> [   10.661003]  ? __pfx_worker_thread+0x10/0x10
> [   10.665267]  kthread+0x10d/0x115
> [   10.668501]  ? __pfx_kthread+0x10/0x10
> [   10.672244]  ret_from_fork+0x2c/0x50
> [   10.675824]  </TASK>
> [   10.678007] Modules linked in: mlx5_core(+) ast drm_kms_helper
> crct10dif_pclmul crc32_pclmul drm_shmem_helper crc32c_intel drm
> ghash_clmulni_intel sha512_ssse3 igb dca i2c_algo_bit mlxfw pci_hyperv_intf
> pkcs8_key_parser
> [   10.697447] CR2: ffffffff8ff0ade0
> [   10.700758] ---[ end trace 0000000000000000 ]---
> [   10.707706] pstore: backend (erst) writing error (-28)
> [   10.712838] RIP: 0010:__bitmap_or+0x10/0x26
> [   10.717014] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c>
> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
> [   10.735752] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
> [   10.740969] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX:
> 0000000000000004
> [   10.748093] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI:
> ffff9156801967b0
> [   10.755218] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09:
> 0000000000000000
> [   10.762341] R10: 0000000000000000 R11: 0000000000000000 R12:
> 00000000000000ec
> [   10.769467] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15:
> 0000000000000020
> [   10.776590] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000)
> knlGS:0000000000000000
> [   10.784666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.790405] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4:
> 00000000003706f0
> [   10.797529] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   10.804651] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [   10.811775] note: kworker/0:6[117] exited with irqs disabled
> [   10.817444] note: kworker/0:6[117] exited with preempt_count 1
> 
> HTH
> 
> --
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-03  1:03 system hang on start-up (mlx5?) Chuck Lever III
  2023-05-03  6:34 ` Eli Cohen
@ 2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 1 reply; 36+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-05-08 12:29 UTC (permalink / raw)
  To: Chuck Lever III, elic
  Cc: saeedm, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL],
	Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 03.05.23 03:03, Chuck Lever III wrote:
> Hi-
> 
> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
> MCX515A-CCAT
> 
> When booting a v6.3+ kernel, the boot process stops cold after a
> few seconds. The last message on the console is the MLX5 driver
> note about "PCIe slot advertised sufficient power (27W)".
> 
> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
> descriptor") is the first bad commit.
> 
> I've trolled lore a couple of times and haven't found any discussion
> of this issue.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced bbac70c74183
#regzbot title system hang on start-up (irq or mlx5 problem?)
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-04 19:02       ` Chuck Lever III
  2023-05-04 23:38         ` Jason Gunthorpe
  2023-05-07  5:31         ` Eli Cohen
@ 2023-05-16 19:23         ` Chuck Lever III
  2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-16 19:23 UTC (permalink / raw)
  To: Leon Romanovsky, Eli Cohen
  Cc: Saeed Mahameed, linux-rdma, open list:NETWORKING [GENERAL]



> On May 4, 2023, at 3:02 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> 
>> On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
>>> 
>>> 
>>>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>> 
>>>> Hi Chuck,
>>>> 
>>>> Just verifying, could you make sure your server and card firmware are up to date?
>>> 
>>> Device firmware updated to 16.35.2000; no change.
>>> 
>>> System firmware is dated September 2016. I'll see if I can get
>>> something more recent installed.
>> 
>> We are trying to reproduce this issue internally.
> 
> More information. I captured the serial console during boot.
> Here are the last messages:
> 
> [    9.837087] mlx5_core 0000:02:00.0: firmware version: 16.35.2000
> [    9.843126] mlx5_core 0000:02:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
> [   10.311515] mlx5_core 0000:02:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
> [   10.321948] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048)
> [   10.344324] mlx5_core 0000:02:00.0: mlx5_pcie_event:301:(pid 88): PCIe slot advertised sufficient power (27W).
> [   10.354339] BUG: unable to handle page fault for address: ffffffff8ff0ade0
> [   10.361206] #PF: supervisor read access in kernel mode
> [   10.366335] #PF: error_code(0x0000) - not-present page
> [   10.371467] PGD 81ec39067 P4D 81ec39067 PUD 81ec3a063 PMD 114b07063 PTE 800ffff7e10f5062
> [   10.379544] Oops: 0000 [#1] PREEMPT SMP PTI
> [   10.383721] CPU: 0 PID: 117 Comm: kworker/0:6 Not tainted 6.3.0-13028-g7222f123c983 #1
> [   10.391625] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
> [   10.398750] Workqueue: events work_for_cpu_fn
> [   10.403108] RIP: 0010:__bitmap_or+0x10/0x26
> [   10.407286] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
> [   10.426024] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
> [   10.431240] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX: 0000000000000004
> [   10.438365] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI: ffff9156801967b0
> [   10.445489] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09: 0000000000000000
> [   10.452613] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ec
> [   10.459737] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15: 0000000000000020
> [   10.466862] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000) knlGS:0000000000000000
> [   10.474936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.480674] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4: 00000000003706f0
> [   10.487800] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   10.494922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   10.502046] Call Trace:
> [   10.504493]  <TASK>
> [   10.506589]  ? matrix_alloc_area.constprop.0+0x43/0x9a
> [   10.511729]  ? prepare_namespace+0x84/0x174
> [   10.515914]  irq_matrix_reserve_managed+0x56/0x10c
> [   10.520699]  x86_vector_alloc_irqs+0x1d2/0x31e
> [   10.525146]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
> [   10.530284]  irq_domain_alloc_irqs_parent+0x1a/0x2a
> [   10.535155]  intel_irq_remapping_alloc+0x59/0x5e9
> [   10.539859]  ? kmem_cache_debug_flags+0x11/0x26
> [   10.544383]  ? __radix_tree_lookup+0x39/0xb9
> [   10.548649]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
> [   10.553779]  irq_domain_alloc_irqs_parent+0x1a/0x2a
> [   10.558650]  msi_domain_alloc+0x8c/0x120
> [ rqs_hierarchy+0x39/0x3f
> [   10.567697]  irq_domain_alloc_irqs_locked+0x11d/0x286
> [   10.572741]  __irq_domain_alloc_irqs+0x72/0x93
> [   10.577179]  __msi_domain_alloc_irqs+0x193/0x3f1
> [   10.581789]  ? __xa_alloc+0xcf/0xe2
> [   10.585273]  msi_domain_alloc_irq_at+0xa8/0xfe
> [   10.589711]  pci_msix_alloc_irq_at+0x47/0x5c
> [   10.593987]  mlx5_irq_alloc+0x99/0x319 [mlx5_core]
> [   10.598881]  ? xa_load+0x5e/0x68
> [   10.602112]  irq_pool_request_vector+0x60/0x7d [mlx5_core]
> [   10.607668]  mlx5_irq_request+0x26/0x98 [mlx5_core]
> [   10.612617]  mlx5_irqs_request_vectors+0x52/0x82 [mlx5_core]
> [   10.618345]  mlx5_eq_table_create+0x613/0x8d3 [mlx5_core]
> [   10.623806]  ? kmalloc_trace+0x46/0x57
> [   10.627549]  mlx5_load+0xb1/0x33e [mlx5_core]
> [   10.631971]  mlx5_init_one+0x497/0x514 [mlx5_core]
> [   10.636824]  probe_one+0x2fa/0x3f6 [mlx5_core]
> [   10.641330]  local_pci_probe+0x47/0x8b
> [   10.645073]  work_for_cpu_fn+0x1a/0x25
> [   10.648817]  process_one_work+0x1e0/0x2e0
> [   10.652822]  process_scheduled_works+0x2c/0x37
> [   10.657258]  worker_thread+0x1e2/0x25e
> [   10.661003]  ? __pfx_worker_thread+0x10/0x10
> [   10.665267]  kthread+0x10d/0x115
> [   10.668501]  ? __pfx_kthread+0x10/0x10
> [   10.672244]  ret_from_fork+0x2c/0x50
> [   10.675824]  </TASK>
> [   10.678007] Modules linked in: mlx5_core(+) ast drm_kms_helper crct10dif_pclmul crc32_pclmul drm_shmem_helper crc32c_intel drm ghash_clmulni_intel sha512_ssse3 igb dca i2c_algo_bit mlxfw pci_hyperv_intf pkcs8_key_parser
> [   10.697447] CR2: ffffffff8ff0ade0
> [   10.700758] ---[ end trace 0000000000000000 ]---
> [   10.707706] pstore: backend (erst) writing error (-28)
> [   10.712838] RIP: 0010:__bitmap_or+0x10/0x26
> [   10.717014] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
> [   10.735752] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
> [   10.740969] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX: 0000000000000004
> [   10.748093] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI: ffff9156801967b0
> [   10.755218] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09: 0000000000000000
> [   10.762341] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ec
> [   10.769467] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15: 0000000000000020
> [   10.776590] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000) knlGS:0000000000000000
> [   10.784666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.790405] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4: 00000000003706f0
> [   10.797529] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   10.804651] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   10.811775] note: kworker/0:6[117] exited with irqs disabled
> [   10.817444] note: kworker/0:6[117] exited with preempt_count 1
> 
> HTH
> 
> --
> Chuck Lever

Following up.

Jason shamed me into replacing a working CX-3Pro in one of
my lab systems with a CX-5 VPI, and the same problem occurs.
Removing the CX-5 from the system alleviates the problem.

Supermicro SYS-6028R-T/X10DRi, v6.4-rc2


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-16 19:23         ` Chuck Lever III
@ 2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
  2023-05-24 14:59             ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-05-23 14:20 UTC (permalink / raw)
  To: Chuck Lever III, Leon Romanovsky, Eli Cohen
  Cc: Saeed Mahameed, linux-rdma, open list:NETWORKING [GENERAL],
	Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 16.05.23 21:23, Chuck Lever III wrote:
>> On May 4, 2023, at 3:02 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>> On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>> On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
>>>>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>> Just verifying, could you make sure your server and card firmware are up to date?
>>>> Device firmware updated to 16.35.2000; no change.
>>>> System firmware is dated September 2016. I'll see if I can get
>>>> something more recent installed.
>>> We are trying to reproduce this issue internally.
>> More information. I captured the serial console during boot.
>> Here are the last messages:
>[…]
> Following up.
> 
> Jason shamed me into replacing a working CX-3Pro in one of
> my lab systems with a CX-5 VPI, and the same problem occurs.
> Removing the CX-5 from the system alleviates the problem.
> 
> Supermicro SYS-6028R-T/X10DRi, v6.4-rc2

I wondered what happened to this, as this looks stalled. Or was progress
to fix this regression made I just missed it?

I noticed the patch "net/mlx5: Fix irq affinity management" (
https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/
) refers to the culprit of this regression. Is that supposed to fix this
issue and just lacks proper tags to indicate that?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-05-24 14:59             ` Chuck Lever III
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-05-24 14:59 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Leon Romanovsky, Eli Cohen, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]



> On May 23, 2023, at 10:20 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
> 
> [CCing the regression list, as it should be in the loop for regressions:
> https://docs.kernel.org/admin-guide/reporting-regressions.html]
> 
> On 16.05.23 21:23, Chuck Lever III wrote:
>>> On May 4, 2023, at 3:02 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>> On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>> On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
>>>>>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>>> Just verifying, could you make sure your server and card firmware are up to date?
>>>>> Device firmware updated to 16.35.2000; no change.
>>>>> System firmware is dated September 2016. I'll see if I can get
>>>>> something more recent installed.
>>>> We are trying to reproduce this issue internally.
>>> More information. I captured the serial console during boot.
>>> Here are the last messages:
>> […]
>> Following up.
>> 
>> Jason shamed me into replacing a working CX-3Pro in one of
>> my lab systems with a CX-5 VPI, and the same problem occurs.
>> Removing the CX-5 from the system alleviates the problem.
>> 
>> Supermicro SYS-6028R-T/X10DRi, v6.4-rc2
> 
> I wondered what happened to this, as this looks stalled. Or was progress
> to fix this regression made I just missed it?

I have not heard of an available fix for this issue.


> I noticed the patch "net/mlx5: Fix irq affinity management" (
> https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/
> ) refers to the culprit of this regression. Is that supposed to fix this
> issue and just lacks proper tags to indicate that?

This patch was suggested to me when I initially reported the crash,
and I tried it at that time. It does not address the problem for me.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-07  5:31         ` Eli Cohen
@ 2023-05-27 20:16           ` Chuck Lever III
  2023-05-29 21:20             ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-27 20:16 UTC (permalink / raw)
  To: Eli Cohen, tglx
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]



> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
> 
> Hi Thomas,
> 
> Do you have insights what could cause this?

Following up. I am not able to reproduce this problem with KASAN
enabled, so I sprinkled a few pr_info() call sites in
kernel/irq/matrix.c.

I can boot the system with mlx5_core deny-listed. I log in, remove
mlx5_core from the deny list, and then "modprobe mlx5_core" to
reproduce the issue while the system is running.

May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: firmware version: 16.35.2000
May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m->system_map=ffff9a33801990d0 end=236
May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80

###

The fault address is the cm->managed_map for one of the CPUs.

###

May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read access in kernel mode
May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000) - not-present page
May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
May 27 15:47:47 manet.1015granger.net kernel: Oops: 0000 [#1] PREEMPT SMP PTI
May 27 15:47:47 manet.1015granger.net kernel: CPU: 6 PID: 364 Comm: kworker/6:3 Tainted: G S                 6.4.0-rc3-00014-g7d5f9d35c255 #2 fde923d833042649d4022091376d234db2fe0900
May 27 15:47:47 manet.1015granger.net kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
May 27 15:47:47 manet.1015granger.net kernel: Workqueue: events work_for_cpu_fn
May 27 15:47:47 manet.1015granger.net kernel: RIP: 0010:__bitmap_or+0x11/0x30
May 27 15:47:47 manet.1015granger.net kernel: Code: c6 48 85 f6 0f 95 c0 c3 31 f6 eb cf 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 89 c9 49 89 f8 48 83 c1 3f 48 c1 e9 06 74 17 31 c0 <48> 8b 3c c6 48 0b 3c c2 49 89 3c c0 48 83 c0 01 48 39 c1 75 eb c3
May 27 15:47:47 manet.1015granger.net kernel: RSP: 0018:ffffb6c88493f798 EFLAGS: 00010046
May 27 15:47:47 manet.1015granger.net kernel: RAX: 0000000000000000 RBX: ffff9a33801990b0 RCX: 0000000000000004
May 27 15:47:47 manet.1015granger.net kernel: RDX: ffff9a33801990d0 RSI: ffffffffb9ef3f80 RDI: ffff9a33801990b0
May 27 15:47:47 manet.1015granger.net kernel: RBP: ffffb6c88493f7d8 R08: ffff9a33801990b0 R09: ffffb6c88493f628
May 27 15:47:47 manet.1015granger.net kernel: R10: 0000000000000003 R11: ffffffffb9d30748 R12: ffffffffb9ef3f60
May 27 15:47:47 manet.1015granger.net kernel: R13: 00000000000000ec R14: 0000000000000020 R15: ffffffffb9ef3f80
May 27 15:47:47 manet.1015granger.net kernel: FS:  0000000000000000(0000) GS:ffff9a3aefc00000(0000) knlGS:0000000000000000
May 27 15:47:47 manet.1015granger.net kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 15:47:47 manet.1015granger.net kernel: CR2: ffffffffb9ef3f80 CR3: 000000054ec16005 CR4: 00000000001706e0
May 27 15:47:47 manet.1015granger.net kernel: Call Trace:
May 27 15:47:47 manet.1015granger.net kernel:  <TASK>
May 27 15:47:47 manet.1015granger.net kernel:  ? matrix_alloc_area.constprop.0+0x66/0xe0
May 27 15:47:47 manet.1015granger.net kernel:  ? unpack_to_rootfs+0x178/0x380
May 27 15:47:47 manet.1015granger.net kernel:  irq_matrix_reserve_managed+0x55/0x170
May 27 15:47:47 manet.1015granger.net kernel:  x86_vector_alloc_irqs.part.0+0x2bb/0x3a0
May 27 15:47:47 manet.1015granger.net kernel:  x86_vector_alloc_irqs+0x23/0x40
May 27 15:47:47 manet.1015granger.net kernel:  irq_domain_alloc_irqs_parent+0x24/0x50
May 27 15:47:47 manet.1015granger.net kernel:  intel_irq_remapping_alloc+0x59/0x650
May 27 15:47:47 manet.1015granger.net kernel:  irq_domain_alloc_irqs_parent+0x24/0x50
May 27 15:47:47 manet.1015granger.net kernel:  msi_domain_alloc+0x74/0x130
May 27 15:47:47 manet.1015granger.net kernel:  irq_domain_alloc_irqs_hierarchy+0x18/0x40
May 27 15:47:47 manet.1015granger.net kernel:  irq_domain_alloc_irqs_locked+0xce/0x370
May 27 15:47:47 manet.1015granger.net kernel:  __irq_domain_alloc_irqs+0x57/0xa0
May 27 15:47:47 manet.1015granger.net kernel:  __msi_domain_alloc_irqs+0x1ca/0x3f0
May 27 15:47:47 manet.1015granger.net kernel:  msi_domain_alloc_irq_at+0xef/0x140
May 27 15:47:47 manet.1015granger.net kernel:  ? vprintk+0x4b/0x60
May 27 15:47:47 manet.1015granger.net kernel:  pci_msix_alloc_irq_at+0x5c/0x70
May 27 15:47:47 manet.1015granger.net kernel:  mlx5_irq_alloc+0x22f/0x3d0 [mlx5_core 11229579f576884e9585dbff83cf9d4f8d975d71]

[ snip ]


>> -----Original Message-----
>> From: Chuck Lever III <chuck.lever@oracle.com>
>> Sent: Thursday, 4 May 2023 22:03
>> To: Leon Romanovsky <leon@kernel.org>; Eli Cohen <elic@nvidia.com>
>> Cc: Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
>> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
>> <netdev@vger.kernel.org>
>> Subject: Re: system hang on start-up (mlx5?)
>> 
>> 
>> 
>>> On May 4, 2023, at 3:29 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>> 
>>> On Wed, May 03, 2023 at 02:02:33PM +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On May 3, 2023, at 2:34 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>> 
>>>>> Hi Chuck,
>>>>> 
>>>>> Just verifying, could you make sure your server and card firmware are up
>> to date?
>>>> 
>>>> Device firmware updated to 16.35.2000; no change.
>>>> 
>>>> System firmware is dated September 2016. I'll see if I can get
>>>> something more recent installed.
>>> 
>>> We are trying to reproduce this issue internally.
>> 
>> More information. I captured the serial console during boot.
>> Here are the last messages:
>> 
>> [    9.837087] mlx5_core 0000:02:00.0: firmware version: 16.35.2000
>> [    9.843126] mlx5_core 0000:02:00.0: 126.016 Gb/s available PCIe
>> bandwidth (8.0 GT/s PCIe x16 link)
>> [   10.311515] mlx5_core 0000:02:00.0: Rate limit: 127 rates are supported,
>> range: 0Mbps to 97656Mbps
>> [   10.321948] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport:
>> max uc(128) max mc(2048)
>> [   10.344324] mlx5_core 0000:02:00.0: mlx5_pcie_event:301:(pid 88): PCIe
>> slot advertised sufficient power (27W).
>> [   10.354339] BUG: unable to handle page fault for address: ffffffff8ff0ade0
>> [   10.361206] #PF: supervisor read access in kernel mode
>> [   10.366335] #PF: error_code(0x0000) - not-present page
>> [   10.371467] PGD 81ec39067 P4D 81ec39067 PUD 81ec3a063 PMD
>> 114b07063 PTE 800ffff7e10f5062
>> [   10.379544] Oops: 0000 [#1] PREEMPT SMP PTI
>> [   10.383721] CPU: 0 PID: 117 Comm: kworker/0:6 Not tainted 6.3.0-13028-
>> g7222f123c983 #1
>> [   10.391625] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b
>> 06/12/2017
>> [   10.398750] Workqueue: events work_for_cpu_fn
>> [   10.403108] RIP: 0010:__bitmap_or+0x10/0x26
>> [   10.407286] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90
>> 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c>
>> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
>> [   10.426024] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
>> [   10.431240] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX:
>> 0000000000000004
>> [   10.438365] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI:
>> ffff9156801967b0
>> [   10.445489] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09:
>> 0000000000000000
>> [   10.452613] R10: 0000000000000000 R11: 0000000000000000 R12:
>> 00000000000000ec
>> [   10.459737] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15:
>> 0000000000000020
>> [   10.466862] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000)
>> knlGS:0000000000000000
>> [   10.474936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   10.480674] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4:
>> 00000000003706f0
>> [   10.487800] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [   10.494922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [   10.502046] Call Trace:
>> [   10.504493]  <TASK>
>> [   10.506589]  ? matrix_alloc_area.constprop.0+0x43/0x9a
>> [   10.511729]  ? prepare_namespace+0x84/0x174
>> [   10.515914]  irq_matrix_reserve_managed+0x56/0x10c
>> [   10.520699]  x86_vector_alloc_irqs+0x1d2/0x31e
>> [   10.525146]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
>> [   10.530284]  irq_domain_alloc_irqs_parent+0x1a/0x2a
>> [   10.535155]  intel_irq_remapping_alloc+0x59/0x5e9
>> [   10.539859]  ? kmem_cache_debug_flags+0x11/0x26
>> [   10.544383]  ? __radix_tree_lookup+0x39/0xb9
>> [   10.548649]  irq_domain_alloc_irqs_hierarchy+0x39/0x3f
>> [   10.553779]  irq_domain_alloc_irqs_parent+0x1a/0x2a
>> [   10.558650]  msi_domain_alloc+0x8c/0x120
>> [ rqs_hierarchy+0x39/0x3f
>> [   10.567697]  irq_domain_alloc_irqs_locked+0x11d/0x286
>> [   10.572741]  __irq_domain_alloc_irqs+0x72/0x93
>> [   10.577179]  __msi_domain_alloc_irqs+0x193/0x3f1
>> [   10.581789]  ? __xa_alloc+0xcf/0xe2
>> [   10.585273]  msi_domain_alloc_irq_at+0xa8/0xfe
>> [   10.589711]  pci_msix_alloc_irq_at+0x47/0x5c
>> [   10.593987]  mlx5_irq_alloc+0x99/0x319 [mlx5_core]
>> [   10.598881]  ? xa_load+0x5e/0x68
>> [   10.602112]  irq_pool_request_vector+0x60/0x7d [mlx5_core]
>> [   10.607668]  mlx5_irq_request+0x26/0x98 [mlx5_core]
>> [   10.612617]  mlx5_irqs_request_vectors+0x52/0x82 [mlx5_core]
>> [   10.618345]  mlx5_eq_table_create+0x613/0x8d3 [mlx5_core]
>> [   10.623806]  ? kmalloc_trace+0x46/0x57
>> [   10.627549]  mlx5_load+0xb1/0x33e [mlx5_core]
>> [   10.631971]  mlx5_init_one+0x497/0x514 [mlx5_core]
>> [   10.636824]  probe_one+0x2fa/0x3f6 [mlx5_core]
>> [   10.641330]  local_pci_probe+0x47/0x8b
>> [   10.645073]  work_for_cpu_fn+0x1a/0x25
>> [   10.648817]  process_one_work+0x1e0/0x2e0
>> [   10.652822]  process_scheduled_works+0x2c/0x37
>> [   10.657258]  worker_thread+0x1e2/0x25e
>> [   10.661003]  ? __pfx_worker_thread+0x10/0x10
>> [   10.665267]  kthread+0x10d/0x115
>> [   10.668501]  ? __pfx_kthread+0x10/0x10
>> [   10.672244]  ret_from_fork+0x2c/0x50
>> [   10.675824]  </TASK>
>> [   10.678007] Modules linked in: mlx5_core(+) ast drm_kms_helper
>> crct10dif_pclmul crc32_pclmul drm_shmem_helper crc32c_intel drm
>> ghash_clmulni_intel sha512_ssse3 igb dca i2c_algo_bit mlxfw pci_hyperv_intf
>> pkcs8_key_parser
>> [   10.697447] CR2: ffffffff8ff0ade0
>> [   10.700758] ---[ end trace 0000000000000000 ]---
>> [   10.707706] pstore: backend (erst) writing error (-28)
>> [   10.712838] RIP: 0010:__bitmap_or+0x10/0x26
>> [   10.717014] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90
>> 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c8 73 11 <4c>
>> 8b 04 c6 4c 0b 04 c2 4c 89 04 c7 48 ff c0 eb eb c3 cc cc cc cc
>> [   10.735752] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097
>> [   10.740969] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX:
>> 0000000000000004
>> [   10.748093] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI:
>> ffff9156801967b0
>> [   10.755218] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09:
>> 0000000000000000
>> [   10.762341] R10: 0000000000000000 R11: 0000000000000000 R12:
>> 00000000000000ec
>> [   10.769467] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15:
>> 0000000000000020
>> [   10.776590] FS:  0000000000000000(0000) GS:ffff9165bfc00000(0000)
>> knlGS:0000000000000000
>> [   10.784666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   10.790405] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4:
>> 00000000003706f0
>> [   10.797529] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [   10.804651] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [   10.811775] note: kworker/0:6[117] exited with irqs disabled
>> [   10.817444] note: kworker/0:6[117] exited with preempt_count 1
>> 
>> HTH
>> 
>> --
>> Chuck Lever
>> 
> 

--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-27 20:16           ` Chuck Lever III
@ 2023-05-29 21:20             ` Thomas Gleixner
  2023-05-30 13:09               ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2023-05-29 21:20 UTC (permalink / raw)
  To: Chuck Lever III, Eli Cohen
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]

On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
> I can boot the system with mlx5_core deny-listed. I log in, remove
> mlx5_core from the deny list, and then "modprobe mlx5_core" to
> reproduce the issue while the system is running.
>
> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: firmware version: 16.35.2000
> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m->system_map=ffff9a33801990d0 end=236
> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>
> ###
>
> The fault address is the cm->managed_map for one of the CPUs.

That does not make any sense at all. The irq matrix is initialized via:

irq_alloc_matrix()
  m = kzalloc(sizeof(matric);
  m->maps = alloc_percpu(*m->maps);

So how is any per CPU map which got allocated there supposed to be
invalid (not mapped):

> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read access in kernel mode
> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000) - not-present page
> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062

But if you look at the address: 0xffffffffb9ef3f80

That one is bogus:

     managed_map=ffff9a36efcf0f80
     managed_map=ffff9a36efd30f80
     managed_map=ffff9a3aefc30f80
     managed_map=ffff9a3aefc70f80
     managed_map=ffff9a3aefd30f80
     managed_map=ffff9a3aefd70f80
     managed_map=ffffffffb9ef3f80

Can you spot the fail?

The first six are in the direct map and the last one is in module map,
which makes no sense at all.

Can you please apply the debug patch below and provide the output?

Thanks,

        tglx
---
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -51,6 +51,7 @@ struct irq_matrix {
 					   unsigned int alloc_end)
 {
 	struct irq_matrix *m;
+	unsigned int cpu;
 
 	if (matrix_bits > IRQ_MATRIX_BITS)
 		return NULL;
@@ -68,6 +69,8 @@ struct irq_matrix {
 		kfree(m);
 		return NULL;
 	}
+	for_each_possible_cpu(cpu)
+		pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned long)per_cpu_ptr(m->maps, cpu));
 	return m;
 }
 
@@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
 		struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
 		unsigned int bit;
 
+		pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned long)cm);
+
 		bit = matrix_alloc_area(m, cm, 1, true);
 		if (bit >= m->alloc_end)
 			goto cleanup;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-29 21:20             ` Thomas Gleixner
@ 2023-05-30 13:09               ` Chuck Lever III
  2023-05-30 13:28                 ` Chuck Lever III
  2023-05-30 19:46                 ` Thomas Gleixner
  0 siblings, 2 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-05-30 13:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]



> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>> I can boot the system with mlx5_core deny-listed. I log in, remove
>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>> reproduce the issue while the system is running.
>> 
>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: firmware version: 16.35.2000
>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m->system_map=ffff9a33801990d0 end=236
>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>> 
>> ###
>> 
>> The fault address is the cm->managed_map for one of the CPUs.
> 
> That does not make any sense at all. The irq matrix is initialized via:
> 
> irq_alloc_matrix()
>  m = kzalloc(sizeof(matric);
>  m->maps = alloc_percpu(*m->maps);
> 
> So how is any per CPU map which got allocated there supposed to be
> invalid (not mapped):
> 
>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read access in kernel mode
>> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000) - not-present page
>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
> 
> But if you look at the address: 0xffffffffb9ef3f80
> 
> That one is bogus:
> 
>     managed_map=ffff9a36efcf0f80
>     managed_map=ffff9a36efd30f80
>     managed_map=ffff9a3aefc30f80
>     managed_map=ffff9a3aefc70f80
>     managed_map=ffff9a3aefd30f80
>     managed_map=ffff9a3aefd70f80
>     managed_map=ffffffffb9ef3f80
> 
> Can you spot the fail?
> 
> The first six are in the direct map and the last one is in module map,
> which makes no sense at all.

Indeed. The reason for that is that the affinity mask has bits
set for CPU IDs that are not present on my system.

After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
that mask is set up like this:

 struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
 {
        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
-       cpumask_var_t req_mask;
+       struct irq_affinity_desc af_desc;
        struct mlx5_irq *irq;
-       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
-               return ERR_PTR(-ENOMEM);
-       cpumask_copy(req_mask, cpu_online_mask);
+       cpumask_copy(&af_desc.mask, cpu_online_mask);
+       af_desc.is_managed = false;

Which normally works as you would expect. But for some historical
reason, I have CONFIG_NR_CPUS=32 on my system, and the
cpumask_copy() misbehaves.

If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
copy, this crash goes away. But mlx5_core crashes during a later
part of its init, in cpu_rmap_update(). cpu_rmap_update() does
exactly the same thing (for_each_cpu() on an affinity mask created
by copying), and crashes in a very similar fashion.

If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
vanishes entirely, and "modprobe mlx5_core" works as expected.

Thus I think the problem is with cpumask_copy() or for_each_cpu()
when NR_CPUS is a small value (the default is 8192).


> Can you please apply the debug patch below and provide the output?
> 
> Thanks,
> 
>        tglx
> ---
> --- a/kernel/irq/matrix.c
> +++ b/kernel/irq/matrix.c
> @@ -51,6 +51,7 @@ struct irq_matrix {
>   unsigned int alloc_end)
> {
> struct irq_matrix *m;
> + unsigned int cpu;
> 
> if (matrix_bits > IRQ_MATRIX_BITS)
> return NULL;
> @@ -68,6 +69,8 @@ struct irq_matrix {
> kfree(m);
> return NULL;
> }
> + for_each_possible_cpu(cpu)
> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned long)per_cpu_ptr(m->maps, cpu));
> return m;
> }
> 
> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
> unsigned int bit;
> 
> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned long)cm);
> +
> bit = matrix_alloc_area(m, cm, 1, true);
> if (bit >= m->alloc_end)
> goto cleanup;

--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 13:09               ` Chuck Lever III
@ 2023-05-30 13:28                 ` Chuck Lever III
  2023-05-30 13:48                   ` Eli Cohen
  2023-05-30 19:46                 ` Thomas Gleixner
  1 sibling, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-30 13:28 UTC (permalink / raw)
  To: Eli Cohen
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Thomas Gleixner



> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>> reproduce the issue while the system is running.
>>> 
>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: firmware version: 16.35.2000
>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m->scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m->system_map=ffff9a33801990d0 end=236
>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>>> 
>>> ###
>>> 
>>> The fault address is the cm->managed_map for one of the CPUs.
>> 
>> That does not make any sense at all. The irq matrix is initialized via:
>> 
>> irq_alloc_matrix()
>> m = kzalloc(sizeof(matric);
>> m->maps = alloc_percpu(*m->maps);
>> 
>> So how is any per CPU map which got allocated there supposed to be
>> invalid (not mapped):
>> 
>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle page fault for address: ffffffffb9ef3f80
>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read access in kernel mode
>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000) - not-present page
>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>> 
>> But if you look at the address: 0xffffffffb9ef3f80
>> 
>> That one is bogus:
>> 
>>    managed_map=ffff9a36efcf0f80
>>    managed_map=ffff9a36efd30f80
>>    managed_map=ffff9a3aefc30f80
>>    managed_map=ffff9a3aefc70f80
>>    managed_map=ffff9a3aefd30f80
>>    managed_map=ffff9a3aefd70f80
>>    managed_map=ffffffffb9ef3f80
>> 
>> Can you spot the fail?
>> 
>> The first six are in the direct map and the last one is in module map,
>> which makes no sense at all.
> 
> Indeed. The reason for that is that the affinity mask has bits
> set for CPU IDs that are not present on my system.
> 
> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
> that mask is set up like this:
> 
> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
> {
>        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
> -       cpumask_var_t req_mask;
> +       struct irq_affinity_desc af_desc;
>        struct mlx5_irq *irq;
> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> -               return ERR_PTR(-ENOMEM);
> -       cpumask_copy(req_mask, cpu_online_mask);
> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
> +       af_desc.is_managed = false;

By the way, why is "is_managed" set to false?

This particular system is a NUMA system, and I'd like to be
able to set IRQ affinity for the card. Since is_managed is
set to false, writing to the /proc/irq files fails with EIO.


> Which normally works as you would expect. But for some historical
> reason, I have CONFIG_NR_CPUS=32 on my system, and the
> cpumask_copy() misbehaves.
> 
> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
> copy, this crash goes away. But mlx5_core crashes during a later
> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
> exactly the same thing (for_each_cpu() on an affinity mask created
> by copying), and crashes in a very similar fashion.
> 
> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
> vanishes entirely, and "modprobe mlx5_core" works as expected.
> 
> Thus I think the problem is with cpumask_copy() or for_each_cpu()
> when NR_CPUS is a small value (the default is 8192).
> 
> 
>> Can you please apply the debug patch below and provide the output?
>> 
>> Thanks,
>> 
>>       tglx
>> ---
>> --- a/kernel/irq/matrix.c
>> +++ b/kernel/irq/matrix.c
>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>  unsigned int alloc_end)
>> {
>> struct irq_matrix *m;
>> + unsigned int cpu;
>> 
>> if (matrix_bits > IRQ_MATRIX_BITS)
>> return NULL;
>> @@ -68,6 +69,8 @@ struct irq_matrix {
>> kfree(m);
>> return NULL;
>> }
>> + for_each_possible_cpu(cpu)
>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned long)per_cpu_ptr(m->maps, cpu));
>> return m;
>> }
>> 
>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>> unsigned int bit;
>> 
>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned long)cm);
>> +
>> bit = matrix_alloc_area(m, cm, 1, true);
>> if (bit >= m->alloc_end)
>> goto cleanup;
> 
> --
> Chuck Lever


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: system hang on start-up (mlx5?)
  2023-05-30 13:28                 ` Chuck Lever III
@ 2023-05-30 13:48                   ` Eli Cohen
  2023-05-30 13:51                     ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Cohen @ 2023-05-30 13:48 UTC (permalink / raw)
  To: Chuck Lever III, Shay Drory
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Thomas Gleixner

> From: Chuck Lever III <chuck.lever@oracle.com>
> Sent: Tuesday, 30 May 2023 16:28
> To: Eli Cohen <elic@nvidia.com>
> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
> <tglx@linutronix.de>
> Subject: Re: system hang on start-up (mlx5?)
> 
> 
> 
> > On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
> wrote:
> >
> >> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
> wrote:
> >>
> >> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
> >>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
> >>> I can boot the system with mlx5_core deny-listed. I log in, remove
> >>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
> >>> reproduce the issue while the system is running.
> >>>
> >>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
> firmware version: 16.35.2000
> >>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
> >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
> >>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
> >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
> Port module event: module 0, Cable plugged
> >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
> >>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
> >>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
> >system_map=ffff9a33801990d0 end=236
> >>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
> page fault for address: ffffffffb9ef3f80
> >>>
> >>> ###
> >>>
> >>> The fault address is the cm->managed_map for one of the CPUs.
> >>
> >> That does not make any sense at all. The irq matrix is initialized via:
> >>
> >> irq_alloc_matrix()
> >> m = kzalloc(sizeof(matric);
> >> m->maps = alloc_percpu(*m->maps);
> >>
> >> So how is any per CPU map which got allocated there supposed to be
> >> invalid (not mapped):
> >>
> >>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
> page fault for address: ffffffffb9ef3f80
> >>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
> access in kernel mode
> >>> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000)
> - not-present page
> >>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
> >>
> >> But if you look at the address: 0xffffffffb9ef3f80
> >>
> >> That one is bogus:
> >>
> >>    managed_map=ffff9a36efcf0f80
> >>    managed_map=ffff9a36efd30f80
> >>    managed_map=ffff9a3aefc30f80
> >>    managed_map=ffff9a3aefc70f80
> >>    managed_map=ffff9a3aefd30f80
> >>    managed_map=ffff9a3aefd70f80
> >>    managed_map=ffffffffb9ef3f80
> >>
> >> Can you spot the fail?
> >>
> >> The first six are in the direct map and the last one is in module map,
> >> which makes no sense at all.
> >
> > Indeed. The reason for that is that the affinity mask has bits
> > set for CPU IDs that are not present on my system.
> >
> > After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
> > that mask is set up like this:
> >
> > struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
> > {
> >        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
> > -       cpumask_var_t req_mask;
> > +       struct irq_affinity_desc af_desc;
> >        struct mlx5_irq *irq;
> > -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> > -               return ERR_PTR(-ENOMEM);
> > -       cpumask_copy(req_mask, cpu_online_mask);
> > +       cpumask_copy(&af_desc.mask, cpu_online_mask);
> > +       af_desc.is_managed = false;
> 
> By the way, why is "is_managed" set to false?
> 
> This particular system is a NUMA system, and I'd like to be
> able to set IRQ affinity for the card. Since is_managed is
> set to false, writing to the /proc/irq files fails with EIO.
>
This is a control irq and is used for issuing configuration commands.

This commit:
commit c410abbbacb9b378365ba17a30df08b4b9eec64f
Author: Dou Liyang <douliyangs@gmail.com>
Date:   Tue Dec 4 23:51:21 2018 +0800

    genirq/affinity: Add is_managed to struct irq_affinity_desc

explains why it should not be managed.
 
> 
> > Which normally works as you would expect. But for some historical
> > reason, I have CONFIG_NR_CPUS=32 on my system, and the
> > cpumask_copy() misbehaves.
> >
> > If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
> > copy, this crash goes away. But mlx5_core crashes during a later
> > part of its init, in cpu_rmap_update(). cpu_rmap_update() does
> > exactly the same thing (for_each_cpu() on an affinity mask created
> > by copying), and crashes in a very similar fashion.
> >
> > If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
> > vanishes entirely, and "modprobe mlx5_core" works as expected.
> >
> > Thus I think the problem is with cpumask_copy() or for_each_cpu()
> > when NR_CPUS is a small value (the default is 8192).
> >
> >
> >> Can you please apply the debug patch below and provide the output?
> >>
> >> Thanks,
> >>
> >>       tglx
> >> ---
> >> --- a/kernel/irq/matrix.c
> >> +++ b/kernel/irq/matrix.c
> >> @@ -51,6 +51,7 @@ struct irq_matrix {
> >>  unsigned int alloc_end)
> >> {
> >> struct irq_matrix *m;
> >> + unsigned int cpu;
> >>
> >> if (matrix_bits > IRQ_MATRIX_BITS)
> >> return NULL;
> >> @@ -68,6 +69,8 @@ struct irq_matrix {
> >> kfree(m);
> >> return NULL;
> >> }
> >> + for_each_possible_cpu(cpu)
> >> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
> long)per_cpu_ptr(m->maps, cpu));
> >> return m;
> >> }
> >>
> >> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
> >> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
> >> unsigned int bit;
> >>
> >> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
> long)cm);
> >> +
> >> bit = matrix_alloc_area(m, cm, 1, true);
> >> if (bit >= m->alloc_end)
> >> goto cleanup;
> >
> > --
> > Chuck Lever
> 
> 
> --
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 13:48                   ` Eli Cohen
@ 2023-05-30 13:51                     ` Chuck Lever III
  2023-05-30 13:54                       ` Eli Cohen
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-30 13:51 UTC (permalink / raw)
  To: Eli Cohen
  Cc: Shay Drory, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Thomas Gleixner



> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@nvidia.com> wrote:
> 
>> From: Chuck Lever III <chuck.lever@oracle.com>
>> Sent: Tuesday, 30 May 2023 16:28
>> To: Eli Cohen <elic@nvidia.com>
>> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
>> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
>> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
>> <tglx@linutronix.de>
>> Subject: Re: system hang on start-up (mlx5?)
>> 
>> 
>> 
>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
>> wrote:
>>> 
>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
>> wrote:
>>>> 
>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>>>> reproduce the issue while the system is running.
>>>>> 
>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
>> firmware version: 16.35.2000
>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
>> Port module event: module 0, Cable plugged
>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core 0000:81:00.0:
>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
>>> system_map=ffff9a33801990d0 end=236
>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>> page fault for address: ffffffffb9ef3f80
>>>>> 
>>>>> ###
>>>>> 
>>>>> The fault address is the cm->managed_map for one of the CPUs.
>>>> 
>>>> That does not make any sense at all. The irq matrix is initialized via:
>>>> 
>>>> irq_alloc_matrix()
>>>> m = kzalloc(sizeof(matric);
>>>> m->maps = alloc_percpu(*m->maps);
>>>> 
>>>> So how is any per CPU map which got allocated there supposed to be
>>>> invalid (not mapped):
>>>> 
>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>> page fault for address: ffffffffb9ef3f80
>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
>> access in kernel mode
>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: error_code(0x0000)
>> - not-present page
>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>>>> 
>>>> But if you look at the address: 0xffffffffb9ef3f80
>>>> 
>>>> That one is bogus:
>>>> 
>>>>   managed_map=ffff9a36efcf0f80
>>>>   managed_map=ffff9a36efd30f80
>>>>   managed_map=ffff9a3aefc30f80
>>>>   managed_map=ffff9a3aefc70f80
>>>>   managed_map=ffff9a3aefd30f80
>>>>   managed_map=ffff9a3aefd70f80
>>>>   managed_map=ffffffffb9ef3f80
>>>> 
>>>> Can you spot the fail?
>>>> 
>>>> The first six are in the direct map and the last one is in module map,
>>>> which makes no sense at all.
>>> 
>>> Indeed. The reason for that is that the affinity mask has bits
>>> set for CPU IDs that are not present on my system.
>>> 
>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>>> that mask is set up like this:
>>> 
>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>>> {
>>>       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>>> -       cpumask_var_t req_mask;
>>> +       struct irq_affinity_desc af_desc;
>>>       struct mlx5_irq *irq;
>>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>>> -               return ERR_PTR(-ENOMEM);
>>> -       cpumask_copy(req_mask, cpu_online_mask);
>>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>>> +       af_desc.is_managed = false;
>> 
>> By the way, why is "is_managed" set to false?
>> 
>> This particular system is a NUMA system, and I'd like to be
>> able to set IRQ affinity for the card. Since is_managed is
>> set to false, writing to the /proc/irq files fails with EIO.
>> 
> This is a control irq and is used for issuing configuration commands.
> 
> This commit:
> commit c410abbbacb9b378365ba17a30df08b4b9eec64f
> Author: Dou Liyang <douliyangs@gmail.com>
> Date:   Tue Dec 4 23:51:21 2018 +0800
> 
>    genirq/affinity: Add is_managed to struct irq_affinity_desc
> 
> explains why it should not be managed.

Understood, but what about the other IRQs? I can't set any
of them. All writes to the proc files result in EIO.


>>> Which normally works as you would expect. But for some historical
>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>>> cpumask_copy() misbehaves.
>>> 
>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>>> copy, this crash goes away. But mlx5_core crashes during a later
>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>>> exactly the same thing (for_each_cpu() on an affinity mask created
>>> by copying), and crashes in a very similar fashion.
>>> 
>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>> 
>>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>>> when NR_CPUS is a small value (the default is 8192).
>>> 
>>> 
>>>> Can you please apply the debug patch below and provide the output?
>>>> 
>>>> Thanks,
>>>> 
>>>>      tglx
>>>> ---
>>>> --- a/kernel/irq/matrix.c
>>>> +++ b/kernel/irq/matrix.c
>>>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>>> unsigned int alloc_end)
>>>> {
>>>> struct irq_matrix *m;
>>>> + unsigned int cpu;
>>>> 
>>>> if (matrix_bits > IRQ_MATRIX_BITS)
>>>> return NULL;
>>>> @@ -68,6 +69,8 @@ struct irq_matrix {
>>>> kfree(m);
>>>> return NULL;
>>>> }
>>>> + for_each_possible_cpu(cpu)
>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
>> long)per_cpu_ptr(m->maps, cpu));
>>>> return m;
>>>> }
>>>> 
>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>>>> unsigned int bit;
>>>> 
>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
>> long)cm);
>>>> +
>>>> bit = matrix_alloc_area(m, cm, 1, true);
>>>> if (bit >= m->alloc_end)
>>>> goto cleanup;
>>> 
>>> --
>>> Chuck Lever
>> 
>> 
>> --
>> Chuck Lever


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: system hang on start-up (mlx5?)
  2023-05-30 13:51                     ` Chuck Lever III
@ 2023-05-30 13:54                       ` Eli Cohen
  2023-05-30 15:08                         ` Shay Drory
  0 siblings, 1 reply; 36+ messages in thread
From: Eli Cohen @ 2023-05-30 13:54 UTC (permalink / raw)
  To: Chuck Lever III, Shay Drory
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Thomas Gleixner

> -----Original Message-----
> From: Chuck Lever III <chuck.lever@oracle.com>
> Sent: Tuesday, 30 May 2023 16:51
> To: Eli Cohen <elic@nvidia.com>
> Cc: Shay Drory <shayd@nvidia.com>; Leon Romanovsky <leon@kernel.org>;
> Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
> <netdev@vger.kernel.org>; Thomas Gleixner <tglx@linutronix.de>
> Subject: Re: system hang on start-up (mlx5?)
> 
> 
> 
> > On May 30, 2023, at 9:48 AM, Eli Cohen <elic@nvidia.com> wrote:
> >
> >> From: Chuck Lever III <chuck.lever@oracle.com>
> >> Sent: Tuesday, 30 May 2023 16:28
> >> To: Eli Cohen <elic@nvidia.com>
> >> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
> >> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
> >> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
> >> <tglx@linutronix.de>
> >> Subject: Re: system hang on start-up (mlx5?)
> >>
> >>
> >>
> >>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
> >> wrote:
> >>>
> >>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
> >> wrote:
> >>>>
> >>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
> >>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
> >>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
> >>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
> >>>>> reproduce the issue while the system is running.
> >>>>>
> >>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> firmware version: 16.35.2000
> >>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
> >> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> Port module event: module 0, Cable plugged
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
> >> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
> 0000:81:00.0:
> >> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
> >>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
> end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
> >>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
> >>> system_map=ffff9a33801990d0 end=236
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
> >> page fault for address: ffffffffb9ef3f80
> >>>>>
> >>>>> ###
> >>>>>
> >>>>> The fault address is the cm->managed_map for one of the CPUs.
> >>>>
> >>>> That does not make any sense at all. The irq matrix is initialized via:
> >>>>
> >>>> irq_alloc_matrix()
> >>>> m = kzalloc(sizeof(matric);
> >>>> m->maps = alloc_percpu(*m->maps);
> >>>>
> >>>> So how is any per CPU map which got allocated there supposed to be
> >>>> invalid (not mapped):
> >>>>
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
> >> page fault for address: ffffffffb9ef3f80
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
> >> access in kernel mode
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
> error_code(0x0000)
> >> - not-present page
> >>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
> >> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
> >>>>
> >>>> But if you look at the address: 0xffffffffb9ef3f80
> >>>>
> >>>> That one is bogus:
> >>>>
> >>>>   managed_map=ffff9a36efcf0f80
> >>>>   managed_map=ffff9a36efd30f80
> >>>>   managed_map=ffff9a3aefc30f80
> >>>>   managed_map=ffff9a3aefc70f80
> >>>>   managed_map=ffff9a3aefd30f80
> >>>>   managed_map=ffff9a3aefd70f80
> >>>>   managed_map=ffffffffb9ef3f80
> >>>>
> >>>> Can you spot the fail?
> >>>>
> >>>> The first six are in the direct map and the last one is in module map,
> >>>> which makes no sense at all.
> >>>
> >>> Indeed. The reason for that is that the affinity mask has bits
> >>> set for CPU IDs that are not present on my system.
> >>>
> >>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
> >>> that mask is set up like this:
> >>>
> >>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
> >>> {
> >>>       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
> >>> -       cpumask_var_t req_mask;
> >>> +       struct irq_affinity_desc af_desc;
> >>>       struct mlx5_irq *irq;
> >>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> >>> -               return ERR_PTR(-ENOMEM);
> >>> -       cpumask_copy(req_mask, cpu_online_mask);
> >>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
> >>> +       af_desc.is_managed = false;
> >>
> >> By the way, why is "is_managed" set to false?
> >>
> >> This particular system is a NUMA system, and I'd like to be
> >> able to set IRQ affinity for the card. Since is_managed is
> >> set to false, writing to the /proc/irq files fails with EIO.
> >>
> > This is a control irq and is used for issuing configuration commands.
> >
> > This commit:
> > commit c410abbbacb9b378365ba17a30df08b4b9eec64f
> > Author: Dou Liyang <douliyangs@gmail.com>
> > Date:   Tue Dec 4 23:51:21 2018 +0800
> >
> >    genirq/affinity: Add is_managed to struct irq_affinity_desc
> >
> > explains why it should not be managed.
> 
> Understood, but what about the other IRQs? I can't set any
> of them. All writes to the proc files result in EIO.
> 
I think @Shay Drory has a fix for that should go upstream.
Shay was it sent?
> 
> >>> Which normally works as you would expect. But for some historical
> >>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
> >>> cpumask_copy() misbehaves.
> >>>
> >>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
> >>> copy, this crash goes away. But mlx5_core crashes during a later
> >>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
> >>> exactly the same thing (for_each_cpu() on an affinity mask created
> >>> by copying), and crashes in a very similar fashion.
> >>>
> >>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
> >>> vanishes entirely, and "modprobe mlx5_core" works as expected.
> >>>
> >>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
> >>> when NR_CPUS is a small value (the default is 8192).
> >>>
> >>>
> >>>> Can you please apply the debug patch below and provide the output?
> >>>>
> >>>> Thanks,
> >>>>
> >>>>      tglx
> >>>> ---
> >>>> --- a/kernel/irq/matrix.c
> >>>> +++ b/kernel/irq/matrix.c
> >>>> @@ -51,6 +51,7 @@ struct irq_matrix {
> >>>> unsigned int alloc_end)
> >>>> {
> >>>> struct irq_matrix *m;
> >>>> + unsigned int cpu;
> >>>>
> >>>> if (matrix_bits > IRQ_MATRIX_BITS)
> >>>> return NULL;
> >>>> @@ -68,6 +69,8 @@ struct irq_matrix {
> >>>> kfree(m);
> >>>> return NULL;
> >>>> }
> >>>> + for_each_possible_cpu(cpu)
> >>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
> >> long)per_cpu_ptr(m->maps, cpu));
> >>>> return m;
> >>>> }
> >>>>
> >>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
> >>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
> >>>> unsigned int bit;
> >>>>
> >>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
> >> long)cm);
> >>>> +
> >>>> bit = matrix_alloc_area(m, cm, 1, true);
> >>>> if (bit >= m->alloc_end)
> >>>> goto cleanup;
> >>>
> >>> --
> >>> Chuck Lever
> >>
> >>
> >> --
> >> Chuck Lever
> 
> 
> --
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 13:54                       ` Eli Cohen
@ 2023-05-30 15:08                         ` Shay Drory
  2023-05-31 14:15                           ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Shay Drory @ 2023-05-30 15:08 UTC (permalink / raw)
  To: Eli Cohen, Chuck Lever III
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Thomas Gleixner


On 30/05/2023 16:54, Eli Cohen wrote:
>> -----Original Message-----
>> From: Chuck Lever III <chuck.lever@oracle.com>
>> Sent: Tuesday, 30 May 2023 16:51
>> To: Eli Cohen <elic@nvidia.com>
>> Cc: Shay Drory <shayd@nvidia.com>; Leon Romanovsky <leon@kernel.org>;
>> Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
>> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
>> <netdev@vger.kernel.org>; Thomas Gleixner <tglx@linutronix.de>
>> Subject: Re: system hang on start-up (mlx5?)
>>
>>
>>
>>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>
>>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>>> Sent: Tuesday, 30 May 2023 16:28
>>>> To: Eli Cohen <elic@nvidia.com>
>>>> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
>>>> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
>>>> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
>>>> <tglx@linutronix.de>
>>>> Subject: Re: system hang on start-up (mlx5?)
>>>>
>>>>
>>>>
>>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
>>>> wrote:
>>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
>>>> wrote:
>>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>>>>>> reproduce the issue while the system is running.
>>>>>>>
>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> firmware version: 16.35.2000
>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> Port module event: module 0, Cable plugged
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>> page fault for address: ffffffffb9ef3f80
>>>>>>> ###
>>>>>>>
>>>>>>> The fault address is the cm->managed_map for one of the CPUs.
>>>>>> That does not make any sense at all. The irq matrix is initialized via:
>>>>>>
>>>>>> irq_alloc_matrix()
>>>>>> m = kzalloc(sizeof(matric);
>>>>>> m->maps = alloc_percpu(*m->maps);
>>>>>>
>>>>>> So how is any per CPU map which got allocated there supposed to be
>>>>>> invalid (not mapped):
>>>>>>
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>> page fault for address: ffffffffb9ef3f80
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
>>>> access in kernel mode
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
>> error_code(0x0000)
>>>> - not-present page
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
>>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>>>>>> But if you look at the address: 0xffffffffb9ef3f80
>>>>>>
>>>>>> That one is bogus:
>>>>>>
>>>>>>    managed_map=ffff9a36efcf0f80
>>>>>>    managed_map=ffff9a36efd30f80
>>>>>>    managed_map=ffff9a3aefc30f80
>>>>>>    managed_map=ffff9a3aefc70f80
>>>>>>    managed_map=ffff9a3aefd30f80
>>>>>>    managed_map=ffff9a3aefd70f80
>>>>>>    managed_map=ffffffffb9ef3f80
>>>>>>
>>>>>> Can you spot the fail?
>>>>>>
>>>>>> The first six are in the direct map and the last one is in module map,
>>>>>> which makes no sense at all.
>>>>> Indeed. The reason for that is that the affinity mask has bits
>>>>> set for CPU IDs that are not present on my system.
>>>>>
>>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>>>>> that mask is set up like this:
>>>>>
>>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>>>>> {
>>>>>        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>>>>> -       cpumask_var_t req_mask;
>>>>> +       struct irq_affinity_desc af_desc;
>>>>>        struct mlx5_irq *irq;
>>>>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>>>>> -               return ERR_PTR(-ENOMEM);
>>>>> -       cpumask_copy(req_mask, cpu_online_mask);
>>>>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>>>>> +       af_desc.is_managed = false;
>>>> By the way, why is "is_managed" set to false?
>>>>
>>>> This particular system is a NUMA system, and I'd like to be
>>>> able to set IRQ affinity for the card. Since is_managed is
>>>> set to false, writing to the /proc/irq files fails with EIO.
>>>>
>>> This is a control irq and is used for issuing configuration commands.
>>>
>>> This commit:
>>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f
>>> Author: Dou Liyang <douliyangs@gmail.com>
>>> Date:   Tue Dec 4 23:51:21 2018 +0800
>>>
>>>     genirq/affinity: Add is_managed to struct irq_affinity_desc
>>>
>>> explains why it should not be managed.
>> Understood, but what about the other IRQs? I can't set any
>> of them. All writes to the proc files result in EIO.
>>
> I think @Shay Drory has a fix for that should go upstream.
> Shay was it sent?

The fix was send and merged.

https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/
>>>>> Which normally works as you would expect. But for some historical
>>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>>>>> cpumask_copy() misbehaves.
>>>>>
>>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>>>>> copy, this crash goes away. But mlx5_core crashes during a later
>>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>>>>> exactly the same thing (for_each_cpu() on an affinity mask created
>>>>> by copying), and crashes in a very similar fashion.
>>>>>
>>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>>>>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>>>>
>>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>>>>> when NR_CPUS is a small value (the default is 8192).
>>>>>
>>>>>
>>>>>> Can you please apply the debug patch below and provide the output?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>       tglx
>>>>>> ---
>>>>>> --- a/kernel/irq/matrix.c
>>>>>> +++ b/kernel/irq/matrix.c
>>>>>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>>>>> unsigned int alloc_end)
>>>>>> {
>>>>>> struct irq_matrix *m;
>>>>>> + unsigned int cpu;
>>>>>>
>>>>>> if (matrix_bits > IRQ_MATRIX_BITS)
>>>>>> return NULL;
>>>>>> @@ -68,6 +69,8 @@ struct irq_matrix {
>>>>>> kfree(m);
>>>>>> return NULL;
>>>>>> }
>>>>>> + for_each_possible_cpu(cpu)
>>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
>>>> long)per_cpu_ptr(m->maps, cpu));
>>>>>> return m;
>>>>>> }
>>>>>>
>>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>>>>>> unsigned int bit;
>>>>>>
>>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
>>>> long)cm);
>>>>>> +
>>>>>> bit = matrix_alloc_area(m, cm, 1, true);
>>>>>> if (bit >= m->alloc_end)
>>>>>> goto cleanup;
>>>>> --
>>>>> Chuck Lever
>>>>
>>>> --
>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 13:09               ` Chuck Lever III
  2023-05-30 13:28                 ` Chuck Lever III
@ 2023-05-30 19:46                 ` Thomas Gleixner
  2023-05-30 21:48                   ` Chuck Lever III
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2023-05-30 19:46 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]

Chuck!

On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:
>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> But if you look at the address: 0xffffffffb9ef3f80
>> 
>> That one is bogus:
>> 
>>     managed_map=ffff9a36efcf0f80
>>     managed_map=ffff9a36efd30f80
>>     managed_map=ffff9a3aefc30f80
>>     managed_map=ffff9a3aefc70f80
>>     managed_map=ffff9a3aefd30f80
>>     managed_map=ffff9a3aefd70f80
>>     managed_map=ffffffffb9ef3f80
>> 
>> Can you spot the fail?
>> 
>> The first six are in the direct map and the last one is in module map,
>> which makes no sense at all.
>
> Indeed. The reason for that is that the affinity mask has bits
> set for CPU IDs that are not present on my system.

Which I don't buy, but even if so then this should not make
for_each_cpu() go south. See below.

> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
> that mask is set up like this:
>
>  struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>  {
>         struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
> -       cpumask_var_t req_mask;
> +       struct irq_affinity_desc af_desc;

That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...

>         struct mlx5_irq *irq;
> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> -               return ERR_PTR(-ENOMEM);
> -       cpumask_copy(req_mask, cpu_online_mask);
> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
> +       af_desc.is_managed = false;
>
> Which normally works as you would expect. But for some historical
> reason, I have CONFIG_NR_CPUS=32 on my system, and the
> cpumask_copy() misbehaves.
>
> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
> copy, this crash goes away. But mlx5_core crashes during a later
> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
> exactly the same thing (for_each_cpu() on an affinity mask created
> by copying), and crashes in a very similar fashion.
>
> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
> vanishes entirely, and "modprobe mlx5_core" works as expected.
>
> Thus I think the problem is with cpumask_copy() or for_each_cpu()
> when NR_CPUS is a small value (the default is 8192).

I don't buy any of this. Here is why:

cpumask_copy(d, s)
   bitmap_copy(d, s, nbits = 32)
     len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);

So it copies as many longs as required to cover nbits, i.e. it copies
any clobbered bits beyond nbits too. While that looks odd at the first
glance, that's just an optimization which is harmless.

for_each_cpu() finds the next set bit in a mask and breaks the loop once
bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.

I just booted a kernel with NR_CPUS=32:

[    0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
[    0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
...
[    0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1

and added a function which does:

    struct cpumask ma, mb;
    int cpu;

    memset(&ma, 0xaa, sizeof(ma);
    cpumask_copy(&mb, &ma);
    pr_info("MASKBITS: %016lx\n", mb.bits[0]);
    pr_info("CPUs:");
    for_each_cpu(cpu, &mb)
         pr_cont(" %d", cpu);
    pr_cont("\n");

[    2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
[    2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

and the same exercise with a 0x55 filled source bitmap.

[    2.167595] smp: MASKBITS: 0x5555555555555555
[    2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
the right thing simple because of this:

nr_cpu_ids is 32, right?

for_each_cpu(cpu, mask)
   for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
	for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)

So if find_next_bit() returns a bit after bit 31 the condition (bit) <
(size) will terminate the loop, right?

Also in the case of that driver the copy is _NOT_ copying set bits
beyond bit 31 simply because the source mask is cpu_online_mask which
definitely does not have a bit set which is greater than 31. As the copy
copies longs the resulting mask in af_desc.mask cannot have any bit set
past bit 31 either.

If the above is not correct, then there is a bigger problem than that
MLX5 driver crashing.

So this:

> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
> copy, this crash goes away.

does not make any sense to me.

Can you please add after the cpumask_copy() in that mlx5 code:

    pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
    pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);

Please print also in irq_matrix_reserve_managed():

  - @mask->bits[0]
  - nr_cpu_ids
  - the CPU numbers inside the for_each_cpu() loop

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 19:46                 ` Thomas Gleixner
@ 2023-05-30 21:48                   ` Chuck Lever III
  2023-05-30 22:17                     ` Thomas Gleixner
  2023-05-31 14:43                     ` Thomas Gleixner
  0 siblings, 2 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-05-30 21:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]



> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Chuck!
> 
> On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:
>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> But if you look at the address: 0xffffffffb9ef3f80
>>> 
>>> That one is bogus:
>>> 
>>>    managed_map=ffff9a36efcf0f80
>>>    managed_map=ffff9a36efd30f80
>>>    managed_map=ffff9a3aefc30f80
>>>    managed_map=ffff9a3aefc70f80
>>>    managed_map=ffff9a3aefd30f80
>>>    managed_map=ffff9a3aefd70f80
>>>    managed_map=ffffffffb9ef3f80
>>> 
>>> Can you spot the fail?
>>> 
>>> The first six are in the direct map and the last one is in module map,
>>> which makes no sense at all.
>> 
>> Indeed. The reason for that is that the affinity mask has bits
>> set for CPU IDs that are not present on my system.
> 
> Which I don't buy, but even if so then this should not make
> for_each_cpu() go south. See below.
> 
>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>> that mask is set up like this:
>> 
>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>> {
>>        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>> -       cpumask_var_t req_mask;
>> +       struct irq_affinity_desc af_desc;
> 
> That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...
> 
>>        struct mlx5_irq *irq;
>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>> -               return ERR_PTR(-ENOMEM);
>> -       cpumask_copy(req_mask, cpu_online_mask);
>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>> +       af_desc.is_managed = false;
>> 
>> Which normally works as you would expect. But for some historical
>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>> cpumask_copy() misbehaves.
>> 
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away. But mlx5_core crashes during a later
>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>> exactly the same thing (for_each_cpu() on an affinity mask created
>> by copying), and crashes in a very similar fashion.
>> 
>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>> 
>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>> when NR_CPUS is a small value (the default is 8192).
> 
> I don't buy any of this. Here is why:
> 
> cpumask_copy(d, s)
>   bitmap_copy(d, s, nbits = 32)
>     len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
> 
> So it copies as many longs as required to cover nbits, i.e. it copies
> any clobbered bits beyond nbits too. While that looks odd at the first
> glance, that's just an optimization which is harmless.
> 
> for_each_cpu() finds the next set bit in a mask and breaks the loop once
> bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.
> 
> I just booted a kernel with NR_CPUS=32:

My system has only 12 CPUs. So every bit in your mask represents
a present CPU, but on my system, only 0x00000fff are ever present.

Therefore, on my system, any bit higher than bit 11 in a CPU mask
will reference a CPU that is not present.


> [    0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
> [    0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
> ...
> [    0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1
> 
> and added a function which does:
> 
>    struct cpumask ma, mb;
>    int cpu;
> 
>    memset(&ma, 0xaa, sizeof(ma);
>    cpumask_copy(&mb, &ma);
>    pr_info("MASKBITS: %016lx\n", mb.bits[0]);
>    pr_info("CPUs:");
>    for_each_cpu(cpu, &mb)
>         pr_cont(" %d", cpu);
>    pr_cont("\n");
> 
> [    2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
> [    2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
> 
> and the same exercise with a 0x55 filled source bitmap.
> 
> [    2.167595] smp: MASKBITS: 0x5555555555555555
> [    2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
> 
> So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
> the right thing simple because of this:
> 
> nr_cpu_ids is 32, right?
> 
> for_each_cpu(cpu, mask)
>   for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
> for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
> 
> So if find_next_bit() returns a bit after bit 31 the condition (bit) <
> (size) will terminate the loop, right?

Again, you are assuming more CPUs than there are bits in the mask.


> Also in the case of that driver the copy is _NOT_ copying set bits
> beyond bit 31 simply because the source mask is cpu_online_mask which
> definitely does not have a bit set which is greater than 31. As the copy
> copies longs the resulting mask in af_desc.mask cannot have any bit set
> past bit 31 either.
> 
> If the above is not correct, then there is a bigger problem than that
> MLX5 driver crashing.
> 
> So this:
> 
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away.
> 
> does not make any sense to me.
> 
> Can you please add after the cpumask_copy() in that mlx5 code:
> 
>    pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>    pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);

Both are 0000 0000 0000 0fff, as expected on a system
where 12 CPUs are present.


> Please print also in irq_matrix_reserve_managed():
> 
>  - @mask->bits[0]
>  - nr_cpu_ids
>  - the CPU numbers inside the for_each_cpu() loop

Here's where it gets interesting:

+       pr_info("%s: MASKBITS:   %016lx\n", __func__, msk->bits[0]);
+       pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids);

[   70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000
[   70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[   71.260555][    C9] port_module: 1 callbacks suppressed
[   71.260561][    C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS:   ffffb1a74686bcd8
[   71.274096][   T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
[   71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12
[   71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3
[   71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4
[   71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6
[   71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7
[   71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10
[   71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11
[   71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12

Notice that there are in fact higher bits set than bit 11.

The lowest 16 bits of MASKBITS are bcd8, or in binary:

... 1011 1100 1101 1000

Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
12, matching the CPU IDs from the loop. At bit 12, we fault,
since there is no CPU 12 on the system.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 21:48                   ` Chuck Lever III
@ 2023-05-30 22:17                     ` Thomas Gleixner
  2023-05-31 14:43                     ` Thomas Gleixner
  1 sibling, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2023-05-30 22:17 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL]

On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> cpumask_copy(d, s)
>>   bitmap_copy(d, s, nbits = 32)
>>     len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
>> 
>> So it copies as many longs as required to cover nbits, i.e. it copies
>> any clobbered bits beyond nbits too. While that looks odd at the first
>> glance, that's just an optimization which is harmless.
>> 
>> for_each_cpu() finds the next set bit in a mask and breaks the loop once
>> bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.
>> 
>> I just booted a kernel with NR_CPUS=32:
>
> My system has only 12 CPUs. So every bit in your mask represents
> a present CPU, but on my system, only 0x00000fff are ever present.
>
> Therefore, on my system, any bit higher than bit 11 in a CPU mask
> will reference a CPU that is not present.

Correct....

Sorry, I missed the part that your machine has only 12 CPUs....

Now I can reproduce the wreckage even with that trivial test I did:

[    0.210089] setup_percpu: NR_CPUS:32 nr_cpumask_bits:12 nr_cpu_ids:12 nr_node_ids:1
...
[    0.606591] smp: MASKBITS: 5555555555555555
[    0.607026] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

I'm way too tired to make sense of that right now. Will have a look at
it tomorrow with brain awake unless you beat me to it.

That's one mystery but the other one is this:

[   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS:   ffffb1a74686bcd8

That's clearly a kernel address within the direct map. How does that end
up as content of a cpumask?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 15:08                         ` Shay Drory
@ 2023-05-31 14:15                           ` Chuck Lever III
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-05-31 14:15 UTC (permalink / raw)
  To: Shay Drory, Eli Cohen
  Cc: Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Thomas Gleixner



> On May 30, 2023, at 11:08 AM, Shay Drory <shayd@nvidia.com> wrote:
> 
> 
> On 30/05/2023 16:54, Eli Cohen wrote:
>>> -----Original Message-----
>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>> Sent: Tuesday, 30 May 2023 16:51
>>> To: Eli Cohen <elic@nvidia.com>
>>> Cc: Shay Drory <shayd@nvidia.com>; Leon Romanovsky <leon@kernel.org>;
>>> Saeed Mahameed <saeedm@nvidia.com>; linux-rdma <linux-
>>> rdma@vger.kernel.org>; open list:NETWORKING [GENERAL]
>>> <netdev@vger.kernel.org>; Thomas Gleixner <tglx@linutronix.de>
>>> Subject: Re: system hang on start-up (mlx5?)
>>> 
>>> 
>>> 
>>>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>> 
>>>>> From: Chuck Lever III <chuck.lever@oracle.com>
>>>>> Sent: Tuesday, 30 May 2023 16:28
>>>>> To: Eli Cohen <elic@nvidia.com>
>>>>> Cc: Leon Romanovsky <leon@kernel.org>; Saeed Mahameed
>>>>> <saeedm@nvidia.com>; linux-rdma <linux-rdma@vger.kernel.org>; open
>>>>> list:NETWORKING [GENERAL] <netdev@vger.kernel.org>; Thomas Gleixner
>>>>> <tglx@linutronix.de>
>>>>> Subject: Re: system hang on start-up (mlx5?)
>>>>> 
>>>>> 
>>>>> 
>>>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@oracle.com>
>>>>> wrote:
>>>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@linutronix.de>
>>>>> wrote:
>>>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@nvidia.com> wrote:
>>>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>>>>>>> reproduce the issue while the system is running.
>>>>>>>> 
>>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> firmware version: 16.35.2000
>>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> Port module event: module 0, Cable plugged
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>>> 0000:81:00.0:
>>>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
>>> end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
>>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>>> page fault for address: ffffffffb9ef3f80
>>>>>>>> ###
>>>>>>>> 
>>>>>>>> The fault address is the cm->managed_map for one of the CPUs.
>>>>>>> That does not make any sense at all. The irq matrix is initialized via:
>>>>>>> 
>>>>>>> irq_alloc_matrix()
>>>>>>> m = kzalloc(sizeof(matric);
>>>>>>> m->maps = alloc_percpu(*m->maps);
>>>>>>> 
>>>>>>> So how is any per CPU map which got allocated there supposed to be
>>>>>>> invalid (not mapped):
>>>>>>> 
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>>> page fault for address: ffffffffb9ef3f80
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
>>>>> access in kernel mode
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
>>> error_code(0x0000)
>>>>> - not-present page
>>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
>>>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>>>>>>> But if you look at the address: 0xffffffffb9ef3f80
>>>>>>> 
>>>>>>> That one is bogus:
>>>>>>> 
>>>>>>>   managed_map=ffff9a36efcf0f80
>>>>>>>   managed_map=ffff9a36efd30f80
>>>>>>>   managed_map=ffff9a3aefc30f80
>>>>>>>   managed_map=ffff9a3aefc70f80
>>>>>>>   managed_map=ffff9a3aefd30f80
>>>>>>>   managed_map=ffff9a3aefd70f80
>>>>>>>   managed_map=ffffffffb9ef3f80
>>>>>>> 
>>>>>>> Can you spot the fail?
>>>>>>> 
>>>>>>> The first six are in the direct map and the last one is in module map,
>>>>>>> which makes no sense at all.
>>>>>> Indeed. The reason for that is that the affinity mask has bits
>>>>>> set for CPU IDs that are not present on my system.
>>>>>> 
>>>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>>>>>> that mask is set up like this:
>>>>>> 
>>>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>>>>>> {
>>>>>>       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>>>>>> -       cpumask_var_t req_mask;
>>>>>> +       struct irq_affinity_desc af_desc;
>>>>>>       struct mlx5_irq *irq;
>>>>>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>>>>>> -               return ERR_PTR(-ENOMEM);
>>>>>> -       cpumask_copy(req_mask, cpu_online_mask);
>>>>>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>>>>>> +       af_desc.is_managed = false;
>>>>> By the way, why is "is_managed" set to false?
>>>>> 
>>>>> This particular system is a NUMA system, and I'd like to be
>>>>> able to set IRQ affinity for the card. Since is_managed is
>>>>> set to false, writing to the /proc/irq files fails with EIO.
>>>>> 
>>>> This is a control irq and is used for issuing configuration commands.
>>>> 
>>>> This commit:
>>>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f
>>>> Author: Dou Liyang <douliyangs@gmail.com>
>>>> Date:   Tue Dec 4 23:51:21 2018 +0800
>>>> 
>>>>    genirq/affinity: Add is_managed to struct irq_affinity_desc
>>>> 
>>>> explains why it should not be managed.
>>> Understood, but what about the other IRQs? I can't set any
>>> of them. All writes to the proc files result in EIO.
>>> 
>> I think @Shay Drory has a fix for that should go upstream.
>> Shay was it sent?
> 
> The fix was send and merged.
> 
> https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/

Fwiw, I'm now on v6.4-rc4, and setting IRQ affinity works as expected.
Sorry for the noise and thanks for the fix.


>>>>>> Which normally works as you would expect. But for some historical
>>>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>>>>>> cpumask_copy() misbehaves.
>>>>>> 
>>>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>>>>>> copy, this crash goes away. But mlx5_core crashes during a later
>>>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>>>>>> exactly the same thing (for_each_cpu() on an affinity mask created
>>>>>> by copying), and crashes in a very similar fashion.
>>>>>> 
>>>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>>>>>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>>>>> 
>>>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>>>>>> when NR_CPUS is a small value (the default is 8192).
>>>>>> 
>>>>>> 
>>>>>>> Can you please apply the debug patch below and provide the output?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>>      tglx
>>>>>>> ---
>>>>>>> --- a/kernel/irq/matrix.c
>>>>>>> +++ b/kernel/irq/matrix.c
>>>>>>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>>>>>> unsigned int alloc_end)
>>>>>>> {
>>>>>>> struct irq_matrix *m;
>>>>>>> + unsigned int cpu;
>>>>>>> 
>>>>>>> if (matrix_bits > IRQ_MATRIX_BITS)
>>>>>>> return NULL;
>>>>>>> @@ -68,6 +69,8 @@ struct irq_matrix {
>>>>>>> kfree(m);
>>>>>>> return NULL;
>>>>>>> }
>>>>>>> + for_each_possible_cpu(cpu)
>>>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
>>>>> long)per_cpu_ptr(m->maps, cpu));
>>>>>>> return m;
>>>>>>> }
>>>>>>> 
>>>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>>>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>>>>>>> unsigned int bit;
>>>>>>> 
>>>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
>>>>> long)cm);
>>>>>>> +
>>>>>>> bit = matrix_alloc_area(m, cm, 1, true);
>>>>>>> if (bit >= m->alloc_end)
>>>>>>> goto cleanup;
>>>>>> --
>>>>>> Chuck Lever
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>> 
>>> --
>>> Chuck Lever


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-30 21:48                   ` Chuck Lever III
  2023-05-30 22:17                     ` Thomas Gleixner
@ 2023-05-31 14:43                     ` Thomas Gleixner
  2023-05-31 15:06                       ` Chuck Lever III
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2023-05-31 14:43 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Peter Zijlstra

On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Can you please add after the cpumask_copy() in that mlx5 code:
>> 
>>    pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>>    pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);
>
> Both are 0000 0000 0000 0fff, as expected on a system
> where 12 CPUs are present.

So the non-initialized mask on stack has the online bits correctly
copied and bits 12-63 are cleared, which is what cpumask_copy()
achieves by copying longs and not bits.

> [   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8

How can that end up with a completely different content here?

As I said before that's clearly a direct map address.

So the call chain is:

mlx5_irq_alloc(af_desc)
  pci_msix_alloc_irq_at(af_desc)
    msi_domain_alloc_irq_at(af_desc)
      __msi_domain_alloc_irqs(af_desc)
1)      msidesc->affinity = kmemdup(af_desc);
        __irq_domain_alloc_irqs()
          __irq_domain_alloc_irqs(aff=msidesc->affinity)
            irq_domain_alloc_irqs_locked(aff)
              irq_domain_alloc_irqs_locked(aff)
                irq_domain_alloc_descs(aff)
                  alloc_desc(mask=&aff->mask)
                    desc_smp_init(mask)
2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
                irq_domain_alloc_irqs_hierarchy()
                  msi_domain_alloc()
                    intel_irq_remapping_alloc()
                      x86_vector_alloc_irqs()
                        reserve_managed_vector()
                          mask = desc->irq_common_data.affinity;
                          irq_matrix_reserve_managed(mask)

So af_desc is kmemdup'ed at #1 and then the result is copied in #2.

Anything else just hands pointers around. So where gets either af_desc
or msidesc->affinity or desc->irq_common_data.affinity overwritten? Or
one of the pointers mangled. I doubt that it's the latter as this is 99%
generic code which would end up in random fails all over the place.

This also ends up in the wrong place. That mlx code does:

   af_desc.is_managed = false;

but the allocation ends up allocating a managed vector.

This screams memory corruption ....

Can you please instrument this along the call chain so we can see where
or at least when this gets corrupted? Please print the relevant pointer
addresses too so we can see whether that's consistent or not.

> The lowest 16 bits of MASKBITS are bcd8, or in binary:
>
> ... 1011 1100 1101 1000
>
> Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
> 12, matching the CPU IDs from the loop. At bit 12, we fault,
> since there is no CPU 12 on the system.

Thats due to a dubious optimization from Linus:

#if NR_CPUS <= BITS_PER_LONG
  #define small_cpumask_bits ((unsigned int)NR_CPUS)
  #define large_cpumask_bits ((unsigned int)NR_CPUS)
#elif NR_CPUS <= 4*BITS_PER_LONG
  #define small_cpumask_bits nr_cpu_ids

small_cpumask_bits is not nr_cpu_ids(12), it's NR_CPUS(32) which is why
the loop does not terminate. Bah!

But that's just the symptom, not the root cause. That code is perfectly
fine when all callers use the proper cpumask functions.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-31 14:43                     ` Thomas Gleixner
@ 2023-05-31 15:06                       ` Chuck Lever III
  2023-05-31 17:11                         ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-31 15:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Peter Zijlstra



> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
>>> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Can you please add after the cpumask_copy() in that mlx5 code:
>>> 
>>>   pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
>>>   pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);
>> 
>> Both are 0000 0000 0000 0fff, as expected on a system
>> where 12 CPUs are present.
> 
> So the non-initialized mask on stack has the online bits correctly
> copied and bits 12-63 are cleared, which is what cpumask_copy()
> achieves by copying longs and not bits.
> 
>> [   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8
> 
> How can that end up with a completely different content here?
> 
> As I said before that's clearly a direct map address.
> 
> So the call chain is:
> 
> mlx5_irq_alloc(af_desc)
>  pci_msix_alloc_irq_at(af_desc)
>    msi_domain_alloc_irq_at(af_desc)
>      __msi_domain_alloc_irqs(af_desc)
> 1)      msidesc->affinity = kmemdup(af_desc);
>        __irq_domain_alloc_irqs()
>          __irq_domain_alloc_irqs(aff=msidesc->affinity)
>            irq_domain_alloc_irqs_locked(aff)
>              irq_domain_alloc_irqs_locked(aff)
>                irq_domain_alloc_descs(aff)
>                  alloc_desc(mask=&aff->mask)
>                    desc_smp_init(mask)
> 2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
>                irq_domain_alloc_irqs_hierarchy()
>                  msi_domain_alloc()
>                    intel_irq_remapping_alloc()
>                      x86_vector_alloc_irqs()

It is x86_vector_alloc_irqs() where the struct irq_data is
fabricated that ends up in irq_matrix_reserve_managed().


>                        reserve_managed_vector()
>                          mask = desc->irq_common_data.affinity;
>                          irq_matrix_reserve_managed(mask)
> 
> So af_desc is kmemdup'ed at #1 and then the result is copied in #2.
> 
> Anything else just hands pointers around. So where gets either af_desc
> or msidesc->affinity or desc->irq_common_data.affinity overwritten? Or
> one of the pointers mangled. I doubt that it's the latter as this is 99%
> generic code which would end up in random fails all over the place.
> 
> This also ends up in the wrong place. That mlx code does:
> 
>   af_desc.is_managed = false;
> 
> but the allocation ends up allocating a managed vector.

That line was changed in 6.4-rc4 to address another bug,
and it avoids the crash by not calling into the misbehaving
code. It doesn't address the mlx5_core initialization issue
though, because as I said before, execution continues and
crashes in a similar scenario later on.

On my system, I've reverted that fix:

-       af_desc.is_managed = false;
+       af_desc.is_managed = 1;

so that we can continue debugging this crash.


> Can you please instrument this along the call chain so we can see where
> or at least when this gets corrupted? Please print the relevant pointer
> addresses too so we can see whether that's consistent or not.

I will continue working on this today.


>> The lowest 16 bits of MASKBITS are bcd8, or in binary:
>> 
>> ... 1011 1100 1101 1000
>> 
>> Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
>> 12, matching the CPU IDs from the loop. At bit 12, we fault,
>> since there is no CPU 12 on the system.
> 
> Thats due to a dubious optimization from Linus:
> 
> #if NR_CPUS <= BITS_PER_LONG
>  #define small_cpumask_bits ((unsigned int)NR_CPUS)
>  #define large_cpumask_bits ((unsigned int)NR_CPUS)
> #elif NR_CPUS <= 4*BITS_PER_LONG
>  #define small_cpumask_bits nr_cpu_ids
> 
> small_cpumask_bits is not nr_cpu_ids(12), it's NR_CPUS(32) which is why
> the loop does not terminate. Bah!
> 
> But that's just the symptom, not the root cause. That code is perfectly
> fine when all callers use the proper cpumask functions.

Agreed: we're crashing here because of the extra bits
in the affinity mask, but those bits should not be set
in the first place.

I wasn't sure if for_each_cpu() was supposed to iterate
into non-present CPUs -- and I guess the answer
is yes, it will iterate the full length of the mask.
The caller is responsible for ensuring the mask is valid.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-31 15:06                       ` Chuck Lever III
@ 2023-05-31 17:11                         ` Thomas Gleixner
  2023-05-31 18:52                           ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2023-05-31 17:11 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Peter Zijlstra

On Wed, May 31 2023 at 15:06, Chuck Lever III wrote:
>> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> mlx5_irq_alloc(af_desc)
>>  pci_msix_alloc_irq_at(af_desc)
>>    msi_domain_alloc_irq_at(af_desc)
>>      __msi_domain_alloc_irqs(af_desc)
>> 1)      msidesc->affinity = kmemdup(af_desc);
>>        __irq_domain_alloc_irqs()
>>          __irq_domain_alloc_irqs(aff=msidesc->affinity)
>>            irq_domain_alloc_irqs_locked(aff)
>>              irq_domain_alloc_irqs_locked(aff)
>>                irq_domain_alloc_descs(aff)
>>                  alloc_desc(mask=&aff->mask)
>>                    desc_smp_init(mask)
>> 2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
>>                irq_domain_alloc_irqs_hierarchy()
>>                  msi_domain_alloc()
>>                    intel_irq_remapping_alloc()
>>                      x86_vector_alloc_irqs()
>
> It is x86_vector_alloc_irqs() where the struct irq_data is
> fabricated that ends up in irq_matrix_reserve_managed().

Kinda fabricated :)
     
     irqd = irq_domain_get_irq_data(domain, virq + i);

Thats finding the irqdata which is associated to the vector domain. That
has been allocated earlier. The affinity mask is retrieved via:

    const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);

which does:

      return irqd->common->affinity;

irqd->common points to desc->irq_common_data. The affinity there was
copied in #2 above.

>> This also ends up in the wrong place. That mlx code does:
>> 
>>   af_desc.is_managed = false;
>> 
>> but the allocation ends up allocating a managed vector.
>
> That line was changed in 6.4-rc4 to address another bug,
> and it avoids the crash by not calling into the misbehaving
> code. It doesn't address the mlx5_core initialization issue
> though, because as I said before, execution continues and
> crashes in a similar scenario later on.

Ok.

> On my system, I've reverted that fix:
>
> -       af_desc.is_managed = false;
> +       af_desc.is_managed = 1;
>
> so that we can continue debugging this crash.

Ah.

>> Can you please instrument this along the call chain so we can see where
>> or at least when this gets corrupted? Please print the relevant pointer
>> addresses too so we can see whether that's consistent or not.
>
> I will continue working on this today.
>> But that's just the symptom, not the root cause. That code is perfectly
>> fine when all callers use the proper cpumask functions.
>
> Agreed: we're crashing here because of the extra bits
> in the affinity mask, but those bits should not be set
> in the first place.

Correct.

> I wasn't sure if for_each_cpu() was supposed to iterate
> into non-present CPUs -- and I guess the answer
> is yes, it will iterate the full length of the mask.
> The caller is responsible for ensuring the mask is valid.

Yes, that's the assumption of this constant optimization for the small
number of CPUs case. All other cases use nr_cpu_ids as limit and won't
go into non-possible CPUs. I didn't spot it yesterday night either.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-31 17:11                         ` Thomas Gleixner
@ 2023-05-31 18:52                           ` Chuck Lever III
  2023-05-31 19:19                             ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-05-31 18:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Peter Zijlstra



> On May 31, 2023, at 1:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, May 31 2023 at 15:06, Chuck Lever III wrote:
>>> On May 31, 2023, at 10:43 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> mlx5_irq_alloc(af_desc)
>>> pci_msix_alloc_irq_at(af_desc)
>>>   msi_domain_alloc_irq_at(af_desc)
>>>     __msi_domain_alloc_irqs(af_desc)
>>> 1)      msidesc->affinity = kmemdup(af_desc);
>>>       __irq_domain_alloc_irqs()
>>>         __irq_domain_alloc_irqs(aff=msidesc->affinity)
>>>           irq_domain_alloc_irqs_locked(aff)
>>>             irq_domain_alloc_irqs_locked(aff)
>>>               irq_domain_alloc_descs(aff)
>>>                 alloc_desc(mask=&aff->mask)
>>>                   desc_smp_init(mask)
>>> 2)                    cpumask_copy(desc->irq_common_data.affinity, mask);
>>>               irq_domain_alloc_irqs_hierarchy()
>>>                 msi_domain_alloc()
>>>                   intel_irq_remapping_alloc()
>>>                     x86_vector_alloc_irqs()
>> 
>> It is x86_vector_alloc_irqs() where the struct irq_data is
>> fabricated that ends up in irq_matrix_reserve_managed().
> 
> Kinda fabricated :)
> 
>     irqd = irq_domain_get_irq_data(domain, virq + i);
> 
> Thats finding the irqdata which is associated to the vector domain. That
> has been allocated earlier. The affinity mask is retrieved via:
> 
>    const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);
> 
> which does:
> 
>      return irqd->common->affinity;
> 
> irqd->common points to desc->irq_common_data. The affinity there was
> copied in #2 above.
> 
>>> This also ends up in the wrong place. That mlx code does:
>>> 
>>>  af_desc.is_managed = false;
>>> 
>>> but the allocation ends up allocating a managed vector.
>> 
>> That line was changed in 6.4-rc4 to address another bug,
>> and it avoids the crash by not calling into the misbehaving
>> code. It doesn't address the mlx5_core initialization issue
>> though, because as I said before, execution continues and
>> crashes in a similar scenario later on.
> 
> Ok.
> 
>> On my system, I've reverted that fix:
>> 
>> -       af_desc.is_managed = false;
>> +       af_desc.is_managed = 1;
>> 
>> so that we can continue debugging this crash.
> 
> Ah.
> 
>>> Can you please instrument this along the call chain so we can see where
>>> or at least when this gets corrupted? Please print the relevant pointer
>>> addresses too so we can see whether that's consistent or not.
>> 
>> I will continue working on this today.
>>> But that's just the symptom, not the root cause. That code is perfectly
>>> fine when all callers use the proper cpumask functions.
>> 
>> Agreed: we're crashing here because of the extra bits
>> in the affinity mask, but those bits should not be set
>> in the first place.
> 
> Correct.
> 
>> I wasn't sure if for_each_cpu() was supposed to iterate
>> into non-present CPUs -- and I guess the answer
>> is yes, it will iterate the full length of the mask.
>> The caller is responsible for ensuring the mask is valid.
> 
> Yes, that's the assumption of this constant optimization for the small
> number of CPUs case. All other cases use nr_cpu_ids as limit and won't
> go into non-possible CPUs. I didn't spot it yesterday night either.

This addresses the problem for me with both is_managed = 1
and is_managed = false:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index db5687d9fec9..bcf5df316c8f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -570,11 +570,11 @@ int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
        af_desc.is_managed = false;
        for (i = 0; i < nirqs; i++) {
+               cpumask_clear(&af_desc.mask);
                cpumask_set_cpu(cpus[i], &af_desc.mask);
                irq = mlx5_irq_request(dev, i + 1, &af_desc, rmap);
                if (IS_ERR(irq))
                        break;
-               cpumask_clear(&af_desc.mask);
                irqs[i] = irq;
        }

If you agree this looks reasonable, I can package it with a
proper patch description and send it to Eli and Saeed.

--
Chuck Lever



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-31 18:52                           ` Chuck Lever III
@ 2023-05-31 19:19                             ` Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2023-05-31 19:19 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Eli Cohen, Leon Romanovsky, Saeed Mahameed, linux-rdma,
	open list:NETWORKING [GENERAL],
	Peter Zijlstra

On Wed, May 31 2023 at 18:52, Chuck Lever III wrote:
>> On May 31, 2023, at 1:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> This addresses the problem for me with both is_managed = 1
> and is_managed = false:
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index db5687d9fec9..bcf5df316c8f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -570,11 +570,11 @@ int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
>         af_desc.is_managed = false;
>         for (i = 0; i < nirqs; i++) {
> +               cpumask_clear(&af_desc.mask);
>                 cpumask_set_cpu(cpus[i], &af_desc.mask);
>                 irq = mlx5_irq_request(dev, i + 1, &af_desc, rmap);
>                 if (IS_ERR(irq))
>                         break;
> -               cpumask_clear(&af_desc.mask);
>                 irqs[i] = irq;
>         }
>
> If you agree this looks reasonable, I can package it with a
> proper patch description and send it to Eli and Saeed.

It does. I clearly missed that function when going through the possible
callchains. Yes, that's definitely broken and the fix is correct.

bbac70c74183 ("net/mlx5: Use newer affinity descriptor") is the culprit.

Feel free to add:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-06-02 13:38     ` Chuck Lever III
  0 siblings, 1 reply; 36+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-06-02 11:05 UTC (permalink / raw)
  To: Chuck Lever III, elic
  Cc: saeedm, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL],
	Linux kernel regressions list

[TLDR: This mail in primarily relevant for Linux regression tracking. A
change or fix related to the regression discussed in this thread was
posted or applied, but it did not use a Link: tag to point to the
report, as Linus and the documentation call for. Things happen, no
worries -- but now the regression tracking bot needs to be told manually
about the fix. See link in footer if these mails annoy you.]

On 08.05.23 14:29, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 03.05.23 03:03, Chuck Lever III wrote:
>>
>> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
>> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
>> MCX515A-CCAT
>>
>> When booting a v6.3+ kernel, the boot process stops cold after a
>> few seconds. The last message on the console is the MLX5 driver
>> note about "PCIe slot advertised sufficient power (27W)".
>>
>> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
>> descriptor") is the first bad commit.
>>
>> I've trolled lore a couple of times and haven't found any discussion
>> of this issue.
> 
> #regzbot ^introduced bbac70c74183
> #regzbot title system hang on start-up (irq or mlx5 problem?)
> #regzbot ignore-activity

#regzbot fix: 368591995d010e6
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
@ 2023-06-02 13:38     ` Chuck Lever III
  2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever III @ 2023-06-02 13:38 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: elic, saeedm, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL]

Hi Thorsten -

> On Jun 2, 2023, at 7:05 AM, Linux regression tracking #update (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
> 
> [TLDR: This mail in primarily relevant for Linux regression tracking. A
> change or fix related to the regression discussed in this thread was
> posted or applied, but it did not use a Link: tag to point to the
> report, as Linus and the documentation call for.

Linus recently stated he did not like Link: tags pointing to an
email thread on lore.

Also, checkpatch.pl is now complaining about Closes: tags instead
of Link: tags. A bug was never opened for this issue.

I did check the regzbot docs on how to mark this issue closed,
but didn't find a ready answer. Thank you for following up.


> Things happen, no
> worries -- but now the regression tracking bot needs to be told manually
> about the fix. See link in footer if these mails annoy you.]
> 
> On 08.05.23 14:29, Linux regression tracking #adding (Thorsten Leemhuis)
> wrote:
>> On 03.05.23 03:03, Chuck Lever III wrote:
>>> 
>>> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
>>> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
>>> MCX515A-CCAT
>>> 
>>> When booting a v6.3+ kernel, the boot process stops cold after a
>>> few seconds. The last message on the console is the MLX5 driver
>>> note about "PCIe slot advertised sufficient power (27W)".
>>> 
>>> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
>>> descriptor") is the first bad commit.
>>> 
>>> I've trolled lore a couple of times and haven't found any discussion
>>> of this issue.
>> 
>> #regzbot ^introduced bbac70c74183
>> #regzbot title system hang on start-up (irq or mlx5 problem?)
>> #regzbot ignore-activity
> 
> #regzbot fix: 368591995d010e6
> #regzbot ignore-activity
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-06-02 13:38     ` Chuck Lever III
@ 2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
  2023-06-02 14:03         ` Chuck Lever III
  2023-06-02 14:29         ` Jason Gunthorpe
  0 siblings, 2 replies; 36+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-06-02 13:55 UTC (permalink / raw)
  To: Chuck Lever III, Linux regressions mailing list
  Cc: elic, saeedm, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL]

On 02.06.23 15:38, Chuck Lever III wrote:
> 
>> On Jun 2, 2023, at 7:05 AM, Linux regression tracking #update (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> [TLDR: This mail in primarily relevant for Linux regression tracking. A
>> change or fix related to the regression discussed in this thread was
>> posted or applied, but it did not use a Link: tag to point to the
>> report, as Linus and the documentation call for.
> 
> Linus recently stated he did not like Link: tags pointing to an
> email thread on lore.

Afaik he strongly dislikes them when a Link: tag just points to the
submission of the patch being applied; at the same time he *really
wants* those links if they tell the backstory how a fix came into being,
which definitely includes the report about the issue being fixed (side
note: without those links regression tracking becomes so hard that it's
basically no feasible).

If my knowledge is not up to date, please if you have a minute do me a
favor and point me to Linus statement your refer to.

> Also, checkpatch.pl is now complaining about Closes: tags instead
> of Link: tags. A bug was never opened for this issue.

That was a change by somebody else, but FWIW, just use Closes: (instead
of Link:) with a link to the report on lore, that tag is not reserved
for bugs.

/me will go and update his boilerplate text used above

> I did check the regzbot docs on how to mark this issue closed,
> but didn't find a ready answer. Thank you for following up.

yw, but no worries, that's what I'm here for. :-D

Ciao, Thorsten

>> Things happen, no
>> worries -- but now the regression tracking bot needs to be told manually
>> about the fix. See link in footer if these mails annoy you.]
>>
>> On 08.05.23 14:29, Linux regression tracking #adding (Thorsten Leemhuis)
>> wrote:
>>> On 03.05.23 03:03, Chuck Lever III wrote:
>>>>
>>>> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
>>>> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
>>>> MCX515A-CCAT
>>>>
>>>> When booting a v6.3+ kernel, the boot process stops cold after a
>>>> few seconds. The last message on the console is the MLX5 driver
>>>> note about "PCIe slot advertised sufficient power (27W)".
>>>>
>>>> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
>>>> descriptor") is the first bad commit.
>>>>
>>>> I've trolled lore a couple of times and haven't found any discussion
>>>> of this issue.
>>>
>>> #regzbot ^introduced bbac70c74183
>>> #regzbot title system hang on start-up (irq or mlx5 problem?)
>>> #regzbot ignore-activity
>>
>> #regzbot fix: 368591995d010e6
>> #regzbot ignore-activity
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
> 
> --
> Chuck Lever
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-06-02 14:03         ` Chuck Lever III
  2023-06-02 14:29         ` Jason Gunthorpe
  1 sibling, 0 replies; 36+ messages in thread
From: Chuck Lever III @ 2023-06-02 14:03 UTC (permalink / raw)
  To: Linux regressions mailing list, saeedm
  Cc: elic, Leon Romanovsky, linux-rdma, open list:NETWORKING [GENERAL]



> On Jun 2, 2023, at 9:55 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
> 
> On 02.06.23 15:38, Chuck Lever III wrote:
>> 
>>> On Jun 2, 2023, at 7:05 AM, Linux regression tracking #update (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>>> 
>>> [TLDR: This mail in primarily relevant for Linux regression tracking. A
>>> change or fix related to the regression discussed in this thread was
>>> posted or applied, but it did not use a Link: tag to point to the
>>> report, as Linus and the documentation call for.
>> 
>> Linus recently stated he did not like Link: tags pointing to an
>> email thread on lore.
> 
> Afaik he strongly dislikes them when a Link: tag just points to the
> submission of the patch being applied; at the same time he *really
> wants* those links if they tell the backstory how a fix came into being,
> which definitely includes the report about the issue being fixed (side
> note: without those links regression tracking becomes so hard that it's
> basically no feasible).

I certainly appreciate having that information available.
I must have misunderstood Linus' comment.


> If my knowledge is not up to date, please if you have a minute do me a
> favor and point me to Linus statement your refer to.
> 
>> Also, checkpatch.pl is now complaining about Closes: tags instead
>> of Link: tags. A bug was never opened for this issue.
> 
> That was a change by somebody else, but FWIW, just use Closes: (instead
> of Link:) with a link to the report on lore, that tag is not reserved
> for bugs.
> 
> /me will go and update his boilerplate text used above

The specific complaint is about the ordering of Reported-by:
and Link: or Closes: tags.

Saeed, if it is still possible, you can add:

Closes: https://lore.kernel.org/netdev/bb2df75d-05be-3f7b-693a-84be195dc2f1@leemhuis.info/T/#m49b88941c8dc5be42fa960f84ecda680ddb1a778

To my patch.


>> I did check the regzbot docs on how to mark this issue closed,
>> but didn't find a ready answer. Thank you for following up.
> 
> yw, but no worries, that's what I'm here for. :-D
> 
> Ciao, Thorsten
> 
>>> Things happen, no
>>> worries -- but now the regression tracking bot needs to be told manually
>>> about the fix. See link in footer if these mails annoy you.]
>>> 
>>> On 08.05.23 14:29, Linux regression tracking #adding (Thorsten Leemhuis)
>>> wrote:
>>>> On 03.05.23 03:03, Chuck Lever III wrote:
>>>>> 
>>>>> I have a Supermicro X10SRA-F/X10SRA-F with a ConnectX®-5 EN network
>>>>> interface card, 100GbE single-port QSFP28, PCIe3.0 x16, tall bracket;
>>>>> MCX515A-CCAT
>>>>> 
>>>>> When booting a v6.3+ kernel, the boot process stops cold after a
>>>>> few seconds. The last message on the console is the MLX5 driver
>>>>> note about "PCIe slot advertised sufficient power (27W)".
>>>>> 
>>>>> bisect reports that bbac70c74183 ("net/mlx5: Use newer affinity
>>>>> descriptor") is the first bad commit.
>>>>> 
>>>>> I've trolled lore a couple of times and haven't found any discussion
>>>>> of this issue.
>>>> 
>>>> #regzbot ^introduced bbac70c74183
>>>> #regzbot title system hang on start-up (irq or mlx5 problem?)
>>>> #regzbot ignore-activity
>>> 
>>> #regzbot fix: 368591995d010e6
>>> #regzbot ignore-activity
>>> 
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>> 
>> --
>> Chuck Lever
>> 
>> 

--
Chuck Lever



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
  2023-06-02 14:03         ` Chuck Lever III
@ 2023-06-02 14:29         ` Jason Gunthorpe
  2023-06-02 15:58           ` Thorsten Leemhuis
  2023-06-02 16:54           ` Jakub Kicinski
  1 sibling, 2 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2023-06-02 14:29 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Chuck Lever III, elic, saeedm, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL]

On Fri, Jun 02, 2023 at 03:55:43PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 02.06.23 15:38, Chuck Lever III wrote:
> > 
> >> On Jun 2, 2023, at 7:05 AM, Linux regression tracking #update (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
> >>
> >> [TLDR: This mail in primarily relevant for Linux regression tracking. A
> >> change or fix related to the regression discussed in this thread was
> >> posted or applied, but it did not use a Link: tag to point to the
> >> report, as Linus and the documentation call for.
> > 
> > Linus recently stated he did not like Link: tags pointing to an
> > email thread on lore.
> 
> Afaik he strongly dislikes them when a Link: tag just points to the
> submission of the patch being applied;

He has said that, but AFAICT enough maintainers disagree that we are
still adding Link tags to the submission as a glorified Change-Id

When done well these do provide information because the cover letter
should back link to all prior version of the series and you can then
capture the entire discussion, although manually.

> at the same time he *really wants* those links if they tell the
> backstory how a fix came into being, which definitely includes the
> report about the issue being fixed (side note: without those links
> regression tracking becomes so hard that it's basically no
> feasible).

Yes, but this started to get a bit redundant as we now have

 Reported-by:  xx@syzkaller

Which does identify the original bug and all its places, and now
people are adding links to the syzkaller email too because checkpatch
is complaining.

> > Also, checkpatch.pl is now complaining about Closes: tags instead
> > of Link: tags. A bug was never opened for this issue.
> 
> That was a change by somebody else, but FWIW, just use Closes: (instead
> of Link:) with a link to the report on lore, that tag is not reserved
> for bugs.
> 
> /me will go and update his boilerplate text used above

And now you say they should be closes not link?

Oy it makes my head hurt all these rules.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-06-02 14:29         ` Jason Gunthorpe
@ 2023-06-02 15:58           ` Thorsten Leemhuis
  2023-06-02 16:54           ` Jakub Kicinski
  1 sibling, 0 replies; 36+ messages in thread
From: Thorsten Leemhuis @ 2023-06-02 15:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Linux regressions mailing list
  Cc: Chuck Lever III, elic, saeedm, Leon Romanovsky, linux-rdma,
	open list:NETWORKING [GENERAL]

On 02.06.23 16:29, Jason Gunthorpe wrote:
> On Fri, Jun 02, 2023 at 03:55:43PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 02.06.23 15:38, Chuck Lever III wrote:
>>>
>>>> On Jun 2, 2023, at 7:05 AM, Linux regression tracking #update (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>>>>
>>>> [TLDR: This mail in primarily relevant for Linux regression tracking. A
>>>> change or fix related to the regression discussed in this thread was
>>>> posted or applied, but it did not use a Link: tag to point to the
>>>> report, as Linus and the documentation call for.
>>>
>>> Linus recently stated he did not like Link: tags pointing to an
>>> email thread on lore.
>>
>> Afaik he strongly dislikes them when a Link: tag just points to the
>> submission of the patch being applied;
> 
> He has said that, but AFAICT enough maintainers disagree that we are
> still adding Link tags to the submission as a glorified Change-Id
> [...]

Which is totally fine for me, I only want the links to the reports, too.
And I for now don't even care if the latter are added using Closes: or
Link:.

> When done well these do provide information because the cover letter
> should back link to all prior version of the series and you can then
> capture the entire discussion, although manually.

I kinda agree. OTOH I like even more when subsystem put the cover letter
text in a merge commit, *if* the cover letter contains important details.

>> at the same time he *really wants* those links if they tell the
>> backstory how a fix came into being, which definitely includes the
>> report about the issue being fixed (side note: without those links
>> regression tracking becomes so hard that it's basically no
>> feasible).
> 
> Yes, but this started to get a bit redundant as we now have
> 
>  Reported-by:  xx@syzkaller
> 
> Which does identify the original bug and all its places, and now
> people are adding links to the syzkaller email too because checkpatch
> is complaining.

For syzkaller it's redundant, yes, but for some other CIs and manual
reports it's useful and nothing new afaics (a lot of people just were
not aware of it). And FWIW, it's a warning, not a error to indicate:
there are situation when this can be ignored.

>>> Also, checkpatch.pl is now complaining about Closes: tags instead
>>> of Link: tags. A bug was never opened for this issue.
>>
>> That was a change by somebody else, but FWIW, just use Closes: (instead
>> of Link:) with a link to the report on lore, that tag is not reserved
>> for bugs.
>>
>> /me will go and update his boilerplate text used above
> 
> And now you say they should be closes not link?
> 
> Oy it makes my head hurt all these rules.

In case you want the backstory (which I doubt :-D ), see here:

https://lore.kernel.org/lkml/20230314-doc-checkpatch-closes-tag-v4-0-d26d1fa66f9f@tessares.net/

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: system hang on start-up (mlx5?)
  2023-06-02 14:29         ` Jason Gunthorpe
  2023-06-02 15:58           ` Thorsten Leemhuis
@ 2023-06-02 16:54           ` Jakub Kicinski
  1 sibling, 0 replies; 36+ messages in thread
From: Jakub Kicinski @ 2023-06-02 16:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linux regressions mailing list, Chuck Lever III, elic, saeedm,
	Leon Romanovsky, linux-rdma, open list:NETWORKING [GENERAL]

On Fri, 2 Jun 2023 11:29:24 -0300 Jason Gunthorpe wrote:
> > > Also, checkpatch.pl is now complaining about Closes: tags instead
> > > of Link: tags. A bug was never opened for this issue.  
> > 
> > That was a change by somebody else, but FWIW, just use Closes: (instead
> > of Link:) with a link to the report on lore, that tag is not reserved
> > for bugs.
> > 
> > /me will go and update his boilerplate text used above  
> 
> And now you say they should be closes not link?
> 
> Oy it makes my head hurt all these rules.

+1

I don't understand why the Closes tag were accepted. 
I may be misremembering but I thought Linus wanted Link tags:

Link: https://bla/bla

optionally with a trailer:

Link: https://bla/bla # closes

The checkpatch warning is just adding an annoying amount of noise
for all of use who don't use Closes tags.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2023-06-02 16:54 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  1:03 system hang on start-up (mlx5?) Chuck Lever III
2023-05-03  6:34 ` Eli Cohen
2023-05-03 14:02   ` Chuck Lever III
2023-05-04  7:29     ` Leon Romanovsky
2023-05-04 19:02       ` Chuck Lever III
2023-05-04 23:38         ` Jason Gunthorpe
2023-05-07  5:23           ` Eli Cohen
2023-05-07  5:31         ` Eli Cohen
2023-05-27 20:16           ` Chuck Lever III
2023-05-29 21:20             ` Thomas Gleixner
2023-05-30 13:09               ` Chuck Lever III
2023-05-30 13:28                 ` Chuck Lever III
2023-05-30 13:48                   ` Eli Cohen
2023-05-30 13:51                     ` Chuck Lever III
2023-05-30 13:54                       ` Eli Cohen
2023-05-30 15:08                         ` Shay Drory
2023-05-31 14:15                           ` Chuck Lever III
2023-05-30 19:46                 ` Thomas Gleixner
2023-05-30 21:48                   ` Chuck Lever III
2023-05-30 22:17                     ` Thomas Gleixner
2023-05-31 14:43                     ` Thomas Gleixner
2023-05-31 15:06                       ` Chuck Lever III
2023-05-31 17:11                         ` Thomas Gleixner
2023-05-31 18:52                           ` Chuck Lever III
2023-05-31 19:19                             ` Thomas Gleixner
2023-05-16 19:23         ` Chuck Lever III
2023-05-23 14:20           ` Linux regression tracking (Thorsten Leemhuis)
2023-05-24 14:59             ` Chuck Lever III
2023-05-08 12:29 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-06-02 11:05   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-06-02 13:38     ` Chuck Lever III
2023-06-02 13:55       ` Linux regression tracking (Thorsten Leemhuis)
2023-06-02 14:03         ` Chuck Lever III
2023-06-02 14:29         ` Jason Gunthorpe
2023-06-02 15:58           ` Thorsten Leemhuis
2023-06-02 16:54           ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.