All of lore.kernel.org
 help / color / mirror / Atom feed
From: Keith Busch <keith.busch@intel.com>
To: LKML <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: [BUG 4.15-rc7] IRQ matrix management errors
Date: Sun, 14 Jan 2018 20:02:55 -0700	[thread overview]
Message-ID: <20180115030255.GA13921@localhost.localdomain> (raw)
In-Reply-To: <20180115025759.GG13580@localhost.localdomain>

I hoped to have a better report before the weekend, but I've run out of
time and without my machine till next week, so sending what I have and
praying someone more in the know will have a better clue.

I've a few NVMe drives and occasionally the IRQ teardown and bring-up
is failing. Resetting the controllers to re-run interrupt allocation
results in device interrupts from not occuring at all.

It appears I need at least 2 NVMe drives for the problem. The NVMe
driver initializes controllers in parallel, and multiple threads calling
pci_alloc_irq_vectors and/or pci_free_vectors at about the same time
seems required to trigger the issue.

Here are the relavent warnings. I also have trace events from irq_matrix
and irq_vectors, but from a different run that won't match up to the
timestamps below, but can send if that's helpful.

[  288.519216] WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120
[  288.519218] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[  288.519286]  ptp nvme_core pps_core [last unloaded: nvme]
[  288.519294] CPU: 28 PID: 1420 Comm: kworker/u674:2 Not tainted 4.15.0-rc7+ #4
[  288.519296] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[  288.519305] Workqueue: nvme-wq nvme_reset_work [nvme]
[  288.519310] RIP: 0010:irq_matrix_remove_managed+0x10f/0x120
[  288.519312] RSP: 0018:ffffb01f0aec7a88 EFLAGS: 00010046
[  288.519315] RAX: 00000000000000ee RBX: ffff9e387d824900 RCX: 0000000000000000
[  288.519317] RDX: 0000000000000100 RSI: 00000000000000ee RDI: ffff9e387d410c50
[  288.519319] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000000
[  288.519320] R10: 0000000000000018 R11: 0000000000000003 R12: ffff9e387d410c00
[  288.519322] R13: ffff9e387d410c30 R14: 00000000000000ee R15: 00000000000000ee
[  288.519324] FS:  0000000000000000(0000) GS:ffff9e487cc00000(0000) knlGS:0000000000000000
[  288.519326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  288.519328] CR2: 00007fbc1c2321d4 CR3: 00000010f5209003 CR4: 00000000007606e0
[  288.519330] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  288.519332] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  288.519333] PKRU: 55555554
[  288.519334] Call Trace:
[  288.519345]  x86_vector_free_irqs+0xa1/0x180
[  288.519349]  x86_vector_alloc_irqs+0x1e4/0x3a0
[  288.519354]  msi_domain_alloc+0x62/0x130
[  288.519363]  ? kmem_cache_alloc_node_trace+0x1ac/0x1d0
[  288.519366]  __irq_domain_alloc_irqs+0x121/0x300
[  288.519370]  msi_domain_alloc_irqs+0x99/0x2e0
[  288.519376]  native_setup_msi_irqs+0x54/0x90
[  288.519383]  __pci_enable_msix+0xfb/0x4e0
[  288.519389]  pci_alloc_irq_vectors_affinity+0x8e/0x130
[  288.519394]  nvme_reset_work+0x919/0x153b [nvme]
[  288.519404]  ? sched_clock+0x5/0x10
[  288.519407]  ? sched_clock+0x5/0x10
[  288.519414]  ? sched_clock_cpu+0xc/0xb0
[  288.519420]  ? pick_next_task_fair+0x4d5/0x5f0
[  288.519427]  ? __switch_to+0xa2/0x430
[  288.519431]  ? put_prev_entity+0x1e/0xe0
[  288.519438]  process_one_work+0x182/0x370
[  288.519441]  worker_thread+0x2e/0x380
[  288.519444]  ? process_one_work+0x370/0x370
[  288.519449]  kthread+0x111/0x130
[  288.519453]  ? kthread_create_worker_on_cpu+0x70/0x70
[  288.519460]  ret_from_fork+0x1f/0x30
[  288.519464] Code: 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f ff e9 14 ff ff ff <0f> ff e9 0d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
[  288.519517] ---[ end trace 8fcc70570a780634 ]---
[  310.959112] nvme nvme0: pci function 0000:5e:00.0
[  310.959333] nvme nvme1: pci function 0000:60:00.0
[  310.959675] nvme nvme2: pci function 0000:86:00.0
[  311.167369] WARNING: CPU: 1 PID: 898 at kernel/irq/matrix.c:215 irq_matrix_remove_managed+0x108/0x120
[  311.167371] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[  311.167440]  ptp nvme_core pps_core [last unloaded: nvme]
[  311.167448] CPU: 1 PID: 898 Comm: kworker/u673:9 Tainted: G        W        4.15.0-rc7+ #4
[  311.167450] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[  311.167459] Workqueue: nvme-wq nvme_reset_work [nvme]
[  311.167464] RIP: 0010:irq_matrix_remove_managed+0x108/0x120
[  311.167466] RSP: 0018:ffffb01f0f387a88 EFLAGS: 00010046
[  311.167469] RAX: 0000000000000000 RBX: ffff9e387df24900 RCX: 0000000000000038
[  311.167471] RDX: 0000000000000038 RSI: 0000000000000000 RDI: ffff9e3871bcd3c0
[  311.167472] RBP: 0000000000000038 R08: fffffffffffffffc R09: 0300000000000003
[  311.167474] R10: ffff9e387d402480 R11: ffff9e3838149d1c R12: ffff9e387d410c00
[  311.167476] R13: ffff9e387d410c30 R14: 0000000000000021 R15: 00000000000000ee
[  311.167478] FS:  0000000000000000(0000) GS:ffff9e387d840000(0000) knlGS:0000000000000000
[  311.167481] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  311.167483] CR2: 00007fbe60000b20 CR3: 00000010f5209001 CR4: 00000000007606e0
[  311.167486] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  311.167489] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  311.167490] PKRU: 55555554
[  311.167491] Call Trace:
[  311.167505]  x86_vector_free_irqs+0xa1/0x180
[  311.167515]  ? cpumask_next+0x17/0x20
[  311.167520]  x86_vector_alloc_irqs+0x1e4/0x3a0
[  311.167525]  msi_domain_alloc+0x62/0x130
[  311.167532]  ? kmem_cache_alloc_node_trace+0x1ac/0x1d0
[  311.167536]  __irq_domain_alloc_irqs+0x121/0x300
[  311.167540]  msi_domain_alloc_irqs+0x99/0x2e0
[  311.167547]  native_setup_msi_irqs+0x54/0x90
[  311.167556]  __pci_enable_msix+0xfb/0x4e0
[  311.167562]  pci_alloc_irq_vectors_affinity+0x8e/0x130
[  311.167568]  nvme_reset_work+0x919/0x153b [nvme]
[  311.167576]  ? update_curr+0xe4/0x1d0
[  311.167580]  ? account_entity_dequeue+0xa4/0xd0
[  311.167584]  ? dequeue_entity+0xd5/0x430
[  311.167589]  ? pick_next_task_fair+0x14f/0x5f0
[  311.167597]  ? __switch_to+0xa2/0x430
[  311.167603]  ? sched_clock+0x5/0x10
[  311.167607]  ? put_prev_entity+0x1e/0xe0
[  311.167612]  process_one_work+0x182/0x370
[  311.167616]  worker_thread+0x2e/0x380
[  311.167619]  ? process_one_work+0x370/0x370
[  311.167623]  kthread+0x111/0x130
[  311.167628]  ? kthread_create_worker_on_cpu+0x70/0x70
[  311.167633]  ret_from_fork+0x1f/0x30
[  311.167637] Code: 18 49 89 d8 4c 89 e1 89 ea 44 89 f6 41 ff d1 4d 8b 0f 4d 85 c9 75 e2 e9 2a ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> ff e9 14 ff ff ff 0f ff e9 0d ff ff ff 66 2e 0f 1f 84 00 00 
[  311.167691] ---[ end trace 8fcc70570a780635 ]---
[  311.172737] WARNING: CPU: 0 PID: 894 at arch/x86/kernel/apic/vector.c:404 x86_vector_activate+0x160/0x280
[  311.172739] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[  311.172801]  ptp nvme_core pps_core [last unloaded: nvme]
[  311.172808] CPU: 0 PID: 894 Comm: kworker/u673:5 Tainted: G        W        4.15.0-rc7+ #4
[  311.172809] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[  311.172817] Workqueue: nvme-wq nvme_reset_work [nvme]
[  311.172822] RIP: 0010:x86_vector_activate+0x160/0x280
[  311.172824] RSP: 0018:ffffb01f0f367b88 EFLAGS: 00010086
[  311.172827] RAX: 00000000ffffffe4 RBX: ffff9e386a90e380 RCX: 0000000000000000
[  311.172829] RDX: 0000000000000100 RSI: 00000000000000ee RDI: ffff9e387d410c50
[  311.172830] RBP: 0000000000000150 R08: 0000000000000100 R09: 0000000000000000
[  311.172832] R10: 0000000000000018 R11: 0000000000000003 R12: 00000000ffffffe4
[  311.172833] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9e386a0b70a4
[  311.172835] FS:  0000000000000000(0000) GS:ffff9e387d800000(0000) knlGS:0000000000000000
[  311.172837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  311.172839] CR2: 00007f268d3d5000 CR3: 00000010f5209001 CR4: 00000000007606f0
[  311.172841] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  311.172842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  311.172843] PKRU: 55555554
[  311.172844] Call Trace:
[  311.172855]  ? kernfs_activate+0x63/0x80
[  311.172864]  __irq_domain_activate_irq+0x43/0x90
[  311.172869]  ? __kernfs_create_file+0x7d/0xb0
[  311.172874]  __irq_domain_activate_irq+0x7a/0x90
[  311.172877]  irq_domain_activate_irq+0x25/0x40
[  311.172882]  irq_startup+0xb2/0xe0
[  311.172887]  __setup_irq+0x66d/0x6e0
[  311.172891]  ? request_threaded_irq+0xad/0x160
[  311.172894]  request_threaded_irq+0xf5/0x160
[  311.172900]  ? nvme_del_cq_end+0x70/0x70 [nvme]
[  311.172906]  pci_request_irq+0x8f/0xd0
[  311.172912]  queue_request_irq+0x4d/0x70 [nvme]
[  311.172917]  nvme_reset_work+0x934/0x153b [nvme]
[  311.172924]  ? update_curr+0xe4/0x1d0
[  311.172928]  ? account_entity_dequeue+0xa4/0xd0
[  311.172932]  ? dequeue_entity+0xd5/0x430
[  311.172937]  ? pick_next_task_fair+0x14f/0x5f0
[  311.172943]  ? __switch_to+0xa2/0x430
[  311.172947]  ? sched_clock+0x5/0x10
[  311.172951]  ? put_prev_entity+0x1e/0xe0
[  311.172956]  process_one_work+0x182/0x370
[  311.172959]  worker_thread+0x2e/0x380
[  311.172962]  ? process_one_work+0x370/0x370
[  311.172966]  kthread+0x111/0x130
[  311.172971]  ? kthread_create_worker_on_cpu+0x70/0x70
[  311.172975]  ret_from_fork+0x1f/0x30
[  311.172978] Code: 4b c2 3d 01 48 8b 3d a0 8a 9c 01 48 89 ee e8 88 33 3d 00 48 39 c5 0f 84 05 01 00 00 48 89 df e8 07 fd ff ff 85 c0 41 89 c4 79 a1 <0f> ff 8b 73 04 48 c7 c7 38 27 08 92 e8 f9 15 0b 00 eb 8e 45 31 
[  311.173032] ---[ end trace 8fcc70570a780636 ]---
[  311.173034] Managed startup irq 41, no vector available
[  311.173090] WARNING: CPU: 0 PID: 894 at kernel/irq/chip.c:226 irq_startup+0xd6/0xe0
[  311.173090] Modules linked in: nvme ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_uncore intel_rapl_perf iTCO_wdt joydev iTCO_vendor_support ipmi_ssif ipmi_si mei_me mei shpchp i2c_i801 wmi lpc_ich ipmi_devintf ioatdma ipmi_msghandler dca acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_kms_helper ttm drm i40e crc32c_intel
[  311.173140]  ptp nvme_core pps_core [last unloaded: nvme]
[  311.173145] CPU: 0 PID: 894 Comm: kworker/u673:5 Tainted: G        W        4.15.0-rc7+ #4
[  311.173147] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.2001.062220170731 06/22/2017
[  311.173151] Workqueue: nvme-wq nvme_reset_work [nvme]
[  311.173156] RIP: 0010:irq_startup+0xd6/0xe0
[  311.173158] RSP: 0018:ffffb01f0f367c28 EFLAGS: 00010086
[  311.173160] RAX: 00000000ffffffe4 RBX: ffff9e386a0b7000 RCX: 0000000000000006
[  311.173162] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[  311.173163] RBP: ffff9e386a90e580 R08: 0000000000000720 R09: 0000000000000000
[  311.173164] R10: 0000000000000018 R11: ffff9e386c83c790 R12: 0000000000000001
[  311.173166] R13: ffff9e386a0b7028 R14: ffff9e386a0b7160 R15: ffff9e386a0b70a4
[  311.173168] FS:  0000000000000000(0000) GS:ffff9e387d800000(0000) knlGS:0000000000000000
[  311.173169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  311.173171] CR2: 00007f268d3d5000 CR3: 00000010f5209001 CR4: 00000000007606f0
[  311.173172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  311.173174] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  311.173175] PKRU: 55555554
[  311.173176] Call Trace:
[  311.173180]  __setup_irq+0x66d/0x6e0
[  311.173183]  ? request_threaded_irq+0xad/0x160
[  311.173187]  request_threaded_irq+0xf5/0x160
[  311.173191]  ? nvme_del_cq_end+0x70/0x70 [nvme]
[  311.173194]  pci_request_irq+0x8f/0xd0
[  311.173199]  queue_request_irq+0x4d/0x70 [nvme]
[  311.173203]  nvme_reset_work+0x934/0x153b [nvme]
[  311.173207]  ? update_curr+0xe4/0x1d0
[  311.173211]  ? account_entity_dequeue+0xa4/0xd0
[  311.173215]  ? dequeue_entity+0xd5/0x430
[  311.173220]  ? pick_next_task_fair+0x14f/0x5f0
[  311.173223]  ? __switch_to+0xa2/0x430
[  311.173227]  ? sched_clock+0x5/0x10
[  311.173231]  ? put_prev_entity+0x1e/0xe0
[  311.173234]  process_one_work+0x182/0x370
[  311.173237]  worker_thread+0x2e/0x380
[  311.173240]  ? process_one_work+0x370/0x370
[  311.173243]  kthread+0x111/0x130
[  311.173248]  ? kthread_create_worker_on_cpu+0x70/0x70
[  311.173250]  ret_from_fork+0x1f/0x30
[  311.173253] Code: 89 ef e8 0e 2f 00 00 85 c0 75 20 48 89 ee 31 d2 4c 89 ef e8 ad da ff ff 48 89 df e8 d5 fe ff ff 89 c5 e9 65 ff ff ff 0f ff eb b9 <0f> ff eb b5 66 0f 1f 44 00 00 0f 1f 44 00 00 55 53 48 89 fd 48 
[  311.173306] ---[ end trace 8fcc70570a780637 ]---
[  371.677929] nvme nvme0: I/O 21 QID 0 timeout, completion polled
[  431.709906] nvme nvme0: I/O 21 QID 0 timeout, completion polled
[  491.997931] nvme nvme0: I/O 5 QID 0 timeout, completion polled
[  552.733905] nvme nvme0: I/O 25 QID 0 timeout, completion polled
[  552.733990] Managed startup irq 43, no vector available

       reply	other threads:[~2018-01-15  2:59 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20180115025759.GG13580@localhost.localdomain>
2018-01-15  3:02 ` Keith Busch [this message]
2018-01-15  9:13   ` Thomas Gleixner
2018-01-16  6:16     ` Keith Busch
2018-01-16  7:11       ` Keith Busch
2018-01-16 10:33         ` Thomas Gleixner
2018-01-16 11:20           ` Thomas Gleixner
2018-01-16 14:26             ` Keith Busch
2018-01-17  2:25             ` Keith Busch
2018-01-17  7:34               ` Thomas Gleixner
2018-01-17  7:55                 ` Keith Busch
2018-01-17  9:24                   ` Thomas Gleixner
2018-01-17  9:32                     ` Thomas Gleixner
2018-01-17 14:24                       ` Keith Busch
2018-01-17 15:01                         ` Thomas Gleixner
2018-01-18  2:37                           ` Keith Busch
2018-01-18  8:10                             ` Thomas Gleixner
2018-01-18  8:48                               ` Keith Busch
2018-01-18  9:06                                 ` Thomas Gleixner
2018-01-18 10:43                           ` [tip:irq/urgent] irq/matrix: Spread interrupts on allocation tip-bot for Thomas Gleixner
2018-01-17 11:15             ` [tip:x86/urgent] x86/apic/vector: Fix off by one in error path tip-bot for Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180115030255.GA13921@localhost.localdomain \
    --to=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --subject='Re: [BUG 4.15-rc7] IRQ matrix management errors' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.