* Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues()
@ 2021-06-15 21:57 Casey Chen
2021-06-15 21:57 ` [PATCH] " Casey Chen
0 siblings, 1 reply; 3+ messages in thread
From: Casey Chen @ 2021-06-15 21:57 UTC (permalink / raw)
To: linux-nvme; +Cc: yzhong, ashishk
All tests and fix are based on tag nvme-5.14-2021-06-08 of repo
http://git.infradead.org/nvme.git
Testing method
while :; power off one drive;
sleep $((RANDOM%3)).$((RANDOM%10));
power on the same drive;
sleep $((RANDOM%3)).$((RANDOM%10));
done
Sample crash call trace
[11668.533431] pcieport 0000:87:08.0: pciehp: Slot(402): Card present
...
[11668.681298] nvme nvme12: pci function 0000:8c:00.0
[11668.681354] nvme 0000:8c:00.0: enabling device (0100 -> 0102)
[11669.046142] pcieport 0000:87:08.0: pciehp: Slot(402): Link Down
[11669.046146] pcieport 0000:87:08.0: pciehp: Slot(402): Card not present
[11669.077428] ------------[ cut here ]------------
[11669.077431] kernel BUG at drivers/pci/msi.c:348!
[11669.077555] invalid opcode: 0000 [#1] SMP KASAN
[11669.077658] CPU: 31 PID: 716 Comm: irq/127-pciehp Not tainted 5.13.0-rc3+
[11669.077869] Hardware name: <MASKED OFF>
[11669.078022] RIP: 0010:free_msi_irqs+0x28a/0x2d0
...
[11669.093982] Call Trace:
[11669.096850] pci_free_irq_vectors+0xe/0x20
[11669.099695] nvme_dev_disable+0x140/0x760 [nvme]
[11669.102503] ? _raw_spin_lock_irqsave+0x9c/0x100
[11669.105271] ? trace_hardirqs_on+0x2c/0xe0
[11669.107994] nvme_remove+0x191/0x1e0 [nvme]
[11669.110689] pci_device_remove+0x6b/0x110
[11669.113316] device_release_driver_internal+0x14f/0x280
[11669.115939] pci_stop_bus_device+0xcb/0x100
[11669.118515] pci_stop_and_remove_bus_device+0xe/0x20
[11669.121079] pciehp_unconfigure_device+0xfa/0x200
[11669.123597] ? pciehp_configure_device+0x1c0/0x1c0
[11669.126049] ? trace_hardirqs_on+0x2c/0xe0
[11669.128444] pciehp_disable_slot+0xc4/0x1a0
[11669.130771] ? pciehp_runtime_suspend+0x40/0x40
[11669.133054] ? __mutex_lock_slowpath+0x10/0x10
[11669.135289] ? trace_hardirqs_on+0x2c/0xe0
[11669.137462] pciehp_handle_presence_or_link_change+0x15c/0x4f0
[11669.139632] ? down_read+0x11f/0x1a0
[11669.141731] ? pciehp_handle_disable_request+0x80/0x80
[11669.143817] ? rwsem_down_read_slowpath+0x600/0x600
[11669.145851] ? __radix_tree_lookup+0xb2/0x130
[11669.147842] pciehp_ist+0x19d/0x1a0
[11669.149790] ? pciehp_set_indicators+0xe0/0xe0
[11669.151704] ? irq_finalize_oneshot.part.46+0x1d0/0x1d0
[11669.153588] irq_thread_fn+0x3f/0xa0
[11669.155407] irq_thread+0x195/0x290
[11669.157147] ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.158883] ? _raw_read_lock_irq+0x50/0x50
[11669.160611] ? _raw_read_lock_irq+0x50/0x50
[11669.162320] ? irq_forced_thread_fn+0xf0/0xf0
[11669.164032] ? trace_hardirqs_on+0x2c/0xe0
[11669.165731] ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.167461] kthread+0x1c8/0x1f0
[11669.169173] ? kthread_parkme+0x40/0x40
[11669.170883] ret_from_fork+0x22/0x30
From: Casey Chen <cachen@purestorage.com>
Reply-To:
Subject: Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues()
In-Reply-To:
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH] nvme-pci: Fix multiple races in nvme_setup_io_queues()
2021-06-15 21:57 Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
@ 2021-06-15 21:57 ` Casey Chen
2021-06-16 16:20 ` Keith Busch
0 siblings, 1 reply; 3+ messages in thread
From: Casey Chen @ 2021-06-15 21:57 UTC (permalink / raw)
To: linux-nvme; +Cc: yzhong, ashishk, Casey Chen
Below two paths could overlap each other if we power off a drive quickly
after power it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.
nvme_reset_work() nvme_remove()
nvme_setup_io_queues() nvme_dev_disable()
... ...
A1 clear NVMEQ_ENABLED bit for admin queue lock
retry: B1 nvme_suspend_io_queues()
A2 pci_free_irq() admin queue B2 nvme_suspend_queue() admin queue
A3 pci_free_irq_vectors() nvme_pci_disable()
A4 nvme_setup_irqs(); B3 pci_free_irq_vectors()
... unlock
A5 queue_request_irq() for admin queue
set NVMEQ_ENABLED bit
...
nvme_create_io_queues()
A6 result = queue_request_irq();
set NVMEQ_ENABLED bit
...
fail to allocate enough IO queues:
A7 nvme_suspend_io_queues()
goto retry
1). If B3 runs in between A1 and A2, it will crash if irqaction haven't
been freed by A2. B2 is supposed to free admin queue IRQ but it simply
can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.
Fix: combine A1 A2 so IRQ get freed as soon as NVMEQ_ENABLED bit get cleared.
2). After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
is checking irqaction. A3 also could race with B2 if B2 is freeing
IRQ while A3 is checking irqaction.
Fix: A2 and A3 take lock for mutual exclusion.
3) A3 could race with B3 since they could run free_msi_irqs() in parallel.
Fix: A3 takes lock for mutual exclusion.
4). A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
interrupted by B3.
Fix: A4 takes lock for mutual exclusion.
5). If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
They are just allocated by A5/A6.
Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.
6). A7 could get chance to pci_free_irq() for certain IO queue while B3
is checking irqaction.
Fix: A7 takes lock.
7). nvme_dev->online_queues need to be protected by shutdown_lock. Since it
is not atomic, both paths could modify it using its own copy.
Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
---
drivers/nvme/host/pci.c | 33 +++++++++++++++++++++++++++------
1 file changed, 27 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aa7245a505f..81e53aaaa77c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
goto release_cq;
nvmeq->cq_vector = vector;
- nvme_init_queue(nvmeq, qid);
+ mutex_lock(&dev->shutdown_lock);
+ nvme_init_queue(nvmeq, qid);
if (!polled) {
result = queue_request_irq(nvmeq);
if (result < 0)
@@ -1599,10 +1600,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
}
set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+ mutex_unlock(&dev->shutdown_lock);
return result;
release_sq:
dev->online_queues--;
+ mutex_unlock(&dev->shutdown_lock);
adapter_delete_sq(dev, qid);
release_cq:
adapter_delete_cq(dev, qid);
@@ -2176,7 +2179,16 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
- clear_bit(NVMEQ_ENABLED, &adminq->flags);
+ /*
+ * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+ * from set to unset. If there is a window to it is truely freed,
+ * pci_free_irq_vectors() jumping into this window will crash.
+ * And take lock to avoid racing with pci_free_irq_vectors() in
+ * nvme_dev_disable() path.
+ */
+ mutex_lock(&dev->shutdown_lock);
+ if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+ pci_free_irq(pdev, 0, adminq);
if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2204,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
result = nvme_remap_bar(dev, size);
if (!result)
break;
- if (!--nr_io_queues)
+ if (!--nr_io_queues) {
+ mutex_unlock(&dev->shutdown_lock);
return -ENOMEM;
+ }
} while (1);
adminq->q_db = dev->dbs;
retry:
/* Deregister the admin queue's interrupt */
- pci_free_irq(pdev, 0, adminq);
+ if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+ pci_free_irq(pdev, 0, adminq);
/*
* If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2223,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
pci_free_irq_vectors(pdev);
result = nvme_setup_irqs(dev, nr_io_queues);
- if (result <= 0)
+ if (result <= 0) {
+ mutex_unlock(&dev->shutdown_lock);
return -EIO;
+ }
dev->num_vecs = result;
result = max(result - 1, 1);
@@ -2222,9 +2239,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
* number of interrupts.
*/
result = queue_request_irq(adminq);
- if (result)
+ if (result) {
+ mutex_unlock(&dev->shutdown_lock);
return result;
+ }
set_bit(NVMEQ_ENABLED, &adminq->flags);
+ mutex_unlock(&dev->shutdown_lock);
result = nvme_create_io_queues(dev);
if (result || dev->online_queues < 2)
@@ -2233,6 +2253,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (dev->online_queues - 1 < dev->max_qid) {
nr_io_queues = dev->online_queues - 1;
nvme_disable_io_queues(dev);
+ mutex_lock(&dev->shutdown_lock);
nvme_suspend_io_queues(dev);
goto retry;
}
--
2.17.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] nvme-pci: Fix multiple races in nvme_setup_io_queues()
2021-06-15 21:57 ` [PATCH] " Casey Chen
@ 2021-06-16 16:20 ` Keith Busch
0 siblings, 0 replies; 3+ messages in thread
From: Keith Busch @ 2021-06-16 16:20 UTC (permalink / raw)
To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk
On Tue, Jun 15, 2021 at 02:57:41PM -0700, Casey Chen wrote:
> @@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
> goto release_cq;
>
> nvmeq->cq_vector = vector;
> - nvme_init_queue(nvmeq, qid);
>
> + mutex_lock(&dev->shutdown_lock);
I really think you want to use mutex_trylock() here and return early if
locking fails since that must mean the device is being disabled.
Waiting to acquire the lock in this path doesn't do any good on a
disabled device.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-06-16 16:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15 21:57 Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
2021-06-15 21:57 ` [PATCH] " Casey Chen
2021-06-16 16:20 ` Keith Busch
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.