All of lore.kernel.org
 help / color / mirror / Atom feed
* Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues()
@ 2021-06-15 21:57 Casey Chen
  2021-06-15 21:57 ` [PATCH] " Casey Chen
  0 siblings, 1 reply; 3+ messages in thread
From: Casey Chen @ 2021-06-15 21:57 UTC (permalink / raw)
  To: linux-nvme; +Cc: yzhong, ashishk

All tests and fix are based on tag nvme-5.14-2021-06-08 of repo
http://git.infradead.org/nvme.git

Testing method
while :; power off one drive;
    sleep $((RANDOM%3)).$((RANDOM%10));
    power on the same drive;
    sleep $((RANDOM%3)).$((RANDOM%10));
done

Sample crash call trace
[11668.533431] pcieport 0000:87:08.0: pciehp: Slot(402): Card present
...
[11668.681298] nvme nvme12: pci function 0000:8c:00.0
[11668.681354] nvme 0000:8c:00.0: enabling device (0100 -> 0102)
[11669.046142] pcieport 0000:87:08.0: pciehp: Slot(402): Link Down
[11669.046146] pcieport 0000:87:08.0: pciehp: Slot(402): Card not present
[11669.077428] ------------[ cut here ]------------
[11669.077431] kernel BUG at drivers/pci/msi.c:348!
[11669.077555] invalid opcode: 0000 [#1] SMP KASAN
[11669.077658] CPU: 31 PID: 716 Comm: irq/127-pciehp Not tainted 5.13.0-rc3+
[11669.077869] Hardware name: <MASKED OFF>
[11669.078022] RIP: 0010:free_msi_irqs+0x28a/0x2d0
...
[11669.093982] Call Trace:
[11669.096850]  pci_free_irq_vectors+0xe/0x20
[11669.099695]  nvme_dev_disable+0x140/0x760 [nvme]
[11669.102503]  ? _raw_spin_lock_irqsave+0x9c/0x100
[11669.105271]  ? trace_hardirqs_on+0x2c/0xe0
[11669.107994]  nvme_remove+0x191/0x1e0 [nvme]
[11669.110689]  pci_device_remove+0x6b/0x110
[11669.113316]  device_release_driver_internal+0x14f/0x280
[11669.115939]  pci_stop_bus_device+0xcb/0x100
[11669.118515]  pci_stop_and_remove_bus_device+0xe/0x20
[11669.121079]  pciehp_unconfigure_device+0xfa/0x200
[11669.123597]  ? pciehp_configure_device+0x1c0/0x1c0
[11669.126049]  ? trace_hardirqs_on+0x2c/0xe0
[11669.128444]  pciehp_disable_slot+0xc4/0x1a0
[11669.130771]  ? pciehp_runtime_suspend+0x40/0x40
[11669.133054]  ? __mutex_lock_slowpath+0x10/0x10
[11669.135289]  ? trace_hardirqs_on+0x2c/0xe0
[11669.137462]  pciehp_handle_presence_or_link_change+0x15c/0x4f0
[11669.139632]  ? down_read+0x11f/0x1a0
[11669.141731]  ? pciehp_handle_disable_request+0x80/0x80
[11669.143817]  ? rwsem_down_read_slowpath+0x600/0x600
[11669.145851]  ? __radix_tree_lookup+0xb2/0x130
[11669.147842]  pciehp_ist+0x19d/0x1a0
[11669.149790]  ? pciehp_set_indicators+0xe0/0xe0
[11669.151704]  ? irq_finalize_oneshot.part.46+0x1d0/0x1d0
[11669.153588]  irq_thread_fn+0x3f/0xa0
[11669.155407]  irq_thread+0x195/0x290
[11669.157147]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.158883]  ? _raw_read_lock_irq+0x50/0x50
[11669.160611]  ? _raw_read_lock_irq+0x50/0x50
[11669.162320]  ? irq_forced_thread_fn+0xf0/0xf0
[11669.164032]  ? trace_hardirqs_on+0x2c/0xe0
[11669.165731]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.167461]  kthread+0x1c8/0x1f0
[11669.169173]  ? kthread_parkme+0x40/0x40
[11669.170883]  ret_from_fork+0x22/0x30

From: Casey Chen <cachen@purestorage.com>
Reply-To: 
Subject: Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues()
In-Reply-To: 



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-15 21:57 Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
@ 2021-06-15 21:57 ` Casey Chen
  2021-06-16 16:20   ` Keith Busch
  0 siblings, 1 reply; 3+ messages in thread
From: Casey Chen @ 2021-06-15 21:57 UTC (permalink / raw)
  To: linux-nvme; +Cc: yzhong, ashishk, Casey Chen

Below two paths could overlap each other if we power off a drive quickly
after power it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()                                nvme_remove()
  nvme_setup_io_queues()                           nvme_dev_disable()
  ...                                              ...
A1  clear NVMEQ_ENABLED bit for admin queue          lock
    retry:                                       B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue                 B2  nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()                         nvme_pci_disable()
A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
      ...                                            unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

1). If B3 runs in between A1 and A2, it will crash if irqaction haven't
  been freed by A2. B2 is supposed to free admin queue IRQ but it simply
  can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as NVMEQ_ENABLED bit get cleared.

2). After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
  is checking irqaction. A3 also could race with B2 if B2 is freeing
  IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

3) A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

4). A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
  interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

5). If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
  They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

6). A7 could get chance to pci_free_irq() for certain IO queue while B3
  is checking irqaction.

Fix: A7 takes lock.

7). nvme_dev->online_queues need to be protected by shutdown_lock. Since it
  is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
---
 drivers/nvme/host/pci.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aa7245a505f..81e53aaaa77c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 		goto release_cq;
 
 	nvmeq->cq_vector = vector;
-	nvme_init_queue(nvmeq, qid);
 
+	mutex_lock(&dev->shutdown_lock);
+	nvme_init_queue(nvmeq, qid);
 	if (!polled) {
 		result = queue_request_irq(nvmeq);
 		if (result < 0)
@@ -1599,10 +1600,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	}
 
 	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 	return result;
 
 release_sq:
 	dev->online_queues--;
+	mutex_unlock(&dev->shutdown_lock);
 	adapter_delete_sq(dev, qid);
 release_cq:
 	adapter_delete_cq(dev, qid);
@@ -2176,7 +2179,16 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	clear_bit(NVMEQ_ENABLED, &adminq->flags);
+	/*
+	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+	 * from set to unset. If there is a window to it is truely freed,
+	 * pci_free_irq_vectors() jumping into this window will crash.
+	 * And take lock to avoid racing with pci_free_irq_vectors() in
+	 * nvme_dev_disable() path.
+	 */
+	mutex_lock(&dev->shutdown_lock);
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2204,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		result = nvme_remap_bar(dev, size);
 		if (!result)
 			break;
-		if (!--nr_io_queues)
+		if (!--nr_io_queues) {
+			mutex_unlock(&dev->shutdown_lock);
 			return -ENOMEM;
+		}
 	} while (1);
 	adminq->q_db = dev->dbs;
 
  retry:
 	/* Deregister the admin queue's interrupt */
-	pci_free_irq(pdev, 0, adminq);
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2223,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	pci_free_irq_vectors(pdev);
 
 	result = nvme_setup_irqs(dev, nr_io_queues);
-	if (result <= 0)
+	if (result <= 0) {
+		mutex_unlock(&dev->shutdown_lock);
 		return -EIO;
+	}
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
@@ -2222,9 +2239,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * number of interrupts.
 	 */
 	result = queue_request_irq(adminq);
-	if (result)
+	if (result) {
+		mutex_unlock(&dev->shutdown_lock);
 		return result;
+	}
 	set_bit(NVMEQ_ENABLED, &adminq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 
 	result = nvme_create_io_queues(dev);
 	if (result || dev->online_queues < 2)
@@ -2233,6 +2253,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (dev->online_queues - 1 < dev->max_qid) {
 		nr_io_queues = dev->online_queues - 1;
 		nvme_disable_io_queues(dev);
+		mutex_lock(&dev->shutdown_lock);
 		nvme_suspend_io_queues(dev);
 		goto retry;
 	}
-- 
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-15 21:57 ` [PATCH] " Casey Chen
@ 2021-06-16 16:20   ` Keith Busch
  0 siblings, 0 replies; 3+ messages in thread
From: Keith Busch @ 2021-06-16 16:20 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk

On Tue, Jun 15, 2021 at 02:57:41PM -0700, Casey Chen wrote:
> @@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
>  		goto release_cq;
>  
>  	nvmeq->cq_vector = vector;
> -	nvme_init_queue(nvmeq, qid);
>  
> +	mutex_lock(&dev->shutdown_lock);

I really think you want to use mutex_trylock() here and return early if
locking fails since that must mean the device is being disabled.
Waiting to acquire the lock in this path doesn't do any good on a
disabled device.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-16 16:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15 21:57 Review Request: nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
2021-06-15 21:57 ` [PATCH] " Casey Chen
2021-06-16 16:20   ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.