linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Casey Chen <cachen@purestorage.com>,
	Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>,
	Sasha Levin <sashal@kernel.org>,
	Yuanyuan Zhong <yzhong@purestorage.com>
Subject: [PATCH 5.13 11/22] nvme-pci: fix multiple races in nvme_setup_io_queues
Date: Thu, 29 Jul 2021 15:54:42 +0200	[thread overview]
Message-ID: <20210729135137.691986338@linuxfoundation.org> (raw)
In-Reply-To: <20210729135137.336097792@linuxfoundation.org>

From: Casey Chen <cachen@purestorage.com>

[ Upstream commit e4b9852a0f4afe40604afb442e3af4452722050a ]

Below two paths could overlap each other if we power off a drive quickly
after powering it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()                                nvme_remove()
  nvme_setup_io_queues()                           nvme_dev_disable()
  ...                                              ...
A1  clear NVMEQ_ENABLED bit for admin queue          lock
    retry:                                       B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue                 B2  nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()                         nvme_pci_disable()
A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
      ...                                            unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

If B3 runs in between A1 and A2, it will crash if irqaction haven't
been freed by A2. B2 is supposed to free admin queue IRQ but it simply
can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as the NVMEQ_ENABLED bit
gets cleared.

After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
is checking irqaction. A3 also could race with B2 if B2 is freeing
IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

A7 could get chance to pci_free_irq() for certain IO queue while B3 is
checking irqaction.

Fix: A7 takes lock.

nvme_dev->online_queues need to be protected by shutdown_lock. Since it
is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/nvme/host/pci.c | 66 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fb1c5ae0da39..d963f25fc7ae 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1562,6 +1562,28 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	wmb(); /* ensure the first interrupt sees the initialization */
 }
 
+/*
+ * Try getting shutdown_lock while setting up IO queues.
+ */
+static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
+{
+	/*
+	 * Give up if the lock is being held by nvme_dev_disable.
+	 */
+	if (!mutex_trylock(&dev->shutdown_lock))
+		return -ENODEV;
+
+	/*
+	 * Controller is in wrong state, fail early.
+	 */
+	if (dev->ctrl.state != NVME_CTRL_CONNECTING) {
+		mutex_unlock(&dev->shutdown_lock);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
 static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 {
 	struct nvme_dev *dev = nvmeq->dev;
@@ -1590,8 +1612,11 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 		goto release_cq;
 
 	nvmeq->cq_vector = vector;
-	nvme_init_queue(nvmeq, qid);
 
+	result = nvme_setup_io_queues_trylock(dev);
+	if (result)
+		return result;
+	nvme_init_queue(nvmeq, qid);
 	if (!polled) {
 		result = queue_request_irq(nvmeq);
 		if (result < 0)
@@ -1599,10 +1624,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	}
 
 	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 	return result;
 
 release_sq:
 	dev->online_queues--;
+	mutex_unlock(&dev->shutdown_lock);
 	adapter_delete_sq(dev, qid);
 release_cq:
 	adapter_delete_cq(dev, qid);
@@ -2176,7 +2203,18 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	clear_bit(NVMEQ_ENABLED, &adminq->flags);
+	/*
+	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+	 * from set to unset. If there is a window to it is truely freed,
+	 * pci_free_irq_vectors() jumping into this window will crash.
+	 * And take lock to avoid racing with pci_free_irq_vectors() in
+	 * nvme_dev_disable() path.
+	 */
+	result = nvme_setup_io_queues_trylock(dev);
+	if (result)
+		return result;
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2230,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		result = nvme_remap_bar(dev, size);
 		if (!result)
 			break;
-		if (!--nr_io_queues)
-			return -ENOMEM;
+		if (!--nr_io_queues) {
+			result = -ENOMEM;
+			goto out_unlock;
+		}
 	} while (1);
 	adminq->q_db = dev->dbs;
 
  retry:
 	/* Deregister the admin queue's interrupt */
-	pci_free_irq(pdev, 0, adminq);
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2249,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	pci_free_irq_vectors(pdev);
 
 	result = nvme_setup_irqs(dev, nr_io_queues);
-	if (result <= 0)
-		return -EIO;
+	if (result <= 0) {
+		result = -EIO;
+		goto out_unlock;
+	}
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
@@ -2223,8 +2266,9 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 */
 	result = queue_request_irq(adminq);
 	if (result)
-		return result;
+		goto out_unlock;
 	set_bit(NVMEQ_ENABLED, &adminq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 
 	result = nvme_create_io_queues(dev);
 	if (result || dev->online_queues < 2)
@@ -2233,6 +2277,9 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (dev->online_queues - 1 < dev->max_qid) {
 		nr_io_queues = dev->online_queues - 1;
 		nvme_disable_io_queues(dev);
+		result = nvme_setup_io_queues_trylock(dev);
+		if (result)
+			return result;
 		nvme_suspend_io_queues(dev);
 		goto retry;
 	}
@@ -2241,6 +2288,9 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 					dev->io_queues[HCTX_TYPE_READ],
 					dev->io_queues[HCTX_TYPE_POLL]);
 	return 0;
+out_unlock:
+	mutex_unlock(&dev->shutdown_lock);
+	return result;
 }
 
 static void nvme_del_queue_end(struct request *req, blk_status_t error)
-- 
2.30.2




  parent reply	other threads:[~2021-07-29 14:02 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-29 13:54 [PATCH 5.13 00/22] 5.13.7-rc1 review Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 01/22] af_unix: fix garbage collect vs MSG_PEEK Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 02/22] workqueue: fix UAF in pwq_unbound_release_workfn() Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 03/22] cgroup1: fix leaked context root causing sporadic NULL deref in LTP Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 04/22] net/802/mrp: fix memleak in mrp_request_join() Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 05/22] net/802/garp: fix memleak in garp_request_join() Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 06/22] net: annotate data race around sk_ll_usec Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 07/22] sctp: move 198 addresses from unusable to private scope Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 08/22] rcu-tasks: Dont delete holdouts within trc_inspect_reader() Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 09/22] rcu-tasks: Dont delete holdouts within trc_wait_for_one_reader() Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 10/22] ipv6: allocate enough headroom in ip6_finish_output2() Greg Kroah-Hartman
2021-07-29 13:54 ` Greg Kroah-Hartman [this message]
2021-07-29 13:54 ` [PATCH 5.13 12/22] drm/ttm: add a check against null pointer dereference Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 13/22] hfs: add missing clean-up in hfs_fill_super Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 14/22] hfs: fix high memory mapping in hfs_bnode_read Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 15/22] hfs: add lock nesting notation to hfs_find_init Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 16/22] firmware: arm_scmi: Fix possible scmi_linux_errmap buffer overflow Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 17/22] firmware: arm_scmi: Fix range check for the maximum number of pending messages Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 18/22] cifs: fix the out of range assignment to bit fields in parse_server_interfaces Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 19/22] iomap: remove the length variable in iomap_seek_data Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 20/22] iomap: remove the length variable in iomap_seek_hole Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 21/22] ARM: dts: versatile: Fix up interrupt controller node names Greg Kroah-Hartman
2021-07-29 13:54 ` [PATCH 5.13 22/22] ipv6: ip6_finish_output2: set sk into newly allocated nskb Greg Kroah-Hartman
2021-07-29 22:49 ` [PATCH 5.13 00/22] 5.13.7-rc1 review Shuah Khan
2021-07-29 23:59 ` Florian Fainelli
2021-07-30  4:47 ` Naresh Kamboju
2021-07-30 16:53 ` Justin Forbes
2021-07-31  4:44 ` Guenter Roeck
2021-07-31  5:36 ` Fox Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210729135137.691986338@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=cachen@purestorage.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).