All of lore.kernel.org
 help / color / mirror / Atom feed
* rr: nvme-pci: fix races in nvme_setup_io_queues() and UAF introduced by nvme_dev_remove_admin()
@ 2021-06-22  0:27 Casey Chen
  2021-06-22  0:27 ` [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
  2021-06-22  0:27 ` [PATCH 2/2] nvme-pci: Fix UAF introduced by nvme_dev_remove_admin() Casey Chen
  0 siblings, 2 replies; 12+ messages in thread
From: Casey Chen @ 2021-06-22  0:27 UTC (permalink / raw)
  To: linux-nvme; +Cc: yzhong, ashishk

Found two bugs while power-cycling PCIe NVMe drive for hours:
- Races in nvme_setup_io_queues()
- UAF introduced by nvme_dev_remove_admin(), which is found
after fixing the races. Without fixing the races, the system
just crashes and cannot survive to reproduce the UAF.

The proposed fixes have been tested for several days for correctness.

0. Code baseline

Tag nvme-5.14-2021-06-08 of repo http://git.infradead.org/nvme.git

1. Testing method

while :; power off one drive;
    sleep $((RANDOM%3)).$((RANDOM%10));
    power on the same drive;
    sleep $((RANDOM%3)).$((RANDOM%10));
done

2. Sample crash trace due to races in nvme_setup_io_queues()
(task ID shown after the timestamps)

[11668.533431][  T716] pcieport 0000:87:08.0: pciehp: Slot(402): Card present
...
[11668.681298][T251231] nvme nvme12: pci function 0000:8c:00.0
[11668.681354][T26714] nvme 0000:8c:00.0: enabling device (0100 -> 0102)
[11669.046119][   C31] pcieport 0000:87:08.0: pciehp: pending interrupts 0x0108 from Slot Status
[11669.046142][  T716] pcieport 0000:87:08.0: pciehp: Slot(402): Link Down
[11669.046146][  T716] pcieport 0000:87:08.0: pciehp: Slot(402): Card not present
[11669.046149][  T716] pcieport 0000:87:08.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:8c:00
[11669.077428][  T716] ------------[ cut here ]------------
[11669.077431][  T716] kernel BUG at drivers/pci/msi.c:348!
[11669.077555][  T716] invalid opcode: 0000 [#1] SMP KASAN
[11669.077658][  T716] CPU: 31 PID: 716 Comm: irq/127-pciehp Not tainted 5.13.0-rc3+
[11669.078022][  T716] RIP: 0010:free_msi_irqs+0x28a/0x2d0
...
[11669.093982][  T716] Call Trace:
[11669.096850][  T716]  pci_free_irq_vectors+0xe/0x20
[11669.099695][  T716]  nvme_dev_disable+0x140/0x760 [nvme]
[11669.102503][  T716]  ? _raw_spin_lock_irqsave+0x9c/0x100
[11669.105271][  T716]  ? trace_hardirqs_on+0x2c/0xe0
[11669.107994][  T716]  nvme_remove+0x191/0x1e0 [nvme]
[11669.110689][  T716]  pci_device_remove+0x6b/0x110
[11669.113316][  T716]  device_release_driver_internal+0x14f/0x280
[11669.115939][  T716]  pci_stop_bus_device+0xcb/0x100
[11669.118515][  T716]  pci_stop_and_remove_bus_device+0xe/0x20
[11669.121079][  T716]  pciehp_unconfigure_device+0xfa/0x200
[11669.123597][  T716]  ? pciehp_configure_device+0x1c0/0x1c0
[11669.126049][  T716]  ? trace_hardirqs_on+0x2c/0xe0
[11669.128444][  T716]  pciehp_disable_slot+0xc4/0x1a0
[11669.130771][  T716]  ? pciehp_runtime_suspend+0x40/0x40
[11669.133054][  T716]  ? __mutex_lock_slowpath+0x10/0x10
[11669.135289][  T716]  ? trace_hardirqs_on+0x2c/0xe0
[11669.137462][  T716]  pciehp_handle_presence_or_link_change+0x15c/0x4f0
[11669.139632][  T716]  ? down_read+0x11f/0x1a0
[11669.141731][  T716]  ? pciehp_handle_disable_request+0x80/0x80
[11669.143817][  T716]  ? rwsem_down_read_slowpath+0x600/0x600
[11669.145851][  T716]  ? __radix_tree_lookup+0xb2/0x130
[11669.147842][  T716]  pciehp_ist+0x19d/0x1a0
[11669.149790][  T716]  ? pciehp_set_indicators+0xe0/0xe0
[11669.151704][  T716]  ? irq_finalize_oneshot.part.46+0x1d0/0x1d0
[11669.153588][  T716]  irq_thread_fn+0x3f/0xa0
[11669.155407][  T716]  irq_thread+0x195/0x290
[11669.157147][  T716]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.158883][  T716]  ? _raw_read_lock_irq+0x50/0x50
[11669.160611][  T716]  ? _raw_read_lock_irq+0x50/0x50
[11669.162320][  T716]  ? irq_forced_thread_fn+0xf0/0xf0
[11669.164032][  T716]  ? trace_hardirqs_on+0x2c/0xe0
[11669.165731][  T716]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.167461][  T716]  kthread+0x1c8/0x1f0
[11669.169173][  T716]  ? kthread_parkme+0x40/0x40
[11669.170883][  T716]  ret_from_fork+0x22/0x30

3. KASAN report for the UAF introduced by nvme_dev_remove_admin()
(task ID is shown after the timestamp)

[18319.015748][T246989] nvme nvme13: pci function 0000:8c:00.0
[18319.015795][T215541] nvme 0000:8c:00.0: enabling device (0100 -> 0102)
[18319.369086][   C31] pcieport 0000:87:08.0: pciehp: pending interrupts 0x0108 from Slot Status
[18319.369107][  T716] pcieport 0000:87:08.0: pciehp: Slot(402): Link Down
[18319.369111][  T716] pcieport 0000:87:08.0: pciehp: Slot(402): Card not present
[18319.369116][  T716] pcieport 0000:87:08.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:8c:00
[18320.452045][T215541] nvme nvme13: 88/0/0 default/read/poll queues
[18320.469475][T215541] nvme nvme13: failed to mark controller live state
[18320.469483][T215541] nvme nvme13: Removing after probe failure status: -19
[18320.551295][T215541] ==================================================================
[18320.551299][T215541] BUG: KASAN: use-after-free in __blk_mq_all_tag_iter+0x9c/0x3f0
[18320.551311][T215541] Read of size 4 at addr ffff888897904d04 by task kworker/u178:2/215541
[18320.551315][T215541] 
[18320.551318][T215541] CPU: 86 PID: 215541 Comm: kworker/u178:2 Not tainted 5.13.0-rc3+
[18320.551327][T215541] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[18320.551339][T215541] Call Trace:
[18320.551343][T215541]  dump_stack+0xa4/0xdb
[18320.551354][T215541]  ? __blk_mq_all_tag_iter+0x9c/0x3f0
[18320.551359][T215541]  print_address_description.constprop.10+0x3a/0x60
[18320.551366][T215541]  ? __blk_mq_all_tag_iter+0x9c/0x3f0
[18320.551372][T215541]  ? __blk_mq_all_tag_iter+0x9c/0x3f0
[18320.551377][T215541]  ? blk_mq_update_nr_requests+0x2a0/0x2a0
[18320.551382][T215541]  kasan_report.cold.15+0x7c/0xd8
[18320.551390][T215541]  ? __blk_mq_all_tag_iter+0x9c/0x3f0
[18320.551395][T215541]  __blk_mq_all_tag_iter+0x9c/0x3f0
[18320.551401][T215541]  ? blk_mq_update_nr_requests+0x2a0/0x2a0
[18320.551407][T215541]  ? bt_iter+0xf0/0xf0
[18320.551412][T215541]  ? __blk_mq_all_tag_iter+0x2c9/0x3f0
[18320.551417][T215541]  ? blk_mq_update_nr_requests+0x2a0/0x2a0
[18320.551422][T215541]  ? bt_iter+0xf0/0xf0
[18320.551427][T215541]  ? dev_printk_emit+0x95/0xbb
[18320.551436][T215541]  blk_mq_tagset_busy_iter+0x75/0xa0
[18320.551441][T215541]  ? blk_mq_update_nr_requests+0x2a0/0x2a0
[18320.551446][T215541]  ? blk_mq_update_nr_requests+0x2a0/0x2a0
[18320.551451][T215541]  blk_mq_tagset_wait_completed_request+0x86/0xc0
[18320.551457][T215541]  ? blk_mq_tagset_busy_iter+0xa0/0xa0
[18320.551463][T215541]  ? blk_mq_tagset_busy_iter+0x80/0xa0
[18320.551469][T215541]  ? trace_event_raw_event_nvme_setup_cmd+0x2d0/0x2d0 [nvme_core]
[18320.551493][T215541]  nvme_dev_disable+0x4f6/0x760 [nvme]
[18320.551502][T215541]  ? trace_hardirqs_on+0x2c/0xe0
[18320.551510][T215541]  nvme_reset_work+0x226/0x2060 [nvme]
[18320.551520][T215541]  ? nvme_remove+0x1e0/0x1e0 [nvme]
[18320.551528][T215541]  ? __update_load_avg_cfs_rq+0x1d8/0x550
[18320.551537][T215541]  ? down_read+0x11f/0x1a0
[18320.551545][T215541]  ? newidle_balance+0x444/0x690
[18320.551552][T215541]  ? update_load_avg+0x626/0xbe0
[18320.551557][T215541]  ? update_cfs_group+0x1e/0x150
[18320.551562][T215541]  ? load_balance+0x11d0/0x11d0
[18320.551567][T215541]  ? dequeue_entity+0x150/0x730
[18320.551573][T215541]  ? nvme_irq_check+0x60/0x60 [nvme]
[18320.551581][T215541]  ? finish_task_switch+0x101/0x3d0
[18320.551588][T215541]  ? read_word_at_a_time+0xe/0x20
[18320.551594][T215541]  ? strscpy+0xc1/0x1d0
[18320.551598][T215541]  process_one_work+0x4b9/0x7b0
[18320.551604][T215541]  worker_thread+0x72/0x710
[18320.551610][T215541]  ? process_one_work+0x7b0/0x7b0
[18320.551614][T215541]  kthread+0x1c8/0x1f0
[18320.551618][T215541]  ? kthread_parkme+0x40/0x40
[18320.551622][T215541]  ret_from_fork+0x22/0x30
[18320.551630][T215541] 
[18320.551632][T215541] Allocated by task 215541:
[18320.551635][T215541]  kasan_save_stack+0x19/0x40
[18320.551639][T215541]  __kasan_kmalloc+0x7f/0xa0
[18320.551642][T215541]  kmem_cache_alloc_node_trace+0x187/0x2b0
[18320.551648][T215541]  blk_mq_init_tags+0x47/0x100
[18320.551651][T215541]  blk_mq_alloc_rq_map+0x44/0xf0
[18320.551656][T215541]  __blk_mq_alloc_map_and_request+0x7f/0x140
[18320.551661][T215541]  blk_mq_alloc_tag_set+0x25e/0x510
[18320.551666][T215541]  nvme_reset_work+0x14f9/0x2060 [nvme]
[18320.551674][T215541]  process_one_work+0x4b9/0x7b0
[18320.551678][T215541]  worker_thread+0x72/0x710
[18320.551682][T215541]  kthread+0x1c8/0x1f0
[18320.551685][T215541]  ret_from_fork+0x22/0x30
[18320.551689][T215541] 
[18320.551690][T215541] Freed by task 716:
[18320.551693][T215541]  kasan_save_stack+0x19/0x40
[18320.551696][T215541]  kasan_set_track+0x1c/0x30
[18320.551699][T215541]  kasan_set_free_info+0x20/0x30
[18320.551704][T215541]  __kasan_slab_free+0xec/0x130
[18320.551707][T215541]  kfree+0xa8/0x460
[18320.551712][T215541]  blk_mq_free_map_and_requests+0x8d/0xc0
[18320.551717][T215541]  blk_mq_free_tag_set+0x30/0xf0
[18320.551721][T215541]  nvme_remove+0x199/0x1e0 [nvme]
[18320.551729][T215541]  pci_device_remove+0x6b/0x110
[18320.551735][T215541]  device_release_driver_internal+0x14f/0x280
[18320.551744][T215541]  pci_stop_bus_device+0xcb/0x100
[18320.551750][T215541]  pci_stop_and_remove_bus_device+0xe/0x20
[18320.551754][T215541]  pciehp_unconfigure_device+0xfa/0x200
[18320.551761][T215541]  pciehp_disable_slot+0xc4/0x1a0
[18320.551765][T215541]  pciehp_handle_presence_or_link_change+0x15c/0x4f0
[18320.551770][T215541]  pciehp_ist+0x19d/0x1a0
[18320.551774][T215541]  irq_thread_fn+0x3f/0xa0
[18320.551780][T215541]  irq_thread+0x195/0x290
[18320.551783][T215541]  kthread+0x1c8/0x1f0
[18320.551786][T215541]  ret_from_fork+0x22/0x30
[18320.551791][T215541] 
[18320.551792][T215541] The buggy address belongs to the object at ffff888897904d00
[18320.551792][T215541]  which belongs to the cache kmalloc-192 of size 192
[18320.551795][T215541] The buggy address is located 4 bytes inside of
[18320.551795][T215541]  192-byte region [ffff888897904d00, ffff888897904dc0)
[18320.551800][T215541] The buggy address belongs to the page:
[18320.551802][T215541] page:000000002f3df664 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x897904
[18320.551807][T215541] head:000000002f3df664 order:1 compound_mapcount:0
[18320.551810][T215541] flags: 0x200000000010200(slab|head|node=0|zone=2)
[18320.551819][T215541] raw: 0200000000010200 dead000000000100 dead000000000122 ffff88810004ca00
[18320.551824][T215541] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[18320.551826][T215541] page dumped because: kasan: bad access detected
[18320.551828][T215541] 
[18320.551829][T215541] Memory state around the buggy address:
[18320.551832][T215541]  ffff888897904c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18320.551835][T215541]  ffff888897904c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[18320.551838][T215541] >ffff888897904d00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18320.551841][T215541]                    ^
[18320.551843][T215541]  ffff888897904d80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[18320.551846][T215541]  ffff888897904e00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[18320.551848][T215541] ==================================================================
[18320.551850][T215541] Disabling lock debugging due to kernel taint



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-22  0:27 rr: nvme-pci: fix races in nvme_setup_io_queues() and UAF introduced by nvme_dev_remove_admin() Casey Chen
@ 2021-06-22  0:27 ` Casey Chen
  2021-06-22 15:06   ` Keith Busch
  2021-06-22  0:27 ` [PATCH 2/2] nvme-pci: Fix UAF introduced by nvme_dev_remove_admin() Casey Chen
  1 sibling, 1 reply; 12+ messages in thread
From: Casey Chen @ 2021-06-22  0:27 UTC (permalink / raw)
  To: linux-nvme; +Cc: yzhong, ashishk, Casey Chen

Below two paths could overlap each other if we power off a drive quickly
after power it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()                                nvme_remove()
  nvme_setup_io_queues()                           nvme_dev_disable()
  ...                                              ...
A1  clear NVMEQ_ENABLED bit for admin queue          lock
    retry:                                       B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue                 B2  nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()                         nvme_pci_disable()
A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
      ...                                            unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

1). If B3 runs in between A1 and A2, it will crash if irqaction haven't
  been freed by A2. B2 is supposed to free admin queue IRQ but it simply
  can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as NVMEQ_ENABLED bit get cleared.

2). After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
  is checking irqaction. A3 also could race with B2 if B2 is freeing
  IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

3) A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

4). A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
  interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

5). If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
  They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

6). A7 could get chance to pci_free_irq() for certain IO queue while B3
  is checking irqaction.

Fix: A7 takes lock.

7). nvme_dev->online_queues need to be protected by shutdown_lock. Since it
  is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
---
 drivers/nvme/host/pci.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aa7245a505f..5abc4c3be454 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1590,8 +1590,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 		goto release_cq;
 
 	nvmeq->cq_vector = vector;
-	nvme_init_queue(nvmeq, qid);
 
+	/*
+	 * Return early if locking fails as device is being disabled.
+	 */
+	if (!mutex_trylock(&dev->shutdown_lock))
+		return -ENODEV;
+	nvme_init_queue(nvmeq, qid);
 	if (!polled) {
 		result = queue_request_irq(nvmeq);
 		if (result < 0)
@@ -1599,10 +1604,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	}
 
 	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 	return result;
 
 release_sq:
 	dev->online_queues--;
+	mutex_unlock(&dev->shutdown_lock);
 	adapter_delete_sq(dev, qid);
 release_cq:
 	adapter_delete_cq(dev, qid);
@@ -2176,7 +2183,16 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	clear_bit(NVMEQ_ENABLED, &adminq->flags);
+	/*
+	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+	 * from set to unset. If there is a window to it is truely freed,
+	 * pci_free_irq_vectors() jumping into this window will crash.
+	 * And take lock to avoid racing with pci_free_irq_vectors() in
+	 * nvme_dev_disable() path.
+	 */
+	mutex_lock(&dev->shutdown_lock);
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2208,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		result = nvme_remap_bar(dev, size);
 		if (!result)
 			break;
-		if (!--nr_io_queues)
+		if (!--nr_io_queues) {
+			mutex_unlock(&dev->shutdown_lock);
 			return -ENOMEM;
+		}
 	} while (1);
 	adminq->q_db = dev->dbs;
 
  retry:
 	/* Deregister the admin queue's interrupt */
-	pci_free_irq(pdev, 0, adminq);
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2227,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	pci_free_irq_vectors(pdev);
 
 	result = nvme_setup_irqs(dev, nr_io_queues);
-	if (result <= 0)
+	if (result <= 0) {
+		mutex_unlock(&dev->shutdown_lock);
 		return -EIO;
+	}
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
@@ -2222,9 +2243,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * number of interrupts.
 	 */
 	result = queue_request_irq(adminq);
-	if (result)
+	if (result) {
+		mutex_unlock(&dev->shutdown_lock);
 		return result;
+	}
 	set_bit(NVMEQ_ENABLED, &adminq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 
 	result = nvme_create_io_queues(dev);
 	if (result || dev->online_queues < 2)
@@ -2233,6 +2257,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (dev->online_queues - 1 < dev->max_qid) {
 		nr_io_queues = dev->online_queues - 1;
 		nvme_disable_io_queues(dev);
+		mutex_lock(&dev->shutdown_lock);
 		nvme_suspend_io_queues(dev);
 		goto retry;
 	}
-- 
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] nvme-pci: Fix UAF introduced by nvme_dev_remove_admin()
  2021-06-22  0:27 rr: nvme-pci: fix races in nvme_setup_io_queues() and UAF introduced by nvme_dev_remove_admin() Casey Chen
  2021-06-22  0:27 ` [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
@ 2021-06-22  0:27 ` Casey Chen
  2021-06-22 19:16   ` Keith Busch
  1 sibling, 1 reply; 12+ messages in thread
From: Casey Chen @ 2021-06-22  0:27 UTC (permalink / raw)
  To: linux-nvme; +Cc: yzhong, ashishk, Casey Chen

nvme_dev_remove_admin() could free admin_q and admin_tagset while
they are being accessed by nvme_dev_disable(), which could come from
nvme_remove_dead_ctrl() by nvme_reset_work() during cleanup.

Commit cb4bfda62afa ("nvme-pci: fix hot removal during error handling") was
to avoid requests being stuck on a removed controller by killing admin queue.
But the later fix c8e9e9b7646e ("nvme-pci: unquiesce admin queue on shutdown"),
together with nvme_dev_disable(dev, true) right before nvme_dev_remove_admin()
could help dispatch requests and fail them early, so we don't need
nvme_dev_remove_admin() any more.

Fixes: cb4bfda62afa ("nvme-pci: fix hot removal during error handling")

upstream_status: upstream
---
 drivers/nvme/host/pci.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5abc4c3be454..49ef2027bcb0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3023,7 +3023,6 @@ static void nvme_remove(struct pci_dev *pdev)
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
 		nvme_dev_disable(dev, true);
-		nvme_dev_remove_admin(dev);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
-- 
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-22  0:27 ` [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
@ 2021-06-22 15:06   ` Keith Busch
  2021-06-23  0:59     ` Casey Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2021-06-22 15:06 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk

On Mon, Jun 21, 2021 at 05:27:09PM -0700, Casey Chen wrote:
> +	/*
> +	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
> +	 * from set to unset. If there is a window to it is truely freed,
> +	 * pci_free_irq_vectors() jumping into this window will crash.
> +	 * And take lock to avoid racing with pci_free_irq_vectors() in
> +	 * nvme_dev_disable() path.
> +	 */
> +	mutex_lock(&dev->shutdown_lock)

Sorry, I wasn't clear in previous review. All of the shutdown_locks
taken after initialization need to by mutex_trylock()'s. If you have to
wait to get the lock, the device is not going to be suitable for
continuing initialization after the lock is available.

And looking at this again, if trylock is successful, I think you need to
verify controller state is suitable for continuing the initialization.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] nvme-pci: Fix UAF introduced by nvme_dev_remove_admin()
  2021-06-22  0:27 ` [PATCH 2/2] nvme-pci: Fix UAF introduced by nvme_dev_remove_admin() Casey Chen
@ 2021-06-22 19:16   ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2021-06-22 19:16 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk

On Mon, Jun 21, 2021 at 05:27:10PM -0700, Casey Chen wrote:
> nvme_dev_remove_admin() could free admin_q and admin_tagset while
> they are being accessed by nvme_dev_disable(), which could come from
> nvme_remove_dead_ctrl() by nvme_reset_work() during cleanup.
> 
> Commit cb4bfda62afa ("nvme-pci: fix hot removal during error handling") was
> to avoid requests being stuck on a removed controller by killing admin queue.
> But the later fix c8e9e9b7646e ("nvme-pci: unquiesce admin queue on shutdown"),
> together with nvme_dev_disable(dev, true) right before nvme_dev_remove_admin()
> could help dispatch requests and fail them early, so we don't need
> nvme_dev_remove_admin() any more.
> 
> Fixes: cb4bfda62afa ("nvme-pci: fix hot removal during error handling")

Thanks, this change looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>

> upstream_status: upstream
> ---
>  drivers/nvme/host/pci.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 5abc4c3be454..49ef2027bcb0 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3023,7 +3023,6 @@ static void nvme_remove(struct pci_dev *pdev)
>  	if (!pci_device_is_present(pdev)) {
>  		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>  		nvme_dev_disable(dev, true);
> -		nvme_dev_remove_admin(dev);
>  	}
>  
>  	flush_work(&dev->ctrl.reset_work);
> -- 
> 2.17.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-22 15:06   ` Keith Busch
@ 2021-06-23  0:59     ` Casey Chen
  2021-06-23  1:53       ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Casey Chen @ 2021-06-23  0:59 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, yzhong, ashishk

On 6/22/21 8:06 AM, Keith Busch wrote:
> On Mon, Jun 21, 2021 at 05:27:09PM -0700, Casey Chen wrote:
>> +	/*
>> +	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
>> +	 * from set to unset. If there is a window to it is truely freed,
>> +	 * pci_free_irq_vectors() jumping into this window will crash.
>> +	 * And take lock to avoid racing with pci_free_irq_vectors() in
>> +	 * nvme_dev_disable() path.
>> +	 */
>> +	mutex_lock(&dev->shutdown_lock)
> Sorry, I wasn't clear in previous review. All of the shutdown_locks
> taken after initialization need to by mutex_trylock()'s. If you have to
> wait to get the lock, the device is not going to be suitable for
> continuing initialization after the lock is available.
>
> And looking at this again, if trylock is successful, I think you need to
> verify controller state is suitable for continuing the initialization.
I assume you mean all the mutex_lock() except the one in 
nvme_dev_disable(), right ? I agree with replacing all mutex_lock() with 
mutex_trylock() except the one in nvme_dev_disable(). But I doubt about 
checking controller state after lock being acquired. Two reasons:

1. What state can use to check ? Suppose there is a state set on 
surprise removal, what if the state check happen before it is set while 
surprise removal is on its way ?

2. Even we don't check controller state after acquiring lock, the 
initialization still fail fairly soon if surprise removal happens. We 
don't sacrifice any performance.

Casey


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-23  0:59     ` Casey Chen
@ 2021-06-23  1:53       ` Keith Busch
  2021-06-23  4:00         ` Casey Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2021-06-23  1:53 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk

On Tue, Jun 22, 2021 at 05:59:30PM -0700, Casey Chen wrote:
> On 6/22/21 8:06 AM, Keith Busch wrote:
> > On Mon, Jun 21, 2021 at 05:27:09PM -0700, Casey Chen wrote:
> > > +	/*
> > > +	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
> > > +	 * from set to unset. If there is a window to it is truely freed,
> > > +	 * pci_free_irq_vectors() jumping into this window will crash.
> > > +	 * And take lock to avoid racing with pci_free_irq_vectors() in
> > > +	 * nvme_dev_disable() path.
> > > +	 */
> > > +	mutex_lock(&dev->shutdown_lock)
> > Sorry, I wasn't clear in previous review. All of the shutdown_locks
> > taken after initialization need to by mutex_trylock()'s. If you have to
> > wait to get the lock, the device is not going to be suitable for
> > continuing initialization after the lock is available.
> > 
> > And looking at this again, if trylock is successful, I think you need to
> > verify controller state is suitable for continuing the initialization.
> I assume you mean all the mutex_lock() except the one in nvme_dev_disable(),
> right ? I agree with replacing all mutex_lock() with mutex_trylock() except
> the one in nvme_dev_disable(). 

I only mean the locks you are adding. The existing usages are correct
as-is.

> But I doubt about checking controller state
> after lock being acquired. Two reasons:
>
> 1. What state can use to check ?

If the state is not NVME_CTRL_CONNECTING after successfully taking the
shutdown_lock, then the controller is not in a state that can proceed
with IO queue setup.

> Suppose there is a state set on surprise
> removal, what if the state check happen before it is set while surprise
> removal is on its way ?
> 
> 2. Even we don't check controller state after acquiring lock, the
> initialization still fail fairly soon if surprise removal happens. We don't
> sacrifice any performance.

Nothing to do with performance; we're not in io path. The suggestion is
about avoiding access to a non-existent device. The timing of a surprise
removal can always defeat such a check, so my suggestion is not critical
since we expect that to be handled anyway, but in my experience,
platforms work better if the OS prevents such access when it knows it
will fail. It's not full-proof.

But as I said, everything should work even without the check, so feel
free to omit it.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-23  1:53       ` Keith Busch
@ 2021-06-23  4:00         ` Casey Chen
  2021-06-23  4:14           ` Casey Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Casey Chen @ 2021-06-23  4:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, yzhong, ashishk


On 6/22/21 6:53 PM, Keith Busch wrote:
> On Tue, Jun 22, 2021 at 05:59:30PM -0700, Casey Chen wrote:
>> On 6/22/21 8:06 AM, Keith Busch wrote:
>>> On Mon, Jun 21, 2021 at 05:27:09PM -0700, Casey Chen wrote:
>>>> +	/*
>>>> +	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
>>>> +	 * from set to unset. If there is a window to it is truely freed,
>>>> +	 * pci_free_irq_vectors() jumping into this window will crash.
>>>> +	 * And take lock to avoid racing with pci_free_irq_vectors() in
>>>> +	 * nvme_dev_disable() path.
>>>> +	 */
>>>> +	mutex_lock(&dev->shutdown_lock)
>>> Sorry, I wasn't clear in previous review. All of the shutdown_locks
>>> taken after initialization need to by mutex_trylock()'s. If you have to
>>> wait to get the lock, the device is not going to be suitable for
>>> continuing initialization after the lock is available.
>>>
>>> And looking at this again, if trylock is successful, I think you need to
>>> verify controller state is suitable for continuing the initialization.
>> I assume you mean all the mutex_lock() except the one in nvme_dev_disable(),
>> right ? I agree with replacing all mutex_lock() with mutex_trylock() except
>> the one in nvme_dev_disable().
> I only mean the locks you are adding. The existing usages are correct
> as-is.
>
>> But I doubt about checking controller state
>> after lock being acquired. Two reasons:
>>
>> 1. What state can use to check ?
> If the state is not NVME_CTRL_CONNECTING after successfully taking the
> shutdown_lock, then the controller is not in a state that can proceed
> with IO queue setup.
>
>> Suppose there is a state set on surprise
>> removal, what if the state check happen before it is set while surprise
>> removal is on its way ?
>>
>> 2. Even we don't check controller state after acquiring lock, the
>> initialization still fail fairly soon if surprise removal happens. We don't
>> sacrifice any performance.
> Nothing to do with performance; we're not in io path. The suggestion is
> about avoiding access to a non-existent device. The timing of a surprise
> removal can always defeat such a check, so my suggestion is not critical
> since we expect that to be handled anyway, but in my experience,
> platforms work better if the OS prevents such access when it knows it
> will fail. It's not full-proof.
>
> But as I said, everything should work even without the check, so feel
> free to omit it.

Thanks for explanation. I prefer not having the check even I had came up 
with a solution as below.

This looks cumbersome. If you can make it better, I might adopt it. 
Otherwise, I will send new patches without the check. :)

+/*
+ * Try getting shutdown_lock while setting up IO queues.
+ * Caller remember to unlock.
+ */
+static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
+{
+       /*
+       * Lock is being held by nvme_dev_disable(), fail early.
+       */
+       if (!mutex_trylock(&dev->shutdown_lock))
+               return -ENODEV;
+
+       /*
+       * Controller is in wrong state, fail early.
+        */
+       if (dev->ctrl.state == NVME_CTRL_CONNECTING)
+               return -ENODEV;
+
+       return 0;
+}
+

Sample call:

-       mutex_lock(&dev->shutdown_lock);
+       if ((result = nvme_setup_io_queues_trylock(dev)))
+               return result;


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-23  4:00         ` Casey Chen
@ 2021-06-23  4:14           ` Casey Chen
  2021-06-23 14:50             ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Casey Chen @ 2021-06-23  4:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, yzhong, ashishk


On 6/22/21 9:00 PM, Casey Chen wrote:
>
> On 6/22/21 6:53 PM, Keith Busch wrote:
>> On Tue, Jun 22, 2021 at 05:59:30PM -0700, Casey Chen wrote:
>>> On 6/22/21 8:06 AM, Keith Busch wrote:
>>>> On Mon, Jun 21, 2021 at 05:27:09PM -0700, Casey Chen wrote:
>>>>> +    /*
>>>>> +     * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
>>>>> +     * from set to unset. If there is a window to it is truely 
>>>>> freed,
>>>>> +     * pci_free_irq_vectors() jumping into this window will crash.
>>>>> +     * And take lock to avoid racing with pci_free_irq_vectors() in
>>>>> +     * nvme_dev_disable() path.
>>>>> +     */
>>>>> +    mutex_lock(&dev->shutdown_lock)
>>>> Sorry, I wasn't clear in previous review. All of the shutdown_locks
>>>> taken after initialization need to by mutex_trylock()'s. If you 
>>>> have to
>>>> wait to get the lock, the device is not going to be suitable for
>>>> continuing initialization after the lock is available.
>>>>
>>>> And looking at this again, if trylock is successful, I think you 
>>>> need to
>>>> verify controller state is suitable for continuing the initialization.
>>> I assume you mean all the mutex_lock() except the one in 
>>> nvme_dev_disable(),
>>> right ? I agree with replacing all mutex_lock() with mutex_trylock() 
>>> except
>>> the one in nvme_dev_disable().
>> I only mean the locks you are adding. The existing usages are correct
>> as-is.
>>
>>> But I doubt about checking controller state
>>> after lock being acquired. Two reasons:
>>>
>>> 1. What state can use to check ?
>> If the state is not NVME_CTRL_CONNECTING after successfully taking the
>> shutdown_lock, then the controller is not in a state that can proceed
>> with IO queue setup.
>>
>>> Suppose there is a state set on surprise
>>> removal, what if the state check happen before it is set while surprise
>>> removal is on its way ?
>>>
>>> 2. Even we don't check controller state after acquiring lock, the
>>> initialization still fail fairly soon if surprise removal happens. 
>>> We don't
>>> sacrifice any performance.
>> Nothing to do with performance; we're not in io path. The suggestion is
>> about avoiding access to a non-existent device. The timing of a surprise
>> removal can always defeat such a check, so my suggestion is not critical
>> since we expect that to be handled anyway, but in my experience,
>> platforms work better if the OS prevents such access when it knows it
>> will fail. It's not full-proof.
>>
>> But as I said, everything should work even without the check, so feel
>> free to omit it.
>
> Thanks for explanation. I prefer not having the check even I had came 
> up with a solution as below.
>
> This looks cumbersome. If you can make it better, I might adopt it. 
> Otherwise, I will send new patches without the check. :)
>
> +/*
> + * Try getting shutdown_lock while setting up IO queues.
> + * Caller remember to unlock.
> + */
> +static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
> +{
> +       /*
> +       * Lock is being held by nvme_dev_disable(), fail early.
> +       */
> +       if (!mutex_trylock(&dev->shutdown_lock))
> +               return -ENODEV;
> +
> +       /*
> +       * Controller is in wrong state, fail early.
> +        */
> +       if (dev->ctrl.state == NVME_CTRL_CONNECTING)
> +               return -ENODEV;
> +
> +       return 0;
> +}
> +
>
> Sample call:
>
> -       mutex_lock(&dev->shutdown_lock);
> +       if ((result = nvme_setup_io_queues_trylock(dev)))
> +               return result;
>

Sorry I sent in a rush, should be:

+/*
+ * Try getting shutdown_lock while setting up IO queues.
+ * Caller remember to unlock if success.
+ */
+static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
+{
+       /*
+       * Lock is being held by nvme_dev_disable(), fail early.
+       */
+       if (!mutex_trylock(&dev->shutdown_lock))
+               return -ENODEV;
+
+       /*
+       * Controller is in wrong state, fail early.
+        */
+       if (dev->ctrl.state != NVME_CTRL_CONNECTING) {
+               mutex_unlock(&dev->shutdown_lock);
+               return -ENODEV;
+       }
+
+       return 0;
+}
+


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-23  4:14           ` Casey Chen
@ 2021-06-23 14:50             ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2021-06-23 14:50 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk

On Tue, Jun 22, 2021 at 09:14:21PM -0700, Casey Chen wrote:
> +/*
> + * Try getting shutdown_lock while setting up IO queues.
> + * Caller remember to unlock if success.
> + */
> +static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
> +{
> +       /*
> +       * Lock is being held by nvme_dev_disable(), fail early.
> +       */
> +       if (!mutex_trylock(&dev->shutdown_lock))
> +               return -ENODEV;
> +
> +       /*
> +       * Controller is in wrong state, fail early.
> +        */
> +       if (dev->ctrl.state != NVME_CTRL_CONNECTING) {
> +               mutex_unlock(&dev->shutdown_lock);
> +               return -ENODEV;
> +       }
> +
> +       return 0;
> +}
> +

Yes, thank you, this is aligned with what I suggested.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-24 17:31 ` [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
@ 2021-07-06 18:37   ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2021-07-06 18:37 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, yzhong, ashishk

On Thu, Jun 24, 2021 at 10:31:24AM -0700, Casey Chen wrote:
>  static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
>  {
>  	struct nvme_dev *dev = nvmeq->dev;
> @@ -1590,8 +1613,10 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
>  		goto release_cq;
>  
>  	nvmeq->cq_vector = vector;
> -	nvme_init_queue(nvmeq, qid);
>  
> +	if ((result = nvme_setup_io_queues_trylock(dev)))

Please separate the assignment to 'result' from the 'if' checking the
result. That way will be easier to read an matches the rest of this
driver's style.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-24 17:31 [PATCH 0/2] rr: nvme-pci: fix races and UAF Casey Chen
@ 2021-06-24 17:31 ` Casey Chen
  2021-07-06 18:37   ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Casey Chen @ 2021-06-24 17:31 UTC (permalink / raw)
  To: linux-nvme; +Cc: yzhong, ashishk, Casey Chen

Below two paths could overlap each other if we power off a drive quickly
after powering it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()                                nvme_remove()
  nvme_setup_io_queues()                           nvme_dev_disable()
  ...                                              ...
A1  clear NVMEQ_ENABLED bit for admin queue          lock
    retry:                                       B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue                 B2  nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()                         nvme_pci_disable()
A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
      ...                                            unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

1). If B3 runs in between A1 and A2, it will crash if irqaction haven't
  been freed by A2. B2 is supposed to free admin queue IRQ but it simply
  can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as NVMEQ_ENABLED bit get cleared.

2). After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
  is checking irqaction. A3 also could race with B2 if B2 is freeing
  IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

3). A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

4). A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
  interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

5). If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
  They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

6). A7 could get chance to pci_free_irq() for certain IO queue while B3
  is checking irqaction.

Fix: A7 takes lock.

7). nvme_dev->online_queues need to be protected by shutdown_lock. Since it
  is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
---
 drivers/nvme/host/pci.c | 59 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aa7245a505f..378ce9c060af 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1562,6 +1562,29 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	wmb(); /* ensure the first interrupt sees the initialization */
 }
 
+/*
+ * Try getting shutdown_lock while setting up IO queues.
+ * Caller remember to unlock if success.
+ */
+static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
+{
+	/*
+	 * Lock is being held by nvme_dev_disable(), fail early.
+	 */
+	if (!mutex_trylock(&dev->shutdown_lock))
+		return -ENODEV;
+
+	/*
+	 * Controller is in wrong state, fail early.
+	 */
+	if (dev->ctrl.state != NVME_CTRL_CONNECTING) {
+		mutex_unlock(&dev->shutdown_lock);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
 static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 {
 	struct nvme_dev *dev = nvmeq->dev;
@@ -1590,8 +1613,10 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 		goto release_cq;
 
 	nvmeq->cq_vector = vector;
-	nvme_init_queue(nvmeq, qid);
 
+	if ((result = nvme_setup_io_queues_trylock(dev)))
+		return result;
+	nvme_init_queue(nvmeq, qid);
 	if (!polled) {
 		result = queue_request_irq(nvmeq);
 		if (result < 0)
@@ -1599,10 +1624,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	}
 
 	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 	return result;
 
 release_sq:
 	dev->online_queues--;
+	mutex_unlock(&dev->shutdown_lock);
 	adapter_delete_sq(dev, qid);
 release_cq:
 	adapter_delete_cq(dev, qid);
@@ -2176,7 +2203,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (nr_io_queues == 0)
 		return 0;
 
-	clear_bit(NVMEQ_ENABLED, &adminq->flags);
+	/*
+	 * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+	 * from set to unset. If there is a window to it is truely freed,
+	 * pci_free_irq_vectors() jumping into this window will crash.
+	 * And take lock to avoid racing with pci_free_irq_vectors() in
+	 * nvme_dev_disable() path.
+	 */
+	if ((result = nvme_setup_io_queues_trylock(dev)))
+		return result;
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2229,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		result = nvme_remap_bar(dev, size);
 		if (!result)
 			break;
-		if (!--nr_io_queues)
+		if (!--nr_io_queues) {
+			mutex_unlock(&dev->shutdown_lock);
 			return -ENOMEM;
+		}
 	} while (1);
 	adminq->q_db = dev->dbs;
 
  retry:
 	/* Deregister the admin queue's interrupt */
-	pci_free_irq(pdev, 0, adminq);
+	if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+		pci_free_irq(pdev, 0, adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2248,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	pci_free_irq_vectors(pdev);
 
 	result = nvme_setup_irqs(dev, nr_io_queues);
-	if (result <= 0)
+	if (result <= 0) {
+		mutex_unlock(&dev->shutdown_lock);
 		return -EIO;
+	}
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
@@ -2222,9 +2264,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * number of interrupts.
 	 */
 	result = queue_request_irq(adminq);
-	if (result)
+	if (result) {
+		mutex_unlock(&dev->shutdown_lock);
 		return result;
+	}
 	set_bit(NVMEQ_ENABLED, &adminq->flags);
+	mutex_unlock(&dev->shutdown_lock);
 
 	result = nvme_create_io_queues(dev);
 	if (result || dev->online_queues < 2)
@@ -2233,6 +2278,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (dev->online_queues - 1 < dev->max_qid) {
 		nr_io_queues = dev->online_queues - 1;
 		nvme_disable_io_queues(dev);
+		if ((result = nvme_setup_io_queues_trylock(dev)))
+			return result;
 		nvme_suspend_io_queues(dev);
 		goto retry;
 	}
-- 
2.17.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-07-06 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22  0:27 rr: nvme-pci: fix races in nvme_setup_io_queues() and UAF introduced by nvme_dev_remove_admin() Casey Chen
2021-06-22  0:27 ` [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
2021-06-22 15:06   ` Keith Busch
2021-06-23  0:59     ` Casey Chen
2021-06-23  1:53       ` Keith Busch
2021-06-23  4:00         ` Casey Chen
2021-06-23  4:14           ` Casey Chen
2021-06-23 14:50             ` Keith Busch
2021-06-22  0:27 ` [PATCH 2/2] nvme-pci: Fix UAF introduced by nvme_dev_remove_admin() Casey Chen
2021-06-22 19:16   ` Keith Busch
2021-06-24 17:31 [PATCH 0/2] rr: nvme-pci: fix races and UAF Casey Chen
2021-06-24 17:31 ` [PATCH 1/2] nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
2021-07-06 18:37   ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.