All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme-pci: Fix multiple races in nvme_setup_io_queues()
@ 2021-06-15  6:17 Casey Chen
  2021-06-15 16:39 ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Casey Chen @ 2021-06-15  6:17 UTC (permalink / raw)
  To: linux-nvme

Hi,

We found multiple races in nvme_setup_io_queues() by power-cycling a
PCIe NVMe drive for hours. nvme_setup_io_queues() could overlap with
nvme_dev_disable() if we power off a drive quickly after powering it
on. Multiple races exist because of shutdown_lock missing and improper
use of NVMEQ_ENABLED bit. All problematic places and their fix are
explained in the commit message. This is also reproducible on Intel
Optane drive. Fix has been tested for a week.

All tests and fix are based on tag nvme-5.14-2021-06-08 of repo
http://git.infradead.org/nvme.git

1. Testing Method
while :; power off drive X; sleep $((RANDOM%3)).$((RANDOM%10)); power
on drive X; sleep $((RANDOM%3)).$((RANDOM%10)); done

2. Crash Call Trace
[11668.533392] pcieport 0000:87:08.0: pciehp: pending interrupts
0x0008 from Slot Status
[11668.533431] pcieport 0000:87:08.0: pciehp: Slot(402): Card present
...
[11668.681298] nvme nvme12: pci function 0000:8c:00.0
[11668.681354] nvme 0000:8c:00.0: enabling device (0100 -> 0102)
[11669.046119] pcieport 0000:87:08.0: pciehp: pending interrupts
0x0108 from Slot Status
[11669.046142] pcieport 0000:87:08.0: pciehp: Slot(402): Link Down
[11669.046146] pcieport 0000:87:08.0: pciehp: Slot(402): Card not present
[11669.046149] pcieport 0000:87:08.0: pciehp:
pciehp_unconfigure_device: domain:bus:dev = 0000:8c:00
[11669.077428] ------------[ cut here ]------------
[11669.077431] kernel BUG at drivers/pci/msi.c:348!
[11669.077555] invalid opcode: 0000 [#1] SMP KASAN
[11669.077658] CPU: 31 PID: 716 Comm: irq/127-pciehp Not tainted 5.13.0-rc3+
[11669.077869] Hardware name: <MASKED OFF>
[11669.078022] RIP: 0010:free_msi_irqs+0x28a/0x2d0
[11669.078129] Code: f7 83 c5 01 e8 87 44 c7 ff 39 6b 14 0f 86 d6 fd
ff ff 4c 89 ef e8 76 44 c7 ff 8b 43 10 8d 7c 05 00 e8 8a 0d a7 ff 84
c0 74 d4 <0f> 0b 48 8d 7b 60 e8 5b 45 c7 ff 48 8b 7b 60 e8 52 46 9b ff
48 89
[11669.078453] RSP: 0018:ffff88813917fa38 EFLAGS: 00010202
[11669.078564] RAX: 0000000000000001 RBX: ffff88e042875800 RCX: ffffffff81160812
[11669.078703] RDX: dffffc0000000000 RSI: ffff88e54fc86740 RDI: ffff892fe1627ca0
[11669.078843] RBP: 0000000000000000 R08: ffffed278641a4fd R09: ffffed278641a4fd
[11669.078982] R10: ffff893c320d27e5 R11: ffffed278641a4fc R12: ffff893c320d22e8
[11669.079121] R13: ffff88e042875810 R14: ffff88e042875814 R15: ffff893c320d2000
[11669.079260] FS:  0000000000000000(0000) GS:ffff893ebe640000(0000)
knlGS:0000000000000000
[11669.079415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11669.079531] CR2: 00007f1fc13fa000 CR3: 00000063a2de2002 CR4: 00000000005706e0
[11669.079670] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11669.085209] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11669.091054] PKRU: 55555554
[11669.093982] Call Trace:
[11669.096850]  pci_free_irq_vectors+0xe/0x20
[11669.099695]  nvme_dev_disable+0x140/0x760 [nvme]
[11669.102503]  ? _raw_spin_lock_irqsave+0x9c/0x100
[11669.105271]  ? trace_hardirqs_on+0x2c/0xe0
[11669.107994]  nvme_remove+0x191/0x1e0 [nvme]
[11669.110689]  pci_device_remove+0x6b/0x110
[11669.113316]  device_release_driver_internal+0x14f/0x280
[11669.115939]  pci_stop_bus_device+0xcb/0x100
[11669.118515]  pci_stop_and_remove_bus_device+0xe/0x20
[11669.121079]  pciehp_unconfigure_device+0xfa/0x200
[11669.123597]  ? pciehp_configure_device+0x1c0/0x1c0
[11669.126049]  ? trace_hardirqs_on+0x2c/0xe0
[11669.128444]  pciehp_disable_slot+0xc4/0x1a0
[11669.130771]  ? pciehp_runtime_suspend+0x40/0x40
[11669.133054]  ? __mutex_lock_slowpath+0x10/0x10
[11669.135289]  ? trace_hardirqs_on+0x2c/0xe0
[11669.137462]  pciehp_handle_presence_or_link_change+0x15c/0x4f0
[11669.139632]  ? down_read+0x11f/0x1a0
[11669.141731]  ? pciehp_handle_disable_request+0x80/0x80
[11669.143817]  ? rwsem_down_read_slowpath+0x600/0x600
[11669.145851]  ? __radix_tree_lookup+0xb2/0x130
[11669.147842]  pciehp_ist+0x19d/0x1a0
[11669.149790]  ? pciehp_set_indicators+0xe0/0xe0
[11669.151704]  ? irq_finalize_oneshot.part.46+0x1d0/0x1d0
[11669.153588]  irq_thread_fn+0x3f/0xa0
[11669.155407]  irq_thread+0x195/0x290
[11669.157147]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.158883]  ? _raw_read_lock_irq+0x50/0x50
[11669.160611]  ? _raw_read_lock_irq+0x50/0x50
[11669.162320]  ? irq_forced_thread_fn+0xf0/0xf0
[11669.164032]  ? trace_hardirqs_on+0x2c/0xe0
[11669.165731]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.167461]  kthread+0x1c8/0x1f0
[11669.169173]  ? kthread_parkme+0x40/0x40
[11669.170883]  ret_from_fork+0x22/0x30
[11669.172592] Modules linked in: iptable_nat nf_nat ip6table_raw
xt_CT iptable_raw ip6t_REJECT nf_reject_ipv6 ip6table_mangle
ipt_REJECT nf_reject_ipv4 xt_owner xt_conntrack iptable_mangle xt_ipvs
ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype
br_netfilter xt_mark xt_tcpudp ip6table_filter ip6_tables
iptable_filter nvme ip6_gre ip6_tunnel tunnel6 gre macvlan bridge stp
llc nvme_rdma nvme_fabrics nvme_core vfio_pci vfio_virqfd
vfio_iommu_type1 vfio qede qed crc8 scsi_dh_alua intel_rapl_msr
intel_rapl_common skx_edac x86_pkg_temp_thermal coretemp kvm_intel kvm
irqbypass ntb_hw_intel ntb intel_pch_thermal hed sunrpc msr bonding
ip_tables x_tables autofs4 rdma_ucm rdma_cm mlx4_ib mlx4_core iw_cm
ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core ghash_clmulni_intel
aesni_intel crypto_simd cryptd mlx5_core mlxfw pci_hyperv_intf mpt3sas
raid_class scsi_transport_sas i2c_i801 i2c_smbus pure_gpio
pinctrl_lewisburg pinctrl_intel hid_generic usbhid hid
[11669.172735]  [last unloaded: qla2xxx]
[11669.201990] Dumping ftrace buffer:
[11669.204376]    (ftrace buffer empty)
[11669.206766] ---[ end trace 07b68fb4b0c1a54c ]---
[11669.652093] RIP: 0010:free_msi_irqs+0x28a/0x2d0
[11669.654461] Code: f7 83 c5 01 e8 87 44 c7 ff 39 6b 14 0f 86 d6 fd
ff ff 4c 89 ef e8 76 44 c7 ff 8b 43 10 8d 7c 05 00 e8 8a 0d a7 ff 84
c0 74 d4 <0f> 0b 48 8d 7b 60 e8 5b 45 c7 ff 48 8b 7b 60 e8 52 46 9b ff
48 89
[11669.661764] RSP: 0018:ffff88813917fa38 EFLAGS: 00010202
[11669.664271] RAX: 0000000000000001 RBX: ffff88e042875800 RCX: ffffffff81160812
[11669.669382] RDX: dffffc0000000000 RSI: ffff88e54fc86740 RDI: ffff892fe1627ca0
[11669.674692] RBP: 0000000000000000 R08: ffffed278641a4fd R09: ffffed278641a4fd
[11669.680179] R10: ffff893c320d27e5 R11: ffffed278641a4fc R12: ffff893c320d22e8
[11669.685880] R13: ffff88e042875810 R14: ffff88e042875814 R15: ffff893c320d2000
[11669.691587] FS:  0000000000000000(0000) GS:ffff893ebe640000(0000)
knlGS:0000000000000000
[11669.697296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11669.700174] CR2: 00007f1fc13fa000 CR3: 00000063a2de2002 CR4: 00000000005706e0

3. Proposed Fix
From b609626a69d584ce36bad4787ea6b4c6e6dba453 Mon Sep 17 00:00:00 2001
From: Casey Chen <cachen@purestorage.com>
Date: Mon, 31 Aug 2020 18:14:23 -0600
Subject: [PATCH] nvme-pci: Fix multiple races in nvme_setup_io_queues()

Below two paths could overlap each other if we power off a drive quickly
after power it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()
      nvme_remove()
  nvme_setup_io_queues()
   nvme_dev_disable()
  ...
                       ...
A1  clear NVMEQ_ENABLED bit for admin queue                  lock
    retry:
                 B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue
B2  nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()
           nvme_pci_disable()
A4    nvme_setup_irqs();
    B3     pci_free_irq_vectors()
      ...
                          unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

1). If B3 runs in between A1 and A2, it will crash if irqaction haven't
  been freed by A2. B2 is supposed to free admin queue IRQ but it simply
  can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as NVMEQ_ENABLED bit get cleared.

2). After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
  is checking irqaction. A3 also could race with B2 if B2 is freeing
  IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

3) A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

4). A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
  interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

5). If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
  They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

6). A7 could get chance to pci_free_irq() for certain IO queue while B3
  is checking irqaction.

Fix: A7 takes lock.

7). nvme_dev->online_queues need to be protected by shutdown_lock. Since it
  is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
---
 drivers/nvme/host/pci.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aa7245a505f..81e53aaaa77c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue
*nvmeq, int qid, bool polled)
  goto release_cq;

  nvmeq->cq_vector = vector;
- nvme_init_queue(nvmeq, qid);

+ mutex_lock(&dev->shutdown_lock);
+ nvme_init_queue(nvmeq, qid);
  if (!polled) {
  result = queue_request_irq(nvmeq);
  if (result < 0)
@@ -1599,10 +1600,12 @@ static int nvme_create_queue(struct nvme_queue
*nvmeq, int qid, bool polled)
  }

  set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+ mutex_unlock(&dev->shutdown_lock);
  return result;

 release_sq:
  dev->online_queues--;
+ mutex_unlock(&dev->shutdown_lock);
  adapter_delete_sq(dev, qid);
 release_cq:
  adapter_delete_cq(dev, qid);
@@ -2176,7 +2179,16 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  if (nr_io_queues == 0)
  return 0;

- clear_bit(NVMEQ_ENABLED, &adminq->flags);
+ /*
+ * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+ * from set to unset. If there is a window to it is truely freed,
+ * pci_free_irq_vectors() jumping into this window will crash.
+ * And take lock to avoid racing with pci_free_irq_vectors() in
+ * nvme_dev_disable() path.
+ */
+ mutex_lock(&dev->shutdown_lock);
+ if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+ pci_free_irq(pdev, 0, adminq);

  if (dev->cmb_use_sqes) {
  result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2204,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  result = nvme_remap_bar(dev, size);
  if (!result)
  break;
- if (!--nr_io_queues)
+ if (!--nr_io_queues) {
+ mutex_unlock(&dev->shutdown_lock);
  return -ENOMEM;
+ }
  } while (1);
  adminq->q_db = dev->dbs;

  retry:
  /* Deregister the admin queue's interrupt */
- pci_free_irq(pdev, 0, adminq);
+ if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+ pci_free_irq(pdev, 0, adminq);

  /*
  * If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2223,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  pci_free_irq_vectors(pdev);

  result = nvme_setup_irqs(dev, nr_io_queues);
- if (result <= 0)
+ if (result <= 0) {
+ mutex_unlock(&dev->shutdown_lock);
  return -EIO;
+ }

  dev->num_vecs = result;
  result = max(result - 1, 1);
@@ -2222,9 +2239,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  * number of interrupts.
  */
  result = queue_request_irq(adminq);
- if (result)
+ if (result) {
+ mutex_unlock(&dev->shutdown_lock);
  return result;
+ }
  set_bit(NVMEQ_ENABLED, &adminq->flags);
+ mutex_unlock(&dev->shutdown_lock);

  result = nvme_create_io_queues(dev);
  if (result || dev->online_queues < 2)
@@ -2233,6 +2253,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  if (dev->online_queues - 1 < dev->max_qid) {
  nr_io_queues = dev->online_queues - 1;
  nvme_disable_io_queues(dev);
+ mutex_lock(&dev->shutdown_lock);
  nvme_suspend_io_queues(dev);
  goto retry;
  }
--
2.17.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-15  6:17 nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
@ 2021-06-15 16:39 ` Christoph Hellwig
  2021-06-15 22:33   ` Casey Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2021-06-15 16:39 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme

The patch is so wrapped that it not only doesn't apply, but that I also
have a hard time reading it.  Can you resend it using git-send-email?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-15 16:39 ` Christoph Hellwig
@ 2021-06-15 22:33   ` Casey Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Casey Chen @ 2021-06-15 22:33 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, linux-nvme

Sorry for the previous bad emails I have sent.

Please discuss and comment on the latest email I sent using git
send-email at 2:58 PM PDT.
There were two emails: one is a summary with subject 'Review Request:
nvme-pci: Fix multiple races in nvme_setup_io_queues()'
The other is patch with subject '[PATCH] nvme-pci: Fix multiple races
in nvme_setup_io_queues()'

I am waiting for more comments and then resolve them together. Thanks!


On Tue, Jun 15, 2021 at 9:40 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> The patch is so wrapped that it not only doesn't apply, but that I also
> have a hard time reading it.  Can you resend it using git-send-email?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nvme-pci: Fix multiple races in nvme_setup_io_queues()
  2021-06-15  6:26 Casey Chen
@ 2021-06-15 20:53 ` Keith Busch
  0 siblings, 0 replies; 5+ messages in thread
From: Keith Busch @ 2021-06-15 20:53 UTC (permalink / raw)
  To: Casey Chen; +Cc: linux-nvme, Yuanyuan Zhong, Ashish Karkare

On Mon, Jun 14, 2021 at 11:26:55PM -0700, Casey Chen wrote:
> (Please ignore the previous email, the call paths comparison shown in
> commit message should look better if you copy then paste in a code
> editor)

Your email client also mangles your patch, rendering it unable to apply.
If you're able to set up 'git send-email' from your development machine,
that will always format correctly for mailing list patch consumption.

I am aware of the race condition you've described. I never bothered with
it because of the unsual circumstances to hit it, but since you have
identified a test case, I agree it's time we address it.

> ---
>  drivers/nvme/host/pci.c | 33 +++++++++++++++++++++++++++------
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3aa7245a505f..81e53aaaa77c 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue
> *nvmeq, int qid, bool polled)
>   goto release_cq;
> 
>   nvmeq->cq_vector = vector;
> - nvme_init_queue(nvmeq, qid);
> 
> + mutex_lock(&dev->shutdown_lock);

There doesn't seem to be a reason to wait for the lock here. A
mutex_try_lock() should be fine, and simply abandon queue initialization
if we locking fails.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* nvme-pci: Fix multiple races in nvme_setup_io_queues()
@ 2021-06-15  6:26 Casey Chen
  2021-06-15 20:53 ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Casey Chen @ 2021-06-15  6:26 UTC (permalink / raw)
  To: linux-nvme; +Cc: Yuanyuan Zhong, Ashish Karkare

(Please ignore the previous email, the call paths comparison shown in
commit message should look better if you copy then paste in a code
editor)

Hi,

We found multiple races in nvme_setup_io_queues() by power-cycling a
PCIe NVMe drive for hours. nvme_setup_io_queues() could overlap with
nvme_dev_disable() if we power off a drive quickly after powering it
on. Multiple races exist because of shutdown_lock missing and improper
use of NVMEQ_ENABLED bit. All problematic places and their fix are
explained in the commit message. This is also reproducible on Intel
Optane drive. Fix has been tested for a week.

All tests and fix are based on tag nvme-5.14-2021-06-08 of repo
http://git.infradead.org/nvme.git

1. Testing Method
while :; power off drive X; sleep $((RANDOM%3)).$((RANDOM%10)); power
on drive X; sleep $((RANDOM%3)).$((RANDOM%10)); done

2. Crash Call Trace
[11668.533392] pcieport 0000:87:08.0: pciehp: pending interrupts
0x0008 from Slot Status
[11668.533431] pcieport 0000:87:08.0: pciehp: Slot(402): Card present
...
[11668.681298] nvme nvme12: pci function 0000:8c:00.0
[11668.681354] nvme 0000:8c:00.0: enabling device (0100 -> 0102)
[11669.046119] pcieport 0000:87:08.0: pciehp: pending interrupts
0x0108 from Slot Status
[11669.046142] pcieport 0000:87:08.0: pciehp: Slot(402): Link Down
[11669.046146] pcieport 0000:87:08.0: pciehp: Slot(402): Card not present
[11669.046149] pcieport 0000:87:08.0: pciehp:
pciehp_unconfigure_device: domain:bus:dev = 0000:8c:00
[11669.077428] ------------[ cut here ]------------
[11669.077431] kernel BUG at drivers/pci/msi.c:348!
[11669.077555] invalid opcode: 0000 [#1] SMP KASAN
[11669.077658] CPU: 31 PID: 716 Comm: irq/127-pciehp Not tainted 5.13.0-rc3+
[11669.077869] Hardware name: <MASKED OFF>
[11669.078022] RIP: 0010:free_msi_irqs+0x28a/0x2d0
[11669.078129] Code: f7 83 c5 01 e8 87 44 c7 ff 39 6b 14 0f 86 d6 fd
ff ff 4c 89 ef e8 76 44 c7 ff 8b 43 10 8d 7c 05 00 e8 8a 0d a7 ff 84
c0 74 d4 <0f> 0b 48 8d 7b 60 e8 5b 45 c7 ff 48 8b 7b 60 e8 52 46 9b ff
48 89
[11669.078453] RSP: 0018:ffff88813917fa38 EFLAGS: 00010202
[11669.078564] RAX: 0000000000000001 RBX: ffff88e042875800 RCX: ffffffff81160812
[11669.078703] RDX: dffffc0000000000 RSI: ffff88e54fc86740 RDI: ffff892fe1627ca0
[11669.078843] RBP: 0000000000000000 R08: ffffed278641a4fd R09: ffffed278641a4fd
[11669.078982] R10: ffff893c320d27e5 R11: ffffed278641a4fc R12: ffff893c320d22e8
[11669.079121] R13: ffff88e042875810 R14: ffff88e042875814 R15: ffff893c320d2000
[11669.079260] FS:  0000000000000000(0000) GS:ffff893ebe640000(0000)
knlGS:0000000000000000
[11669.079415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11669.079531] CR2: 00007f1fc13fa000 CR3: 00000063a2de2002 CR4: 00000000005706e0
[11669.079670] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11669.085209] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11669.091054] PKRU: 55555554
[11669.093982] Call Trace:
[11669.096850]  pci_free_irq_vectors+0xe/0x20
[11669.099695]  nvme_dev_disable+0x140/0x760 [nvme]
[11669.102503]  ? _raw_spin_lock_irqsave+0x9c/0x100
[11669.105271]  ? trace_hardirqs_on+0x2c/0xe0
[11669.107994]  nvme_remove+0x191/0x1e0 [nvme]
[11669.110689]  pci_device_remove+0x6b/0x110
[11669.113316]  device_release_driver_internal+0x14f/0x280
[11669.115939]  pci_stop_bus_device+0xcb/0x100
[11669.118515]  pci_stop_and_remove_bus_device+0xe/0x20
[11669.121079]  pciehp_unconfigure_device+0xfa/0x200
[11669.123597]  ? pciehp_configure_device+0x1c0/0x1c0
[11669.126049]  ? trace_hardirqs_on+0x2c/0xe0
[11669.128444]  pciehp_disable_slot+0xc4/0x1a0
[11669.130771]  ? pciehp_runtime_suspend+0x40/0x40
[11669.133054]  ? __mutex_lock_slowpath+0x10/0x10
[11669.135289]  ? trace_hardirqs_on+0x2c/0xe0
[11669.137462]  pciehp_handle_presence_or_link_change+0x15c/0x4f0
[11669.139632]  ? down_read+0x11f/0x1a0
[11669.141731]  ? pciehp_handle_disable_request+0x80/0x80
[11669.143817]  ? rwsem_down_read_slowpath+0x600/0x600
[11669.145851]  ? __radix_tree_lookup+0xb2/0x130
[11669.147842]  pciehp_ist+0x19d/0x1a0
[11669.149790]  ? pciehp_set_indicators+0xe0/0xe0
[11669.151704]  ? irq_finalize_oneshot.part.46+0x1d0/0x1d0
[11669.153588]  irq_thread_fn+0x3f/0xa0
[11669.155407]  irq_thread+0x195/0x290
[11669.157147]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.158883]  ? _raw_read_lock_irq+0x50/0x50
[11669.160611]  ? _raw_read_lock_irq+0x50/0x50
[11669.162320]  ? irq_forced_thread_fn+0xf0/0xf0
[11669.164032]  ? trace_hardirqs_on+0x2c/0xe0
[11669.165731]  ? irq_thread_check_affinity.part.49+0xe0/0xe0
[11669.167461]  kthread+0x1c8/0x1f0
[11669.169173]  ? kthread_parkme+0x40/0x40
[11669.170883]  ret_from_fork+0x22/0x30
[11669.172592] Modules linked in: iptable_nat nf_nat ip6table_raw
xt_CT iptable_raw ip6t_REJECT nf_reject_ipv6 ip6table_mangle
ipt_REJECT nf_reject_ipv4 xt_owner xt_conntrack iptable_mangle xt_ipvs
ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype
br_netfilter xt_mark xt_tcpudp ip6table_filter ip6_tables
iptable_filter nvme ip6_gre ip6_tunnel tunnel6 gre macvlan bridge stp
llc nvme_rdma nvme_fabrics nvme_core vfio_pci vfio_virqfd
vfio_iommu_type1 vfio qede qed crc8 scsi_dh_alua intel_rapl_msr
intel_rapl_common skx_edac x86_pkg_temp_thermal coretemp kvm_intel kvm
irqbypass ntb_hw_intel ntb intel_pch_thermal hed sunrpc msr bonding
ip_tables x_tables autofs4 rdma_ucm rdma_cm mlx4_ib mlx4_core iw_cm
ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core ghash_clmulni_intel
aesni_intel crypto_simd cryptd mlx5_core mlxfw pci_hyperv_intf mpt3sas
raid_class scsi_transport_sas i2c_i801 i2c_smbus pure_gpio
pinctrl_lewisburg pinctrl_intel hid_generic usbhid hid
[11669.172735]  [last unloaded: qla2xxx]
[11669.201990] Dumping ftrace buffer:
[11669.204376]    (ftrace buffer empty)
[11669.206766] ---[ end trace 07b68fb4b0c1a54c ]---
[11669.652093] RIP: 0010:free_msi_irqs+0x28a/0x2d0
[11669.654461] Code: f7 83 c5 01 e8 87 44 c7 ff 39 6b 14 0f 86 d6 fd
ff ff 4c 89 ef e8 76 44 c7 ff 8b 43 10 8d 7c 05 00 e8 8a 0d a7 ff 84
c0 74 d4 <0f> 0b 48 8d 7b 60 e8 5b 45 c7 ff 48 8b 7b 60 e8 52 46 9b ff
48 89
[11669.661764] RSP: 0018:ffff88813917fa38 EFLAGS: 00010202
[11669.664271] RAX: 0000000000000001 RBX: ffff88e042875800 RCX: ffffffff81160812
[11669.669382] RDX: dffffc0000000000 RSI: ffff88e54fc86740 RDI: ffff892fe1627ca0
[11669.674692] RBP: 0000000000000000 R08: ffffed278641a4fd R09: ffffed278641a4fd
[11669.680179] R10: ffff893c320d27e5 R11: ffffed278641a4fc R12: ffff893c320d22e8
[11669.685880] R13: ffff88e042875810 R14: ffff88e042875814 R15: ffff893c320d2000
[11669.691587] FS:  0000000000000000(0000) GS:ffff893ebe640000(0000)
knlGS:0000000000000000
[11669.697296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11669.700174] CR2: 00007f1fc13fa000 CR3: 00000063a2de2002 CR4: 00000000005706e0

3. Proposed Fix
From b609626a69d584ce36bad4787ea6b4c6e6dba453 Mon Sep 17 00:00:00 2001
From: Casey Chen <cachen@purestorage.com>
Date: Mon, 31 Aug 2020 18:14:23 -0600
Subject: [PATCH] nvme-pci: Fix multiple races in nvme_setup_io_queues()

Below two paths could overlap each other if we power off a drive quickly
after power it on. There are multiple races in nvme_setup_io_queues()
because of shutdown_lock missing and improper use of NVMEQ_ENABLED bit.

nvme_reset_work()                                nvme_remove()
  nvme_setup_io_queues()                           nvme_dev_disable()
  ...                                              ...
A1  clear NVMEQ_ENABLED bit for admin queue          lock
    retry:                                       B1  nvme_suspend_io_queues()
A2    pci_free_irq() admin queue                 B2
nvme_suspend_queue() admin queue
A3    pci_free_irq_vectors()                         nvme_pci_disable()
A4    nvme_setup_irqs();                         B3    pci_free_irq_vectors()
      ...                                            unlock
A5    queue_request_irq() for admin queue
      set NVMEQ_ENABLED bit
      ...
      nvme_create_io_queues()
A6      result = queue_request_irq();
        set NVMEQ_ENABLED bit
      ...
      fail to allocate enough IO queues:
A7      nvme_suspend_io_queues()
        goto retry

1). If B3 runs in between A1 and A2, it will crash if irqaction haven't
  been freed by A2. B2 is supposed to free admin queue IRQ but it simply
  can't fulfill the job as A1 has cleared NVMEQ_ENABLED bit.

Fix: combine A1 A2 so IRQ get freed as soon as NVMEQ_ENABLED bit get cleared.

2). After solved #1, A2 could race with B3 if A2 is freeing IRQ while B3
  is checking irqaction. A3 also could race with B2 if B2 is freeing
  IRQ while A3 is checking irqaction.

Fix: A2 and A3 take lock for mutual exclusion.

3) A3 could race with B3 since they could run free_msi_irqs() in parallel.

Fix: A3 takes lock for mutual exclusion.

4). A4 could fail to allocate all needed IRQ vectors if A3 and A4 are
  interrupted by B3.

Fix: A4 takes lock for mutual exclusion.

5). If A5/A6 happened after B2/B1, B3 will crash since irqaction is not NULL.
  They are just allocated by A5/A6.

Fix: Lock queue_request_irq() and setting of NVMEQ_ENABLED bit.

6). A7 could get chance to pci_free_irq() for certain IO queue while B3
  is checking irqaction.

Fix: A7 takes lock.

7). nvme_dev->online_queues need to be protected by shutdown_lock. Since it
  is not atomic, both paths could modify it using its own copy.

Co-developed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Casey Chen <cachen@purestorage.com>
---
 drivers/nvme/host/pci.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3aa7245a505f..81e53aaaa77c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1590,8 +1590,9 @@ static int nvme_create_queue(struct nvme_queue
*nvmeq, int qid, bool polled)
  goto release_cq;

  nvmeq->cq_vector = vector;
- nvme_init_queue(nvmeq, qid);

+ mutex_lock(&dev->shutdown_lock);
+ nvme_init_queue(nvmeq, qid);
  if (!polled) {
  result = queue_request_irq(nvmeq);
  if (result < 0)
@@ -1599,10 +1600,12 @@ static int nvme_create_queue(struct nvme_queue
*nvmeq, int qid, bool polled)
  }

  set_bit(NVMEQ_ENABLED, &nvmeq->flags);
+ mutex_unlock(&dev->shutdown_lock);
  return result;

 release_sq:
  dev->online_queues--;
+ mutex_unlock(&dev->shutdown_lock);
  adapter_delete_sq(dev, qid);
 release_cq:
  adapter_delete_cq(dev, qid);
@@ -2176,7 +2179,16 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  if (nr_io_queues == 0)
  return 0;

- clear_bit(NVMEQ_ENABLED, &adminq->flags);
+ /*
+ * Free IRQ resources as soon as NVMEQ_ENABLED bit transitions
+ * from set to unset. If there is a window to it is truely freed,
+ * pci_free_irq_vectors() jumping into this window will crash.
+ * And take lock to avoid racing with pci_free_irq_vectors() in
+ * nvme_dev_disable() path.
+ */
+ mutex_lock(&dev->shutdown_lock);
+ if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+ pci_free_irq(pdev, 0, adminq);

  if (dev->cmb_use_sqes) {
  result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2192,14 +2204,17 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  result = nvme_remap_bar(dev, size);
  if (!result)
  break;
- if (!--nr_io_queues)
+ if (!--nr_io_queues) {
+ mutex_unlock(&dev->shutdown_lock);
  return -ENOMEM;
+ }
  } while (1);
  adminq->q_db = dev->dbs;

  retry:
  /* Deregister the admin queue's interrupt */
- pci_free_irq(pdev, 0, adminq);
+ if (test_and_clear_bit(NVMEQ_ENABLED, &adminq->flags))
+ pci_free_irq(pdev, 0, adminq);

  /*
  * If we enable msix early due to not intx, disable it again before
@@ -2208,8 +2223,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  pci_free_irq_vectors(pdev);

  result = nvme_setup_irqs(dev, nr_io_queues);
- if (result <= 0)
+ if (result <= 0) {
+ mutex_unlock(&dev->shutdown_lock);
  return -EIO;
+ }

  dev->num_vecs = result;
  result = max(result - 1, 1);
@@ -2222,9 +2239,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  * number of interrupts.
  */
  result = queue_request_irq(adminq);
- if (result)
+ if (result) {
+ mutex_unlock(&dev->shutdown_lock);
  return result;
+ }
  set_bit(NVMEQ_ENABLED, &adminq->flags);
+ mutex_unlock(&dev->shutdown_lock);

  result = nvme_create_io_queues(dev);
  if (result || dev->online_queues < 2)
@@ -2233,6 +2253,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  if (dev->online_queues - 1 < dev->max_qid) {
  nr_io_queues = dev->online_queues - 1;
  nvme_disable_io_queues(dev);
+ mutex_lock(&dev->shutdown_lock);
  nvme_suspend_io_queues(dev);
  goto retry;
  }
--
2.17.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-15 23:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  6:17 nvme-pci: Fix multiple races in nvme_setup_io_queues() Casey Chen
2021-06-15 16:39 ` Christoph Hellwig
2021-06-15 22:33   ` Casey Chen
2021-06-15  6:26 Casey Chen
2021-06-15 20:53 ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.