linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>,
	James Puthukattukaran <james.puthukattukaran@oracle.com>,
	Keith Busch <kbusch@kernel.org>, Sasha Levin <sashal@kernel.org>,
	linux-nvme@lists.infradead.org
Subject: [PATCH AUTOSEL 5.2 14/44] nvme: fix a possible deadlock when passthru commands sent to a multipath device
Date: Tue, 20 Aug 2019 09:39:58 -0400	[thread overview]
Message-ID: <20190820134028.10829-14-sashal@kernel.org> (raw)
In-Reply-To: <20190820134028.10829-1-sashal@kernel.org>

From: Sagi Grimberg <sagi@grimberg.me>

[ Upstream commit b9156daeb1601d69007b7e50efcf89d69d72ec1d ]

When the user issues a command with side effects, we will end up freezing
the namespace request queue when updating disk info (and the same for
the corresponding mpath disk node).

However, we are not freezing the mpath node request queue,
which means that mpath I/O can still come in and block on blk_queue_enter
(called from nvme_ns_head_make_request -> direct_make_request).

This is a deadlock, because blk_queue_enter will block until the inner
namespace request queue is unfroze, but that process is blocked because
the namespace revalidation is trying to update the mpath disk info
and freeze its request queue (which will never complete because
of the I/O that is blocked on blk_queue_enter).

Fix this by freezing all the subsystem nsheads request queues before
executing the passthru command. Given that these commands are infrequent
we should not worry about this temporary I/O freeze to keep things sane.

Here is the matching hang traces:
--
[ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
[ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
[ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 374.487274] systemd-udevd D 0 17994 1 0x00000000
[ 374.493407] Call Trace:
[ 374.496145] __schedule+0x2ef/0x620
[ 374.500047] schedule+0x38/0xa0
[ 374.503569] blk_queue_enter+0x139/0x220
[ 374.507959] ? remove_wait_queue+0x60/0x60
[ 374.512540] direct_make_request+0x60/0x130
[ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
[ 374.523740] ? generic_make_request_checks+0x307/0x6f0
[ 374.529484] generic_make_request+0x10d/0x2e0
[ 374.534356] submit_bio+0x75/0x140
[ 374.538163] ? guard_bio_eod+0x32/0xe0
[ 374.542361] submit_bh_wbc+0x171/0x1b0
[ 374.546553] block_read_full_page+0x1ed/0x330
[ 374.551426] ? check_disk_change+0x70/0x70
[ 374.556008] ? scan_shadow_nodes+0x30/0x30
[ 374.560588] blkdev_readpage+0x18/0x20
[ 374.564783] do_read_cache_page+0x301/0x860
[ 374.569463] ? blkdev_writepages+0x10/0x10
[ 374.574037] ? prep_new_page+0x88/0x130
[ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
[ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
[ 374.588947] read_cache_page+0x12/0x20
[ 374.593142] read_dev_sector+0x2d/0xd0
[ 374.597337] read_lba+0x104/0x1f0
[ 374.601046] find_valid_gpt+0xfa/0x720
[ 374.605243] ? string_nocheck+0x58/0x70
[ 374.609534] ? find_valid_gpt+0x720/0x720
[ 374.614016] efi_partition+0x89/0x430
[ 374.618113] ? string+0x48/0x60
[ 374.621632] ? snprintf+0x49/0x70
[ 374.625339] ? find_valid_gpt+0x720/0x720
[ 374.629828] check_partition+0x116/0x210
[ 374.634214] rescan_partitions+0xb6/0x360
[ 374.638699] __blkdev_reread_part+0x64/0x70
[ 374.643377] blkdev_reread_part+0x23/0x40
[ 374.647860] blkdev_ioctl+0x48c/0x990
[ 374.651956] block_ioctl+0x41/0x50
[ 374.655766] do_vfs_ioctl+0xa7/0x600
[ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
[ 374.664832] ksys_ioctl+0x67/0x90
[ 374.668539] __x64_sys_ioctl+0x1a/0x20
[ 374.672732] do_syscall_64+0x5a/0x1c0
[ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
[ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
[ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 374.760170] nvmeadm D 0 49141 36333 0x00004080
[ 374.766301] Call Trace:
[ 374.769038] __schedule+0x2ef/0x620
[ 374.772939] schedule+0x38/0xa0
[ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
[ 374.781614] ? remove_wait_queue+0x60/0x60
[ 374.786192] blk_mq_freeze_queue+0x1a/0x20
[ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
[ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
[ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
[ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
[ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
[ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
[ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
[ 374.833114] do_vfs_ioctl+0xa7/0x600
[ 374.837117] ? __audit_syscall_entry+0xdd/0x130
[ 374.842184] ksys_ioctl+0x67/0x90
[ 374.845891] __x64_sys_ioctl+0x1a/0x20
[ 374.850082] do_syscall_64+0x5a/0x1c0
[ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
--

Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Tested-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/nvme/host/core.c      |  5 +++++
 drivers/nvme/host/multipath.c | 30 ++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      | 12 ++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index dfbd5872f4422..05301b94e2fa0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1268,6 +1268,9 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	 */
 	if (effects & (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK)) {
 		mutex_lock(&ctrl->scan_lock);
+		mutex_lock(&ctrl->subsys->lock);
+		nvme_mpath_start_freeze(ctrl->subsys);
+		nvme_mpath_wait_freeze(ctrl->subsys);
 		nvme_start_freeze(ctrl);
 		nvme_wait_freeze(ctrl);
 	}
@@ -1298,6 +1301,8 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
 		nvme_update_formats(ctrl);
 	if (effects & (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK)) {
 		nvme_unfreeze(ctrl);
+		nvme_mpath_unfreeze(ctrl->subsys);
+		mutex_unlock(&ctrl->subsys->lock);
 		mutex_unlock(&ctrl->scan_lock);
 	}
 	if (effects & NVME_CMD_EFFECTS_CCC)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index e942b3e840687..dafb9e4aa1237 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -12,6 +12,36 @@ module_param(multipath, bool, 0444);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
 
+void nvme_mpath_unfreeze(struct nvme_subsystem *subsys)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+	list_for_each_entry(h, &subsys->nsheads, entry)
+		if (h->disk)
+			blk_mq_unfreeze_queue(h->disk->queue);
+}
+
+void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+	list_for_each_entry(h, &subsys->nsheads, entry)
+		if (h->disk)
+			blk_mq_freeze_queue_wait(h->disk->queue);
+}
+
+void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+	list_for_each_entry(h, &subsys->nsheads, entry)
+		if (h->disk)
+			blk_freeze_queue_start(h->disk->queue);
+}
+
 /*
  * If multipathing is enabled we need to always use the subsystem instance
  * number for numbering our devices to avoid conflicts between subsystems that
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7391cd0a7739e..b8b45822f7be0 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -477,6 +477,9 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 	return ctrl->ana_log_buf != NULL;
 }
 
+void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
+void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
+void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
 void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
 			struct nvme_ctrl *ctrl, int *flags);
 void nvme_failover_req(struct request *req);
@@ -555,6 +558,15 @@ static inline void nvme_mpath_uninit(struct nvme_ctrl *ctrl)
 static inline void nvme_mpath_stop(struct nvme_ctrl *ctrl)
 {
 }
+static inline void nvme_mpath_unfreeze(struct nvme_subsystem *subsys)
+{
+}
+static inline void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys)
+{
+}
+static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
+{
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 #ifdef CONFIG_NVM
-- 
2.20.1


  parent reply	other threads:[~2019-08-20 13:47 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20 13:39 [PATCH AUTOSEL 5.2 01/44] dmaengine: ste_dma40: fix unneeded variable warning Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 02/44] nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 03/44] afs: Fix the CB.ProbeUuid service handler to reply correctly Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 04/44] afs: Fix loop index mixup in afs_deliver_vl_get_entry_by_name_u() Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 05/44] fs: afs: Fix a possible null-pointer dereference in afs_put_read() Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 06/44] afs: Fix off-by-one in afs_rename() expected data version calculation Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 07/44] afs: Only update d_fsdata if different in afs_d_revalidate() Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 08/44] afs: Fix missing dentry data version updating Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 09/44] intel_th: Use the correct style for SPDX License Identifier Sasha Levin
2019-08-20 14:27   ` Greg Kroah-Hartman
2019-08-20 20:03     ` Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 10/44] nvmet: Fix use-after-free bug when a port is removed Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 11/44] nvmet-loop: Flush nvme_delete_wq when removing the port Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 12/44] nvmet-file: fix nvmet_file_flush() always returning an error Sasha Levin
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 13/44] nvme-core: Fix extra device_put() call on error path Sasha Levin
2019-08-20 13:39 ` Sasha Levin [this message]
2019-08-20 13:39 ` [PATCH AUTOSEL 5.2 15/44] nvme-rdma: fix possible use-after-free in connect error flow Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 16/44] nvme: fix controller removal race with scan work Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 17/44] nvme-pci: Fix async probe remove race Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 18/44] soundwire: cadence_master: fix register definition for SLAVE_STATE Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 19/44] soundwire: cadence_master: fix definitions for INTSTAT0/1 Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 20/44] iio: adc: max9611: Fix temperature reading in probe Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 21/44] auxdisplay: panel: need to delete scan_timer when misc_register fails in panel_attach Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 22/44] btrfs: trim: Check the range passed into to prevent overflow Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 23/44] IB/mlx5: Fix implicit MR release flow Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 24/44] dmaengine: stm32-mdma: Fix a possible null-pointer dereference in stm32_mdma_irq_handler() Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 25/44] omap-dma/omap_vout_vrfb: fix off-by-one fi value Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 26/44] iommu/dma: Handle SG length overflow better Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 27/44] dma-direct: don't truncate dma_required_mask to bus addressing capabilities Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 28/44] usb: gadget: composite: Clear "suspended" on reset/disconnect Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 29/44] usb: gadget: mass_storage: Fix races between fsg_disable and fsg_set_alt Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 30/44] habanalabs: fix DRAM usage accounting on context tear down Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 31/44] habanalabs: fix endianness handling for packets from user Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 32/44] habanalabs: fix completion queue handling when host is BE Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 33/44] habanalabs: fix endianness handling for internal QMAN submission Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 34/44] habanalabs: fix device IRQ unmasking for BE host Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 35/44] xen/blkback: fix memory leaks Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 36/44] arm64: cpufeature: Don't treat granule sizes as strict Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 37/44] riscv: fix flush_tlb_range() end address for flush_tlb_page() Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 38/44] i2c: rcar: avoid race when unregistering slave client Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 39/44] i2c: emev2: " Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 40/44] drm/scheduler: use job count instead of peek Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 41/44] drm/ast: Fixed reboot test may cause system hanged Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 42/44] usb: host: fotg2: restart hcd after port reset Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 43/44] tools: hv: fixed Python pep8/flake8 warnings for lsvmbus Sasha Levin
2019-08-20 13:40 ` [PATCH AUTOSEL 5.2 44/44] tools: hv: fix KVP and VSS daemons exit code Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190820134028.10829-14-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=james.puthukattukaran@oracle.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).