All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Anton Eidelman <anton.eidelman@gmail.com>,
	Anton Eidelman <anton@lightbitslabs.com>,
	Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>,
	Sasha Levin <sashal@kernel.org>,
	kbusch@kernel.org, axboe@fb.com, linux-nvme@lists.infradead.org
Subject: [PATCH AUTOSEL 5.15 19/27] nvme-multipath: fix hang when disk goes live over reconnect
Date: Wed,  6 Apr 2022 21:12:49 -0400	[thread overview]
Message-ID: <20220407011257.114287-19-sashal@kernel.org> (raw)
In-Reply-To: <20220407011257.114287-1-sashal@kernel.org>

From: Anton Eidelman <anton.eidelman@gmail.com>

[ Upstream commit a4a6f3c8f61c3cfbda4998ad94596059ad7e4332 ]

nvme_mpath_init_identify() invoked from nvme_init_identify() fetches a
fresh ANA log from the ctrl.  This is essential to have an up to date
path states for both existing namespaces and for those scan_work may
discover once the ctrl is up.

This happens in the following cases:
  1) A new ctrl is being connected.
  2) An existing ctrl is successfully reconnected.
  3) An existing ctrl is being reset.

While in (1) ctrl->namespaces is empty, (2 & 3) may have namespaces, and
nvme_read_ana_log() may call nvme_update_ns_ana_state().

This result in a hang when the ANA state of an existing namespace changes
and makes the disk live: nvme_mpath_set_live() issues IO to the namespace
through the ctrl, which does NOT have IO queues yet.

See sample hang below.

Solution:
- nvme_update_ns_ana_state() to call set_live only if ctrl is live
- nvme_read_ana_log() call from nvme_mpath_init_identify()
  therefore only fetches and parses the ANA log;
  any erros in this process will fail the ctrl setup as appropriate;
- a separate function nvme_mpath_update()
  is called in nvme_start_ctrl();
  this parses the ANA log without fetching it.
  At this point the ctrl is live,
  therefore, disks can be set live normally.

Sample failure:
    nvme nvme0: starting error recovery
    nvme nvme0: Reconnecting in 10 seconds...
    block nvme0n6: no usable path - requeuing I/O
    INFO: task kworker/u8:3:312 blocked for more than 122 seconds.
          Tainted: G            E     5.14.5-1.el7.elrepo.x86_64 #1
    Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
    Call Trace:
     __schedule+0x2a2/0x7e0
     schedule+0x4e/0xb0
     io_schedule+0x16/0x40
     wait_on_page_bit_common+0x15c/0x3e0
     do_read_cache_page+0x1e0/0x410
     read_cache_page+0x12/0x20
     read_part_sector+0x46/0x100
     read_lba+0x121/0x240
     efi_partition+0x1d2/0x6a0
     bdev_disk_changed.part.0+0x1df/0x430
     bdev_disk_changed+0x18/0x20
     blkdev_get_whole+0x77/0xe0
     blkdev_get_by_dev+0xd2/0x3a0
     __device_add_disk+0x1ed/0x310
     device_add_disk+0x13/0x20
     nvme_mpath_set_live+0x138/0x1b0 [nvme_core]
     nvme_update_ns_ana_state+0x2b/0x30 [nvme_core]
     nvme_update_ana_state+0xca/0xe0 [nvme_core]
     nvme_parse_ana_log+0xac/0x170 [nvme_core]
     nvme_read_ana_log+0x7d/0xe0 [nvme_core]
     nvme_mpath_init_identify+0x105/0x150 [nvme_core]
     nvme_init_identify+0x2df/0x4d0 [nvme_core]
     nvme_init_ctrl_finish+0x8d/0x3b0 [nvme_core]
     nvme_tcp_setup_ctrl+0x337/0x390 [nvme_tcp]
     nvme_tcp_reconnect_ctrl_work+0x24/0x40 [nvme_tcp]
     process_one_work+0x1bd/0x360
     worker_thread+0x50/0x3d0

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/nvme/host/core.c      |  1 +
 drivers/nvme/host/multipath.c | 25 +++++++++++++++++++++++--
 drivers/nvme/host/nvme.h      |  4 ++++
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d5d5d035d677..17df1c783be3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4338,6 +4338,7 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 	if (ctrl->queue_count > 1) {
 		nvme_queue_scan(ctrl);
 		nvme_start_queues(ctrl);
+		nvme_mpath_update(ctrl);
 	}
 }
 EXPORT_SYMBOL_GPL(nvme_start_ctrl);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 727520c39710..2105fead04aa 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -573,8 +573,17 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
 	ns->ana_grpid = le32_to_cpu(desc->grpid);
 	ns->ana_state = desc->state;
 	clear_bit(NVME_NS_ANA_PENDING, &ns->flags);
-
-	if (nvme_state_is_live(ns->ana_state))
+	/*
+	 * nvme_mpath_set_live() will trigger I/O to the multipath path device
+	 * and in turn to this path device.  However we cannot accept this I/O
+	 * if the controller is not live.  This may deadlock if called from
+	 * nvme_mpath_init_identify() and the ctrl will never complete
+	 * initialization, preventing I/O from completing.  For this case we
+	 * will reprocess the ANA log page in nvme_mpath_update() once the
+	 * controller is ready.
+	 */
+	if (nvme_state_is_live(ns->ana_state) &&
+	    ns->ctrl->state == NVME_CTRL_LIVE)
 		nvme_mpath_set_live(ns);
 }
 
@@ -661,6 +670,18 @@ static void nvme_ana_work(struct work_struct *work)
 	nvme_read_ana_log(ctrl);
 }
 
+void nvme_mpath_update(struct nvme_ctrl *ctrl)
+{
+	u32 nr_change_groups = 0;
+
+	if (!ctrl->ana_log_buf)
+		return;
+
+	mutex_lock(&ctrl->ana_lock);
+	nvme_parse_ana_log(ctrl, &nr_change_groups, nvme_update_ana_state);
+	mutex_unlock(&ctrl->ana_lock);
+}
+
 static void nvme_anatt_timeout(struct timer_list *t)
 {
 	struct nvme_ctrl *ctrl = from_timer(ctrl, t, anatt_timer);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ed79a6c7e804..c1442e6f1b87 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -752,6 +752,7 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
 void nvme_mpath_remove_disk(struct nvme_ns_head *head);
 int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
 void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl);
+void nvme_mpath_update(struct nvme_ctrl *ctrl);
 void nvme_mpath_uninit(struct nvme_ctrl *ctrl);
 void nvme_mpath_stop(struct nvme_ctrl *ctrl);
 bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
@@ -826,6 +827,9 @@ static inline int nvme_mpath_init_identify(struct nvme_ctrl *ctrl,
 "Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.\n");
 	return 0;
 }
+static inline void nvme_mpath_update(struct nvme_ctrl *ctrl)
+{
+}
 static inline void nvme_mpath_uninit(struct nvme_ctrl *ctrl)
 {
 }
-- 
2.35.1


  parent reply	other threads:[~2022-04-07  1:22 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-07  1:12 [PATCH AUTOSEL 5.15 01/27] gfs2: assign rgrp glock before compute_bitstructs Sasha Levin
2022-04-07  1:12 ` [Cluster-devel] " Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 02/27] rtc: fix use-after-free on device removal Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 03/27] rtc: pcf2127: fix bug when reading alarm registers Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 04/27] um: Cleanup syscall_handler_t definition/cast, fix warning Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 05/27] um: port_user: Improve error handling when port-helper is not found Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 06/27] Input: add bounds checking to input_set_capability() Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 07/27] Input: stmfts - fix reference leak in stmfts_input_open Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 08/27] nvme-pci: add quirks for Samsung X5 SSDs Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 09/27] gfs2: Disable page faults during lockless buffered reads Sasha Levin
2022-04-07  1:12   ` [Cluster-devel] " Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 10/27] rtc: sun6i: Fix time overflow handling Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 11/27] crypto: stm32 - fix reference leak in stm32_crc_remove Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 12/27] crypto: x86/chacha20 - Avoid spurious jumps to other functions Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 13/27] ALSA: hda/realtek: Enable headset mic on Lenovo P360 Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 14/27] s390/traps: improve panic message for translation-specification exception Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 15/27] s390/pci: improve zpci_dev reference counting Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 16/27] vhost_vdpa: don't setup irq offloading when irq_num < 0 Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 17/27] tools/virtio: compile with -pthread Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 18/27] nvmet: use a private workqueue instead of the system workqueue Sasha Levin
2022-04-07  1:12 ` Sasha Levin [this message]
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 20/27] rtc: mc146818-lib: Fix the AltCentury for AMD platforms Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 21/27] fs: fix an infinite loop in iomap_fiemap Sasha Levin
2022-04-07 11:28   ` Amir Goldstein
2022-04-07 11:30     ` Amir Goldstein
2022-05-17 17:33       ` Amir Goldstein
2022-05-18 12:24         ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 22/27] MIPS: lantiq: check the return value of kzalloc() Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 23/27] drbd: remove usage of list iterator variable after loop Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 24/27] platform/chrome: cros_ec_debugfs: detach log reader wq from devm Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 25/27] ARM: 9191/1: arm/stacktrace, kasan: Silence KASAN warnings in unwind_frame() Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 26/27] nilfs2: fix lockdep warnings in page operations for btree nodes Sasha Levin
2022-04-07  1:12   ` Sasha Levin
2022-04-07  1:12 ` [PATCH AUTOSEL 5.15 27/27] nilfs2: fix lockdep warnings during disk space reclamation Sasha Levin
2022-04-07  1:12   ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220407011257.114287-19-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=anton.eidelman@gmail.com \
    --cc=anton@lightbitslabs.com \
    --cc=axboe@fb.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.