* [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs @ 2021-03-15 22:27 Sagi Grimberg 2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg ` (4 more replies) 0 siblings, 5 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni The below patches caused a regression in a multipath setup: Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") These patches on their own are correct because they fixed a controller reset regression. When we reset/teardown a controller, we must freeze and quiesce the namespaces request queues to make sure that we safely stop inflight I/O submissions. Freeze is mandatory because if our hctx map changed between reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and if it still has pending submissions (that are still quiesced) it will hang. This is what the above patches fixed. However, by freezing the namespaces request queues, and only unfreezing them when we successfully reconnect, inflight submissions that are running concurrently can now block grabbing the nshead srcu until either we successfully reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). This caused a deadlock [1] when a different controller (different path on the same subsystem) became live (i.e. optimized/non-optimized). This is because nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O in order to make sure that current_path is visible to future (re)submisions. However the srcu lock is taken by a blocked submission on a frozen request queue, and we have a deadlock. For multipath, we obviously cannot allow that as we want to failover I/O asap. However for non-mpath, we do not want to fail I/O (at least until controller FASTFAIL expires, and that is disabled by default btw). This creates a non-symmetric behavior of how the driver should behave in the presence or absence of multipath. [1]: Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] Call Trace: __schedule+0x293/0x730 schedule+0x33/0xa0 schedule_timeout+0x1d3/0x2f0 wait_for_completion+0xba/0x140 __synchronize_srcu.part.21+0x91/0xc0 synchronize_srcu_expedited+0x27/0x30 synchronize_srcu+0xce/0xe0 nvme_mpath_set_live+0x64/0x130 [nvme_core] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] nvme_update_ana_state+0xcd/0xe0 [nvme_core] nvme_parse_ana_log+0xa1/0x180 [nvme_core] nvme_read_ana_log+0x76/0x100 [nvme_core] nvme_mpath_init+0x122/0x180 [nvme_core] nvme_init_identify+0x80e/0xe20 [nvme_core] nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] In order to fix this, we recognize the different behavior a driver needs to take in error recovery scenarios for mpath and non-mpath scenarios and expose this awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs to be done. Sagi Grimberg (3): nvme: introduce nvme_ctrl_is_mpath helper nvme-tcp: fix possible hang when trying to set a live path during I/O nvme-rdma: fix possible hang when trying to set a live path during I/O drivers/nvme/host/multipath.c | 5 +++-- drivers/nvme/host/nvme.h | 15 +++++++++++++++ drivers/nvme/host/rdma.c | 29 +++++++++++++++++------------ drivers/nvme/host/tcp.c | 30 +++++++++++++++++------------- 4 files changed, 52 insertions(+), 27 deletions(-) -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper 2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg @ 2021-03-15 22:27 ` Sagi Grimberg 2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg ` (3 subsequent siblings) 4 siblings, 0 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni Given that our error recovery and I/O failover logic semantics is different for multipath vs. non-multipath controllers, transports need a helper to distinguish how to react upon error recovery or controller resets (i.e. fail I/O or to keep the queue frozen/quiesced). Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") [sagi: added the fixes tag so this can also go to stable with its respective fixes] Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/multipath.c | 5 +++-- drivers/nvme/host/nvme.h | 15 +++++++++++++++ 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a1d476e1ac02..5c67a5e96738 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -8,10 +8,11 @@ #include <trace/events/block.h> #include "nvme.h" -static bool multipath = true; +bool multipath = true; module_param(multipath, bool, 0444); MODULE_PARM_DESC(multipath, "turn on native support for multiple controllers per subsystem"); +EXPORT_SYMBOL_GPL(multipath); void nvme_mpath_unfreeze(struct nvme_subsystem *subsys) { @@ -372,7 +373,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) * We also do this for private namespaces as the namespace sharing data could * change after a rescan. */ - if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) || !multipath) + if (!nvme_ctrl_is_mpath(ctrl)) return 0; q = blk_alloc_queue(ctrl->numa_node); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 07b34175c6ce..4f7c9970e3fc 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -840,4 +840,19 @@ struct nvme_ctrl *nvme_ctrl_from_file(struct file *file); struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid); void nvme_put_ns(struct nvme_ns *ns); +extern bool multipath; +#ifdef CONFIG_NVME_MULTIPATH +static inline bool nvme_ctrl_is_mpath(struct nvme_ctrl *ctrl) +{ + return !(!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) || !multipath); + +} +#else +static inline bool nvme_ctrl_is_mpath(struct nvme_ctrl *ctrl) +{ + return false; + +} +#endif + #endif /* _NVME_H */ -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O 2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg 2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg @ 2021-03-15 22:27 ` Sagi Grimberg 2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg ` (2 subsequent siblings) 4 siblings, 0 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni When we teardown a controller we first freeze the queue to prevent request submissions, and quiesce the queue to prevent request queueing and we only unfreeze/unquiesce when we successfully reconnect a controller. In case we attempt to set a live path (optimized/non-optimized) and update the current_path reference, we first need to wait for any ongoing dispatches (synchronize the head->srcu). However bio submissions _can_ block as the underlying controller queue is frozen. which creates the below deadlock [1]. So it is clear that the namespaces request queue must be unfrozen and unquiesced asap when we teardown the controller. However, when we are not in a multipath environment (!multipath or cmic indicates ns isn't shared) we don't want to fail-fast the I/O, hence we must keep the namespaces request queue frozen and quiesced and only expire them when the controller successfully reconnects (and FAILFAST may fail the I/O sooner). [1]: Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] Call Trace: __schedule+0x293/0x730 schedule+0x33/0xa0 schedule_timeout+0x1d3/0x2f0 wait_for_completion+0xba/0x140 __synchronize_srcu.part.21+0x91/0xc0 synchronize_srcu_expedited+0x27/0x30 synchronize_srcu+0xce/0xe0 nvme_mpath_set_live+0x64/0x130 [nvme_core] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] nvme_update_ana_state+0xcd/0xe0 [nvme_core] nvme_parse_ana_log+0xa1/0x180 [nvme_core] nvme_read_ana_log+0x76/0x100 [nvme_core] nvme_mpath_init+0x122/0x180 [nvme_core] nvme_init_identify+0x80e/0xe20 [nvme_core] nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] Fix this by looking into the newly introduced nvme_ctrl_is_mpath and unquiesce/unfreeze the namespaces request queues accordingly (in the teardown for mpath and after a successful reconnect for non-mpath). Also, we no longer need the explicit nvme_start_queues in the error recovery work. Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/tcp.c | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index a0f00cb8f9f3..b81649d0c12c 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1803,19 +1803,22 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new) goto out_cleanup_connect_q; if (!new) { - nvme_start_queues(ctrl); - if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) { - /* - * If we timed out waiting for freeze we are likely to - * be stuck. Fail the controller initialization just - * to be safe. - */ - ret = -ENODEV; - goto out_wait_freeze_timed_out; + if (!nvme_ctrl_is_mpath(ctrl)) { + nvme_start_queues(ctrl); + if (!nvme_wait_freeze_timeout(ctrl, NVME_IO_TIMEOUT)) { + /* + * If we timed out waiting for freeze we are + * likely to be stuck. Fail the controller + * initialization just to be safe. + */ + ret = -ENODEV; + goto out_wait_freeze_timed_out; + } } blk_mq_update_nr_hw_queues(ctrl->tagset, ctrl->queue_count - 1); - nvme_unfreeze(ctrl); + if (!nvme_ctrl_is_mpath(ctrl)) + nvme_unfreeze(ctrl); } return 0; @@ -1934,8 +1937,11 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, nvme_sync_io_queues(ctrl); nvme_tcp_stop_io_queues(ctrl); nvme_cancel_tagset(ctrl); - if (remove) + if (nvme_ctrl_is_mpath(ctrl)) { nvme_start_queues(ctrl); + nvme_wait_freeze(ctrl); + nvme_unfreeze(ctrl); + } nvme_tcp_destroy_io_queues(ctrl, remove); } @@ -2056,8 +2062,6 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work) nvme_stop_keep_alive(ctrl); nvme_tcp_teardown_io_queues(ctrl, false); - /* unquiesce to fail fast pending requests */ - nvme_start_queues(ctrl); nvme_tcp_teardown_admin_queue(ctrl, false); blk_mq_unquiesce_queue(ctrl->admin_q); -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 3/3] nvme-rdma: fix possible hang when trying to set a live path during I/O 2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg 2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg 2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg @ 2021-03-15 22:27 ` Sagi Grimberg 2021-03-16 3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng 2021-03-16 20:07 ` Sagi Grimberg 4 siblings, 0 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-15 22:27 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni When we teardown a controller we first freeze the queue to prevent request submissions, and quiesce the queue to prevent request queueing and we only unfreeze/unquiesce when we successfully reconnect a controller. In case we attempt to set a live path (optimized/non-optimized) and update the current_path reference, we first need to wait for any ongoing dispatches (synchronize the head->srcu). However bio submissions _can_ block as the underlying controller queue is frozen. which creates the below deadlock [1]. So it is clear that the namespaces request queue must be unfrozen and unquiesced asap when we teardown the controller. However, when we are not in a multipath environment (!multipath or cmic indicates ns isn't shared) we don't want to fail-fast the I/O, hence we must keep the namespaces request queue frozen and quiesced and only expire them when the controller successfully reconnects (and FAILFAST may fail the I/O sooner). [1] (happened in nvme-tcp, but works the same in nvme-rdma): Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] Call Trace: __schedule+0x293/0x730 schedule+0x33/0xa0 schedule_timeout+0x1d3/0x2f0 wait_for_completion+0xba/0x140 __synchronize_srcu.part.21+0x91/0xc0 synchronize_srcu_expedited+0x27/0x30 synchronize_srcu+0xce/0xe0 nvme_mpath_set_live+0x64/0x130 [nvme_core] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] nvme_update_ana_state+0xcd/0xe0 [nvme_core] nvme_parse_ana_log+0xa1/0x180 [nvme_core] nvme_read_ana_log+0x76/0x100 [nvme_core] nvme_mpath_init+0x122/0x180 [nvme_core] nvme_init_identify+0x80e/0xe20 [nvme_core] nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] Fix this by looking into the newly introduced nvme_ctrl_is_mpath and unquiesce/unfreeze the namespaces request queues accordingly (in the teardown for mpath and after a successful reconnect for non-mpath). Also, we no longer need the explicit nvme_start_queues in the error recovery work. Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/rdma.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index be905d4fdb47..43e8608ad5b7 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -989,19 +989,22 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new) goto out_cleanup_connect_q; if (!new) { - nvme_start_queues(&ctrl->ctrl); - if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) { - /* - * If we timed out waiting for freeze we are likely to - * be stuck. Fail the controller initialization just - * to be safe. - */ - ret = -ENODEV; - goto out_wait_freeze_timed_out; + if (!nvme_ctrl_is_mpath(&ctrl->ctrl)) { + nvme_start_queues(&ctrl->ctrl); + if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) { + /* + * If we timed out waiting for freeze we are likely to + * be stuck. Fail the controller initialization just + * to be safe. + */ + ret = -ENODEV; + goto out_wait_freeze_timed_out; + } } blk_mq_update_nr_hw_queues(ctrl->ctrl.tagset, ctrl->ctrl.queue_count - 1); - nvme_unfreeze(&ctrl->ctrl); + if (!nvme_ctrl_is_mpath(&ctrl->ctrl)) + nvme_unfreeze(&ctrl->ctrl); } return 0; @@ -1043,8 +1046,11 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl, nvme_sync_io_queues(&ctrl->ctrl); nvme_rdma_stop_io_queues(ctrl); nvme_cancel_tagset(&ctrl->ctrl); - if (remove) + if (nvme_ctrl_is_mpath(&ctrl->ctrl)) { nvme_start_queues(&ctrl->ctrl); + nvme_wait_freeze(&ctrl->ctrl); + nvme_unfreeze(&ctrl->ctrl); + } nvme_rdma_destroy_io_queues(ctrl, remove); } } @@ -1191,7 +1197,6 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) nvme_stop_keep_alive(&ctrl->ctrl); nvme_rdma_teardown_io_queues(ctrl, false); - nvme_start_queues(&ctrl->ctrl); nvme_rdma_teardown_admin_queue(ctrl, false); blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); -- 2.27.0 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg ` (2 preceding siblings ...) 2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg @ 2021-03-16 3:24 ` Chao Leng 2021-03-16 5:04 ` Sagi Grimberg 2021-03-16 20:07 ` Sagi Grimberg 4 siblings, 1 reply; 30+ messages in thread From: Chao Leng @ 2021-03-16 3:24 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni Does the problem exist on the latest version? We also found Similar deadlocks in the older version. However, with the latest code, it do not block grabbing the nshead srcu when ctrl is freezed. related patches: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead I am not sure they are the same problem. On 2021/3/16 6:27, Sagi Grimberg wrote: > The below patches caused a regression in a multipath setup: > Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") > Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") > > These patches on their own are correct because they fixed a controller reset > regression. > > When we reset/teardown a controller, we must freeze and quiesce the namespaces > request queues to make sure that we safely stop inflight I/O submissions. > Freeze is mandatory because if our hctx map changed between reconnects, > blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and > if it still has pending submissions (that are still quiesced) it will hang. > This is what the above patches fixed. > > However, by freezing the namespaces request queues, and only unfreezing them > when we successfully reconnect, inflight submissions that are running > concurrently can now block grabbing the nshead srcu until either we successfully > reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). > > This caused a deadlock [1] when a different controller (different path on the > same subsystem) became live (i.e. optimized/non-optimized). This is because > nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O > in order to make sure that current_path is visible to future (re)submisions. > However the srcu lock is taken by a blocked submission on a frozen request > queue, and we have a deadlock. > > For multipath, we obviously cannot allow that as we want to failover I/O asap. > However for non-mpath, we do not want to fail I/O (at least until controller > FASTFAIL expires, and that is disabled by default btw). > > This creates a non-symmetric behavior of how the driver should behave in the > presence or absence of multipath. > > [1]: > Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] > Call Trace: > __schedule+0x293/0x730 > schedule+0x33/0xa0 > schedule_timeout+0x1d3/0x2f0 > wait_for_completion+0xba/0x140 > __synchronize_srcu.part.21+0x91/0xc0 > synchronize_srcu_expedited+0x27/0x30 > synchronize_srcu+0xce/0xe0 > nvme_mpath_set_live+0x64/0x130 [nvme_core] > nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] > nvme_update_ana_state+0xcd/0xe0 [nvme_core] > nvme_parse_ana_log+0xa1/0x180 [nvme_core] > nvme_read_ana_log+0x76/0x100 [nvme_core] > nvme_mpath_init+0x122/0x180 [nvme_core] > nvme_init_identify+0x80e/0xe20 [nvme_core] > nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] > nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] > > > In order to fix this, we recognize the different behavior a driver needs to take > in error recovery scenarios for mpath and non-mpath scenarios and expose this > awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs > to be done. > > Sagi Grimberg (3): > nvme: introduce nvme_ctrl_is_mpath helper > nvme-tcp: fix possible hang when trying to set a live path during I/O > nvme-rdma: fix possible hang when trying to set a live path during I/O > > drivers/nvme/host/multipath.c | 5 +++-- > drivers/nvme/host/nvme.h | 15 +++++++++++++++ > drivers/nvme/host/rdma.c | 29 +++++++++++++++++------------ > drivers/nvme/host/tcp.c | 30 +++++++++++++++++------------- > 4 files changed, 52 insertions(+), 27 deletions(-) > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-16 3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng @ 2021-03-16 5:04 ` Sagi Grimberg 2021-03-16 6:18 ` Chao Leng 0 siblings, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-16 5:04 UTC (permalink / raw) To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni > Does the problem exist on the latest version? This was seen on 5.4 stable, not upstream but nothing prevents this from happening in upstream code. > > We also found Similar deadlocks in the older version. > However, with the latest code, it do not block grabbing the nshead srcu > when ctrl is freezed. > related patches: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead > > I am not sure they are the same problem. Its not the same problem. When we teardown the io queues, we freeze the namespaces request queues. This means that concurrent mpath submit_bio calls can now block with the srcu lock taken. When another path calls nvme_mpath_set_live, it needs to wait for the srcu to sync before kicking the requeue work (to make sure the updated current_path is visible). And this is where the hang is, the only thing that will free it is if the offending controller reconnects (and unfreeze the queue) or it will disconnect (automatically or manually). Both can take a very long time or even forever in some cases. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-16 5:04 ` Sagi Grimberg @ 2021-03-16 6:18 ` Chao Leng 2021-03-16 6:25 ` Sagi Grimberg 0 siblings, 1 reply; 30+ messages in thread From: Chao Leng @ 2021-03-16 6:18 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni On 2021/3/16 13:04, Sagi Grimberg wrote: > >> Does the problem exist on the latest version? > > This was seen on 5.4 stable, not upstream but nothing prevents > this from happening in upstream code. > >> >> We also found Similar deadlocks in the older version. >> However, with the latest code, it do not block grabbing the nshead srcu >> when ctrl is freezed. >> related patches: >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead >> I am not sure they are the same problem. > > Its not the same problem. > > When we teardown the io queues, we freeze the namespaces request queues. > This means that concurrent mpath submit_bio calls can now block with > the srcu lock taken.What is the call trace of ->submit_bio()? The requeue work or normal submit bio? > > When another path calls nvme_mpath_set_live, it needs to wait for > the srcu to sync before kicking the requeue work (to make sure > the updated current_path is visible). > > And this is where the hang is, the only thing that will free it > is if the offending controller reconnects (and unfreeze the queue) > or it will disconnect (automatically or manually). Both can take > a very long time or even forever in some cases. > . _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-16 6:18 ` Chao Leng @ 2021-03-16 6:25 ` Sagi Grimberg 0 siblings, 0 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-16 6:25 UTC (permalink / raw) To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni >>> We also found Similar deadlocks in the older version. >>> However, with the latest code, it do not block grabbing the nshead srcu >>> when ctrl is freezed. >>> related patches: >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead >>> >>> I am not sure they are the same problem. >> >> Its not the same problem. >> >> When we teardown the io queues, we freeze the namespaces request queues. >> This means that concurrent mpath submit_bio calls can now block with >> the srcu lock taken.What is the call trace of ->submit_bio()? > The requeue work or normal submit bio? Both. submit_bio_noacct will try to get the queue->g_usage_counter (blk_queue_enter), will fail because the queue is frozen, and then will block until the queue unfreeze will wake it up to try again... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg ` (3 preceding siblings ...) 2021-03-16 3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng @ 2021-03-16 20:07 ` Sagi Grimberg 2021-03-16 20:42 ` Keith Busch 4 siblings, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-16 20:07 UTC (permalink / raw) To: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni > The below patches caused a regression in a multipath setup: > Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") > Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") > > These patches on their own are correct because they fixed a controller reset > regression. > > When we reset/teardown a controller, we must freeze and quiesce the namespaces > request queues to make sure that we safely stop inflight I/O submissions. > Freeze is mandatory because if our hctx map changed between reconnects, > blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and > if it still has pending submissions (that are still quiesced) it will hang. > This is what the above patches fixed. > > However, by freezing the namespaces request queues, and only unfreezing them > when we successfully reconnect, inflight submissions that are running > concurrently can now block grabbing the nshead srcu until either we successfully > reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). > > This caused a deadlock [1] when a different controller (different path on the > same subsystem) became live (i.e. optimized/non-optimized). This is because > nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O > in order to make sure that current_path is visible to future (re)submisions. > However the srcu lock is taken by a blocked submission on a frozen request > queue, and we have a deadlock. > > For multipath, we obviously cannot allow that as we want to failover I/O asap. > However for non-mpath, we do not want to fail I/O (at least until controller > FASTFAIL expires, and that is disabled by default btw). > > This creates a non-symmetric behavior of how the driver should behave in the > presence or absence of multipath. > > [1]: > Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] > Call Trace: > __schedule+0x293/0x730 > schedule+0x33/0xa0 > schedule_timeout+0x1d3/0x2f0 > wait_for_completion+0xba/0x140 > __synchronize_srcu.part.21+0x91/0xc0 > synchronize_srcu_expedited+0x27/0x30 > synchronize_srcu+0xce/0xe0 > nvme_mpath_set_live+0x64/0x130 [nvme_core] > nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] > nvme_update_ana_state+0xcd/0xe0 [nvme_core] > nvme_parse_ana_log+0xa1/0x180 [nvme_core] > nvme_read_ana_log+0x76/0x100 [nvme_core] > nvme_mpath_init+0x122/0x180 [nvme_core] > nvme_init_identify+0x80e/0xe20 [nvme_core] > nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] > nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] > > > In order to fix this, we recognize the different behavior a driver needs to take > in error recovery scenarios for mpath and non-mpath scenarios and expose this > awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs > to be done. Christoph, Keith, Any thoughts on this? The RFC part is getting the transport driver to behave differently for mpath vs. non-mpath. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-16 20:07 ` Sagi Grimberg @ 2021-03-16 20:42 ` Keith Busch 2021-03-16 23:51 ` Sagi Grimberg 0 siblings, 1 reply; 30+ messages in thread From: Keith Busch @ 2021-03-16 20:42 UTC (permalink / raw) To: Sagi Grimberg; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni On Tue, Mar 16, 2021 at 01:07:12PM -0700, Sagi Grimberg wrote: > > The below patches caused a regression in a multipath setup: > > Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") > > Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") > > > > These patches on their own are correct because they fixed a controller reset > > regression. > > > > When we reset/teardown a controller, we must freeze and quiesce the namespaces > > request queues to make sure that we safely stop inflight I/O submissions. > > Freeze is mandatory because if our hctx map changed between reconnects, > > blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and > > if it still has pending submissions (that are still quiesced) it will hang. > > This is what the above patches fixed. > > > > However, by freezing the namespaces request queues, and only unfreezing them > > when we successfully reconnect, inflight submissions that are running > > concurrently can now block grabbing the nshead srcu until either we successfully > > reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). > > > > This caused a deadlock [1] when a different controller (different path on the > > same subsystem) became live (i.e. optimized/non-optimized). This is because > > nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O > > in order to make sure that current_path is visible to future (re)submisions. > > However the srcu lock is taken by a blocked submission on a frozen request > > queue, and we have a deadlock. > > > > For multipath, we obviously cannot allow that as we want to failover I/O asap. > > However for non-mpath, we do not want to fail I/O (at least until controller > > FASTFAIL expires, and that is disabled by default btw). > > > > This creates a non-symmetric behavior of how the driver should behave in the > > presence or absence of multipath. > > > > [1]: > > Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] > > Call Trace: > > __schedule+0x293/0x730 > > schedule+0x33/0xa0 > > schedule_timeout+0x1d3/0x2f0 > > wait_for_completion+0xba/0x140 > > __synchronize_srcu.part.21+0x91/0xc0 > > synchronize_srcu_expedited+0x27/0x30 > > synchronize_srcu+0xce/0xe0 > > nvme_mpath_set_live+0x64/0x130 [nvme_core] > > nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] > > nvme_update_ana_state+0xcd/0xe0 [nvme_core] > > nvme_parse_ana_log+0xa1/0x180 [nvme_core] > > nvme_read_ana_log+0x76/0x100 [nvme_core] > > nvme_mpath_init+0x122/0x180 [nvme_core] > > nvme_init_identify+0x80e/0xe20 [nvme_core] > > nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] > > nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] > > > > > > In order to fix this, we recognize the different behavior a driver needs to take > > in error recovery scenarios for mpath and non-mpath scenarios and expose this > > awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs > > to be done. > > Christoph, Keith, > > Any thoughts on this? The RFC part is getting the transport driver to > behave differently for mpath vs. non-mpath. Will it work if nvme mpath used request NOWAIT flag for its submit_bio() call, and add the bio to the requeue_list if blk_queue_enter() fails? I think that looks like another way to resolve the deadlock, but we need the block layer to return a failed status to the original caller. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-16 20:42 ` Keith Busch @ 2021-03-16 23:51 ` Sagi Grimberg 2021-03-17 2:55 ` Chao Leng 0 siblings, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-16 23:51 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni >>> These patches on their own are correct because they fixed a controller reset >>> regression. >>> >>> When we reset/teardown a controller, we must freeze and quiesce the namespaces >>> request queues to make sure that we safely stop inflight I/O submissions. >>> Freeze is mandatory because if our hctx map changed between reconnects, >>> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and >>> if it still has pending submissions (that are still quiesced) it will hang. >>> This is what the above patches fixed. >>> >>> However, by freezing the namespaces request queues, and only unfreezing them >>> when we successfully reconnect, inflight submissions that are running >>> concurrently can now block grabbing the nshead srcu until either we successfully >>> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). >>> >>> This caused a deadlock [1] when a different controller (different path on the >>> same subsystem) became live (i.e. optimized/non-optimized). This is because >>> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O >>> in order to make sure that current_path is visible to future (re)submisions. >>> However the srcu lock is taken by a blocked submission on a frozen request >>> queue, and we have a deadlock. >>> >>> For multipath, we obviously cannot allow that as we want to failover I/O asap. >>> However for non-mpath, we do not want to fail I/O (at least until controller >>> FASTFAIL expires, and that is disabled by default btw). >>> >>> This creates a non-symmetric behavior of how the driver should behave in the >>> presence or absence of multipath. >>> >>> [1]: >>> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] >>> Call Trace: >>> __schedule+0x293/0x730 >>> schedule+0x33/0xa0 >>> schedule_timeout+0x1d3/0x2f0 >>> wait_for_completion+0xba/0x140 >>> __synchronize_srcu.part.21+0x91/0xc0 >>> synchronize_srcu_expedited+0x27/0x30 >>> synchronize_srcu+0xce/0xe0 >>> nvme_mpath_set_live+0x64/0x130 [nvme_core] >>> nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] >>> nvme_update_ana_state+0xcd/0xe0 [nvme_core] >>> nvme_parse_ana_log+0xa1/0x180 [nvme_core] >>> nvme_read_ana_log+0x76/0x100 [nvme_core] >>> nvme_mpath_init+0x122/0x180 [nvme_core] >>> nvme_init_identify+0x80e/0xe20 [nvme_core] >>> nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] >>> nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] >>> >>> >>> In order to fix this, we recognize the different behavior a driver needs to take >>> in error recovery scenarios for mpath and non-mpath scenarios and expose this >>> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs >>> to be done. >> >> Christoph, Keith, >> >> Any thoughts on this? The RFC part is getting the transport driver to >> behave differently for mpath vs. non-mpath. > > Will it work if nvme mpath used request NOWAIT flag for its submit_bio() > call, and add the bio to the requeue_list if blk_queue_enter() fails? I > think that looks like another way to resolve the deadlock, but we need > the block layer to return a failed status to the original caller. But who would kick the requeue list? and that would make near-tag-exhaust performance stink... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-16 23:51 ` Sagi Grimberg @ 2021-03-17 2:55 ` Chao Leng 2021-03-17 6:59 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Chao Leng @ 2021-03-17 2:55 UTC (permalink / raw) To: Sagi Grimberg, Keith Busch Cc: linux-nvme, Christoph Hellwig, Chaitanya Kulkarni On 2021/3/17 7:51, Sagi Grimberg wrote: > >>>> These patches on their own are correct because they fixed a controller reset >>>> regression. >>>> >>>> When we reset/teardown a controller, we must freeze and quiesce the namespaces >>>> request queues to make sure that we safely stop inflight I/O submissions. >>>> Freeze is mandatory because if our hctx map changed between reconnects, >>>> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and >>>> if it still has pending submissions (that are still quiesced) it will hang. >>>> This is what the above patches fixed. >>>> >>>> However, by freezing the namespaces request queues, and only unfreezing them >>>> when we successfully reconnect, inflight submissions that are running >>>> concurrently can now block grabbing the nshead srcu until either we successfully >>>> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). >>>> >>>> This caused a deadlock [1] when a different controller (different path on the >>>> same subsystem) became live (i.e. optimized/non-optimized). This is because >>>> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O >>>> in order to make sure that current_path is visible to future (re)submisions. >>>> However the srcu lock is taken by a blocked submission on a frozen request >>>> queue, and we have a deadlock. >>>> >>>> For multipath, we obviously cannot allow that as we want to failover I/O asap. >>>> However for non-mpath, we do not want to fail I/O (at least until controller >>>> FASTFAIL expires, and that is disabled by default btw). >>>> >>>> This creates a non-symmetric behavior of how the driver should behave in the >>>> presence or absence of multipath. >>>> >>>> [1]: >>>> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp] >>>> Call Trace: >>>> __schedule+0x293/0x730 >>>> schedule+0x33/0xa0 >>>> schedule_timeout+0x1d3/0x2f0 >>>> wait_for_completion+0xba/0x140 >>>> __synchronize_srcu.part.21+0x91/0xc0 >>>> synchronize_srcu_expedited+0x27/0x30 >>>> synchronize_srcu+0xce/0xe0 >>>> nvme_mpath_set_live+0x64/0x130 [nvme_core] >>>> nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] >>>> nvme_update_ana_state+0xcd/0xe0 [nvme_core] >>>> nvme_parse_ana_log+0xa1/0x180 [nvme_core] >>>> nvme_read_ana_log+0x76/0x100 [nvme_core] >>>> nvme_mpath_init+0x122/0x180 [nvme_core] >>>> nvme_init_identify+0x80e/0xe20 [nvme_core] >>>> nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp] >>>> nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp] >>>> >>>> >>>> In order to fix this, we recognize the different behavior a driver needs to take >>>> in error recovery scenarios for mpath and non-mpath scenarios and expose this >>>> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs >>>> to be done. >>> >>> Christoph, Keith, >>> >>> Any thoughts on this? The RFC part is getting the transport driver to >>> behave differently for mpath vs. non-mpath. >> >> Will it work if nvme mpath used request NOWAIT flag for its submit_bio() >> call, and add the bio to the requeue_list if blk_queue_enter() fails? I >> think that looks like another way to resolve the deadlock, but we need >> the block layer to return a failed status to the original caller. > > But who would kick the requeue list? and that would make near-tag-exhaust performance stink... moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. It can also avoid I/O hang long time if reconnection failed. > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > . _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-17 2:55 ` Chao Leng @ 2021-03-17 6:59 ` Christoph Hellwig 2021-03-17 7:59 ` Chao Leng 2021-03-17 8:16 ` Sagi Grimberg 0 siblings, 2 replies; 30+ messages in thread From: Christoph Hellwig @ 2021-03-17 6:59 UTC (permalink / raw) To: Chao Leng Cc: Sagi Grimberg, Keith Busch, linux-nvme, Christoph Hellwig, Chaitanya Kulkarni On Wed, Mar 17, 2021 at 10:55:57AM +0800, Chao Leng wrote: >>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio() >>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I >>> think that looks like another way to resolve the deadlock, but we need >>> the block layer to return a failed status to the original caller. Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here. dm-mpath also uses it for its request allocation for similar reasons. >> >> But who would kick the requeue list? and that would make near-tag-exhaust performance stink... The multipath code would have to kick the list. We could also try to split into two flags, one that affects blk_queue_enter and one that affects the tag allocation. > moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. > It can also avoid I/O hang long time if reconnection failed. Can you explain how we'd still ensure that no new commands get queued during teardown using that scheme? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-17 6:59 ` Christoph Hellwig @ 2021-03-17 7:59 ` Chao Leng 2021-03-17 18:43 ` Sagi Grimberg 2021-03-17 8:16 ` Sagi Grimberg 1 sibling, 1 reply; 30+ messages in thread From: Chao Leng @ 2021-03-17 7:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chaitanya Kulkarni On 2021/3/17 14:59, Christoph Hellwig wrote: > On Wed, Mar 17, 2021 at 10:55:57AM +0800, Chao Leng wrote: >>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio() >>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I >>>> think that looks like another way to resolve the deadlock, but we need >>>> the block layer to return a failed status to the original caller. > > Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here. dm-mpath also > uses it for its request allocation for similar reasons. > >>> >>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink... > > The multipath code would have to kick the list. We could also try to > split into two flags, one that affects blk_queue_enter and one that > affects the tag allocation. > >> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. >> It can also avoid I/O hang long time if reconnection failed. > > Can you explain how we'd still ensure that no new commands get queued > during teardown using that scheme? 1. tear down will cancel all inflight requests, and then multipath will clear the path. 2. and then we may freeze the controler. 3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe. > . > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-17 7:59 ` Chao Leng @ 2021-03-17 18:43 ` Sagi Grimberg 2021-03-18 1:51 ` Chao Leng 0 siblings, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-17 18:43 UTC (permalink / raw) To: Chao Leng, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni >>>>> Will it work if nvme mpath used request NOWAIT flag for its >>>>> submit_bio() >>>>> call, and add the bio to the requeue_list if blk_queue_enter() >>>>> fails? I >>>>> think that looks like another way to resolve the deadlock, but we need >>>>> the block layer to return a failed status to the original caller. >> >> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here. dm-mpath also >> uses it for its request allocation for similar reasons. >> >>>> >>>> But who would kick the requeue list? and that would make >>>> near-tag-exhaust performance stink... >> >> The multipath code would have to kick the list. We could also try to >> split into two flags, one that affects blk_queue_enter and one that >> affects the tag allocation. >> >>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to >>> nvme_rdma_configure_io_queues can fix it. >>> It can also avoid I/O hang long time if reconnection failed. >> >> Can you explain how we'd still ensure that no new commands get queued >> during teardown using that scheme? > 1. tear down will cancel all inflight requests, and then multipath will > clear the path. > 2. and then we may freeze the controler. > 3. nvme_ns_head_submit_bio can not find the reconnection controller as > valid path, so it is safe. In non-mpath (which unfortunately is a valid use-case), there is no failover, and we cannot freeze the queue after we stopped (and/or started) the queues because then fail_non_ready_command() constantly return BLK_STS_RESOURCE (just causing a re-submission over and over again) and the freeze will never complete (the commands are still inflight from the queue->g_usage_counter perspective). So I think we should still start queue freeze before we quiesce the queues. I still don't see how the mpath NOWAIT suggestion works either... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-17 18:43 ` Sagi Grimberg @ 2021-03-18 1:51 ` Chao Leng 2021-03-18 4:45 ` Christoph Hellwig 2021-03-18 18:46 ` Sagi Grimberg 0 siblings, 2 replies; 30+ messages in thread From: Chao Leng @ 2021-03-18 1:51 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni On 2021/3/18 2:43, Sagi Grimberg wrote: > >>>>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio() >>>>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I >>>>>> think that looks like another way to resolve the deadlock, but we need >>>>>> the block layer to return a failed status to the original caller. >>> >>> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here. dm-mpath also >>> uses it for its request allocation for similar reasons. >>> >>>>> >>>>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink... >>> >>> The multipath code would have to kick the list. We could also try to >>> split into two flags, one that affects blk_queue_enter and one that >>> affects the tag allocation. >>> >>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. >>>> It can also avoid I/O hang long time if reconnection failed. >>> >>> Can you explain how we'd still ensure that no new commands get queued >>> during teardown using that scheme? >> 1. tear down will cancel all inflight requests, and then multipath will clear the path. >> 2. and then we may freeze the controler. >> 3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe. > > In non-mpath (which unfortunately is a valid use-case), there is no > failover, and we cannot freeze the queue after we stopped (and/or > started) the queues because then fail_non_ready_command() constantly return BLK_STS_RESOURCE (just causing a re-submission over and over > again) and the freeze will never complete (the commands are still > inflight from the queue->g_usage_counter perspective). If the request set the flags to REQ_FAILFAST_xxx, will hang long time if reconnection failed. This is not expected. Another, If the controller is not live and the controller is freezed ,fast_io_fail_tmo will not work. This is also not expected. So I think freezing the controller when reconnecting is not good idea. It's really not good behavior to try again and again. This is at least better than request hang long time. > > So I think we should still start queue freeze before we quiesce > the queues. We should unquiesce and unfreeze the queues when reconnecting, otherwise fast_io_fail_tmo will not work. > > I still don't see how the mpath NOWAIT suggestion works either... mpath will queuue request to other live path or requeue the request(if no used path), so it will not wait. > . _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 1:51 ` Chao Leng @ 2021-03-18 4:45 ` Christoph Hellwig 2021-03-18 18:46 ` Sagi Grimberg 1 sibling, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2021-03-18 4:45 UTC (permalink / raw) To: Chao Leng Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme, Chaitanya Kulkarni On Thu, Mar 18, 2021 at 09:51:14AM +0800, Chao Leng wrote: >>>> The multipath code would have to kick the list. We could also try to >>>> split into two flags, one that affects blk_queue_enter and one that >>>> affects the tag allocation. >>>> >>>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. >>>>> It can also avoid I/O hang long time if reconnection failed. >>>> >>>> Can you explain how we'd still ensure that no new commands get queued >>>> during teardown using that scheme? >>> 1. tear down will cancel all inflight requests, and then multipath will clear the path. >>> 2. and then we may freeze the controler. >>> 3. nvme_ns_head_submit_bio can not find the reconnection controller as valid path, so it is safe. >> >> In non-mpath (which unfortunately is a valid use-case), there is no >> failover, and we cannot freeze the queue after we stopped (and/or >> started) the queues because then fail_non_ready_command() constantly return BLK_STS_RESOURCE (just causing a re-submission over and over >> again) and the freeze will never complete (the commands are still >> inflight from the queue->g_usage_counter perspective). > If the request set the flags to REQ_FAILFAST_xxx, will hang long time if reconnection failed. > This is not expected. > Another, If the controller is not live and the controller is freezed ,fast_io_fail_tmo will not work. > This is also not expected. > So I think freezing the controller when reconnecting is not good idea. > It's really not good behavior to try again and again. This is at least better than request hang long time. Well, it is pretty clear that REQ_FAILFAST_* (and I'm still confused about the three variants of that) should not block in blk_queue_enter, and we should make sure nvme-multipath triggers that. Let me thing of a good way to refactor blk_queue_enter first to make that least painful. >> So I think we should still start queue freeze before we quiesce >> the queues. > We should unquiesce and unfreeze the queues when reconnecting, otherwise fast_io_fail_tmo will not work. >> >> I still don't see how the mpath NOWAIT suggestion works either... > mpath will queuue request to other live path or requeue the request(if no used path), so it will not wait. >> . Yes. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 1:51 ` Chao Leng 2021-03-18 4:45 ` Christoph Hellwig @ 2021-03-18 18:46 ` Sagi Grimberg 2021-03-18 19:16 ` Keith Busch 1 sibling, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-18 18:46 UTC (permalink / raw) To: Chao Leng, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni >>>>>>> Will it work if nvme mpath used request NOWAIT flag for its >>>>>>> submit_bio() >>>>>>> call, and add the bio to the requeue_list if blk_queue_enter() >>>>>>> fails? I >>>>>>> think that looks like another way to resolve the deadlock, but we >>>>>>> need >>>>>>> the block layer to return a failed status to the original caller. >>>> >>>> Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here. dm-mpath also >>>> uses it for its request allocation for similar reasons. >>>> >>>>>> >>>>>> But who would kick the requeue list? and that would make >>>>>> near-tag-exhaust performance stink... >>>> >>>> The multipath code would have to kick the list. We could also try to >>>> split into two flags, one that affects blk_queue_enter and one that >>>> affects the tag allocation. >>>> >>>>> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to >>>>> nvme_rdma_configure_io_queues can fix it. >>>>> It can also avoid I/O hang long time if reconnection failed. >>>> >>>> Can you explain how we'd still ensure that no new commands get queued >>>> during teardown using that scheme? >>> 1. tear down will cancel all inflight requests, and then multipath >>> will clear the path. >>> 2. and then we may freeze the controler. >>> 3. nvme_ns_head_submit_bio can not find the reconnection controller >>> as valid path, so it is safe. >> >> In non-mpath (which unfortunately is a valid use-case), there is no >> failover, and we cannot freeze the queue after we stopped (and/or >> started) the queues because then fail_non_ready_command() constantly >> return BLK_STS_RESOURCE (just causing a re-submission over and over >> again) and the freeze will never complete (the commands are still >> inflight from the queue->g_usage_counter perspective). > If the request set the flags to REQ_FAILFAST_xxx, will hang long time if > reconnection failed. > This is not expected. > Another, If the controller is not live and the controller is freezed > ,fast_io_fail_tmo will not work. > This is also not expected. No arguments that the queue needs to unfreeze asap for mpath, that is exactly what the patch does. The only unnatural part is the non-mpath case where if we unfreeze the queue before we reconnect I/Os will fail, which is we should also respect fast_fail_tmo. The main issue here is that there are two behaviors that we should maintain based if its mpath or non-mpath... > So I think freezing the controller when reconnecting is not good idea. As said, for mpath its for sure not, but for non-mpath that matches the expected behavior. > It's really not good behavior to try again and again. This is at least > better than request hang long time. I am not sure I understand how that even supposed to work TBH. >> So I think we should still start queue freeze before we quiesce >> the queues. > We should unquiesce and unfreeze the queues when reconnecting, otherwise > fast_io_fail_tmo will not work. >> >> I still don't see how the mpath NOWAIT suggestion works either... > mpath will queuue request to other live path or requeue the request(if > no used path), so it will not wait. Placing the request on the requeue_list is fine, but the question is when to kick the requeue_work, nothing guarantees that an alternate path exist or will in a sane period. So constantly requeue+kick sounds like a really bad practice to me. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 18:46 ` Sagi Grimberg @ 2021-03-18 19:16 ` Keith Busch 2021-03-18 19:31 ` Sagi Grimberg 0 siblings, 1 reply; 30+ messages in thread From: Keith Busch @ 2021-03-18 19:16 UTC (permalink / raw) To: Sagi Grimberg Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni On Thu, Mar 18, 2021 at 11:46:08AM -0700, Sagi Grimberg wrote: > > Placing the request on the requeue_list is fine, but the question is > when to kick the requeue_work, nothing guarantees that an alternate path > exist or will in a sane period. So constantly requeue+kick sounds like > a really bad practice to me. nvme_mpath_set_live(), where you reported the deadlock, kicks the requeue_list. The difference that NOWAIT provides is that nvme_mpath_set_live's schronize_srcu() is no longer blocked forever because the .submit_bio() isn't waiting for entery on a frozen queue, so now it's free to schedule the dispatch. There's probably an optimization to kick it sooner if there's a viable alternate path, but that could be a follow on. If there's no immediate viable path, then the requests would remain on the requeue list. That currently happens as long as there's a potential controller in a reset or connecting state. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 19:16 ` Keith Busch @ 2021-03-18 19:31 ` Sagi Grimberg 2021-03-18 21:52 ` Keith Busch 0 siblings, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-18 19:31 UTC (permalink / raw) To: Keith Busch; +Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni >> Placing the request on the requeue_list is fine, but the question is >> when to kick the requeue_work, nothing guarantees that an alternate path >> exist or will in a sane period. So constantly requeue+kick sounds like >> a really bad practice to me. > > nvme_mpath_set_live(), where you reported the deadlock, kicks the > requeue_list. The difference that NOWAIT provides is that > nvme_mpath_set_live's schronize_srcu() is no longer blocked forever > because the .submit_bio() isn't waiting for entery on a frozen queue, so > now it's free to schedule the dispatch. > > There's probably an optimization to kick it sooner if there's a viable > alternate path, but that could be a follow on. That would be mandatory I think, otherwise this would introduce a regression... > If there's no immediate viable path, then the requests would remain on > the requeue list. That currently happens as long as there's a potential > controller in a reset or connecting state. Well, also worth to keep in mind that now we'll need to clone the bio because we need to override bi_end_io which adds us some overhead in the data path. Unless we make submit_bio return a status which is a much bigger scope of a change I would expect... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 19:31 ` Sagi Grimberg @ 2021-03-18 21:52 ` Keith Busch 2021-03-18 22:45 ` Sagi Grimberg 2021-03-19 14:05 ` Christoph Hellwig 0 siblings, 2 replies; 30+ messages in thread From: Keith Busch @ 2021-03-18 21:52 UTC (permalink / raw) To: Sagi Grimberg Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni On Thu, Mar 18, 2021 at 12:31:35PM -0700, Sagi Grimberg wrote: > > > > Placing the request on the requeue_list is fine, but the question is > > > when to kick the requeue_work, nothing guarantees that an alternate path > > > exist or will in a sane period. So constantly requeue+kick sounds like > > > a really bad practice to me. > > > > nvme_mpath_set_live(), where you reported the deadlock, kicks the > > requeue_list. The difference that NOWAIT provides is that > > nvme_mpath_set_live's schronize_srcu() is no longer blocked forever > > because the .submit_bio() isn't waiting for entery on a frozen queue, so > > now it's free to schedule the dispatch. > > > > There's probably an optimization to kick it sooner if there's a viable > > alternate path, but that could be a follow on. > > That would be mandatory I think, otherwise this would introduce > a regression... > > > If there's no immediate viable path, then the requests would remain on > > the requeue list. That currently happens as long as there's a potential > > controller in a reset or connecting state. > > Well, also worth to keep in mind that now we'll need to clone the bio > because we need to override bi_end_io which adds us some overhead > in the data path. Unless we make submit_bio return a status which > is a much bigger scope of a change I would expect... Having submit_bio() return the enter status was where I was going with this, but the recursive handling makes this more complicated than I initially thought. If you use the NOWAIT flag today with a freezing queue, the IO will end with BLK_STS_AGAIN and punt retry handling to the application. I'm guessing you don't want that to happen, so a little more is required for this idea. Since it's an error path, perhaps a block operations callback is okay? Something like this compile tested patch? --- diff --git a/block/blk-core.c b/block/blk-core.c index fc60ff208497..423b89005a28 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -475,6 +475,16 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) } } +static inline void bio_enter_error(struct bio *bio) +{ + struct gendisk *disk = bio->bi_bdev->bd_disk; + + if (disk->fops->enter_err) + disk->fops->enter_err(bio); + else + bio_wouldblock_error(bio); +} + static inline int bio_queue_enter(struct bio *bio) { struct request_queue *q = bio->bi_bdev->bd_disk->queue; @@ -484,7 +494,7 @@ static inline int bio_queue_enter(struct bio *bio) ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0); if (unlikely(ret)) { if (nowait && !blk_queue_dying(q)) - bio_wouldblock_error(bio); + bio_enter_error(bio); else bio_io_error(bio); } diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index edf19bbb904f..2c27eeaa83b0 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2366,9 +2366,24 @@ static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode) nvme_put_ns_head(disk->private_data); } +void nvme_ns_head_enter_err(struct bio *bio) +{ + struct nvme_ns_head *head = bio->bi_bdev->bd_disk->private_data; + + if (nvme_available_path(head)) { + spin_lock_irq(&head->requeue_lock); + bio_list_add(&head->requeue_list, bio); + spin_unlock_irq(&head->requeue_lock); + } else { + bio->bi_status = BLK_STS_IOERR; + bio_endio(bio); + } +} + const struct block_device_operations nvme_ns_head_ops = { .owner = THIS_MODULE, .submit_bio = nvme_ns_head_submit_bio, + .enter_err = nvme_ns_head_enter_err, .open = nvme_ns_head_open, .release = nvme_ns_head_release, .ioctl = nvme_ioctl, diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a1d476e1ac02..47595bb09032 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -274,7 +274,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) return ns; } -static bool nvme_available_path(struct nvme_ns_head *head) +bool nvme_available_path(struct nvme_ns_head *head) { struct nvme_ns *ns; @@ -313,7 +313,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) ns = nvme_find_path(head); if (likely(ns)) { bio_set_dev(bio, ns->disk->part0); - bio->bi_opf |= REQ_NVME_MPATH; + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; trace_block_bio_remap(bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector); ret = submit_bio_noacct(bio); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 815c032a190e..5dbd6baebd70 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -677,6 +677,7 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns); void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); struct nvme_ns *nvme_find_path(struct nvme_ns_head *head); blk_qc_t nvme_ns_head_submit_bio(struct bio *bio); +bool nvme_available_path(struct nvme_ns_head *head); static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) { diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index bc6bc8383b43..b5ae1aa292c1 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1862,6 +1862,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { } struct block_device_operations { blk_qc_t (*submit_bio) (struct bio *bio); + void (*enter_err) (struct bio *bio); int (*open) (struct block_device *, fmode_t); void (*release) (struct gendisk *, fmode_t); int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int); -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 21:52 ` Keith Busch @ 2021-03-18 22:45 ` Sagi Grimberg 2021-03-19 14:05 ` Christoph Hellwig 1 sibling, 0 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-18 22:45 UTC (permalink / raw) To: Keith Busch; +Cc: Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni >>>> Placing the request on the requeue_list is fine, but the question is >>>> when to kick the requeue_work, nothing guarantees that an alternate path >>>> exist or will in a sane period. So constantly requeue+kick sounds like >>>> a really bad practice to me. >>> >>> nvme_mpath_set_live(), where you reported the deadlock, kicks the >>> requeue_list. The difference that NOWAIT provides is that >>> nvme_mpath_set_live's schronize_srcu() is no longer blocked forever >>> because the .submit_bio() isn't waiting for entery on a frozen queue, so >>> now it's free to schedule the dispatch. >>> >>> There's probably an optimization to kick it sooner if there's a viable >>> alternate path, but that could be a follow on. >> >> That would be mandatory I think, otherwise this would introduce >> a regression... >> >>> If there's no immediate viable path, then the requests would remain on >>> the requeue list. That currently happens as long as there's a potential >>> controller in a reset or connecting state. >> >> Well, also worth to keep in mind that now we'll need to clone the bio >> because we need to override bi_end_io which adds us some overhead >> in the data path. Unless we make submit_bio return a status which >> is a much bigger scope of a change I would expect... > > Having submit_bio() return the enter status was where I was going with > this, but the recursive handling makes this more complicated than I > initially thought. > > If you use the NOWAIT flag today with a freezing queue, the IO will end > with BLK_STS_AGAIN and punt retry handling to the application. I'm > guessing you don't want that to happen, so a little more is required for > this idea. > > Since it's an error path, perhaps a block operations callback is okay? > Something like this compile tested patch? Maybe... I don't see any apparent reason why it would not work... > --- > diff --git a/block/blk-core.c b/block/blk-core.c > index fc60ff208497..423b89005a28 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -475,6 +475,16 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) > } > } > > +static inline void bio_enter_error(struct bio *bio) > +{ > + struct gendisk *disk = bio->bi_bdev->bd_disk; > + > + if (disk->fops->enter_err) > + disk->fops->enter_err(bio); > + else > + bio_wouldblock_error(bio); > +} > + > static inline int bio_queue_enter(struct bio *bio) > { > struct request_queue *q = bio->bi_bdev->bd_disk->queue; > @@ -484,7 +494,7 @@ static inline int bio_queue_enter(struct bio *bio) > ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0); > if (unlikely(ret)) { > if (nowait && !blk_queue_dying(q)) > - bio_wouldblock_error(bio); > + bio_enter_error(bio); > else > bio_io_error(bio); > } > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index edf19bbb904f..2c27eeaa83b0 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -2366,9 +2366,24 @@ static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode) > nvme_put_ns_head(disk->private_data); > } > > +void nvme_ns_head_enter_err(struct bio *bio) > +{ > + struct nvme_ns_head *head = bio->bi_bdev->bd_disk->private_data; > + > + if (nvme_available_path(head)) { > + spin_lock_irq(&head->requeue_lock); > + bio_list_add(&head->requeue_list, bio); > + spin_unlock_irq(&head->requeue_lock); > + } else { > + bio->bi_status = BLK_STS_IOERR; > + bio_endio(bio); > + } > +} Nice, you can take the error path in nvme_ns_head_submit_bio and use it there too: -- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 5c67a5e96738..8d0ef83f545c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -318,17 +318,8 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) trace_block_bio_remap(bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector); ret = submit_bio_noacct(bio); - } else if (nvme_available_path(head)) { - dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); - - spin_lock_irq(&head->requeue_lock); - bio_list_add(&head->requeue_list, bio); - spin_unlock_irq(&head->requeue_lock); } else { - dev_warn_ratelimited(dev, "no available path - failing I/O\n"); - - bio->bi_status = BLK_STS_IOERR; - bio_endio(bio); + nvme_ns_head_enter_err(bio); } srcu_read_unlock(&head->srcu, srcu_idx); -- And move the prints as well for some logging.. > + > const struct block_device_operations nvme_ns_head_ops = { > .owner = THIS_MODULE, > .submit_bio = nvme_ns_head_submit_bio, > + .enter_err = nvme_ns_head_enter_err, > .open = nvme_ns_head_open, > .release = nvme_ns_head_release, > .ioctl = nvme_ioctl, > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index a1d476e1ac02..47595bb09032 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -274,7 +274,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) > return ns; > } > > -static bool nvme_available_path(struct nvme_ns_head *head) > +bool nvme_available_path(struct nvme_ns_head *head) > { > struct nvme_ns *ns; > > @@ -313,7 +313,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > ns = nvme_find_path(head); > if (likely(ns)) { > bio_set_dev(bio, ns->disk->part0); > - bio->bi_opf |= REQ_NVME_MPATH; > + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > bio->bi_iter.bi_sector); > ret = submit_bio_noacct(bio); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 815c032a190e..5dbd6baebd70 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -677,6 +677,7 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns); > void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); > struct nvme_ns *nvme_find_path(struct nvme_ns_head *head); > blk_qc_t nvme_ns_head_submit_bio(struct bio *bio); > +bool nvme_available_path(struct nvme_ns_head *head); > > static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) > { > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index bc6bc8383b43..b5ae1aa292c1 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1862,6 +1862,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { } > > struct block_device_operations { > blk_qc_t (*submit_bio) (struct bio *bio); > + void (*enter_err) (struct bio *bio); > int (*open) (struct block_device *, fmode_t); > void (*release) (struct gendisk *, fmode_t); > int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int); > -- > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-18 21:52 ` Keith Busch 2021-03-18 22:45 ` Sagi Grimberg @ 2021-03-19 14:05 ` Christoph Hellwig 2021-03-19 17:28 ` Christoph Hellwig 1 sibling, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2021-03-19 14:05 UTC (permalink / raw) To: Keith Busch Cc: Sagi Grimberg, Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni On Fri, Mar 19, 2021 at 06:52:56AM +0900, Keith Busch wrote: > Having submit_bio() return the enter status was where I was going with > this, but the recursive handling makes this more complicated than I > initially thought. Note that the recursion handling is not really required for nvme-multipath. I have some plans to actually kill it off entirely for blk-mq submissions, which needs work on the bounce buffering and bio splitting code, but should not be too hard. > If you use the NOWAIT flag today with a freezing queue, the IO will end > with BLK_STS_AGAIN and punt retry handling to the application. I'm > guessing you don't want that to happen, so a little more is required for > this idea. We really should not use NOWAIT but a flag that only escapes the freeze protection. I think REQ_FAILFAST_DRIVER should probably be changed to trigger that, but even if not we could add a new flag. > Since it's an error path, perhaps a block operations callback is okay? > Something like this compile tested patch? We really should not need an indirection. And more importantly I don't think the consuming driver cares, it really is the submitting one. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-19 14:05 ` Christoph Hellwig @ 2021-03-19 17:28 ` Christoph Hellwig 2021-03-19 19:07 ` Keith Busch 2021-03-19 19:34 ` Sagi Grimberg 0 siblings, 2 replies; 30+ messages in thread From: Christoph Hellwig @ 2021-03-19 17:28 UTC (permalink / raw) To: Keith Busch Cc: Sagi Grimberg, Chao Leng, Christoph Hellwig, linux-nvme, Chaitanya Kulkarni What about something like this? diff --git a/block/blk-core.c b/block/blk-core.c index fc60ff20849738..4344f3c9058282 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; } -static noinline_for_stack bool submit_bio_checks(struct bio *bio) +noinline_for_stack bool submit_bio_checks(struct bio *bio) { struct block_device *bdev = bio->bi_bdev; struct request_queue *q = bdev->bd_disk->queue; diff --git a/block/blk-mq.c b/block/blk-mq.c index d4d7c1caa43966..4ff85692843b49 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio) return BLK_QC_T_NONE; } +/** + * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O + * @bio: The bio describing the location in memory and on the device. + * + * This function behaves similar to submit_bio_noacct(), but does never waits + * for the queue to be unfreozen, instead it return false and lets the caller + * deal with the fallout. It also does not protect against recursion and thus + * must only be used if the called driver is known to be blk-mq based. + */ +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc) +{ + struct gendisk *disk = bio->bi_bdev->bd_disk; + struct request_queue *q = disk->queue; + + if (WARN_ON_ONCE(!current->bio_list) || + WARN_ON_ONCE(disk->fops->submit_bio)) { + bio_io_error(bio); + goto fail; + } + if (!submit_bio_checks(bio)) + goto fail; + + if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT))) + return false; + if (!blk_crypto_bio_prep(&bio)) + goto fail_queue_exit; + *qc = blk_mq_submit_bio(bio); + return true; + +fail_queue_exit: + blk_queue_exit(disk->queue); +fail: + *qc = BLK_QC_T_NONE; + return true; +} +EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct); + void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, unsigned int hctx_idx) { diff --git a/block/blk.h b/block/blk.h index 3b53e44b967e4e..c4c66b2a9ffb19 100644 --- a/block/blk.h +++ b/block/blk.h @@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *); ssize_t part_timeout_store(struct device *, struct device_attribute *, const char *, size_t); +bool submit_bio_checks(struct bio *bio); void __blk_queue_split(struct bio **bio, unsigned int *nr_segs); int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a1d476e1ac020f..92adebfaf86fd1 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) */ blk_queue_split(&bio); +retry: srcu_idx = srcu_read_lock(&head->srcu); ns = nvme_find_path(head); if (likely(ns)) { @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) bio->bi_opf |= REQ_NVME_MPATH; trace_block_bio_remap(bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector); - ret = submit_bio_noacct(bio); + + if (!blk_mq_submit_bio_direct(bio, &ret)) { + nvme_mpath_clear_current_path(ns); + srcu_read_unlock(&head->srcu, srcu_idx); + goto retry; + } } else if (nvme_available_path(head)) { dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 2c473c9b899089..6804f397106ada 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio, } blk_qc_t blk_mq_submit_bio(struct bio *bio); +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc); void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx, struct lock_class_key *key); _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-19 17:28 ` Christoph Hellwig @ 2021-03-19 19:07 ` Keith Busch 2021-03-19 19:34 ` Sagi Grimberg 1 sibling, 0 replies; 30+ messages in thread From: Keith Busch @ 2021-03-19 19:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Chao Leng, linux-nvme, Chaitanya Kulkarni On Fri, Mar 19, 2021 at 06:28:17PM +0100, Christoph Hellwig wrote: > What about something like this? Yes, this looks good. > diff --git a/block/blk-core.c b/block/blk-core.c > index fc60ff20849738..4344f3c9058282 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, > return BLK_STS_OK; > } > > -static noinline_for_stack bool submit_bio_checks(struct bio *bio) > +noinline_for_stack bool submit_bio_checks(struct bio *bio) > { > struct block_device *bdev = bio->bi_bdev; > struct request_queue *q = bdev->bd_disk->queue; > diff --git a/block/blk-mq.c b/block/blk-mq.c > index d4d7c1caa43966..4ff85692843b49 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio) > return BLK_QC_T_NONE; > } > > +/** > + * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O > + * @bio: The bio describing the location in memory and on the device. > + * > + * This function behaves similar to submit_bio_noacct(), but does never waits > + * for the queue to be unfreozen, instead it return false and lets the caller > + * deal with the fallout. It also does not protect against recursion and thus > + * must only be used if the called driver is known to be blk-mq based. > + */ > +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc) > +{ > + struct gendisk *disk = bio->bi_bdev->bd_disk; > + struct request_queue *q = disk->queue; > + > + if (WARN_ON_ONCE(!current->bio_list) || > + WARN_ON_ONCE(disk->fops->submit_bio)) { > + bio_io_error(bio); > + goto fail; > + } > + if (!submit_bio_checks(bio)) > + goto fail; > + > + if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT))) > + return false; > + if (!blk_crypto_bio_prep(&bio)) > + goto fail_queue_exit; > + *qc = blk_mq_submit_bio(bio); > + return true; > + > +fail_queue_exit: > + blk_queue_exit(disk->queue); > +fail: > + *qc = BLK_QC_T_NONE; > + return true; > +} > +EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct); > + > void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > unsigned int hctx_idx) > { > diff --git a/block/blk.h b/block/blk.h > index 3b53e44b967e4e..c4c66b2a9ffb19 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *); > ssize_t part_timeout_store(struct device *, struct device_attribute *, > const char *, size_t); > > +bool submit_bio_checks(struct bio *bio); > void __blk_queue_split(struct bio **bio, unsigned int *nr_segs); > int ll_back_merge_fn(struct request *req, struct bio *bio, > unsigned int nr_segs); > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index a1d476e1ac020f..92adebfaf86fd1 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > */ > blk_queue_split(&bio); > > +retry: > srcu_idx = srcu_read_lock(&head->srcu); > ns = nvme_find_path(head); > if (likely(ns)) { > @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > bio->bi_opf |= REQ_NVME_MPATH; > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > bio->bi_iter.bi_sector); > - ret = submit_bio_noacct(bio); > + > + if (!blk_mq_submit_bio_direct(bio, &ret)) { > + nvme_mpath_clear_current_path(ns); > + srcu_read_unlock(&head->srcu, srcu_idx); > + goto retry; > + } > } else if (nvme_available_path(head)) { > dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index 2c473c9b899089..6804f397106ada 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio, > } > > blk_qc_t blk_mq_submit_bio(struct bio *bio); > +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc); > void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx, > struct lock_class_key *key); > _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-19 17:28 ` Christoph Hellwig 2021-03-19 19:07 ` Keith Busch @ 2021-03-19 19:34 ` Sagi Grimberg 2021-03-20 6:11 ` Christoph Hellwig 1 sibling, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-19 19:34 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch; +Cc: Chao Leng, linux-nvme, Chaitanya Kulkarni > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index a1d476e1ac020f..92adebfaf86fd1 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > */ > blk_queue_split(&bio); > > +retry: > srcu_idx = srcu_read_lock(&head->srcu); > ns = nvme_find_path(head); > if (likely(ns)) { > @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > bio->bi_opf |= REQ_NVME_MPATH; > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > bio->bi_iter.bi_sector); > - ret = submit_bio_noacct(bio); > + > + if (!blk_mq_submit_bio_direct(bio, &ret)) { > + nvme_mpath_clear_current_path(ns); > + srcu_read_unlock(&head->srcu, srcu_idx); Its a bit unusual to see mutation of a data protected by srcu while under the srcu_read_lock, can that be problematic somehow? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-19 19:34 ` Sagi Grimberg @ 2021-03-20 6:11 ` Christoph Hellwig 2021-03-21 6:49 ` Sagi Grimberg 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2021-03-20 6:11 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Chao Leng, linux-nvme, Chaitanya Kulkarni On Fri, Mar 19, 2021 at 12:34:25PM -0700, Sagi Grimberg wrote: > >> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >> index a1d476e1ac020f..92adebfaf86fd1 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) >> */ >> blk_queue_split(&bio); >> +retry: >> srcu_idx = srcu_read_lock(&head->srcu); >> ns = nvme_find_path(head); >> if (likely(ns)) { >> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) >> bio->bi_opf |= REQ_NVME_MPATH; >> trace_block_bio_remap(bio, disk_devt(ns->head->disk), >> bio->bi_iter.bi_sector); >> - ret = submit_bio_noacct(bio); >> + >> + if (!blk_mq_submit_bio_direct(bio, &ret)) { >> + nvme_mpath_clear_current_path(ns); >> + srcu_read_unlock(&head->srcu, srcu_idx); > > Its a bit unusual to see mutation of a data protected by srcu while > under the srcu_read_lock, can that be problematic somehow? Hmm. I don't think head->srcu is intended to protect the current path. We also call nvme_mpath_clear_current_path from nvme_complete_rq or nvme_ns_remove, which has no locking at all. The srcu protection is for head->list, but leaks into the current path due to the __rcu annotations. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-20 6:11 ` Christoph Hellwig @ 2021-03-21 6:49 ` Sagi Grimberg 2021-03-22 6:34 ` Christoph Hellwig 0 siblings, 1 reply; 30+ messages in thread From: Sagi Grimberg @ 2021-03-21 6:49 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, Chao Leng, linux-nvme, Chaitanya Kulkarni >>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >>> index a1d476e1ac020f..92adebfaf86fd1 100644 >>> --- a/drivers/nvme/host/multipath.c >>> +++ b/drivers/nvme/host/multipath.c >>> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) >>> */ >>> blk_queue_split(&bio); >>> +retry: >>> srcu_idx = srcu_read_lock(&head->srcu); >>> ns = nvme_find_path(head); >>> if (likely(ns)) { >>> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) >>> bio->bi_opf |= REQ_NVME_MPATH; >>> trace_block_bio_remap(bio, disk_devt(ns->head->disk), >>> bio->bi_iter.bi_sector); >>> - ret = submit_bio_noacct(bio); >>> + >>> + if (!blk_mq_submit_bio_direct(bio, &ret)) { >>> + nvme_mpath_clear_current_path(ns); >>> + srcu_read_unlock(&head->srcu, srcu_idx); >> >> Its a bit unusual to see mutation of a data protected by srcu while >> under the srcu_read_lock, can that be problematic somehow? > > Hmm. I don't think head->srcu is intended to protect the current path. > We also call nvme_mpath_clear_current_path from nvme_complete_rq or > nvme_ns_remove, which has no locking at all. The srcu protection is > for head->list, but leaks into the current path due to the __rcu > annotations. OK, care to send a formal patch that I can give a test drive? Also, given that this issue has gone back to stable 5.4 and 5.10 we will need to take care of those too. We should make sure to annotate the fixes tags in this patch and probably also understand how we can create a version of this to apply cleanly (for sure there are some extra dependencies). _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-21 6:49 ` Sagi Grimberg @ 2021-03-22 6:34 ` Christoph Hellwig 0 siblings, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2021-03-22 6:34 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Chao Leng, linux-nvme, Chaitanya Kulkarni On Sat, Mar 20, 2021 at 11:49:35PM -0700, Sagi Grimberg wrote: >> Hmm. I don't think head->srcu is intended to protect the current path. >> We also call nvme_mpath_clear_current_path from nvme_complete_rq or >> nvme_ns_remove, which has no locking at all. The srcu protection is >> for head->list, but leaks into the current path due to the __rcu >> annotations. > > OK, care to send a formal patch that I can give a test drive? I'll write a changelog. But as-is this should be testable. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs 2021-03-17 6:59 ` Christoph Hellwig 2021-03-17 7:59 ` Chao Leng @ 2021-03-17 8:16 ` Sagi Grimberg 1 sibling, 0 replies; 30+ messages in thread From: Sagi Grimberg @ 2021-03-17 8:16 UTC (permalink / raw) To: Christoph Hellwig, Chao Leng; +Cc: Keith Busch, linux-nvme, Chaitanya Kulkarni >>>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio() >>>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I >>>> think that looks like another way to resolve the deadlock, but we need >>>> the block layer to return a failed status to the original caller. > > Yes, I think BLK_MQ_REQ_NOWAIT makes total sense here. BTW, the specific hang reported is not blocking on tag allocation, but rather than on blk_queue_enter blocking on a frozen queue. > dm-mpath also uses it for its request allocation for similar reasons. That is the rq based dm, and I think it is because dm_mq_queue_rq is non-blocking. Care to explain what is similar to nvme-mpath? I don't see how bio based dm cares about any of this... >>> But who would kick the requeue list? and that would make near-tag-exhaust performance stink... > > The multipath code would have to kick the list. When? Not following your thoughts... You are suggesting that we call submit_bio that will fail, put it on the requeue_list and then what? blindly kick the requeue list? try to see if there is an alternate path and then kick the list? for every bio that comes in? > We could also try to > split into two flags, one that affects blk_queue_enter and one that > affects the tag allocation. If this is something that can work reliably then its better off, plus we can probably kill the srcu as well. But I don't see how this would work unfortunately. >> moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it. >> It can also avoid I/O hang long time if reconnection failed. > > Can you explain how we'd still ensure that no new commands get queued > during teardown using that scheme? quiescing the queue would prevent new submissions from coming down to the driver, but I don't see how this move can help here... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2021-03-22 6:34 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-15 22:27 [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Sagi Grimberg 2021-03-15 22:27 ` [PATCH 1/3] nvme: introduce nvme_ctrl_is_mpath helper Sagi Grimberg 2021-03-15 22:27 ` [PATCH 2/3] nvme-tcp: fix possible hang when trying to set a live path during I/O Sagi Grimberg 2021-03-15 22:27 ` [PATCH 3/3] nvme-rdma: " Sagi Grimberg 2021-03-16 3:24 ` [PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs Chao Leng 2021-03-16 5:04 ` Sagi Grimberg 2021-03-16 6:18 ` Chao Leng 2021-03-16 6:25 ` Sagi Grimberg 2021-03-16 20:07 ` Sagi Grimberg 2021-03-16 20:42 ` Keith Busch 2021-03-16 23:51 ` Sagi Grimberg 2021-03-17 2:55 ` Chao Leng 2021-03-17 6:59 ` Christoph Hellwig 2021-03-17 7:59 ` Chao Leng 2021-03-17 18:43 ` Sagi Grimberg 2021-03-18 1:51 ` Chao Leng 2021-03-18 4:45 ` Christoph Hellwig 2021-03-18 18:46 ` Sagi Grimberg 2021-03-18 19:16 ` Keith Busch 2021-03-18 19:31 ` Sagi Grimberg 2021-03-18 21:52 ` Keith Busch 2021-03-18 22:45 ` Sagi Grimberg 2021-03-19 14:05 ` Christoph Hellwig 2021-03-19 17:28 ` Christoph Hellwig 2021-03-19 19:07 ` Keith Busch 2021-03-19 19:34 ` Sagi Grimberg 2021-03-20 6:11 ` Christoph Hellwig 2021-03-21 6:49 ` Sagi Grimberg 2021-03-22 6:34 ` Christoph Hellwig 2021-03-17 8:16 ` Sagi Grimberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).