fix nvme-tcp and nvme-rdma controller reset hangs when using multipath

All of lore.kernel.org
 help / color / mirror / Atom feed

* fix nvme-tcp and nvme-rdma controller reset hangs when using multipath
@ 2021-03-22  7:37 ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-22  7:37 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe; +Cc: Chao Leng, linux-block, linux-nvme

Hi all,

this series attempts to fix the nvme queue freeze deadlock Sagi reported.
It needs a little help from the block layer so that we can avoid blocking
on the queue free percpu_ref without forcing the entire NOWAIT flag and
thus also not waiting for tag allocations.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* fix nvme-tcp and nvme-rdma controller reset hangs when using multipath
@ 2021-03-22  7:37 ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-22  7:37 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe; +Cc: Chao Leng, linux-block, linux-nvme

Hi all,

this series attempts to fix the nvme queue freeze deadlock Sagi reported.
It needs a little help from the block layer so that we can avoid blocking
on the queue free percpu_ref without forcing the entire NOWAIT flag and
thus also not waiting for tag allocations.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API
  2021-03-22  7:37 ` Christoph Hellwig
@ 2021-03-22  7:37   ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-22  7:37 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe; +Cc: Chao Leng, linux-block, linux-nvme

This adds (back) and API for simple stacking drivers to submit a bio to
blk-mq queue.  The prime aim is to avoid blocking on the queue freeze
percpu ref, as a multipath driver really does not want to get blocked
on that when an underlying device is undergoing error recovery.  It also
happens to optimize away the small overhead of the curren->bio_list based
recursion avoidance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       |  2 +-
 block/blk-mq.c         | 37 +++++++++++++++++++++++++++++++++++++
 block/blk.h            |  1 +
 include/linux/blk-mq.h |  1 +
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff20849738..4344f3c9058282 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static noinline_for_stack bool submit_bio_checks(struct bio *bio)
+noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct request_queue *q = bdev->bd_disk->queue;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa43966..4ff85692843b49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+/**
+ * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves similar to submit_bio_noacct(), but does never waits
+ * for the queue to be unfreozen, instead it return false and lets the caller
+ * deal with the fallout.  It also does not protect against recursion and thus
+ * must only be used if the called driver is known to be blk-mq based.
+ */
+bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct request_queue *q = disk->queue;
+
+	if (WARN_ON_ONCE(!current->bio_list) ||
+	    WARN_ON_ONCE(disk->fops->submit_bio)) {
+		bio_io_error(bio);
+		goto fail;
+	}
+	if (!submit_bio_checks(bio))
+		goto fail;
+
+	if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)))
+		return false;
+	if (!blk_crypto_bio_prep(&bio))
+		goto fail_queue_exit;
+	*qc = blk_mq_submit_bio(bio);
+	return true;
+
+fail_queue_exit:
+	blk_queue_exit(disk->queue);
+fail:
+	*qc = BLK_QC_T_NONE;
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct);
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e4e..c4c66b2a9ffb19 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
 ssize_t part_timeout_store(struct device *, struct device_attribute *,
 				const char *, size_t);
 
+bool submit_bio_checks(struct bio *bio);
 void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
 int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b899089..6804f397106ada 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
+bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
 
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API
@ 2021-03-22  7:37   ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-22  7:37 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe; +Cc: Chao Leng, linux-block, linux-nvme

This adds (back) and API for simple stacking drivers to submit a bio to
blk-mq queue.  The prime aim is to avoid blocking on the queue freeze
percpu ref, as a multipath driver really does not want to get blocked
on that when an underlying device is undergoing error recovery.  It also
happens to optimize away the small overhead of the curren->bio_list based
recursion avoidance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       |  2 +-
 block/blk-mq.c         | 37 +++++++++++++++++++++++++++++++++++++
 block/blk.h            |  1 +
 include/linux/blk-mq.h |  1 +
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff20849738..4344f3c9058282 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static noinline_for_stack bool submit_bio_checks(struct bio *bio)
+noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
 	struct request_queue *q = bdev->bd_disk->queue;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa43966..4ff85692843b49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+/**
+ * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves similar to submit_bio_noacct(), but does never waits
+ * for the queue to be unfreozen, instead it return false and lets the caller
+ * deal with the fallout.  It also does not protect against recursion and thus
+ * must only be used if the called driver is known to be blk-mq based.
+ */
+bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct request_queue *q = disk->queue;
+
+	if (WARN_ON_ONCE(!current->bio_list) ||
+	    WARN_ON_ONCE(disk->fops->submit_bio)) {
+		bio_io_error(bio);
+		goto fail;
+	}
+	if (!submit_bio_checks(bio))
+		goto fail;
+
+	if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)))
+		return false;
+	if (!blk_crypto_bio_prep(&bio))
+		goto fail_queue_exit;
+	*qc = blk_mq_submit_bio(bio);
+	return true;
+
+fail_queue_exit:
+	blk_queue_exit(disk->queue);
+fail:
+	*qc = BLK_QC_T_NONE;
+	return true;
+}
+EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct);
+
 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx)
 {
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e4e..c4c66b2a9ffb19 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
 ssize_t part_timeout_store(struct device *, struct device_attribute *,
 				const char *, size_t);
 
+bool submit_bio_checks(struct bio *bio);
 void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
 int ll_back_merge_fn(struct request *req, struct bio *bio,
 		unsigned int nr_segs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b899089..6804f397106ada 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
+bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
 
-- 
2.30.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-22  7:37 ` Christoph Hellwig
@ 2021-03-22  7:37   ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-22  7:37 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe; +Cc: Chao Leng, linux-block, linux-nvme

When we reset/teardown a controller, we must freeze and quiesce the
namespaces request queues to make sure that we safely stop inflight I/O
submissions. Freeze is mandatory because if our hctx map changed between
reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
the queue, and if it still has pending submissions (that are still
quiesced) it will hang.

However, by freezing the namespaces request queues, and only unfreezing
them when we successfully reconnect, inflight submissions that are
running concurrently can now block grabbing the nshead srcu until either
we successfully reconnect or ctrl_loss_tmo expired (or the user
explicitly disconnected).

This caused a deadlock when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is
because nvme_mpath_set_live needs to synchronize the nshead srcu before
requeueing I/O in order to make sure that current_path is visible to
future (re-)submisions. However the srcu lock is taken by a blocked
submission on a frozen request queue, and we have a deadlock.

In order to fix this use the blk_mq_submit_bio_direct API to submit the
bio to the low-level driver, which does not block on the queue free
but instead allows nvme-multipath to pick another path or queue up the
bio.

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

Reported-by Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/multipath.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac020f..92adebfaf86fd1 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 	 */
 	blk_queue_split(&bio);

+retry:
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
@@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 		bio->bi_opf |= REQ_NVME_MPATH;
 		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
-		ret = submit_bio_noacct(bio);
+
+		if (!blk_mq_submit_bio_direct(bio, &ret)) {
+			nvme_mpath_clear_current_path(ns);
+			srcu_read_unlock(&head->srcu, srcu_idx);
+			goto retry;
+		}
 	} else if (nvme_available_path(head)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");

-- 
2.30.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-22  7:37   ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-22  7:37 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Jens Axboe; +Cc: Chao Leng, linux-block, linux-nvme

When we reset/teardown a controller, we must freeze and quiesce the
namespaces request queues to make sure that we safely stop inflight I/O
submissions. Freeze is mandatory because if our hctx map changed between
reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
the queue, and if it still has pending submissions (that are still
quiesced) it will hang.

However, by freezing the namespaces request queues, and only unfreezing
them when we successfully reconnect, inflight submissions that are
running concurrently can now block grabbing the nshead srcu until either
we successfully reconnect or ctrl_loss_tmo expired (or the user
explicitly disconnected).

This caused a deadlock when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is
because nvme_mpath_set_live needs to synchronize the nshead srcu before
requeueing I/O in order to make sure that current_path is visible to
future (re-)submisions. However the srcu lock is taken by a blocked
submission on a frozen request queue, and we have a deadlock.

In order to fix this use the blk_mq_submit_bio_direct API to submit the
bio to the low-level driver, which does not block on the queue free
but instead allows nvme-multipath to pick another path or queue up the
bio.

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

Reported-by Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/multipath.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac020f..92adebfaf86fd1 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 	 */
 	blk_queue_split(&bio);

+retry:
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
@@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 		bio->bi_opf |= REQ_NVME_MPATH;
 		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
-		ret = submit_bio_noacct(bio);
+
+		if (!blk_mq_submit_bio_direct(bio, &ret)) {
+			nvme_mpath_clear_current_path(ns);
+			srcu_read_unlock(&head->srcu, srcu_idx);
+			goto retry;
+		}
 	} else if (nvme_available_path(head)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");

-- 
2.30.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-22  7:37   ` Christoph Hellwig
@ 2021-03-22 11:22     ` Hannes Reinecke
  -1 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-22 11:22 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/22/21 8:37 AM, Christoph Hellwig wrote:
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.
> 
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> Reported-by Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/multipath.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac020f..92adebfaf86fd1 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  	 */
>  	blk_queue_split(&bio);
>  
> +retry:
>  	srcu_idx = srcu_read_lock(&head->srcu);
>  	ns = nvme_find_path(head);
>  	if (likely(ns)) {
> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  		bio->bi_opf |= REQ_NVME_MPATH;
>  		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>  				      bio->bi_iter.bi_sector);
> -		ret = submit_bio_noacct(bio);
> +
> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
> +			nvme_mpath_clear_current_path(ns);
> +			srcu_read_unlock(&head->srcu, srcu_idx);
> +			goto retry;
> +		}
>  	} else if (nvme_available_path(head)) {
>  		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>  
> 
Ah. We've run into the same issue, and I've come up with basically the
same patch to have it fixed.
Tests are still outstanding, so I haven't been able to validate it properly.
Thanks for fixing it up.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-22 11:22     ` Hannes Reinecke
  0 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-22 11:22 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/22/21 8:37 AM, Christoph Hellwig wrote:
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.
> 
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> Reported-by Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/multipath.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac020f..92adebfaf86fd1 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  	 */
>  	blk_queue_split(&bio);
>  
> +retry:
>  	srcu_idx = srcu_read_lock(&head->srcu);
>  	ns = nvme_find_path(head);
>  	if (likely(ns)) {
> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  		bio->bi_opf |= REQ_NVME_MPATH;
>  		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>  				      bio->bi_iter.bi_sector);
> -		ret = submit_bio_noacct(bio);
> +
> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
> +			nvme_mpath_clear_current_path(ns);
> +			srcu_read_unlock(&head->srcu, srcu_idx);
> +			goto retry;
> +		}
>  	} else if (nvme_available_path(head)) {
>  		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>  
> 
Ah. We've run into the same issue, and I've come up with basically the
same patch to have it fixed.
Tests are still outstanding, so I haven't been able to validate it properly.
Thanks for fixing it up.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API
  2021-03-22  7:37   ` Christoph Hellwig
@ 2021-03-22 11:23     ` Hannes Reinecke
  -1 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-22 11:23 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/22/21 8:37 AM, Christoph Hellwig wrote:
> This adds (back) and API for simple stacking drivers to submit a bio to
> blk-mq queue.  The prime aim is to avoid blocking on the queue freeze
> percpu ref, as a multipath driver really does not want to get blocked
> on that when an underlying device is undergoing error recovery.  It also
> happens to optimize away the small overhead of the curren->bio_list based
> recursion avoidance.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-core.c       |  2 +-
>  block/blk-mq.c         | 37 +++++++++++++++++++++++++++++++++++++
>  block/blk.h            |  1 +
>  include/linux/blk-mq.h |  1 +
>  4 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff20849738..4344f3c9058282 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> +noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
>  	struct request_queue *q = bdev->bd_disk->queue;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa43966..4ff85692843b49 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
>  	return BLK_QC_T_NONE;
>  }
>  
> +/**
> + * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O
> + * @bio:  The bio describing the location in memory and on the device.
> + *
> + * This function behaves similar to submit_bio_noacct(), but does never waits
> + * for the queue to be unfreozen, instead it return false and lets the caller
> + * deal with the fallout.  It also does not protect against recursion and thus
> + * must only be used if the called driver is known to be blk-mq based.
> + */
> +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc)
> +{
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct request_queue *q = disk->queue;
> +
> +	if (WARN_ON_ONCE(!current->bio_list) ||
> +	    WARN_ON_ONCE(disk->fops->submit_bio)) {
> +		bio_io_error(bio);
> +		goto fail;
> +	}
> +	if (!submit_bio_checks(bio))
> +		goto fail;
> +
> +	if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)))
> +		return false;
> +	if (!blk_crypto_bio_prep(&bio))
> +		goto fail_queue_exit;
> +	*qc = blk_mq_submit_bio(bio);
> +	return true;
> +
> +fail_queue_exit:
> +	blk_queue_exit(disk->queue);
> +fail:
> +	*qc = BLK_QC_T_NONE;
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct);
> +
>  void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		     unsigned int hctx_idx)
>  {
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e4e..c4c66b2a9ffb19 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
>  ssize_t part_timeout_store(struct device *, struct device_attribute *,
>  				const char *, size_t);
>  
> +bool submit_bio_checks(struct bio *bio);
>  void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
>  int ll_back_merge_fn(struct request *req, struct bio *bio,
>  		unsigned int nr_segs);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b899089..6804f397106ada 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>  }
>  
>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
> +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc);
>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>  		struct lock_class_key *key);
>  
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API
@ 2021-03-22 11:23     ` Hannes Reinecke
  0 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-22 11:23 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/22/21 8:37 AM, Christoph Hellwig wrote:
> This adds (back) and API for simple stacking drivers to submit a bio to
> blk-mq queue.  The prime aim is to avoid blocking on the queue freeze
> percpu ref, as a multipath driver really does not want to get blocked
> on that when an underlying device is undergoing error recovery.  It also
> happens to optimize away the small overhead of the curren->bio_list based
> recursion avoidance.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-core.c       |  2 +-
>  block/blk-mq.c         | 37 +++++++++++++++++++++++++++++++++++++
>  block/blk.h            |  1 +
>  include/linux/blk-mq.h |  1 +
>  4 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff20849738..4344f3c9058282 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,7 +792,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> +noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
>  	struct request_queue *q = bdev->bd_disk->queue;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa43966..4ff85692843b49 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2286,6 +2286,43 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
>  	return BLK_QC_T_NONE;
>  }
>  
> +/**
> + * blk_mq_submit_bio_direct - hand a bio directly to the driver for I/O
> + * @bio:  The bio describing the location in memory and on the device.
> + *
> + * This function behaves similar to submit_bio_noacct(), but does never waits
> + * for the queue to be unfreozen, instead it return false and lets the caller
> + * deal with the fallout.  It also does not protect against recursion and thus
> + * must only be used if the called driver is known to be blk-mq based.
> + */
> +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc)
> +{
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct request_queue *q = disk->queue;
> +
> +	if (WARN_ON_ONCE(!current->bio_list) ||
> +	    WARN_ON_ONCE(disk->fops->submit_bio)) {
> +		bio_io_error(bio);
> +		goto fail;
> +	}
> +	if (!submit_bio_checks(bio))
> +		goto fail;
> +
> +	if (unlikely(blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)))
> +		return false;
> +	if (!blk_crypto_bio_prep(&bio))
> +		goto fail_queue_exit;
> +	*qc = blk_mq_submit_bio(bio);
> +	return true;
> +
> +fail_queue_exit:
> +	blk_queue_exit(disk->queue);
> +fail:
> +	*qc = BLK_QC_T_NONE;
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_submit_bio_direct);
> +
>  void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  		     unsigned int hctx_idx)
>  {
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e4e..c4c66b2a9ffb19 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -221,6 +221,7 @@ ssize_t part_timeout_show(struct device *, struct device_attribute *, char *);
>  ssize_t part_timeout_store(struct device *, struct device_attribute *,
>  				const char *, size_t);
>  
> +bool submit_bio_checks(struct bio *bio);
>  void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
>  int ll_back_merge_fn(struct request *req, struct bio *bio,
>  		unsigned int nr_segs);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b899089..6804f397106ada 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>  }
>  
>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
> +bool blk_mq_submit_bio_direct(struct bio *bio, blk_qc_t *qc);
>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>  		struct lock_class_key *key);
>  
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API
  2021-03-22  7:37   ` Christoph Hellwig
@ 2021-03-22 15:30     ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2021-03-22 15:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Jens Axboe, Chao Leng, linux-block, linux-nvme

Looks good. Just a couple minor typo's in the commit message.

Reviewed-by: Keith Busch <kbusch@kernel.org>

On Mon, Mar 22, 2021 at 08:37:25AM +0100, Christoph Hellwig wrote:
> This adds (back) and API for simple stacking drivers to submit a bio to

                   an API

> blk-mq queue.  The prime aim is to avoid blocking on the queue freeze
> percpu ref, as a multipath driver really does not want to get blocked
> on that when an underlying device is undergoing error recovery.  It also
> happens to optimize away the small overhead of the curren->bio_list based

                                                     current->bio_list

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API
@ 2021-03-22 15:30     ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2021-03-22 15:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Jens Axboe, Chao Leng, linux-block, linux-nvme

Looks good. Just a couple minor typo's in the commit message.

Reviewed-by: Keith Busch <kbusch@kernel.org>

On Mon, Mar 22, 2021 at 08:37:25AM +0100, Christoph Hellwig wrote:
> This adds (back) and API for simple stacking drivers to submit a bio to

                   an API

> blk-mq queue.  The prime aim is to avoid blocking on the queue freeze
> percpu ref, as a multipath driver really does not want to get blocked
> on that when an underlying device is undergoing error recovery.  It also
> happens to optimize away the small overhead of the curren->bio_list based

                                                     current->bio_list

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-22  7:37   ` Christoph Hellwig
@ 2021-03-22 15:31     ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2021-03-22 15:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Jens Axboe, Chao Leng, linux-block, linux-nvme

On Mon, Mar 22, 2021 at 08:37:26AM +0100, Christoph Hellwig wrote:
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-22 15:31     ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2021-03-22 15:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Jens Axboe, Chao Leng, linux-block, linux-nvme

On Mon, Mar 22, 2021 at 08:37:26AM +0100, Christoph Hellwig wrote:
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-22  7:37   ` Christoph Hellwig
@ 2021-03-23  2:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  2:57 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme


> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.

Almost...

This still has the same issue but instead of blocking on
blk_queue_enter() it is blocked on blk_mq_get_tag():
--
  __schedule+0x22b/0x6e0
  schedule+0x46/0xb0
  io_schedule+0x42/0x70
  blk_mq_get_tag+0x11d/0x270
  ? blk_bio_segment_split+0x235/0x2a0
  ? finish_wait+0x80/0x80
  __blk_mq_alloc_request+0x65/0xe0
  blk_mq_submit_bio+0x144/0x500
  blk_mq_submit_bio_direct+0x78/0xa0
  nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
  __submit_bio_noacct+0xcf/0x2e0
  __blkdev_direct_IO+0x413/0x440
  ? __io_complete_rw.constprop.0+0x150/0x150
  generic_file_read_iter+0x92/0x160
  io_iter_do_read+0x1a/0x40
  io_read+0xc5/0x350
  ? common_interrupt+0x14/0xa0
  ? update_load_avg+0x7a/0x5e0
  io_issue_sqe+0xa28/0x1020
  ? lock_timer_base+0x61/0x80
  io_wq_submit_work+0xaa/0x120
  io_worker_handle_work+0x121/0x330
  io_wqe_worker+0xb6/0x190
  ? io_worker_handle_work+0x330/0x330
  ret_from_fork+0x22/0x30
--

--
  ? usleep_range+0x80/0x80
  __schedule+0x22b/0x6e0
  ? usleep_range+0x80/0x80
  schedule+0x46/0xb0
  schedule_timeout+0xff/0x140
  ? del_timer_sync+0x67/0xb0
  ? __prepare_to_swait+0x4b/0x70
  __wait_for_common+0xb3/0x160
  __synchronize_srcu.part.0+0x75/0xe0
  ? __bpf_trace_rcu_utilization+0x10/0x10
  nvme_mpath_set_live+0x61/0x130 [nvme_core]
  nvme_update_ana_state+0xd7/0x100 [nvme_core]
  nvme_parse_ana_log+0xa5/0x160 [nvme_core]
  ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
  nvme_read_ana_log+0x7b/0xe0 [nvme_core]
  process_one_work+0x1e6/0x380
  worker_thread+0x49/0x300
--



If I were to always start the queues in nvme_tcp_teardown_ctrl
right after I cancel the tagset inflights like:
--
@@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct 
nvme_ctrl *ctrl,
         nvme_sync_io_queues(ctrl);
         nvme_tcp_stop_io_queues(ctrl);
         nvme_cancel_tagset(ctrl);
-       if (remove)
-               nvme_start_queues(ctrl);
+       nvme_start_queues(ctrl);
         nvme_tcp_destroy_io_queues(ctrl, remove);
--

then a simple reset during traffic bricks the host on infinite loop
because in the setup sequence we freeze the queue in
nvme_update_ns_info, so the queue is frozen but we still have an
available path (because the controller is back to live!) so nvme-mpath
keeps calling blk_mq_submit_bio_direct and fails, and
nvme_update_ns_info cannot properly freeze the queue..
-> deadlock.

So this is obviously incorrect.

Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
will fail as soon as we run out of tags, even in the normal path...

So I'm not exactly sure what we should do to fix this...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  2:57     ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  2:57 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme


> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.

Almost...

This still has the same issue but instead of blocking on
blk_queue_enter() it is blocked on blk_mq_get_tag():
--
  __schedule+0x22b/0x6e0
  schedule+0x46/0xb0
  io_schedule+0x42/0x70
  blk_mq_get_tag+0x11d/0x270
  ? blk_bio_segment_split+0x235/0x2a0
  ? finish_wait+0x80/0x80
  __blk_mq_alloc_request+0x65/0xe0
  blk_mq_submit_bio+0x144/0x500
  blk_mq_submit_bio_direct+0x78/0xa0
  nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
  __submit_bio_noacct+0xcf/0x2e0
  __blkdev_direct_IO+0x413/0x440
  ? __io_complete_rw.constprop.0+0x150/0x150
  generic_file_read_iter+0x92/0x160
  io_iter_do_read+0x1a/0x40
  io_read+0xc5/0x350
  ? common_interrupt+0x14/0xa0
  ? update_load_avg+0x7a/0x5e0
  io_issue_sqe+0xa28/0x1020
  ? lock_timer_base+0x61/0x80
  io_wq_submit_work+0xaa/0x120
  io_worker_handle_work+0x121/0x330
  io_wqe_worker+0xb6/0x190
  ? io_worker_handle_work+0x330/0x330
  ret_from_fork+0x22/0x30
--

--
  ? usleep_range+0x80/0x80
  __schedule+0x22b/0x6e0
  ? usleep_range+0x80/0x80
  schedule+0x46/0xb0
  schedule_timeout+0xff/0x140
  ? del_timer_sync+0x67/0xb0
  ? __prepare_to_swait+0x4b/0x70
  __wait_for_common+0xb3/0x160
  __synchronize_srcu.part.0+0x75/0xe0
  ? __bpf_trace_rcu_utilization+0x10/0x10
  nvme_mpath_set_live+0x61/0x130 [nvme_core]
  nvme_update_ana_state+0xd7/0x100 [nvme_core]
  nvme_parse_ana_log+0xa5/0x160 [nvme_core]
  ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
  nvme_read_ana_log+0x7b/0xe0 [nvme_core]
  process_one_work+0x1e6/0x380
  worker_thread+0x49/0x300
--



If I were to always start the queues in nvme_tcp_teardown_ctrl
right after I cancel the tagset inflights like:
--
@@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct 
nvme_ctrl *ctrl,
         nvme_sync_io_queues(ctrl);
         nvme_tcp_stop_io_queues(ctrl);
         nvme_cancel_tagset(ctrl);
-       if (remove)
-               nvme_start_queues(ctrl);
+       nvme_start_queues(ctrl);
         nvme_tcp_destroy_io_queues(ctrl, remove);
--

then a simple reset during traffic bricks the host on infinite loop
because in the setup sequence we freeze the queue in
nvme_update_ns_info, so the queue is frozen but we still have an
available path (because the controller is back to live!) so nvme-mpath
keeps calling blk_mq_submit_bio_direct and fails, and
nvme_update_ns_info cannot properly freeze the queue..
-> deadlock.

So this is obviously incorrect.

Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
will fail as soon as we run out of tags, even in the normal path...

So I'm not exactly sure what we should do to fix this...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  2:57     ` Sagi Grimberg
@ 2021-03-23  3:23       ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  3:23 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme



On 3/22/21 7:57 PM, Sagi Grimberg wrote:
> 
>> When we reset/teardown a controller, we must freeze and quiesce the
>> namespaces request queues to make sure that we safely stop inflight I/O
>> submissions. Freeze is mandatory because if our hctx map changed between
>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>> the queue, and if it still has pending submissions (that are still
>> quiesced) it will hang.
>>
>> However, by freezing the namespaces request queues, and only unfreezing
>> them when we successfully reconnect, inflight submissions that are
>> running concurrently can now block grabbing the nshead srcu until either
>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>> explicitly disconnected).
>>
>> This caused a deadlock when a different controller (different path on the
>> same subsystem) became live (i.e. optimized/non-optimized). This is
>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>> requeueing I/O in order to make sure that current_path is visible to
>> future (re-)submisions. However the srcu lock is taken by a blocked
>> submission on a frozen request queue, and we have a deadlock.
>>
>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>> bio to the low-level driver, which does not block on the queue free
>> but instead allows nvme-multipath to pick another path or queue up the
>> bio.
> 
> Almost...
> 
> This still has the same issue but instead of blocking on
> blk_queue_enter() it is blocked on blk_mq_get_tag():
> -- 
>   __schedule+0x22b/0x6e0
>   schedule+0x46/0xb0
>   io_schedule+0x42/0x70
>   blk_mq_get_tag+0x11d/0x270
>   ? blk_bio_segment_split+0x235/0x2a0
>   ? finish_wait+0x80/0x80
>   __blk_mq_alloc_request+0x65/0xe0
>   blk_mq_submit_bio+0x144/0x500
>   blk_mq_submit_bio_direct+0x78/0xa0
>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>   __submit_bio_noacct+0xcf/0x2e0
>   __blkdev_direct_IO+0x413/0x440
>   ? __io_complete_rw.constprop.0+0x150/0x150
>   generic_file_read_iter+0x92/0x160
>   io_iter_do_read+0x1a/0x40
>   io_read+0xc5/0x350
>   ? common_interrupt+0x14/0xa0
>   ? update_load_avg+0x7a/0x5e0
>   io_issue_sqe+0xa28/0x1020
>   ? lock_timer_base+0x61/0x80
>   io_wq_submit_work+0xaa/0x120
>   io_worker_handle_work+0x121/0x330
>   io_wqe_worker+0xb6/0x190
>   ? io_worker_handle_work+0x330/0x330
>   ret_from_fork+0x22/0x30
> -- 
> 
> -- 
>   ? usleep_range+0x80/0x80
>   __schedule+0x22b/0x6e0
>   ? usleep_range+0x80/0x80
>   schedule+0x46/0xb0
>   schedule_timeout+0xff/0x140
>   ? del_timer_sync+0x67/0xb0
>   ? __prepare_to_swait+0x4b/0x70
>   __wait_for_common+0xb3/0x160
>   __synchronize_srcu.part.0+0x75/0xe0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>   process_one_work+0x1e6/0x380
>   worker_thread+0x49/0x300
> -- 
> 
> 
> 
> If I were to always start the queues in nvme_tcp_teardown_ctrl
> right after I cancel the tagset inflights like:
> -- 
> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct 
> nvme_ctrl *ctrl,
>          nvme_sync_io_queues(ctrl);
>          nvme_tcp_stop_io_queues(ctrl);
>          nvme_cancel_tagset(ctrl);
> -       if (remove)
> -               nvme_start_queues(ctrl);
> +       nvme_start_queues(ctrl);
>          nvme_tcp_destroy_io_queues(ctrl, remove);
> -- 
> 
> then a simple reset during traffic bricks the host on infinite loop
> because in the setup sequence we freeze the queue in
> nvme_update_ns_info, so the queue is frozen but we still have an
> available path (because the controller is back to live!) so nvme-mpath
> keeps calling blk_mq_submit_bio_direct and fails, and
> nvme_update_ns_info cannot properly freeze the queue..
> -> deadlock.
> 
> So this is obviously incorrect.
> 
> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
> will fail as soon as we run out of tags, even in the normal path...
> 
> So I'm not exactly sure what we should do to fix this...

It's still not too late to go with my original approach... ;)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  3:23       ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  3:23 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme



On 3/22/21 7:57 PM, Sagi Grimberg wrote:
> 
>> When we reset/teardown a controller, we must freeze and quiesce the
>> namespaces request queues to make sure that we safely stop inflight I/O
>> submissions. Freeze is mandatory because if our hctx map changed between
>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>> the queue, and if it still has pending submissions (that are still
>> quiesced) it will hang.
>>
>> However, by freezing the namespaces request queues, and only unfreezing
>> them when we successfully reconnect, inflight submissions that are
>> running concurrently can now block grabbing the nshead srcu until either
>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>> explicitly disconnected).
>>
>> This caused a deadlock when a different controller (different path on the
>> same subsystem) became live (i.e. optimized/non-optimized). This is
>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>> requeueing I/O in order to make sure that current_path is visible to
>> future (re-)submisions. However the srcu lock is taken by a blocked
>> submission on a frozen request queue, and we have a deadlock.
>>
>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>> bio to the low-level driver, which does not block on the queue free
>> but instead allows nvme-multipath to pick another path or queue up the
>> bio.
> 
> Almost...
> 
> This still has the same issue but instead of blocking on
> blk_queue_enter() it is blocked on blk_mq_get_tag():
> -- 
>   __schedule+0x22b/0x6e0
>   schedule+0x46/0xb0
>   io_schedule+0x42/0x70
>   blk_mq_get_tag+0x11d/0x270
>   ? blk_bio_segment_split+0x235/0x2a0
>   ? finish_wait+0x80/0x80
>   __blk_mq_alloc_request+0x65/0xe0
>   blk_mq_submit_bio+0x144/0x500
>   blk_mq_submit_bio_direct+0x78/0xa0
>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>   __submit_bio_noacct+0xcf/0x2e0
>   __blkdev_direct_IO+0x413/0x440
>   ? __io_complete_rw.constprop.0+0x150/0x150
>   generic_file_read_iter+0x92/0x160
>   io_iter_do_read+0x1a/0x40
>   io_read+0xc5/0x350
>   ? common_interrupt+0x14/0xa0
>   ? update_load_avg+0x7a/0x5e0
>   io_issue_sqe+0xa28/0x1020
>   ? lock_timer_base+0x61/0x80
>   io_wq_submit_work+0xaa/0x120
>   io_worker_handle_work+0x121/0x330
>   io_wqe_worker+0xb6/0x190
>   ? io_worker_handle_work+0x330/0x330
>   ret_from_fork+0x22/0x30
> -- 
> 
> -- 
>   ? usleep_range+0x80/0x80
>   __schedule+0x22b/0x6e0
>   ? usleep_range+0x80/0x80
>   schedule+0x46/0xb0
>   schedule_timeout+0xff/0x140
>   ? del_timer_sync+0x67/0xb0
>   ? __prepare_to_swait+0x4b/0x70
>   __wait_for_common+0xb3/0x160
>   __synchronize_srcu.part.0+0x75/0xe0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>   process_one_work+0x1e6/0x380
>   worker_thread+0x49/0x300
> -- 
> 
> 
> 
> If I were to always start the queues in nvme_tcp_teardown_ctrl
> right after I cancel the tagset inflights like:
> -- 
> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct 
> nvme_ctrl *ctrl,
>          nvme_sync_io_queues(ctrl);
>          nvme_tcp_stop_io_queues(ctrl);
>          nvme_cancel_tagset(ctrl);
> -       if (remove)
> -               nvme_start_queues(ctrl);
> +       nvme_start_queues(ctrl);
>          nvme_tcp_destroy_io_queues(ctrl, remove);
> -- 
> 
> then a simple reset during traffic bricks the host on infinite loop
> because in the setup sequence we freeze the queue in
> nvme_update_ns_info, so the queue is frozen but we still have an
> available path (because the controller is back to live!) so nvme-mpath
> keeps calling blk_mq_submit_bio_direct and fails, and
> nvme_update_ns_info cannot properly freeze the queue..
> -> deadlock.
> 
> So this is obviously incorrect.
> 
> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
> will fail as soon as we run out of tags, even in the normal path...
> 
> So I'm not exactly sure what we should do to fix this...

It's still not too late to go with my original approach... ;)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  3:23       ` Sagi Grimberg
@ 2021-03-23  7:04         ` Chao Leng
  -1 siblings, 0 replies; 48+ messages in thread
From: Chao Leng @ 2021-03-23  7:04 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: linux-block, linux-nvme



On 2021/3/23 11:23, Sagi Grimberg wrote:
> 
> 
> On 3/22/21 7:57 PM, Sagi Grimberg wrote:
>>
>>> When we reset/teardown a controller, we must freeze and quiesce the
>>> namespaces request queues to make sure that we safely stop inflight I/O
>>> submissions. Freeze is mandatory because if our hctx map changed between
>>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>>> the queue, and if it still has pending submissions (that are still
>>> quiesced) it will hang.
>>>
>>> However, by freezing the namespaces request queues, and only unfreezing
>>> them when we successfully reconnect, inflight submissions that are
>>> running concurrently can now block grabbing the nshead srcu until either
>>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>>> explicitly disconnected).
>>>
>>> This caused a deadlock when a different controller (different path on the
>>> same subsystem) became live (i.e. optimized/non-optimized). This is
>>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>>> requeueing I/O in order to make sure that current_path is visible to
>>> future (re-)submisions. However the srcu lock is taken by a blocked
>>> submission on a frozen request queue, and we have a deadlock.
>>>
>>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>>> bio to the low-level driver, which does not block on the queue free
>>> but instead allows nvme-multipath to pick another path or queue up the
>>> bio.
>>
>> Almost...
>>
>> This still has the same issue but instead of blocking on
>> blk_queue_enter() it is blocked on blk_mq_get_tag():
>> -- 
>>   __schedule+0x22b/0x6e0
>>   schedule+0x46/0xb0
>>   io_schedule+0x42/0x70
>>   blk_mq_get_tag+0x11d/0x270
>>   ? blk_bio_segment_split+0x235/0x2a0
>>   ? finish_wait+0x80/0x80
>>   __blk_mq_alloc_request+0x65/0xe0
>>   blk_mq_submit_bio+0x144/0x500
>>   blk_mq_submit_bio_direct+0x78/0xa0
>>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>>   __submit_bio_noacct+0xcf/0x2e0
>>   __blkdev_direct_IO+0x413/0x440
>>   ? __io_complete_rw.constprop.0+0x150/0x150
>>   generic_file_read_iter+0x92/0x160
>>   io_iter_do_read+0x1a/0x40
>>   io_read+0xc5/0x350
>>   ? common_interrupt+0x14/0xa0
>>   ? update_load_avg+0x7a/0x5e0
>>   io_issue_sqe+0xa28/0x1020
>>   ? lock_timer_base+0x61/0x80
>>   io_wq_submit_work+0xaa/0x120
>>   io_worker_handle_work+0x121/0x330
>>   io_wqe_worker+0xb6/0x190
>>   ? io_worker_handle_work+0x330/0x330
>>   ret_from_fork+0x22/0x30
>> -- 
>>
>> -- 
>>   ? usleep_range+0x80/0x80
>>   __schedule+0x22b/0x6e0
>>   ? usleep_range+0x80/0x80
>>   schedule+0x46/0xb0
>>   schedule_timeout+0xff/0x140
>>   ? del_timer_sync+0x67/0xb0
>>   ? __prepare_to_swait+0x4b/0x70
>>   __wait_for_common+0xb3/0x160
>>   __synchronize_srcu.part.0+0x75/0xe0
>>   ? __bpf_trace_rcu_utilization+0x10/0x10
>>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>>   process_one_work+0x1e6/0x380
>>   worker_thread+0x49/0x300
>> -- 
>>
>>
>>
>> If I were to always start the queues in nvme_tcp_teardown_ctrl
>> right after I cancel the tagset inflights like:
>> -- 
>> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>          nvme_sync_io_queues(ctrl);
>>          nvme_tcp_stop_io_queues(ctrl);
>>          nvme_cancel_tagset(ctrl);
>> -       if (remove)
>> -               nvme_start_queues(ctrl);
>> +       nvme_start_queues(ctrl);
>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>> -- 
>>
>> then a simple reset during traffic bricks the host on infinite loop
>> because in the setup sequence we freeze the queue in
>> nvme_update_ns_info, so the queue is frozen but we still have an
>> available path (because the controller is back to live!) so nvme-mpath
>> keeps calling blk_mq_submit_bio_direct and fails, and
>> nvme_update_ns_info cannot properly freeze the queue..
>> -> deadlock.
>>
>> So this is obviously incorrect.
>>
>> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
>> will fail as soon as we run out of tags, even in the normal path...
>>
>> So I'm not exactly sure what we should do to fix this...
> 
> It's still not too late to go with my original approach... ;)
I check it again. I still think the below patch can avoid the bug.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
The process:
1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue.
3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock.

Sagi, suggest trying this patch.

> .

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  7:04         ` Chao Leng
  0 siblings, 0 replies; 48+ messages in thread
From: Chao Leng @ 2021-03-23  7:04 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: linux-block, linux-nvme



On 2021/3/23 11:23, Sagi Grimberg wrote:
> 
> 
> On 3/22/21 7:57 PM, Sagi Grimberg wrote:
>>
>>> When we reset/teardown a controller, we must freeze and quiesce the
>>> namespaces request queues to make sure that we safely stop inflight I/O
>>> submissions. Freeze is mandatory because if our hctx map changed between
>>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>>> the queue, and if it still has pending submissions (that are still
>>> quiesced) it will hang.
>>>
>>> However, by freezing the namespaces request queues, and only unfreezing
>>> them when we successfully reconnect, inflight submissions that are
>>> running concurrently can now block grabbing the nshead srcu until either
>>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>>> explicitly disconnected).
>>>
>>> This caused a deadlock when a different controller (different path on the
>>> same subsystem) became live (i.e. optimized/non-optimized). This is
>>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>>> requeueing I/O in order to make sure that current_path is visible to
>>> future (re-)submisions. However the srcu lock is taken by a blocked
>>> submission on a frozen request queue, and we have a deadlock.
>>>
>>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>>> bio to the low-level driver, which does not block on the queue free
>>> but instead allows nvme-multipath to pick another path or queue up the
>>> bio.
>>
>> Almost...
>>
>> This still has the same issue but instead of blocking on
>> blk_queue_enter() it is blocked on blk_mq_get_tag():
>> -- 
>>   __schedule+0x22b/0x6e0
>>   schedule+0x46/0xb0
>>   io_schedule+0x42/0x70
>>   blk_mq_get_tag+0x11d/0x270
>>   ? blk_bio_segment_split+0x235/0x2a0
>>   ? finish_wait+0x80/0x80
>>   __blk_mq_alloc_request+0x65/0xe0
>>   blk_mq_submit_bio+0x144/0x500
>>   blk_mq_submit_bio_direct+0x78/0xa0
>>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>>   __submit_bio_noacct+0xcf/0x2e0
>>   __blkdev_direct_IO+0x413/0x440
>>   ? __io_complete_rw.constprop.0+0x150/0x150
>>   generic_file_read_iter+0x92/0x160
>>   io_iter_do_read+0x1a/0x40
>>   io_read+0xc5/0x350
>>   ? common_interrupt+0x14/0xa0
>>   ? update_load_avg+0x7a/0x5e0
>>   io_issue_sqe+0xa28/0x1020
>>   ? lock_timer_base+0x61/0x80
>>   io_wq_submit_work+0xaa/0x120
>>   io_worker_handle_work+0x121/0x330
>>   io_wqe_worker+0xb6/0x190
>>   ? io_worker_handle_work+0x330/0x330
>>   ret_from_fork+0x22/0x30
>> -- 
>>
>> -- 
>>   ? usleep_range+0x80/0x80
>>   __schedule+0x22b/0x6e0
>>   ? usleep_range+0x80/0x80
>>   schedule+0x46/0xb0
>>   schedule_timeout+0xff/0x140
>>   ? del_timer_sync+0x67/0xb0
>>   ? __prepare_to_swait+0x4b/0x70
>>   __wait_for_common+0xb3/0x160
>>   __synchronize_srcu.part.0+0x75/0xe0
>>   ? __bpf_trace_rcu_utilization+0x10/0x10
>>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>>   process_one_work+0x1e6/0x380
>>   worker_thread+0x49/0x300
>> -- 
>>
>>
>>
>> If I were to always start the queues in nvme_tcp_teardown_ctrl
>> right after I cancel the tagset inflights like:
>> -- 
>> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>          nvme_sync_io_queues(ctrl);
>>          nvme_tcp_stop_io_queues(ctrl);
>>          nvme_cancel_tagset(ctrl);
>> -       if (remove)
>> -               nvme_start_queues(ctrl);
>> +       nvme_start_queues(ctrl);
>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>> -- 
>>
>> then a simple reset during traffic bricks the host on infinite loop
>> because in the setup sequence we freeze the queue in
>> nvme_update_ns_info, so the queue is frozen but we still have an
>> available path (because the controller is back to live!) so nvme-mpath
>> keeps calling blk_mq_submit_bio_direct and fails, and
>> nvme_update_ns_info cannot properly freeze the queue..
>> -> deadlock.
>>
>> So this is obviously incorrect.
>>
>> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
>> will fail as soon as we run out of tags, even in the normal path...
>>
>> So I'm not exactly sure what we should do to fix this...
> 
> It's still not too late to go with my original approach... ;)
I check it again. I still think the below patch can avoid the bug.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
The process:
1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue.
3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock.

Sagi, suggest trying this patch.

> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  2:57     ` Sagi Grimberg
@ 2021-03-23  7:28       ` Hannes Reinecke
  -1 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-23  7:28 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/23/21 3:57 AM, Sagi Grimberg wrote:
> 
>> When we reset/teardown a controller, we must freeze and quiesce the
>> namespaces request queues to make sure that we safely stop inflight I/O
>> submissions. Freeze is mandatory because if our hctx map changed between
>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>> the queue, and if it still has pending submissions (that are still
>> quiesced) it will hang.
>>
>> However, by freezing the namespaces request queues, and only unfreezing
>> them when we successfully reconnect, inflight submissions that are
>> running concurrently can now block grabbing the nshead srcu until either
>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>> explicitly disconnected).
>>
>> This caused a deadlock when a different controller (different path on the
>> same subsystem) became live (i.e. optimized/non-optimized). This is
>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>> requeueing I/O in order to make sure that current_path is visible to
>> future (re-)submisions. However the srcu lock is taken by a blocked
>> submission on a frozen request queue, and we have a deadlock.
>>
>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>> bio to the low-level driver, which does not block on the queue free
>> but instead allows nvme-multipath to pick another path or queue up the
>> bio.
> 
> Almost...
> 
> This still has the same issue but instead of blocking on
> blk_queue_enter() it is blocked on blk_mq_get_tag():
> -- 
>   __schedule+0x22b/0x6e0
>   schedule+0x46/0xb0
>   io_schedule+0x42/0x70
>   blk_mq_get_tag+0x11d/0x270
>   ? blk_bio_segment_split+0x235/0x2a0
>   ? finish_wait+0x80/0x80
>   __blk_mq_alloc_request+0x65/0xe0
>   blk_mq_submit_bio+0x144/0x500
>   blk_mq_submit_bio_direct+0x78/0xa0
>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>   __submit_bio_noacct+0xcf/0x2e0
>   __blkdev_direct_IO+0x413/0x440
>   ? __io_complete_rw.constprop.0+0x150/0x150
>   generic_file_read_iter+0x92/0x160
>   io_iter_do_read+0x1a/0x40
>   io_read+0xc5/0x350
>   ? common_interrupt+0x14/0xa0
>   ? update_load_avg+0x7a/0x5e0
>   io_issue_sqe+0xa28/0x1020
>   ? lock_timer_base+0x61/0x80
>   io_wq_submit_work+0xaa/0x120
>   io_worker_handle_work+0x121/0x330
>   io_wqe_worker+0xb6/0x190
>   ? io_worker_handle_work+0x330/0x330
>   ret_from_fork+0x22/0x30
> -- 
> 
> -- 
>   ? usleep_range+0x80/0x80
>   __schedule+0x22b/0x6e0
>   ? usleep_range+0x80/0x80
>   schedule+0x46/0xb0
>   schedule_timeout+0xff/0x140
>   ? del_timer_sync+0x67/0xb0
>   ? __prepare_to_swait+0x4b/0x70
>   __wait_for_common+0xb3/0x160
>   __synchronize_srcu.part.0+0x75/0xe0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>   process_one_work+0x1e6/0x380
>   worker_thread+0x49/0x300
> -- 
> 
Actually, I had been playing around with marking the entire bio as 
'NOWAIT'; that would avoid the tag stall, too:

@@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
         ns = nvme_find_path(head);
         if (likely(ns)) {
                 bio_set_dev(bio, ns->disk->part0);
-               bio->bi_opf |= REQ_NVME_MPATH;
+               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
                 trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = submit_bio_noacct(bio);


My only worry here is that we might incur spurious failures under high 
load; but then this is not necessarily a bad thing.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  7:28       ` Hannes Reinecke
  0 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-23  7:28 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/23/21 3:57 AM, Sagi Grimberg wrote:
> 
>> When we reset/teardown a controller, we must freeze and quiesce the
>> namespaces request queues to make sure that we safely stop inflight I/O
>> submissions. Freeze is mandatory because if our hctx map changed between
>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>> the queue, and if it still has pending submissions (that are still
>> quiesced) it will hang.
>>
>> However, by freezing the namespaces request queues, and only unfreezing
>> them when we successfully reconnect, inflight submissions that are
>> running concurrently can now block grabbing the nshead srcu until either
>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>> explicitly disconnected).
>>
>> This caused a deadlock when a different controller (different path on the
>> same subsystem) became live (i.e. optimized/non-optimized). This is
>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>> requeueing I/O in order to make sure that current_path is visible to
>> future (re-)submisions. However the srcu lock is taken by a blocked
>> submission on a frozen request queue, and we have a deadlock.
>>
>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>> bio to the low-level driver, which does not block on the queue free
>> but instead allows nvme-multipath to pick another path or queue up the
>> bio.
> 
> Almost...
> 
> This still has the same issue but instead of blocking on
> blk_queue_enter() it is blocked on blk_mq_get_tag():
> -- 
>   __schedule+0x22b/0x6e0
>   schedule+0x46/0xb0
>   io_schedule+0x42/0x70
>   blk_mq_get_tag+0x11d/0x270
>   ? blk_bio_segment_split+0x235/0x2a0
>   ? finish_wait+0x80/0x80
>   __blk_mq_alloc_request+0x65/0xe0
>   blk_mq_submit_bio+0x144/0x500
>   blk_mq_submit_bio_direct+0x78/0xa0
>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>   __submit_bio_noacct+0xcf/0x2e0
>   __blkdev_direct_IO+0x413/0x440
>   ? __io_complete_rw.constprop.0+0x150/0x150
>   generic_file_read_iter+0x92/0x160
>   io_iter_do_read+0x1a/0x40
>   io_read+0xc5/0x350
>   ? common_interrupt+0x14/0xa0
>   ? update_load_avg+0x7a/0x5e0
>   io_issue_sqe+0xa28/0x1020
>   ? lock_timer_base+0x61/0x80
>   io_wq_submit_work+0xaa/0x120
>   io_worker_handle_work+0x121/0x330
>   io_wqe_worker+0xb6/0x190
>   ? io_worker_handle_work+0x330/0x330
>   ret_from_fork+0x22/0x30
> -- 
> 
> -- 
>   ? usleep_range+0x80/0x80
>   __schedule+0x22b/0x6e0
>   ? usleep_range+0x80/0x80
>   schedule+0x46/0xb0
>   schedule_timeout+0xff/0x140
>   ? del_timer_sync+0x67/0xb0
>   ? __prepare_to_swait+0x4b/0x70
>   __wait_for_common+0xb3/0x160
>   __synchronize_srcu.part.0+0x75/0xe0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>   process_one_work+0x1e6/0x380
>   worker_thread+0x49/0x300
> -- 
> 
Actually, I had been playing around with marking the entire bio as 
'NOWAIT'; that would avoid the tag stall, too:

@@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
         ns = nvme_find_path(head);
         if (likely(ns)) {
                 bio_set_dev(bio, ns->disk->part0);
-               bio->bi_opf |= REQ_NVME_MPATH;
+               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
                 trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = submit_bio_noacct(bio);


My only worry here is that we might incur spurious failures under high 
load; but then this is not necessarily a bad thing.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  7:28       ` Hannes Reinecke
@ 2021-03-23  7:31         ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  7:31 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme


> Actually, I had been playing around with marking the entire bio as 
> 'NOWAIT'; that would avoid the tag stall, too:
> 
> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>          ns = nvme_find_path(head);
>          if (likely(ns)) {
>                  bio_set_dev(bio, ns->disk->part0);
> -               bio->bi_opf |= REQ_NVME_MPATH;
> +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>                                        bio->bi_iter.bi_sector);
>                  ret = submit_bio_noacct(bio);
> 
> 
> My only worry here is that we might incur spurious failures under high 
> load; but then this is not necessarily a bad thing.

What? making spurious failures is not ok under any load. what fs will
take into account that you may have run out of tags?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  7:31         ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  7:31 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme


> Actually, I had been playing around with marking the entire bio as 
> 'NOWAIT'; that would avoid the tag stall, too:
> 
> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>          ns = nvme_find_path(head);
>          if (likely(ns)) {
>                  bio_set_dev(bio, ns->disk->part0);
> -               bio->bi_opf |= REQ_NVME_MPATH;
> +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>                                        bio->bi_iter.bi_sector);
>                  ret = submit_bio_noacct(bio);
> 
> 
> My only worry here is that we might incur spurious failures under high 
> load; but then this is not necessarily a bad thing.

What? making spurious failures is not ok under any load. what fs will
take into account that you may have run out of tags?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  7:04         ` Chao Leng
@ 2021-03-23  7:36           ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  7:36 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: linux-block, linux-nvme


> I check it again. I still think the below patch can avoid the bug.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 

I don't understand what you are saying...

> 
> The process:
> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead 
> of waiting for the frozen queue.

Nothing guarantees that you have a bio_list active at any point in time,
in fact for a workload that submits one by one you will always drain
that list directly in the submission...

> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is 
> frozen, can avoid deadlock.
> 
> Sagi, suggest trying this patch.

The above reproduces with the patch applied on upstream nvme code.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  7:36           ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23  7:36 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: linux-block, linux-nvme


> I check it again. I still think the below patch can avoid the bug.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 

I don't understand what you are saying...

> 
> The process:
> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead 
> of waiting for the frozen queue.

Nothing guarantees that you have a bio_list active at any point in time,
in fact for a workload that submits one by one you will always drain
that list directly in the submission...

> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is 
> frozen, can avoid deadlock.
> 
> Sagi, suggest trying this patch.

The above reproduces with the patch applied on upstream nvme code.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  7:36           ` Sagi Grimberg
@ 2021-03-23  8:13             ` Chao Leng
  -1 siblings, 0 replies; 48+ messages in thread
From: Chao Leng @ 2021-03-23  8:13 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: linux-block, linux-nvme



On 2021/3/23 15:36, Sagi Grimberg wrote:
> 
>> I check it again. I still think the below patch can avoid the bug.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 
> 
> I don't understand what you are saying...
> 
>>
>> The process:
>> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
>> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue.
> 
> Nothing guarantees that you have a bio_list active at any point in time,
> in fact for a workload that submits one by one you will always drain
> that list directly in the submission...
submit_bio and nvme_requeue_work both guarantee current->bio_list.
The process:
1.submit_bio and  nvme_requeue_work will call submit_bio_noacct.
2.submit_bio_noacct will call __submit_bio_noacct because bio->bi_disk->fops->submit_bio = nvme_ns_head_submit_bio.
3.__submit_bio_noacct set current->bio_list, and then __submit_bio will call bio->bi_disk->fops->submit_bio(nvme_ns_head_submit_bio)
4.nvme_ns_head_submit_bio will add the bio to current->bio_list.
5.__submit_bio_noacct drain current->bio_list.
when drain current->bio_list, it will wait for the frozen queue but do not hold the head->srcu.
Because it call blk_mq_submit_bio directly instead of ->submit_bio(nvme_ns_head_submit_bio).
So it is safe.
> 
>> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
>> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock.
>>
>> Sagi, suggest trying this patch.
> 
> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again.
Because it revert add the bio to current->bio_list.
Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct).
> .

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  8:13             ` Chao Leng
  0 siblings, 0 replies; 48+ messages in thread
From: Chao Leng @ 2021-03-23  8:13 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: linux-block, linux-nvme



On 2021/3/23 15:36, Sagi Grimberg wrote:
> 
>> I check it again. I still think the below patch can avoid the bug.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 
> 
> I don't understand what you are saying...
> 
>>
>> The process:
>> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
>> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue.
> 
> Nothing guarantees that you have a bio_list active at any point in time,
> in fact for a workload that submits one by one you will always drain
> that list directly in the submission...
submit_bio and nvme_requeue_work both guarantee current->bio_list.
The process:
1.submit_bio and  nvme_requeue_work will call submit_bio_noacct.
2.submit_bio_noacct will call __submit_bio_noacct because bio->bi_disk->fops->submit_bio = nvme_ns_head_submit_bio.
3.__submit_bio_noacct set current->bio_list, and then __submit_bio will call bio->bi_disk->fops->submit_bio(nvme_ns_head_submit_bio)
4.nvme_ns_head_submit_bio will add the bio to current->bio_list.
5.__submit_bio_noacct drain current->bio_list.
when drain current->bio_list, it will wait for the frozen queue but do not hold the head->srcu.
Because it call blk_mq_submit_bio directly instead of ->submit_bio(nvme_ns_head_submit_bio).
So it is safe.
> 
>> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
>> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock.
>>
>> Sagi, suggest trying this patch.
> 
> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again.
Because it revert add the bio to current->bio_list.
Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct).
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  7:31         ` Sagi Grimberg
@ 2021-03-23  8:36           ` Hannes Reinecke
  -1 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-23  8:36 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/23/21 8:31 AM, Sagi Grimberg wrote:
> 
>> Actually, I had been playing around with marking the entire bio as 
>> 'NOWAIT'; that would avoid the tag stall, too:
>>
>> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>          ns = nvme_find_path(head);
>>          if (likely(ns)) {
>>                  bio_set_dev(bio, ns->disk->part0);
>> -               bio->bi_opf |= REQ_NVME_MPATH;
>> +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>>                                        bio->bi_iter.bi_sector);
>>                  ret = submit_bio_noacct(bio);
>>
>>
>> My only worry here is that we might incur spurious failures under high 
>> load; but then this is not necessarily a bad thing.
> 
> What? making spurious failures is not ok under any load. what fs will
> take into account that you may have run out of tags?

Well, it's not actually a spurious failure but rather a spurious 
failover, as we're still on a multipath scenario, and bios will still be 
re-routed to other paths. Or queued if all paths are out of tags.
Hence the OS would not see any difference in behaviour.

But in the end, we abandoned this attempt, as the crash we've been 
seeing was in bio_endio (due to bi_bdev still pointing to the removed 
path device):

[ 6552.155251]  bio_endio+0x74/0x120
[ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
[ 6552.155271]  submit_bio_noacct+0x175/0x490
[ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155296]  process_one_work+0x1f4/0x3e0
[ 6552.155299]  worker_thread+0x2d/0x3e0
[ 6552.155302]  ? process_one_work+0x3e0/0x3e0
[ 6552.155305]  kthread+0x10d/0x130
[ 6552.155307]  ? kthread_park+0xa0/0xa0
[ 6552.155311]  ret_from_fork+0x35/0x40

So we're not blocked on blk_queue_enter(), and it's a crash, not a 
deadlock. Blocking on blk_queue_enter() certainly plays a part here,
but is seems not to be the full picture.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23  8:36           ` Hannes Reinecke
  0 siblings, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2021-03-23  8:36 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Chao Leng, linux-block, linux-nvme

On 3/23/21 8:31 AM, Sagi Grimberg wrote:
> 
>> Actually, I had been playing around with marking the entire bio as 
>> 'NOWAIT'; that would avoid the tag stall, too:
>>
>> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>          ns = nvme_find_path(head);
>>          if (likely(ns)) {
>>                  bio_set_dev(bio, ns->disk->part0);
>> -               bio->bi_opf |= REQ_NVME_MPATH;
>> +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>>                                        bio->bi_iter.bi_sector);
>>                  ret = submit_bio_noacct(bio);
>>
>>
>> My only worry here is that we might incur spurious failures under high 
>> load; but then this is not necessarily a bad thing.
> 
> What? making spurious failures is not ok under any load. what fs will
> take into account that you may have run out of tags?

Well, it's not actually a spurious failure but rather a spurious 
failover, as we're still on a multipath scenario, and bios will still be 
re-routed to other paths. Or queued if all paths are out of tags.
Hence the OS would not see any difference in behaviour.

But in the end, we abandoned this attempt, as the crash we've been 
seeing was in bio_endio (due to bi_bdev still pointing to the removed 
path device):

[ 6552.155251]  bio_endio+0x74/0x120
[ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
[ 6552.155271]  submit_bio_noacct+0x175/0x490
[ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155296]  process_one_work+0x1f4/0x3e0
[ 6552.155299]  worker_thread+0x2d/0x3e0
[ 6552.155302]  ? process_one_work+0x3e0/0x3e0
[ 6552.155305]  kthread+0x10d/0x130
[ 6552.155307]  ? kthread_park+0xa0/0xa0
[ 6552.155311]  ret_from_fork+0x35/0x40

So we're not blocked on blk_queue_enter(), and it's a crash, not a 
deadlock. Blocking on blk_queue_enter() certainly plays a part here,
but is seems not to be the full picture.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  8:36           ` Hannes Reinecke
@ 2021-03-23 14:53             ` Keith Busch
  -1 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2021-03-23 14:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Jens Axboe, Chao Leng,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 09:36:47AM +0100, Hannes Reinecke wrote:
> On 3/23/21 8:31 AM, Sagi Grimberg wrote:
> > 
> > > Actually, I had been playing around with marking the entire bio as
> > > 'NOWAIT'; that would avoid the tag stall, too:
> > > 
> > > @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
> > >          ns = nvme_find_path(head);
> > >          if (likely(ns)) {
> > >                  bio_set_dev(bio, ns->disk->part0);
> > > -               bio->bi_opf |= REQ_NVME_MPATH;
> > > +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
> > >                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
> > >                                        bio->bi_iter.bi_sector);
> > >                  ret = submit_bio_noacct(bio);
> > > 
> > > 
> > > My only worry here is that we might incur spurious failures under
> > > high load; but then this is not necessarily a bad thing.
> > 
> > What? making spurious failures is not ok under any load. what fs will
> > take into account that you may have run out of tags?
> 
> Well, it's not actually a spurious failure but rather a spurious failover,
> as we're still on a multipath scenario, and bios will still be re-routed to
> other paths. Or queued if all paths are out of tags.
> Hence the OS would not see any difference in behaviour.

Failover might be overkill. We can run out of tags in a perfectly normal
situation, and simply waiting may be the best option, or even scheduling
on a different CPU may be sufficient to get a viable tag  rather than
selecting a different path.

Does it make sense to just abort all allocated tags during a reset and
let the original bio requeue for multipath IO?

 
> But in the end, we abandoned this attempt, as the crash we've been seeing
> was in bio_endio (due to bi_bdev still pointing to the removed path device):
> 
> [ 6552.155251]  bio_endio+0x74/0x120
> [ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
> [ 6552.155271]  submit_bio_noacct+0x175/0x490
> [ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
> [ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
> [ 6552.155296]  process_one_work+0x1f4/0x3e0
> [ 6552.155299]  worker_thread+0x2d/0x3e0
> [ 6552.155302]  ? process_one_work+0x3e0/0x3e0
> [ 6552.155305]  kthread+0x10d/0x130
> [ 6552.155307]  ? kthread_park+0xa0/0xa0
> [ 6552.155311]  ret_from_fork+0x35/0x40
> 
> So we're not blocked on blk_queue_enter(), and it's a crash, not a deadlock.
> Blocking on blk_queue_enter() certainly plays a part here,
> but is seems not to be the full picture.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 14:53             ` Keith Busch
  0 siblings, 0 replies; 48+ messages in thread
From: Keith Busch @ 2021-03-23 14:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Jens Axboe, Chao Leng,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 09:36:47AM +0100, Hannes Reinecke wrote:
> On 3/23/21 8:31 AM, Sagi Grimberg wrote:
> > 
> > > Actually, I had been playing around with marking the entire bio as
> > > 'NOWAIT'; that would avoid the tag stall, too:
> > > 
> > > @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
> > >          ns = nvme_find_path(head);
> > >          if (likely(ns)) {
> > >                  bio_set_dev(bio, ns->disk->part0);
> > > -               bio->bi_opf |= REQ_NVME_MPATH;
> > > +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
> > >                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
> > >                                        bio->bi_iter.bi_sector);
> > >                  ret = submit_bio_noacct(bio);
> > > 
> > > 
> > > My only worry here is that we might incur spurious failures under
> > > high load; but then this is not necessarily a bad thing.
> > 
> > What? making spurious failures is not ok under any load. what fs will
> > take into account that you may have run out of tags?
> 
> Well, it's not actually a spurious failure but rather a spurious failover,
> as we're still on a multipath scenario, and bios will still be re-routed to
> other paths. Or queued if all paths are out of tags.
> Hence the OS would not see any difference in behaviour.

Failover might be overkill. We can run out of tags in a perfectly normal
situation, and simply waiting may be the best option, or even scheduling
on a different CPU may be sufficient to get a viable tag  rather than
selecting a different path.

Does it make sense to just abort all allocated tags during a reset and
let the original bio requeue for multipath IO?

 
> But in the end, we abandoned this attempt, as the crash we've been seeing
> was in bio_endio (due to bi_bdev still pointing to the removed path device):
> 
> [ 6552.155251]  bio_endio+0x74/0x120
> [ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
> [ 6552.155271]  submit_bio_noacct+0x175/0x490
> [ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
> [ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
> [ 6552.155296]  process_one_work+0x1f4/0x3e0
> [ 6552.155299]  worker_thread+0x2d/0x3e0
> [ 6552.155302]  ? process_one_work+0x3e0/0x3e0
> [ 6552.155305]  kthread+0x10d/0x130
> [ 6552.155307]  ? kthread_park+0xa0/0xa0
> [ 6552.155311]  ret_from_fork+0x35/0x40
> 
> So we're not blocked on blk_queue_enter(), and it's a crash, not a deadlock.
> Blocking on blk_queue_enter() certainly plays a part here,
> but is seems not to be the full picture.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  7:36           ` Sagi Grimberg
@ 2021-03-23 16:15             ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Christoph Hellwig, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 12:36:40AM -0700, Sagi Grimberg wrote:
>> The process:
>> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
>> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of 
>> waiting for the frozen queue.
>
> Nothing guarantees that you have a bio_list active at any point in time,
> in fact for a workload that submits one by one you will always drain
> that list directly in the submission...

It should always be active when ->submit_bio is called.

>
>> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
>> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is 
>> frozen, can avoid deadlock.
>>
>> Sagi, suggest trying this patch.
>
> The above reproduces with the patch applied on upstream nvme code.

Weird.  I don't think the deadlock in your original report should
happen due to this.  Can you take a look at the callstacks in the
reproduced deadlock?  Either we're missing something obvious or it is a
a somewhat different deadlock.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 16:15             ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chao Leng, Christoph Hellwig, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 12:36:40AM -0700, Sagi Grimberg wrote:
>> The process:
>> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
>> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of 
>> waiting for the frozen queue.
>
> Nothing guarantees that you have a bio_list active at any point in time,
> in fact for a workload that submits one by one you will always drain
> that list directly in the submission...

It should always be active when ->submit_bio is called.

>
>> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
>> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is 
>> frozen, can avoid deadlock.
>>
>> Sagi, suggest trying this patch.
>
> The above reproduces with the patch applied on upstream nvme code.

Weird.  I don't think the deadlock in your original report should
happen due to this.  Can you take a look at the callstacks in the
reproduced deadlock?  Either we're missing something obvious or it is a
a somewhat different deadlock.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23  8:13             ` Chao Leng
@ 2021-03-23 16:17               ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:17 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 04:13:09PM +0800, Chao Leng wrote:
>> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again.
> Because it revert add the bio to current->bio_list.
> Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct).

Indeed, the patch removes the deferred submission, so the original
deadlock should not happen.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 16:17               ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:17 UTC (permalink / raw)
  To: Chao Leng
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 04:13:09PM +0800, Chao Leng wrote:
>> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again.
> Because it revert add the bio to current->bio_list.
> Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct).

Indeed, the patch removes the deferred submission, so the original
deadlock should not happen.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23 14:53             ` Keith Busch
@ 2021-03-23 16:19               ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Sagi Grimberg, Christoph Hellwig, Jens Axboe,
	Chao Leng, linux-block, linux-nvme

On Tue, Mar 23, 2021 at 11:53:30PM +0900, Keith Busch wrote:
> Failover might be overkill. We can run out of tags in a perfectly normal
> situation, and simply waiting may be the best option, or even scheduling
> on a different CPU may be sufficient to get a viable tag  rather than
> selecting a different path.

Indeed.  Then again IFF there are multiple optimized paths picking one
could also be a way to make progress.

> Does it make sense to just abort all allocated tags during a reset and
> let the original bio requeue for multipath IO?

Well, this would again hard code multipath information into the lower
levels.  But then again we could at least do it so that we check for
the REQ_NVME_MPATH to make it clear what is going on.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 16:19               ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 16:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Sagi Grimberg, Christoph Hellwig, Jens Axboe,
	Chao Leng, linux-block, linux-nvme

On Tue, Mar 23, 2021 at 11:53:30PM +0900, Keith Busch wrote:
> Failover might be overkill. We can run out of tags in a perfectly normal
> situation, and simply waiting may be the best option, or even scheduling
> on a different CPU may be sufficient to get a viable tag  rather than
> selecting a different path.

Indeed.  Then again IFF there are multiple optimized paths picking one
could also be a way to make progress.

> Does it make sense to just abort all allocated tags during a reset and
> let the original bio requeue for multipath IO?

Well, this would again hard code multipath information into the lower
levels.  But then again we could at least do it so that we check for
the REQ_NVME_MPATH to make it clear what is going on.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23 16:15             ` Christoph Hellwig
@ 2021-03-23 18:13               ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23 18:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Leng, Keith Busch, Jens Axboe, linux-block, linux-nvme


>>> Sagi, suggest trying this patch.
>>
>> The above reproduces with the patch applied on upstream nvme code.
> 
> Weird.  I don't think the deadlock in your original report should
> happen due to this.  Can you take a look at the callstacks in the
> reproduced deadlock?  Either we're missing something obvious or it is a
> a somewhat different deadlock.

The deadlock in this patchset reproduces upstream. It is not possible to
update the kernel in the env in the original report.

So IFF we assume that this does not reproduce in upstream (pending
proof), is there something that we can do with stable fixes? This will
probably go back to everything that is before 5.8...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 18:13               ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23 18:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Leng, Keith Busch, Jens Axboe, linux-block, linux-nvme


>>> Sagi, suggest trying this patch.
>>
>> The above reproduces with the patch applied on upstream nvme code.
> 
> Weird.  I don't think the deadlock in your original report should
> happen due to this.  Can you take a look at the callstacks in the
> reproduced deadlock?  Either we're missing something obvious or it is a
> a somewhat different deadlock.

The deadlock in this patchset reproduces upstream. It is not possible to
update the kernel in the env in the original report.

So IFF we assume that this does not reproduce in upstream (pending
proof), is there something that we can do with stable fixes? This will
probably go back to everything that is before 5.8...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23 18:13               ` Sagi Grimberg
@ 2021-03-23 18:22                 ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 18:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Chao Leng, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 11:13:13AM -0700, Sagi Grimberg wrote:
> The deadlock in this patchset reproduces upstream. It is not possible to
> update the kernel in the env in the original report.
>
> So IFF we assume that this does not reproduce in upstream (pending
> proof), is there something that we can do with stable fixes? This will
> probably go back to everything that is before 5.8...

The direct_make_request removal should be pretty easily backportable.
In old kernels without the streamlined normal path it might cause minor
performance degradations, but the actual change should be trivial.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 18:22                 ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 18:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Chao Leng, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 11:13:13AM -0700, Sagi Grimberg wrote:
> The deadlock in this patchset reproduces upstream. It is not possible to
> update the kernel in the env in the original report.
>
> So IFF we assume that this does not reproduce in upstream (pending
> proof), is there something that we can do with stable fixes? This will
> probably go back to everything that is before 5.8...

The direct_make_request removal should be pretty easily backportable.
In old kernels without the streamlined normal path it might cause minor
performance degradations, but the actual change should be trivial.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23 18:22                 ` Christoph Hellwig
@ 2021-03-23 19:00                   ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23 19:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Leng, Keith Busch, Jens Axboe, linux-block, linux-nvme


>> The deadlock in this patchset reproduces upstream. It is not possible to
>> update the kernel in the env in the original report.
>>
>> So IFF we assume that this does not reproduce in upstream (pending
>> proof), is there something that we can do with stable fixes? This will
>> probably go back to everything that is before 5.8...
> 
> The direct_make_request removal should be pretty easily backportable.

Umm, the direct_make_request removal is based on the submit_bio_noacct
rework you've done. Are you suggesting that we replace it with
generic_make_request for these kernels?

> In old kernels without the streamlined normal path it might cause minor
> performance degradations, but the actual change should be trivial.

That is better than getting stuck in I/O failover...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 19:00                   ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23 19:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Leng, Keith Busch, Jens Axboe, linux-block, linux-nvme


>> The deadlock in this patchset reproduces upstream. It is not possible to
>> update the kernel in the env in the original report.
>>
>> So IFF we assume that this does not reproduce in upstream (pending
>> proof), is there something that we can do with stable fixes? This will
>> probably go back to everything that is before 5.8...
> 
> The direct_make_request removal should be pretty easily backportable.

Umm, the direct_make_request removal is based on the submit_bio_noacct
rework you've done. Are you suggesting that we replace it with
generic_make_request for these kernels?

> In old kernels without the streamlined normal path it might cause minor
> performance degradations, but the actual change should be trivial.

That is better than getting stuck in I/O failover...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23 19:00                   ` Sagi Grimberg
@ 2021-03-23 19:01                     ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 19:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Chao Leng, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 12:00:34PM -0700, Sagi Grimberg wrote:
>
>>> The deadlock in this patchset reproduces upstream. It is not possible to
>>> update the kernel in the env in the original report.
>>>
>>> So IFF we assume that this does not reproduce in upstream (pending
>>> proof), is there something that we can do with stable fixes? This will
>>> probably go back to everything that is before 5.8...
>>
>> The direct_make_request removal should be pretty easily backportable.
>
> Umm, the direct_make_request removal is based on the submit_bio_noacct
> rework you've done. Are you suggesting that we replace it with
> generic_make_request for these kernels?

Yes.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 19:01                     ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-03-23 19:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Chao Leng, Keith Busch, Jens Axboe,
	linux-block, linux-nvme

On Tue, Mar 23, 2021 at 12:00:34PM -0700, Sagi Grimberg wrote:
>
>>> The deadlock in this patchset reproduces upstream. It is not possible to
>>> update the kernel in the env in the original report.
>>>
>>> So IFF we assume that this does not reproduce in upstream (pending
>>> proof), is there something that we can do with stable fixes? This will
>>> probably go back to everything that is before 5.8...
>>
>> The direct_make_request removal should be pretty easily backportable.
>
> Umm, the direct_make_request removal is based on the submit_bio_noacct
> rework you've done. Are you suggesting that we replace it with
> generic_make_request for these kernels?

Yes.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
  2021-03-23 19:01                     ` Christoph Hellwig
@ 2021-03-23 19:10                       ` Sagi Grimberg
  -1 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23 19:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Leng, Keith Busch, Jens Axboe, linux-block, linux-nvme


>>>> The deadlock in this patchset reproduces upstream. It is not possible to
>>>> update the kernel in the env in the original report.
>>>>
>>>> So IFF we assume that this does not reproduce in upstream (pending
>>>> proof), is there something that we can do with stable fixes? This will
>>>> probably go back to everything that is before 5.8...
>>>
>>> The direct_make_request removal should be pretty easily backportable.
>>
>> Umm, the direct_make_request removal is based on the submit_bio_noacct
>> rework you've done. Are you suggesting that we replace it with
>> generic_make_request for these kernels?
> 
> Yes.

OK, that should be testable...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device
@ 2021-03-23 19:10                       ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2021-03-23 19:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Leng, Keith Busch, Jens Axboe, linux-block, linux-nvme


>>>> The deadlock in this patchset reproduces upstream. It is not possible to
>>>> update the kernel in the env in the original report.
>>>>
>>>> So IFF we assume that this does not reproduce in upstream (pending
>>>> proof), is there something that we can do with stable fixes? This will
>>>> probably go back to everything that is before 5.8...
>>>
>>> The direct_make_request removal should be pretty easily backportable.
>>
>> Umm, the direct_make_request removal is based on the submit_bio_noacct
>> rework you've done. Are you suggesting that we replace it with
>> generic_make_request for these kernels?
> 
> Yes.

OK, that should be testable...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-03-23 19:11 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22  7:37 fix nvme-tcp and nvme-rdma controller reset hangs when using multipath Christoph Hellwig
2021-03-22  7:37 ` Christoph Hellwig
2021-03-22  7:37 ` [PATCH 1/2] blk-mq: add a blk_mq_submit_bio_direct API Christoph Hellwig
2021-03-22  7:37   ` Christoph Hellwig
2021-03-22 11:23   ` Hannes Reinecke
2021-03-22 11:23     ` Hannes Reinecke
2021-03-22 15:30   ` Keith Busch
2021-03-22 15:30     ` Keith Busch
2021-03-22  7:37 ` [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device Christoph Hellwig
2021-03-22  7:37   ` Christoph Hellwig
2021-03-22 11:22   ` Hannes Reinecke
2021-03-22 11:22     ` Hannes Reinecke
2021-03-22 15:31   ` Keith Busch
2021-03-22 15:31     ` Keith Busch
2021-03-23  2:57   ` Sagi Grimberg
2021-03-23  2:57     ` Sagi Grimberg
2021-03-23  3:23     ` Sagi Grimberg
2021-03-23  3:23       ` Sagi Grimberg
2021-03-23  7:04       ` Chao Leng
2021-03-23  7:04         ` Chao Leng
2021-03-23  7:36         ` Sagi Grimberg
2021-03-23  7:36           ` Sagi Grimberg
2021-03-23  8:13           ` Chao Leng
2021-03-23  8:13             ` Chao Leng
2021-03-23 16:17             ` Christoph Hellwig
2021-03-23 16:17               ` Christoph Hellwig
2021-03-23 16:15           ` Christoph Hellwig
2021-03-23 16:15             ` Christoph Hellwig
2021-03-23 18:13             ` Sagi Grimberg
2021-03-23 18:13               ` Sagi Grimberg
2021-03-23 18:22               ` Christoph Hellwig
2021-03-23 18:22                 ` Christoph Hellwig
2021-03-23 19:00                 ` Sagi Grimberg
2021-03-23 19:00                   ` Sagi Grimberg
2021-03-23 19:01                   ` Christoph Hellwig
2021-03-23 19:01                     ` Christoph Hellwig
2021-03-23 19:10                     ` Sagi Grimberg
2021-03-23 19:10                       ` Sagi Grimberg
2021-03-23  7:28     ` Hannes Reinecke
2021-03-23  7:28       ` Hannes Reinecke
2021-03-23  7:31       ` Sagi Grimberg
2021-03-23  7:31         ` Sagi Grimberg
2021-03-23  8:36         ` Hannes Reinecke
2021-03-23  8:36           ` Hannes Reinecke
2021-03-23 14:53           ` Keith Busch
2021-03-23 14:53             ` Keith Busch
2021-03-23 16:19             ` Christoph Hellwig
2021-03-23 16:19               ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.