linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request
@ 2021-04-02 20:08 Sagi Grimberg
  2021-04-03  6:10 ` Greg KH
  2021-04-09  9:33 ` Patch "nvme-mpath: replace direct_make_request with generic_make_request" has been added to the 5.4-stable tree gregkh
  0 siblings, 2 replies; 7+ messages in thread
From: Sagi Grimberg @ 2021-04-02 20:08 UTC (permalink / raw)
  To: stable; +Cc: Christoph Hellwig, Keith Busch, linux-nvme

The below patches caused a regression in a multipath setup:
Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

These patches on their own are correct because they fixed a controller reset
regression.

When we reset/teardown a controller, we must freeze and quiesce the namespaces
request queues to make sure that we safely stop inflight I/O submissions.
Freeze is mandatory because if our hctx map changed between reconnects,
blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
if it still has pending submissions (that are still quiesced) it will hang.
This is what the above patches fixed.

However, by freezing the namespaces request queues, and only unfreezing them
when we successfully reconnect, inflight submissions that are running
concurrently can now block grabbing the nshead srcu until either we successfully
reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).

This caused a deadlock [1] when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is because
nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
in order to make sure that current_path is visible to future (re)submisions.
However the srcu lock is taken by a blocked submission on a frozen request
queue, and we have a deadlock.

In recent kernels (v5.9+) direct_make_request was replaced by submit_bio_noacct
which does not have this issue because it bio_list will be active when
nvme-mpath calls submit_bio_noacct on the bottom device (because it was
populated when submit_bio was triggered on it.

Hence, we need to fix all the kernels that were before submit_bio_noacct was
introduced.

[1]:
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
 __schedule+0x293/0x730
 schedule+0x33/0xa0
 schedule_timeout+0x1d3/0x2f0
 wait_for_completion+0xba/0x140
 __synchronize_srcu.part.21+0x91/0xc0
 synchronize_srcu_expedited+0x27/0x30
 synchronize_srcu+0xce/0xe0
 nvme_mpath_set_live+0x64/0x130 [nvme_core]
 nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
 nvme_update_ana_state+0xcd/0xe0 [nvme_core]
 nvme_parse_ana_log+0xa1/0x180 [nvme_core]
 nvme_read_ana_log+0x76/0x100 [nvme_core]
 nvme_mpath_init+0x122/0x180 [nvme_core]
 nvme_init_identify+0x80e/0xe20 [nvme_core]
 nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
 nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Note: This patch does not exist in upstream, it is a pure
backport fix that was just now found. The reason for that is
that this specific issue exists on on kernels 5.4-5.8 as it
was fixed in 5.9, and the patches that caused this was only
backported to linux-5.4.y (which are correct as mentioned
in the patch description)

 drivers/nvme/host/multipath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 041a755f936a..0d9d0bebe645 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -339,7 +339,7 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
 		trace_block_bio_remap(bio->bi_disk->queue, bio,
 				      disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
-		ret = direct_make_request(bio);
+		ret = generic_make_request(bio);
 	} else if (nvme_available_path(head)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
 
-- 
2.27.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request
  2021-04-02 20:08 [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request Sagi Grimberg
@ 2021-04-03  6:10 ` Greg KH
  2021-04-07  1:04   ` Sagi Grimberg
  2021-04-09  9:33 ` Patch "nvme-mpath: replace direct_make_request with generic_make_request" has been added to the 5.4-stable tree gregkh
  1 sibling, 1 reply; 7+ messages in thread
From: Greg KH @ 2021-04-03  6:10 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: stable, Christoph Hellwig, Keith Busch, linux-nvme

On Fri, Apr 02, 2021 at 01:08:41PM -0700, Sagi Grimberg wrote:
> The below patches caused a regression in a multipath setup:
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> These patches on their own are correct because they fixed a controller reset
> regression.
> 
> When we reset/teardown a controller, we must freeze and quiesce the namespaces
> request queues to make sure that we safely stop inflight I/O submissions.
> Freeze is mandatory because if our hctx map changed between reconnects,
> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
> if it still has pending submissions (that are still quiesced) it will hang.
> This is what the above patches fixed.
> 
> However, by freezing the namespaces request queues, and only unfreezing them
> when we successfully reconnect, inflight submissions that are running
> concurrently can now block grabbing the nshead srcu until either we successfully
> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
> 
> This caused a deadlock [1] when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is because
> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
> in order to make sure that current_path is visible to future (re)submisions.
> However the srcu lock is taken by a blocked submission on a frozen request
> queue, and we have a deadlock.
> 
> In recent kernels (v5.9+) direct_make_request was replaced by submit_bio_noacct
> which does not have this issue because it bio_list will be active when
> nvme-mpath calls submit_bio_noacct on the bottom device (because it was
> populated when submit_bio was triggered on it.
> 
> Hence, we need to fix all the kernels that were before submit_bio_noacct was
> introduced.

Why can we not just add submit_bio_noacct to the 5.4 kernel to correct
this?  What commit id is that?

thanks,

greg k-h

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request
  2021-04-03  6:10 ` Greg KH
@ 2021-04-07  1:04   ` Sagi Grimberg
  2021-04-07  5:28     ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2021-04-07  1:04 UTC (permalink / raw)
  To: Greg KH; +Cc: stable, Christoph Hellwig, Keith Busch, linux-nvme


>> Hence, we need to fix all the kernels that were before submit_bio_noacct was
>> introduced.
> 
> Why can we not just add submit_bio_noacct to the 5.4 kernel to correct
> this?  What commit id is that?

Hey Greg,

submit_bio_noacct was applied as part of a rework by Christoph that I
didn't feel was suitable as a stable candidate. The commit-id is:
ed00aabd5eb9fb44d6aff1173234a2e911b9fead

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request
  2021-04-07  1:04   ` Sagi Grimberg
@ 2021-04-07  5:28     ` Christoph Hellwig
  2021-04-07 23:18       ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2021-04-07  5:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Greg KH, stable, Christoph Hellwig, Keith Busch, linux-nvme

On Tue, Apr 06, 2021 at 06:04:09PM -0700, Sagi Grimberg wrote:
>
>>> Hence, we need to fix all the kernels that were before submit_bio_noacct was
>>> introduced.
>>
>> Why can we not just add submit_bio_noacct to the 5.4 kernel to correct
>> this?  What commit id is that?
>
> Hey Greg,
>
> submit_bio_noacct was applied as part of a rework by Christoph that I
> didn't feel was suitable as a stable candidate. The commit-id is:
> ed00aabd5eb9fb44d6aff1173234a2e911b9fead

submit_bio_noacct really is just a new name for generic_make_request,
as the old one was horribly misleading.  So this does use
submit_bio_noacct, just with its old name.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request
  2021-04-07  5:28     ` Christoph Hellwig
@ 2021-04-07 23:18       ` Sagi Grimberg
  2021-04-09  9:43         ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2021-04-07 23:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Greg KH, stable, Keith Busch, linux-nvme


>>>> Hence, we need to fix all the kernels that were before submit_bio_noacct was
>>>> introduced.
>>>
>>> Why can we not just add submit_bio_noacct to the 5.4 kernel to correct
>>> this?  What commit id is that?
>>
>> Hey Greg,
>>
>> submit_bio_noacct was applied as part of a rework by Christoph that I
>> didn't feel was suitable as a stable candidate. The commit-id is:
>> ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> submit_bio_noacct really is just a new name for generic_make_request,
> as the old one was horribly misleading.  So this does use
> submit_bio_noacct, just with its old name.

commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead does not apply
cleanly on any of these kernels, so I think it's better to take this
small one-liner instead of trying to fit a commit that changes the
name treewide.

Greg, what is your preference? backporting
ed00aabd5eb9fb44d6aff1173234a2e911b9fead to the various kernels
or to take this isolated one?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Patch "nvme-mpath: replace direct_make_request with generic_make_request" has been added to the 5.4-stable tree
  2021-04-02 20:08 [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request Sagi Grimberg
  2021-04-03  6:10 ` Greg KH
@ 2021-04-09  9:33 ` gregkh
  1 sibling, 0 replies; 7+ messages in thread
From: gregkh @ 2021-04-09  9:33 UTC (permalink / raw)
  To: gregkh, hch, kbusch, linux-nvme, sagi; +Cc: stable-commits


This is a note to let you know that I've just added the patch titled

    nvme-mpath: replace direct_make_request with generic_make_request

to the 5.4-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     nvme-mpath-replace-direct_make_request-with-generic_make_request.patch
and it can be found in the queue-5.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.


From sagi@grimberg.me  Fri Apr  9 11:33:14 2021
From: Sagi Grimberg <sagi@grimberg.me>
Date: Fri,  2 Apr 2021 13:08:41 -0700
Subject: nvme-mpath: replace direct_make_request with generic_make_request
To: <stable@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org
Message-ID: <20210402200841.347696-1-sagi@grimberg.me>

From: Sagi Grimberg <sagi@grimberg.me>

The below patches caused a regression in a multipath setup:
Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

These patches on their own are correct because they fixed a controller reset
regression.

When we reset/teardown a controller, we must freeze and quiesce the namespaces
request queues to make sure that we safely stop inflight I/O submissions.
Freeze is mandatory because if our hctx map changed between reconnects,
blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
if it still has pending submissions (that are still quiesced) it will hang.
This is what the above patches fixed.

However, by freezing the namespaces request queues, and only unfreezing them
when we successfully reconnect, inflight submissions that are running
concurrently can now block grabbing the nshead srcu until either we successfully
reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).

This caused a deadlock [1] when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is because
nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
in order to make sure that current_path is visible to future (re)submisions.
However the srcu lock is taken by a blocked submission on a frozen request
queue, and we have a deadlock.

In recent kernels (v5.9+) direct_make_request was replaced by submit_bio_noacct
which does not have this issue because it bio_list will be active when
nvme-mpath calls submit_bio_noacct on the bottom device (because it was
populated when submit_bio was triggered on it.

Hence, we need to fix all the kernels that were before submit_bio_noacct was
introduced.

[1]:
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
 __schedule+0x293/0x730
 schedule+0x33/0xa0
 schedule_timeout+0x1d3/0x2f0
 wait_for_completion+0xba/0x140
 __synchronize_srcu.part.21+0x91/0xc0
 synchronize_srcu_expedited+0x27/0x30
 synchronize_srcu+0xce/0xe0
 nvme_mpath_set_live+0x64/0x130 [nvme_core]
 nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
 nvme_update_ana_state+0xcd/0xe0 [nvme_core]
 nvme_parse_ana_log+0xa1/0x180 [nvme_core]
 nvme_read_ana_log+0x76/0x100 [nvme_core]
 nvme_mpath_init+0x122/0x180 [nvme_core]
 nvme_init_identify+0x80e/0xe20 [nvme_core]
 nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
 nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/nvme/host/multipath.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -330,7 +330,7 @@ static blk_qc_t nvme_ns_head_make_reques
 		trace_block_bio_remap(bio->bi_disk->queue, bio,
 				      disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
-		ret = direct_make_request(bio);
+		ret = generic_make_request(bio);
 	} else if (nvme_available_path(head)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
 


Patches currently in stable-queue which might be from sagi@grimberg.me are

queue-5.4/nvme-mpath-replace-direct_make_request-with-generic_make_request.patch

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request
  2021-04-07 23:18       ` Sagi Grimberg
@ 2021-04-09  9:43         ` Greg KH
  0 siblings, 0 replies; 7+ messages in thread
From: Greg KH @ 2021-04-09  9:43 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, stable, Keith Busch, linux-nvme

On Wed, Apr 07, 2021 at 04:18:30PM -0700, Sagi Grimberg wrote:
> 
> > > > > Hence, we need to fix all the kernels that were before submit_bio_noacct was
> > > > > introduced.
> > > > 
> > > > Why can we not just add submit_bio_noacct to the 5.4 kernel to correct
> > > > this?  What commit id is that?
> > > 
> > > Hey Greg,
> > > 
> > > submit_bio_noacct was applied as part of a rework by Christoph that I
> > > didn't feel was suitable as a stable candidate. The commit-id is:
> > > ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> > 
> > submit_bio_noacct really is just a new name for generic_make_request,
> > as the old one was horribly misleading.  So this does use
> > submit_bio_noacct, just with its old name.
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead does not apply
> cleanly on any of these kernels, so I think it's better to take this
> small one-liner instead of trying to fit a commit that changes the
> name treewide.
> 
> Greg, what is your preference? backporting
> ed00aabd5eb9fb44d6aff1173234a2e911b9fead to the various kernels
> or to take this isolated one?

I took just this change now, thanks.

greg k-h

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-09  9:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02 20:08 [PATCH stable/5.4..5.8] nvme-mpath: replace direct_make_request with generic_make_request Sagi Grimberg
2021-04-03  6:10 ` Greg KH
2021-04-07  1:04   ` Sagi Grimberg
2021-04-07  5:28     ` Christoph Hellwig
2021-04-07 23:18       ` Sagi Grimberg
2021-04-09  9:43         ` Greg KH
2021-04-09  9:33 ` Patch "nvme-mpath: replace direct_make_request with generic_make_request" has been added to the 5.4-stable tree gregkh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).