[PATCH v2] nvme-fabrics: allow to queue requests for live queues

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] nvme-fabrics: allow to queue requests for live queues
@ 2020-07-29  7:58 Sagi Grimberg
  2020-07-29 19:34 ` James Smart
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2020-07-29  7:58 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch, James Smart

Right now we are failing requests based on the controller
state (which is checked inline in nvmf_check_ready) however
we should definitely accept requests if the queue is live.

When entering controller reset, we transition the controller
into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
non-mpath requests (have blk_noretry_request set).

This is also the case for NVME_REQ_USER for the wrong reason.
There shouldn't be any reason for us to reject this I/O in a
controller reset. We do want to prevent passthru commands on
the admin queue because we need the controller to fully initialize
first before we let user passthru admin commands to be issued.

In a non-mpath setup, this means that the requests will simply
be requeued over and over forever not allowing the q_usage_counter
to drop its final reference, causing controller reset to hang
if running concurrently with heavy I/O.

While we are at it, remove the redundant NVME_CTRL_NEW case, which
should never see any I/O as it must first transition to
NVME_CTRL_CONNECTING.

Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Changes from v1:
- fail passthru commands on the admin queue if the controller is not alive
  because we want to make sure they don't preceed controller init.

 drivers/nvme/host/fabrics.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4ec4829d6233..8575724734e0 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -565,10 +565,14 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	struct nvme_request *req = nvme_req(rq);

 	/*
-	 * If we are in some state of setup or teardown only allow
-	 * internally generated commands.
+	 * currently we have a problem sending passthru commands
+	 * on the admin_q if the controller is not LIVE because we can't
+	 * make sure that they are going out after the admin connect,
+	 * controller enable and/or other commands in the initialization
+	 * sequence. until the controller will be LIVE, fail with
+	 * BLK_STS_RESOURCE so that they will be rescheduled.
 	 */
-	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
+	if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
 		return false;

 	/*
@@ -576,9 +580,8 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	 * which is require to set the queue live in the appropinquate states.
 	 */
 	switch (ctrl->state) {
-	case NVME_CTRL_NEW:
 	case NVME_CTRL_CONNECTING:
-		if (nvme_is_fabrics(req->cmd) &&
+		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
 		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
 			return true;
 		break;
-- 
2.25.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
  2020-07-29  7:58 [PATCH v2] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-07-29 19:34 ` James Smart
  2020-07-29 22:09   ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: James Smart @ 2020-07-29 19:34 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch

On 7/29/2020 12:58 AM, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller
> state (which is checked inline in nvmf_check_ready) however
> we should definitely accept requests if the queue is live.
>
> When entering controller reset, we transition the controller
> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
> non-mpath requests (have blk_noretry_request set).
>
> This is also the case for NVME_REQ_USER for the wrong reason.
> There shouldn't be any reason for us to reject this I/O in a
> controller reset. We do want to prevent passthru commands on
> the admin queue because we need the controller to fully initialize
> first before we let user passthru admin commands to be issued.
>
> In a non-mpath setup, this means that the requests will simply
> be requeued over and over forever not allowing the q_usage_counter
> to drop its final reference, causing controller reset to hang
> if running concurrently with heavy I/O.

I've been looking at this trying to understand the real issues.

First issue: even if the check_ready checks pass, there's nothing that 
says the transport won't return BLK_STS_RESOURCE for it's own reasons.  
Maybe Tcp/rdma don't, but FC does today for cases of lost connectivity - 
and a loss of connectivity will cause a Resetting state transition.

I agree with
+    if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
change as until we do have acceptance of a way to serialize/prioritize 
Connect and initialization commands. We need to do this.

Which then leaves:
The patch isn't changing any behavior when ! q->live - which occurs in 
the latter steps of Resetting as well as most of the Connecting state.  
So any I/O received while !q->live is still getting queued/retried. So 
what was the problem ?  It has to be limited in scope to either the 
start of the Resetting state while q->live had yet to transition to 
!live. But I don't see any freezing in this path.

Looking at the original problem that lock ups are:
1) the sysfs write to do a reset - which is stalled waiting on a flush 
of reset_work
2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() which 
then does a blk_mq_freeze_queue_wait.

The 2nd one is odd in my mind as the the update_nr_hw_queues is 
something done while connecting and I wouldn't think the reset flush 
must wait for a reconnect as well.  Although not the fix, I'd recommend 
to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
schedule nvme_tcp_reconnect_ctrl_work().

As tcp did change state to Connecting before making the 
update_nr_hw_queues call, yet we didn't change the behavior of 
rescheduling if Connecting and !q->live, what did the patch actually do 
to help ?  The patch only changed the queuing behavior for Connecting 
and q->live.  So whether or not we hit it is dependent on the timing of 
when a new io is received vs the call to hr_hw_queues.   The patch 
didn't help this.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
  2020-07-29 19:34 ` James Smart
@ 2020-07-29 22:09   ` Sagi Grimberg
  2020-07-30  0:18     ` James Smart
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2020-07-29 22:09 UTC (permalink / raw)
  To: James Smart, linux-nvme, Christoph Hellwig, Keith Busch


>> Right now we are failing requests based on the controller
>> state (which is checked inline in nvmf_check_ready) however
>> we should definitely accept requests if the queue is live.
>>
>> When entering controller reset, we transition the controller
>> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
>> non-mpath requests (have blk_noretry_request set).
>>
>> This is also the case for NVME_REQ_USER for the wrong reason.
>> There shouldn't be any reason for us to reject this I/O in a
>> controller reset. We do want to prevent passthru commands on
>> the admin queue because we need the controller to fully initialize
>> first before we let user passthru admin commands to be issued.
>>
>> In a non-mpath setup, this means that the requests will simply
>> be requeued over and over forever not allowing the q_usage_counter
>> to drop its final reference, causing controller reset to hang
>> if running concurrently with heavy I/O.
> 
> I've been looking at this trying to understand the real issues.
> 
> First issue: even if the check_ready checks pass, there's nothing that 
> says the transport won't return BLK_STS_RESOURCE for it's own reasons. 
> Maybe Tcp/rdma don't, but FC does today for cases of lost connectivity - 
> and a loss of connectivity will cause a Resetting state transition.

That's fine, as I said, the controller state should only advise us to
look further (i.e. queue_live). Not blindly fail these commands.

> I agree with
> +    if (rq->q == ctrl->admin_q && (req->flags & NVME_REQ_USERCMD))
> change as until we do have acceptance of a way to serialize/prioritize 
> Connect and initialization commands. We need to do this.
> 
> Which then leaves:
> The patch isn't changing any behavior when ! q->live - which occurs in 
> the latter steps of Resetting as well as most of the Connecting state. 
> So any I/O received while !q->live is still getting queued/retried.

Correct.

> So 
> what was the problem ?  It has to be limited in scope to either the 
> start of the Resetting state while q->live had yet to transition to 
> !live. But I don't see any freezing in this path.

When the queue is frozen, the q_usage_counter is killed, and waited
until it reaches to 0, and that will not happen if all the requests
that got the q_usage_counter reference are completed. So we cannot
get into a state where requests are requeued due to BLK_STS_RESOURCE,
as they still have the reference that need to be dropped.

> Looking at the original problem that lock ups are:
> 1) the sysfs write to do a reset - which is stalled waiting on a flush 
> of reset_work
> 2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() which 
> then does a blk_mq_freeze_queue_wait.

Right, and it blocks because we have commands that are requeued
indefinitely due to the fact that we refused to accept them while
we should have.

> The 2nd one is odd in my mind as the the update_nr_hw_queues is 
> something done while connecting and I wouldn't think the reset flush 
> must wait for a reconnect as well.

We cannot transition to LIVE and restart traffic before we update
the mappings to the new number of queues (controller suddenly allowed
less queues, or we got cpu hotplug or whatever).

   Although not the fix, I'd recommend
> to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
> schedule nvme_tcp_reconnect_ctrl_work().

What is the difference? we cannot resubmit traffic, hence we cannot
complete it, hence the freeze cannot complete.

As I see it what the driver needs to do in reset is:
1. start freeze - prevent from new requests from taking the q_usage_counter
2. quiesce - prevent new commands from entering queue_rq
3. stop and drain all transport queues
4. cancel all inflight commands (tagset_iter or other)
5. start transport queues again - allow only connect to pass
6. unquiesce the queues - to allow any requests that are stuck
    between (1) and (2) (blocked on quiesced but already have
    the q_usage_counter) to run and complete.

--> at this point anyone waiting for queue freeze should now be
     able to complete (e.g. update_nr_hw_queues).

7. wait for started freeze (paired to balance the freeze_counter)
8. unfreeze the queue - paired with (1)

> As tcp did change state to Connecting before making the 
> update_nr_hw_queues call, yet we didn't change the behavior of 
> rescheduling if Connecting and !q->live, what did the patch actually do 
> to help ?

As I said, we shouldn't have I/O commands that are requeued forever
preventing the freeze from completing. if we are CONNECTING, then
the queues are not started, so we shouldn't see any I/O commands
coming yet.

   The patch only changed the queuing behavior for Connecting
> and q->live.  So whether or not we hit it is dependent on the timing of 
> when a new io is received vs the call to hr_hw_queues.   The patch 
> didn't help this.

There are some supporting patches I sent for tcp and rdma (to operate
in the order I explained above).

I'll resend them as a set soon.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
  2020-07-29 22:09   ` Sagi Grimberg
@ 2020-07-30  0:18     ` James Smart
  2020-07-30  1:23       ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: James Smart @ 2020-07-30  0:18 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch

On 7/29/2020 3:09 PM, Sagi Grimberg wrote:
>
>> So what was the problem ?  It has to be limited in scope to either 
>> the start of the Resetting state while q->live had yet to transition 
>> to !live. But I don't see any freezing in this path.
>
> When the queue is frozen, the q_usage_counter is killed, and waited
> until it reaches to 0, and that will not happen if all the requests
> that got the q_usage_counter reference are completed. So we cannot
> get into a state where requests are requeued due to BLK_STS_RESOURCE,
> as they still have the reference that need to be dropped.
>
>> Looking at the original problem that lock ups are:
>> 1) the sysfs write to do a reset - which is stalled waiting on a 
>> flush of reset_work
>> 2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() 
>> which then does a blk_mq_freeze_queue_wait.
>
> Right, and it blocks because we have commands that are requeued
> indefinitely due to the fact that we refused to accept them while
> we should have.

I understand what you are saying, but.

Invariably, the queues are unquiesced to keep connect going through (all 
queues) and more (admin queue), and as new io will come in that time, we 
have no recourse but to bounce the io as the q isn't live. And it has to 
be a resource status so it is requeued as we can't fail the io as the 
failure, in a non-mpath scenario, will go all the way back to the caller 
which is bad news.  So we're in a catch-22.

That was why I suggested perhaps we needed to make mpath always be 
involved, even for a single-path device (non-mpath) so that it can catch 
and queue at it's level (as if all paths are gone), for a short time 
period, rather than letting the error go back to the issuer.

>
>> The 2nd one is odd in my mind as the the update_nr_hw_queues is 
>> something done while connecting and I wouldn't think the reset flush 
>> must wait for a reconnect as well.
>
> We cannot transition to LIVE and restart traffic before we update
> the mappings to the new number of queues (controller suddenly allowed
> less queues, or we got cpu hotplug or whatever).

agree

>
>   Although not the fix, I'd recommend
>> to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
>> schedule nvme_tcp_reconnect_ctrl_work().
>
> What is the difference?

it didn't matter for the issue we're really trying to solve. Just 
something that looked odd.

>
> As I see it what the driver needs to do in reset is:
> 1. start freeze - prevent from new requests from taking the 
> q_usage_counter
> 2. quiesce - prevent new commands from entering queue_rq
> 3. stop and drain all transport queues
> 4. cancel all inflight commands (tagset_iter or other)
> 5. start transport queues again - allow only connect to pass
augmented for admin-q to allow other controller init commands as well.

> 6. unquiesce the queues - to allow any requests that are stuck
>    between (1) and (2) (blocked on quiesced but already have
>    the q_usage_counter) to run and complete.
>
> --> at this point anyone waiting for queue freeze should now be
>     able to complete (e.g. update_nr_hw_queues).
>
> 7. wait for started freeze (paired to balance the freeze_counter)
> 8. unfreeze the queue - paired with (1)

ok. But you never stated where in these steps the transport queue is no 
longer live.  I think you're saying (6) will process the io, not abort 
it like the "drain" step, so it can run to completion on the link.  
That's ok for a user-initiated reset, but if the reset is due to a 
link-side error such as the FC case, the queues aren't live and the only 
way to complete the io is to fail it with some error. And for non-mpath, 
that error may surface back to the issuer.

>
>> As tcp did change state to Connecting before making the 
>> update_nr_hw_queues call, yet we didn't change the behavior of 
>> rescheduling if Connecting and !q->live, what did the patch actually 
>> do to help ?
>
> As I said, we shouldn't have I/O commands that are requeued forever
> preventing the freeze from completing. if we are CONNECTING, then
> the queues are not started, so we shouldn't see any I/O commands
> coming yet.

well - ok, and we're back to why they are started. Let's say we're not 
worred about connect ordering - another reason for the continued running 
is so that during an extended reconnect period that may take a couple of 
retries, we can have a fast_io_fail timeout to start failing those 
rescheduled normal ios. even then - fast_io_fail should have been with 
mpath ios so they already should have been fine. Any fast_io_fail would 
still run the risk of returning an error to the issuer.

>
>   The patch only changed the queuing behavior for Connecting
>> and q->live.  So whether or not we hit it is dependent on the timing 
>> of when a new io is received vs the call to hr_hw_queues.   The patch 
>> didn't help this.
>
> There are some supporting patches I sent for tcp and rdma (to operate
> in the order I explained above).
>
> I'll resend them as a set soon.

ok - but unless I'm missing something we have some holes.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] nvme-fabrics: allow to queue requests for live queues
  2020-07-30  0:18     ` James Smart
@ 2020-07-30  1:23       ` Sagi Grimberg
  0 siblings, 0 replies; 5+ messages in thread
From: Sagi Grimberg @ 2020-07-30  1:23 UTC (permalink / raw)
  To: James Smart, linux-nvme, Christoph Hellwig, Keith Busch


>>> So what was the problem ?  It has to be limited in scope to either 
>>> the start of the Resetting state while q->live had yet to transition 
>>> to !live. But I don't see any freezing in this path.
>>
>> When the queue is frozen, the q_usage_counter is killed, and waited
>> until it reaches to 0, and that will not happen if all the requests
>> that got the q_usage_counter reference are completed. So we cannot
>> get into a state where requests are requeued due to BLK_STS_RESOURCE,
>> as they still have the reference that need to be dropped.
>>
>>> Looking at the original problem that lock ups are:
>>> 1) the sysfs write to do a reset - which is stalled waiting on a 
>>> flush of reset_work
>>> 2) reset_work is stalled on a call to blk_mq_update_nr_hw_queues() 
>>> which then does a blk_mq_freeze_queue_wait.
>>
>> Right, and it blocks because we have commands that are requeued
>> indefinitely due to the fact that we refused to accept them while
>> we should have.
> 
> I understand what you are saying, but.
> 
> Invariably, the queues are unquiesced to keep connect going through (all 
> queues)

No, the connect has its own connect_q, which is not quiesced, so
the connect passes thru and I/O commands are kept quiesced.

> and more (admin queue),

For the admin queue we have the temporary w.a which still needs to
be fixed.

> and as new io will come in that time, we 
> have no recourse but to bounce the io as the q isn't live. And it has to 
> be a resource status so it is requeued as we can't fail the io as the 
> failure, in a non-mpath scenario, will go all the way back to the caller 
> which is bad news.  So we're in a catch-22.

See above, for I/O the issue is solved, for admin we need a similar
solution.

> That was why I suggested perhaps we needed to make mpath always be 
> involved, even for a single-path device (non-mpath) so that it can catch 
> and queue at it's level (as if all paths are gone), for a short time 
> period, rather than letting the error go back to the issuer.

This needs to work also for non-mpath devices. I don't think that we
should force mpath.

>>
>>> The 2nd one is odd in my mind as the the update_nr_hw_queues is 
>>> something done while connecting and I wouldn't think the reset flush 
>>> must wait for a reconnect as well.
>>
>> We cannot transition to LIVE and restart traffic before we update
>> the mappings to the new number of queues (controller suddenly allowed
>> less queues, or we got cpu hotplug or whatever).
> 
> agree
> 
>>
>>   Although not the fix, I'd recommend
>>> to not call nvme_tcp_setup_ctrl() from reset_work and instead have it 
>>> schedule nvme_tcp_reconnect_ctrl_work().
>>
>> What is the difference?
> 
> it didn't matter for the issue we're really trying to solve. Just 
> something that looked odd.

I don't necessarily think so, this update_nr_hw_queues still needs
to happen before the we allow new I/O in, regardless of the context.

>> As I see it what the driver needs to do in reset is:
>> 1. start freeze - prevent from new requests from taking the 
>> q_usage_counter
>> 2. quiesce - prevent new commands from entering queue_rq
>> 3. stop and drain all transport queues
>> 4. cancel all inflight commands (tagset_iter or other)
>> 5. start transport queues again - allow only connect to pass
> augmented for admin-q to allow other controller init commands as well.

Yes.

>> 6. unquiesce the queues - to allow any requests that are stuck
>>    between (1) and (2) (blocked on quiesced but already have
>>    the q_usage_counter) to run and complete.
>>
>> --> at this point anyone waiting for queue freeze should now be
>>     able to complete (e.g. update_nr_hw_queues).
>>
>> 7. wait for started freeze (paired to balance the freeze_counter)
>> 8. unfreeze the queue - paired with (1)
> 
> ok. But you never stated where in these steps the transport queue is no 
> longer live.

The transport queues are not alive after the queues were frozen and
quiesced.

>  I think you're saying (6) will process the io, not abort 
> it like the "drain" step, so it can run to completion on the link. 
> That's ok for a user-initiated reset, but if the reset is due to a 
> link-side error such as the FC case, the queues aren't live and the only 
> way to complete the io is to fail it with some error. And for non-mpath, 
> that error may surface back to the issuer.

error recovery due to link-side errors also exist in tcp and rdma. step
(6) happens only after we successfully recreated the I/O. for the link
side errors, queues are unquiesced immediately.

But I think we do have an issue here. If the link side failed exactly
when we connected, we are allowed commands to pass through (step 6)
, sent to the controller that is now unresponsive (in the middle of a
reset/reconnect), we don't have any way to fail them because even if
we do fail them in timeout, the core will retry and will be freeze
will still not complete.

Perhaps Keith can help,
Keith,

How does that work for pci?
Assume that you are in the middle of a reset, with I/O inflight, you
got the controller established and them all of a sudden, it stopped
responding. I see that the timeout handler kicks in and you delete
the controller (forcing requests to fail). That is not a good
option for fabrics, because we want to stick around for when
the controller comes back online (sometime in the future).

>>> As tcp did change state to Connecting before making the 
>>> update_nr_hw_queues call, yet we didn't change the behavior of 
>>> rescheduling if Connecting and !q->live, what did the patch actually 
>>> do to help ?
>>
>> As I said, we shouldn't have I/O commands that are requeued forever
>> preventing the freeze from completing. if we are CONNECTING, then
>> the queues are not started, so we shouldn't see any I/O commands
>> coming yet.
> 
> well - ok, and we're back to why they are started. Let's say we're not 
> worred about connect ordering

connect is ordered for I/O, we should fix it for admin as well.

  - another reason for the continued running
> is so that during an extended reconnect period that may take a couple of 
> retries, we can have a fast_io_fail timeout to start failing those 
> rescheduled normal ios. even then - fast_io_fail should have been with 
> mpath ios so they already should have been fine. Any fast_io_fail would 
> still run the risk of returning an error to the issuer.

Not sure I follow your point...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-30  1:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-29  7:58 [PATCH v2] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
2020-07-29 19:34 ` James Smart
2020-07-29 22:09   ` Sagi Grimberg
2020-07-30  0:18     ` James Smart
2020-07-30  1:23       ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).