Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] nvme-fabrics: allow to queue requests for live queues
@ 2020-07-28  5:35 Sagi Grimberg
  2020-07-28  6:44 ` Chao Leng
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sagi Grimberg @ 2020-07-28  5:35 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: James Smart

Right now we are failing requests based on the controller
state (which is checked inline in nvmf_check_ready) however
we should definitely accept requests if the queue is live.

When entering controller reset, we transition the controller
into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
non-mpath requests (have blk_noretry_request set).

This is also the case for NVME_REQ_USER for some reason, but
there shouldn't be any reason for us to reject this I/O in a
controller reset.

In a non-mpath setup, this means that the requests will simply
be requeued over and over forever not allowing the q_usage_counter
to drop its final reference, causing controller reset to hang
if running concurrently with heavy I/O.

While we are at it, remove the redundant NVME_CTRL_NEW case, which
should never see any I/O as it must first transition to
NVME_CTRL_CONNECTING.

Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 4ec4829d6233..2e7838f42e36 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 {
 	struct nvme_request *req = nvme_req(rq);
 
-	/*
-	 * If we are in some state of setup or teardown only allow
-	 * internally generated commands.
-	 */
-	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
-		return false;
-
 	/*
 	 * Only allow commands on a live queue, except for the connect command,
 	 * which is require to set the queue live in the appropinquate states.
 	 */
 	switch (ctrl->state) {
-	case NVME_CTRL_NEW:
 	case NVME_CTRL_CONNECTING:
-		if (nvme_is_fabrics(req->cmd) &&
+		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
 		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
 			return true;
 		break;
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28  5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
@ 2020-07-28  6:44 ` Chao Leng
  2020-07-28  6:49   ` Sagi Grimberg
  2020-07-28 10:50 ` Christoph Hellwig
  2020-07-29  5:45 ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Chao Leng @ 2020-07-28  6:44 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: James Smart



On 2020/7/28 13:35, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller
> state (which is checked inline in nvmf_check_ready) however
> we should definitely accept requests if the queue is live.
> 
> When entering controller reset, we transition the controller
> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
> non-mpath requests (have blk_noretry_request set).
> 
> This is also the case for NVME_REQ_USER for some reason, but
> there shouldn't be any reason for us to reject this I/O in a
> controller reset.
> 
> In a non-mpath setup, this means that the requests will simply
> be requeued over and over forever not allowing the q_usage_counter
> to drop its final reference, causing controller reset to hang
> if running concurrently with heavy I/O.
> 
> While we are at it, remove the redundant NVME_CTRL_NEW case, which
> should never see any I/O as it must first transition to
> NVME_CTRL_CONNECTING.
> 
> Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   drivers/nvme/host/fabrics.c | 10 +---------
>   1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 4ec4829d6233..2e7838f42e36 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   {
>   	struct nvme_request *req = nvme_req(rq);
>   
> -	/*
> -	 * If we are in some state of setup or teardown only allow
> -	 * internally generated commands.
> -	 */
> -	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
"if (!blk_rq_is_passthrough(rq))" should not delete. Because if we delete,
the normal io will be send to target, the target can not treat the io
if the queue is not NVME_CTRL_LIVE.
> -		return false;
> -
>   	/*
>   	 * Only allow commands on a live queue, except for the connect command,
>   	 * which is require to set the queue live in the appropinquate states.
>   	 */
>   	switch (ctrl->state) {
> -	case NVME_CTRL_NEW:
>   	case NVME_CTRL_CONNECTING:
> -		if (nvme_is_fabrics(req->cmd) &&
> +		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
If nvme_is_fabrics(req->cmd) is true, blk_rq_is_passthrough(rq) must
be ture. why need add check blk_rq_is_passthrough(rq)?
>   		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
>   			return true;
>   		break;
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28  6:44 ` Chao Leng
@ 2020-07-28  6:49   ` Sagi Grimberg
  2020-07-28 17:11     ` James Smart
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2020-07-28  6:49 UTC (permalink / raw)
  To: Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch; +Cc: James Smart


>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>> index 4ec4829d6233..2e7838f42e36 100644
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, 
>> struct request *rq,
>>   {
>>       struct nvme_request *req = nvme_req(rq);
>> -    /*
>> -     * If we are in some state of setup or teardown only allow
>> -     * internally generated commands.
>> -     */
>> -    if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we delete,
> the normal io will be send to target, the target can not treat the io
> if the queue is not NVME_CTRL_LIVE.

Sure it does, the only reason for us to deny this I/O, is if the queue
is not live. The controller state should only _advise_ us if we need to
look at the queue state.

>> -        return false;
>> -
>>       /*
>>        * Only allow commands on a live queue, except for the connect 
>> command,
>>        * which is require to set the queue live in the appropinquate 
>> states.
>>        */
>>       switch (ctrl->state) {
>> -    case NVME_CTRL_NEW:
>>       case NVME_CTRL_CONNECTING:
>> -        if (nvme_is_fabrics(req->cmd) &&
>> +        if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
> If nvme_is_fabrics(req->cmd) is true, blk_rq_is_passthrough(rq) must
> be ture. why need add check blk_rq_is_passthrough(rq)?

req->cmd is only valid for passthru commands, hence we need to check for
that before we reference req->cmd.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28  5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
  2020-07-28  6:44 ` Chao Leng
@ 2020-07-28 10:50 ` Christoph Hellwig
  2020-07-28 16:50   ` James Smart
  2020-07-29  5:45 ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2020-07-28 10:50 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

Applied to nvme-5.9, thanks.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28 10:50 ` Christoph Hellwig
@ 2020-07-28 16:50   ` James Smart
  0 siblings, 0 replies; 14+ messages in thread
From: James Smart @ 2020-07-28 16:50 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg; +Cc: Keith Busch, linux-nvme


On 7/28/2020 3:50 AM, Christoph Hellwig wrote:
> Applied to nvme-5.9, thanks.

No - I disagree with this patch. I'll reply in the other thread. Where 
is the delay I see with other patches before being applied ?

-- james


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28  6:49   ` Sagi Grimberg
@ 2020-07-28 17:11     ` James Smart
  2020-07-28 17:50       ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: James Smart @ 2020-07-28 17:11 UTC (permalink / raw)
  To: Sagi Grimberg, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch



On 7/27/2020 11:49 PM, Sagi Grimberg wrote:
>
>>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>>> index 4ec4829d6233..2e7838f42e36 100644
>>> --- a/drivers/nvme/host/fabrics.c
>>> +++ b/drivers/nvme/host/fabrics.c
>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl 
>>> *ctrl, struct request *rq,
>>>   {
>>>       struct nvme_request *req = nvme_req(rq);
>>> -    /*
>>> -     * If we are in some state of setup or teardown only allow
>>> -     * internally generated commands.
>>> -     */
>>> -    if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we 
>> delete,
>> the normal io will be send to target, the target can not treat the io
>> if the queue is not NVME_CTRL_LIVE.
>
> Sure it does, the only reason for us to deny this I/O, is if the queue
> is not live. The controller state should only _advise_ us if we need to
> look at the queue state.

I disagree strongly with removing the check on NVME_REQ_USERCMD. We've 
seen cli ioctls going to the admin queue while we're in the middle of 
doing controller initialization and it's has hosed the controller state 
in some cases. We've seen commands issued before the controller is in 
the proper state.  The admin queue may be live - but we don't 
necessarily want other io sneaking in.

As for the blk_rq_is_passthrough check - I guess I can see it being 
based on the queue state, and the check looks ok  (we should never see 
!blk_rq_is_passthrough on the admin q).
But...
- I don't know why it was that important to change it. On the connecting 
path, all you're doing is letting io start flowing before all the queues 
have been created.  Did you really need to start that much sooner ?

- But on the resetting path, or deleting cases, you've added a condition 
now where the controller state was changed, but there was a delay before 
the transport marked the queue live. It's common practice in the 
transports to change state then schedule a work element to perform the 
actual state change.  Why would you want io to continue to flow during 
that window ?   This may bring out other problems we've avoided in the past.

-- james



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28 17:11     ` James Smart
@ 2020-07-28 17:50       ` Sagi Grimberg
  2020-07-28 20:11         ` James Smart
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2020-07-28 17:50 UTC (permalink / raw)
  To: James Smart, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch


>>>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>>>> index 4ec4829d6233..2e7838f42e36 100644
>>>> --- a/drivers/nvme/host/fabrics.c
>>>> +++ b/drivers/nvme/host/fabrics.c
>>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl 
>>>> *ctrl, struct request *rq,
>>>>   {
>>>>       struct nvme_request *req = nvme_req(rq);
>>>> -    /*
>>>> -     * If we are in some state of setup or teardown only allow
>>>> -     * internally generated commands.
>>>> -     */
>>>> -    if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
>>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we 
>>> delete,
>>> the normal io will be send to target, the target can not treat the io
>>> if the queue is not NVME_CTRL_LIVE.
>>
>> Sure it does, the only reason for us to deny this I/O, is if the queue
>> is not live. The controller state should only _advise_ us if we need to
>> look at the queue state.
> 
> I disagree strongly with removing the check on NVME_REQ_USERCMD. We've 
> seen cli ioctls going to the admin queue while we're in the middle of 
> doing controller initialization and it's has hosed the controller state 
> in some cases. We've seen commands issued before the controller is in 
> the proper state.  The admin queue may be live - but we don't 
> necessarily want other io sneaking in.

Can you please give an example? NVME_REQ_USERCMD should not be any
different from any other type of I/O.

Also, do note that pci does allow to queue any type of command based
on the queue state only. fabrics should be slightly different because
we have the CONNECTING state where we want to let the connect command
only to be issued.


> As for the blk_rq_is_passthrough check - I guess I can see it being 
> based on the queue state, and the check looks ok  (we should never see 
> !blk_rq_is_passthrough on the admin q).
> But...
> - I don't know why it was that important to change it. On the connecting 
> path, all you're doing is letting io start flowing before all the queues 
> have been created.  Did you really need to start that much sooner ?

The issue is that controller in RESETTING state will have requests that
are being issued, and if we don't let it pass through, it will hang
around forever being requeued preventing queue freeze to complete.

See bug report from Krishnamraju Eraparaju.

> 
> - But on the resetting path, or deleting cases, you've added a condition 
> now where the controller state was changed, but there was a delay before 
> the transport marked the queue live. It's common practice in the 
> transports to change state then schedule a work element to perform the 
> actual state change.  Why would you want io to continue to flow during 
> that window ?   This may bring out other problems we've avoided in the 
> past.

What are you referring to? the change here? the controller reset must
allow requests that came in before we started queue freeze to pass,
otherwise freeze will never complete.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28 17:50       ` Sagi Grimberg
@ 2020-07-28 20:11         ` James Smart
  2020-07-28 20:38           ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: James Smart @ 2020-07-28 20:11 UTC (permalink / raw)
  To: Sagi Grimberg, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch



On 7/28/2020 10:50 AM, Sagi Grimberg wrote:
>
>>>>> diff --git a/drivers/nvme/host/fabrics.c 
>>>>> b/drivers/nvme/host/fabrics.c
>>>>> index 4ec4829d6233..2e7838f42e36 100644
>>>>> --- a/drivers/nvme/host/fabrics.c
>>>>> +++ b/drivers/nvme/host/fabrics.c
>>>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl 
>>>>> *ctrl, struct request *rq,
>>>>>   {
>>>>>       struct nvme_request *req = nvme_req(rq);
>>>>> -    /*
>>>>> -     * If we are in some state of setup or teardown only allow
>>>>> -     * internally generated commands.
>>>>> -     */
>>>>> -    if (!blk_rq_is_passthrough(rq) || (req->flags & 
>>>>> NVME_REQ_USERCMD))
>>>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we 
>>>> delete,
>>>> the normal io will be send to target, the target can not treat the io
>>>> if the queue is not NVME_CTRL_LIVE.
>>>
>>> Sure it does, the only reason for us to deny this I/O, is if the queue
>>> is not live. The controller state should only _advise_ us if we need to
>>> look at the queue state.
>>
>> I disagree strongly with removing the check on NVME_REQ_USERCMD. 
>> We've seen cli ioctls going to the admin queue while we're in the 
>> middle of doing controller initialization and it's has hosed the 
>> controller state in some cases. We've seen commands issued before the 
>> controller is in the proper state.  The admin queue may be live - but 
>> we don't necessarily want other io sneaking in.
>
> Can you please give an example? NVME_REQ_USERCMD should not be any
> different from any other type of I/O.

Keep in mind  - in order to send any command, Fabric connect included, 
the admin queue must be unquiesced.

Here's what we have today:
At the time of the reset or immediately during the reset, an ioctl is 
done - it can (and has been seen) to be queued on the request queue as 
there are not state checks. When this io gets scheduled, as of the 
resetting or even the connecting state - it's currently failing the 
NVME_REQ_USERCMD check, which fails back into 
nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE, 
which reschedules it.  This rescheduling reoccurs, until the its tried 
again with the controller state changed (usually LIVE) which then passes 
the admin-queue live check. It's good to go.

With your patch:
The ioctl will at least be requeued with BLK_STS_RESOURCE until after 
the Fabric Connect is completed (Connect is explicitly allowed by the 
state-specific check) as hopefully queue->live is false. It may actually 
be issued if we're in this resetting state only to be terminated 
afterward. All of the transports set the queue live flag as of Fabric 
Connect complete. If the ioctl is retried as of that point - it may be 
immediately sent on the admin queue - before or in-between any of the 
commands issued by nvme_enable_ctrl() - including before controller 
enabled, nvme_init_identify(), and before nvme_start_ctrl() has been 
called.    Hopefully none of these commands try to set/change 
configuration.   They simply shouldn't be allowed until after the 
controller has transitioned to LIVE. There's no need to allow them 
in-between.


>
> Also, do note that pci does allow to queue any type of command based
> on the queue state only. fabrics should be slightly different because
> we have the CONNECTING state where we want to let the connect command
> only to be issued.
>

That doesn't make me feel any better.  PCI bit-bangs register writes, 
the odds they can fail/be discarded on the "fabric" is near zero. Admin 
Commands - both sending/processing/responding is much much faster than 
on a fabric, where the cmd can be dropped/delayed by fabric routing. 
Odds of having core-generated commands fail while 
reconnecting/reinitializing are much much lower.   All says - fabrics 
have a much higher command and inter-command latency than pcie does - so 
pcie is going to have a harder time seeing these cases.

>
>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>> based on the queue state, and the check looks ok  (we should never 
>> see !blk_rq_is_passthrough on the admin q).
>> But...
>> - I don't know why it was that important to change it. On the 
>> connecting path, all you're doing is letting io start flowing before 
>> all the queues have been created.  Did you really need to start that 
>> much sooner ?
>
> The issue is that controller in RESETTING state will have requests that
> are being issued, and if we don't let it pass through, it will hang
> around forever being requeued preventing queue freeze to complete.

is it really "forever" or is it until controller loss timeout ? granted 
10minutes does seem like forever


>
> See bug report from Krishnamraju Eraparaju.

I'll go look.


>
>>
>> - But on the resetting path, or deleting cases, you've added a 
>> condition now where the controller state was changed, but there was a 
>> delay before the transport marked the queue live. It's common 
>> practice in the transports to change state then schedule a work 
>> element to perform the actual state change.  Why would you want io to 
>> continue to flow during that window ?   This may bring out other 
>> problems we've avoided in the past.
>
> What are you referring to? the change here? the controller reset must
> allow requests that came in before we started queue freeze to pass,
> otherwise freeze will never complete.

I'm referring to the io's that now pass through with the patch as the 
controller state is no longer blocking them (e.g. the removal of the 
!blk_rq_is_passthrough and NVME_REQ_USERCMD checks). They will pass the 
ready checks and go out on the wire, creating more active io that has to 
be deleted when the reset finally quiesces and terminates the active io 
list. For FC, we have to real link traffic for any io that gets started, 
so it raises the demands.   Should be fine, but it is creating more load 
at a point where it could have been blocked.

-- james




_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28 20:11         ` James Smart
@ 2020-07-28 20:38           ` Sagi Grimberg
  2020-07-28 22:47             ` James Smart
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2020-07-28 20:38 UTC (permalink / raw)
  To: James Smart, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch


>> Can you please give an example? NVME_REQ_USERCMD should not be any
>> different from any other type of I/O.
> 
> Keep in mind  - in order to send any command, Fabric connect included, 
> the admin queue must be unquiesced.
> 
> Here's what we have today:
> At the time of the reset or immediately during the reset, an ioctl is 
> done - it can (and has been seen) to be queued on the request queue as 
> there are not state checks. When this io gets scheduled, as of the 
> resetting or even the connecting state - it's currently failing the 
> NVME_REQ_USERCMD check, which fails back into 
> nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE, 
> which reschedules it.  This rescheduling reoccurs, until the its tried 
> again with the controller state changed (usually LIVE) which then passes 
> the admin-queue live check. It's good to go.
> 
> With your patch:
> The ioctl will at least be requeued with BLK_STS_RESOURCE until after 
> the Fabric Connect is completed (Connect is explicitly allowed by the 
> state-specific check) as hopefully queue->live is false. It may actually 
> be issued if we're in this resetting state only to be terminated 
> afterward. All of the transports set the queue live flag as of Fabric 
> Connect complete. If the ioctl is retried as of that point - it may be 
> immediately sent on the admin queue - before or in-between any of the 
> commands issued by nvme_enable_ctrl() - including before controller 
> enabled, nvme_init_identify(), and before nvme_start_ctrl() has been 
> called.    Hopefully none of these commands try to set/change 
> configuration.   They simply shouldn't be allowed until after the 
> controller has transitioned to LIVE. There's no need to allow them 
> in-between.

What you are describing is correct, that is why we have connect_q to
save us from it for I/O queues.

The problem I have with this code is that checking for NVME_REQ_USERCMD
is a big hammer that you land on what can be normal I/O
(e.g. nvme read/write/write-zeros), and we must not rely on this
indication to prevent what you are describing.

It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin
queue, but we are really circling over the fact that we cannot reliably
send admin connect before to go off first, because we have to unquiesce
it before we issue the admin connect, and there might be a stray command
going into execution before we submit the admin connect.

>> Also, do note that pci does allow to queue any type of command based
>> on the queue state only. fabrics should be slightly different because
>> we have the CONNECTING state where we want to let the connect command
>> only to be issued.
>>
> 
> That doesn't make me feel any better.  PCI bit-bangs register writes, 
> the odds they can fail/be discarded on the "fabric" is near zero. Admin 
> Commands - both sending/processing/responding is much much faster than 
> on a fabric, where the cmd can be dropped/delayed by fabric routing. 
> Odds of having core-generated commands fail while 
> reconnecting/reinitializing are much much lower.   All says - fabrics 
> have a much higher command and inter-command latency than pcie does - so 
> pcie is going to have a harder time seeing these cases.

Timing here is not the issue, we should take a step back and understand
that we fail any user command for a very specific reason. We should
solve the root cause and not paper around it.

>>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>>> based on the queue state, and the check looks ok  (we should never 
>>> see !blk_rq_is_passthrough on the admin q).
>>> But...
>>> - I don't know why it was that important to change it. On the 
>>> connecting path, all you're doing is letting io start flowing before 
>>> all the queues have been created.  Did you really need to start that 
>>> much sooner ?
>>
>> The issue is that controller in RESETTING state will have requests that
>> are being issued, and if we don't let it pass through, it will hang
>> around forever being requeued preventing queue freeze to complete.
> 
> is it really "forever" or is it until controller loss timeout ? granted 
> 10minutes does seem like forever

ctrl_loss_tmo can be a 1000000 years, it cannot rescue us.

>>> - But on the resetting path, or deleting cases, you've added a 
>>> condition now where the controller state was changed, but there was a 
>>> delay before the transport marked the queue live. It's common 
>>> practice in the transports to change state then schedule a work 
>>> element to perform the actual state change.  Why would you want io to 
>>> continue to flow during that window ?   This may bring out other 
>>> problems we've avoided in the past.
>>
>> What are you referring to? the change here? the controller reset must
>> allow requests that came in before we started queue freeze to pass,
>> otherwise freeze will never complete.
> 
> I'm referring to the io's that now pass through with the patch as the 
> controller state is no longer blocking them (e.g. the removal of the 
> !blk_rq_is_passthrough and NVME_REQ_USERCMD checks).

First of all, you have a bug hiding there if blk_mq_update_nr_hw_queues
actually ever does anything, because it won't be able to reliably freeze
the queue (because the !blk_rq_is_passthrough keeps the commands
endlessly rescheduling itself, not ever completing hence the queue
cannot be frozen). This is what started this, because today we support
multiple queue maps, it always ends up freezing the queue, so it can
happen much more frequently.

  They will pass the
> ready checks and go out on the wire, creating more active io that has to 
> be deleted when the reset finally quiesces and terminates the active io 
> list. For FC, we have to real link traffic for any io that gets started, 
> so it raises the demands.   Should be fine, but it is creating more load 
> at a point where it could have been blocked.

As I said, freezing the queue is impossible when we let these commands
hang in limbo state, and queue freeze may occur in a lot of places, so
we must either let them pass, or fence them when starting the freeze.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28 20:38           ` Sagi Grimberg
@ 2020-07-28 22:47             ` James Smart
  2020-07-28 23:39               ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: James Smart @ 2020-07-28 22:47 UTC (permalink / raw)
  To: Sagi Grimberg, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch



On 7/28/2020 1:38 PM, Sagi Grimberg wrote:
>
>>> Can you please give an example? NVME_REQ_USERCMD should not be any
>>> different from any other type of I/O.
>> ...
>
> The problem I have with this code is that checking for NVME_REQ_USERCMD
> is a big hammer that you land on what can be normal I/O
> (e.g. nvme read/write/write-zeros), and we must not rely on this
> indication to prevent what you are describing.
>
> It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin
> queue, but we are really circling over the fact that we cannot reliably
> send admin connect before to go off first, because we have to unquiesce
> it before we issue the admin connect, and there might be a stray command
> going into execution before we submit the admin connect.

agree.

Didn't you suggest something separate to isolate the connect 
commands/init commands ?   Can we use a different tag set for 
"initialization" commands?  We can use different queue_rq routines with 
different check ready checks.  I don't like the overhead, but sure makes 
the logic more straightforward.


>>>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>>>> based on the queue state, and the check looks ok  (we should never 
>>>> see !blk_rq_is_passthrough on the admin q).
>>>> But...
>>>> - I don't know why it was that important to change it. On the 
>>>> connecting path, all you're doing is letting io start flowing 
>>>> before all the queues have been created.  Did you really need to 
>>>> start that much sooner ?
>>>
>>> The issue is that controller in RESETTING state will have requests that
>>> are being issued, and if we don't let it pass through, it will hang
>>> around forever being requeued preventing queue freeze to complete.
>>

Isn't this the real problem:  we require the request to complete in 
order to get off the request queue in order to freeze. But to complete - 
it has to finish with some status. This is ok if running multipath, as 
mpath queues and retries - but what mpath is not running ? The issuer of 
the io will get the error ?   am I understanding it ?   If so I don't 
think a simple check reorg such as proposed will work as ultimately we 
can't return an error to something that doesn't know how to requeue/retry.

Should we always be putting even non-mp devices under mp, thus mp 
catches the errors rather than the blk queue, and can have a short delay 
while the controller reconnects ?  At least normal io can be handled 
this way and we'd have to figure out the ioctl path.   But it should 
still allow the core io, such as scanning, to get the error and fail 
out, to be picked up after the reconnect.

-- james



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28 22:47             ` James Smart
@ 2020-07-28 23:39               ` Sagi Grimberg
  0 siblings, 0 replies; 14+ messages in thread
From: Sagi Grimberg @ 2020-07-28 23:39 UTC (permalink / raw)
  To: James Smart, Chao Leng, linux-nvme, Christoph Hellwig, Keith Busch

>>>> Can you please give an example? NVME_REQ_USERCMD should not be any
>>>> different from any other type of I/O.
>>> ...
>>
>> The problem I have with this code is that checking for NVME_REQ_USERCMD
>> is a big hammer that you land on what can be normal I/O
>> (e.g. nvme read/write/write-zeros), and we must not rely on this
>> indication to prevent what you are describing.
>>
>> It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin
>> queue, but we are really circling over the fact that we cannot reliably
>> send admin connect before to go off first, because we have to unquiesce
>> it before we issue the admin connect, and there might be a stray command
>> going into execution before we submit the admin connect.
> 
> agree.
> 
> Didn't you suggest something separate to isolate the connect 
> commands/init commands ?   Can we use a different tag set for 
> "initialization" commands?  We can use different queue_rq routines with 
> different check ready checks.  I don't like the overhead, but sure makes 
> the logic more straightforward.

I think I've added this now that I'm looking back:
e7832cb48a65 ("nvme: make fabrics command run on a separate request queue")

But it still doesn't solve the user commands you are referring to.

>>>>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>>>>> based on the queue state, and the check looks ok  (we should never 
>>>>> see !blk_rq_is_passthrough on the admin q).
>>>>> But...
>>>>> - I don't know why it was that important to change it. On the 
>>>>> connecting path, all you're doing is letting io start flowing 
>>>>> before all the queues have been created.  Did you really need to 
>>>>> start that much sooner ?
>>>>
>>>> The issue is that controller in RESETTING state will have requests that
>>>> are being issued, and if we don't let it pass through, it will hang
>>>> around forever being requeued preventing queue freeze to complete.
>>>
> 
> Isn't this the real problem:  we require the request to complete in 
> order to get off the request queue in order to freeze. But to complete - 
> it has to finish with some status. This is ok if running multipath, as 
> mpath queues and retries - but what mpath is not running ? The issuer of 
> the io will get the error ?

We already fail noretry errors when we cancel (or abort) anything that
was submitted before, so there is nothing new here. non-mpath
does not set FAILFAST so it gets requeued, user commands do, so they
are failed, this is not different than before.

> Should we always be putting even non-mp devices under mp, thus mp 
> catches the errors rather than the blk queue, and can have a short delay 
> while the controller reconnects ?  At least normal io can be handled 
> this way and we'd have to figure out the ioctl path.   But it should 
> still allow the core io, such as scanning, to get the error and fail 
> out, to be picked up after the reconnect.

non-mpath still gets requeued because it doesn't set any sort of
FAILFAST flag.
or

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-28  5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
  2020-07-28  6:44 ` Chao Leng
  2020-07-28 10:50 ` Christoph Hellwig
@ 2020-07-29  5:45 ` Christoph Hellwig
  2020-07-29  5:53   ` Sagi Grimberg
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2020-07-29  5:45 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

Dropped from nvme-5.9 to wait to the discussion to finish as I want
to send a PR to Jens today..

On Mon, Jul 27, 2020 at 10:35:23PM -0700, Sagi Grimberg wrote:
> Right now we are failing requests based on the controller
> state (which is checked inline in nvmf_check_ready) however
> we should definitely accept requests if the queue is live.
> 
> When entering controller reset, we transition the controller
> into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for
> non-mpath requests (have blk_noretry_request set).
> 
> This is also the case for NVME_REQ_USER for some reason, but
> there shouldn't be any reason for us to reject this I/O in a
> controller reset.
> 
> In a non-mpath setup, this means that the requests will simply
> be requeued over and over forever not allowing the q_usage_counter
> to drop its final reference, causing controller reset to hang
> if running concurrently with heavy I/O.
> 
> While we are at it, remove the redundant NVME_CTRL_NEW case, which
> should never see any I/O as it must first transition to
> NVME_CTRL_CONNECTING.
> 
> Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/nvme/host/fabrics.c | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 4ec4829d6233..2e7838f42e36 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>  {
>  	struct nvme_request *req = nvme_req(rq);
>  
> -	/*
> -	 * If we are in some state of setup or teardown only allow
> -	 * internally generated commands.
> -	 */
> -	if (!blk_rq_is_passthrough(rq) || (req->flags & NVME_REQ_USERCMD))
> -		return false;
> -
>  	/*
>  	 * Only allow commands on a live queue, except for the connect command,
>  	 * which is require to set the queue live in the appropinquate states.
>  	 */
>  	switch (ctrl->state) {
> -	case NVME_CTRL_NEW:
>  	case NVME_CTRL_CONNECTING:
> -		if (nvme_is_fabrics(req->cmd) &&
> +		if (blk_rq_is_passthrough(rq) && nvme_is_fabrics(req->cmd) &&
>  		    req->cmd->fabrics.fctype == nvme_fabrics_type_connect)
>  			return true;
>  		break;
> -- 
> 2.25.1
---end quoted text---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-29  5:45 ` Christoph Hellwig
@ 2020-07-29  5:53   ` Sagi Grimberg
  2020-07-29  6:05     ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2020-07-29  5:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, James Smart, linux-nvme


> Dropped from nvme-5.9 to wait to the discussion to finish as I want
> to send a PR to Jens today..

OK, this is a bug fix, so we can send it later for sure.

I can also send a patch that doesn't touch the USERCMDs that
James is concerned about and we will fix that later...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-fabrics: allow to queue requests for live queues
  2020-07-29  5:53   ` Sagi Grimberg
@ 2020-07-29  6:05     ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2020-07-29  6:05 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, James Smart

On Tue, Jul 28, 2020 at 10:53:41PM -0700, Sagi Grimberg wrote:
>
>> Dropped from nvme-5.9 to wait to the discussion to finish as I want
>> to send a PR to Jens today..
>
> OK, this is a bug fix, so we can send it later for sure.
>
> I can also send a patch that doesn't touch the USERCMDs that
> James is concerned about and we will fix that later...

I need to catch up on the actual discussion.  I'm heading off to a
vacation today and just need to get everything in order first.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28  5:35 [PATCH] nvme-fabrics: allow to queue requests for live queues Sagi Grimberg
2020-07-28  6:44 ` Chao Leng
2020-07-28  6:49   ` Sagi Grimberg
2020-07-28 17:11     ` James Smart
2020-07-28 17:50       ` Sagi Grimberg
2020-07-28 20:11         ` James Smart
2020-07-28 20:38           ` Sagi Grimberg
2020-07-28 22:47             ` James Smart
2020-07-28 23:39               ` Sagi Grimberg
2020-07-28 10:50 ` Christoph Hellwig
2020-07-28 16:50   ` James Smart
2020-07-29  5:45 ` Christoph Hellwig
2020-07-29  5:53   ` Sagi Grimberg
2020-07-29  6:05     ` Christoph Hellwig

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git