Re: [PATCH] nvme-fabrics: reject I/O to offline device

From: James Smart <james.smart@broadcom.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	Victor Gladkov <Victor.Gladkov@kioxia.com>,
	Hannes Reinecke <hare@suse.de>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: [PATCH] nvme-fabrics: reject I/O to offline device
Date: Wed, 18 Dec 2019 14:20:38 -0800	[thread overview]
Message-ID: <bef8f5a3-5dee-ba7d-7423-8ab130b1aa65@broadcom.com> (raw)
In-Reply-To: <73006c25-b6a8-fc36-0789-772e3ea59a02@grimberg.me>

On 12/17/2019 1:46 PM, Sagi Grimberg wrote:
>
>
> On 12/9/19 7:30 AM, Victor Gladkov wrote:
>> On 12/8/19 14:18 PM, Hannes Reinecke wrote:
>>>
>>> On 12/6/19 11:18 PM, Sagi Grimberg wrote:
>>>>
>>>>>> ---
>>>>>> diff --git a/drivers/nvme/host/fabrics.c
>>>>>> b/drivers/nvme/host/fabrics.c index 74b8818..b58abc1 100644
>>>>>> --- a/drivers/nvme/host/fabrics.c
>>>>>> +++ b/drivers/nvme/host/fabrics.c
>>>>>> @@ -549,6 +549,8 @@ blk_status_t nvmf_fail_nonready_command(struct
>>>>>> nvme_ctrl *ctrl,
>>>>>>    {
>>>>>>           if (ctrl->state != NVME_CTRL_DELETING &&
>>>>>>               ctrl->state != NVME_CTRL_DEAD &&
>>>>>> +           !(ctrl->state == NVME_CTRL_CONNECTING &&
>>>>>> +            ((ktime_get_ns() - rq->start_time_ns) >
>>>>>> jiffies_to_nsecs(rq->timeout))) &&
>>>>>>               !blk_noretry_request(rq) && !(rq->cmd_flags &
>>>>>> REQ_NVME_MPATH))
>>>>>>                   return BLK_STS_RESOURCE;
>>>>>>
>>>>>
>>>>> Did you test this to ensure it's doing what you expect. I'm not sure
>>>>> that all the timers are set right at this point. Most I/O's timeout
>>>>> from a deadline time stamped at blk_mq_start_request(). But that
>>>>> routine is actually called by the transports post the
>>>>> nvmf_check_ready/fail_nonready calls.  E.g. the io is not yet in
>>>>> flight, thus queued, and the blk-mq internal queuing doesn't count
>>>>> against the io timeout.  I can't see anything that guarantees
>>>>> start_time_ns is set.
>>>>
>>>> I'm not sure this behavior for failing I/O always desired? some
>>>> consumers would actually not want the I/O to fail prematurely if we
>>>> are not multipathing...
>>>>
>>>> I think we need a fail_fast_tmo set in when establishing the
>>>> controller to get it right.
>>>>
>>> Agreed. This whole patch looks like someone is trying to reimplement
>>> fast_io_fail_tmo / dev_loss_tmo.
>>> As we're moving into unreliable fabrics I guess we'll need a similar 
>>> mechanism.
>>>
>>> Cheers,
>>>
>>> Hannes
>>
>>
>> Following your suggestions, I added a new session parameter called 
>> "fast_fail_tmo".
>> The timeout is measured in seconds from the controller reconnect, any 
>> command beyond that timeout is rejected.
>> The new parameter value may be passed during ‘connect’, and its 
>> default value is 30 seconds.
>
> The default should be consistent with the existing behavior.
>
>> A value of -1 means no timeout (in similar to current behavior).
>>
>> ---
>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>> index 74b8818..ed6b911 100644
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -406,6 +406,7 @@
>>       }
>>
>>       ctrl->cntlid = le16_to_cpu(res.u16);
>> +    ctrl->start_reconnect_ns = ktime_get_ns();
>>
>>   out_free_data:
>>       kfree(data);
>> @@ -474,8 +475,12 @@
>>   bool nvmf_should_reconnect(struct nvme_ctrl *ctrl)
>>   {
>>       if (ctrl->opts->max_reconnects == -1 ||
>> -        ctrl->nr_reconnects < ctrl->opts->max_reconnects)
>> +        ctrl->nr_reconnects < ctrl->opts->max_reconnects){
>> +        if(ctrl->nr_reconnects == 0)
>> +            ctrl->start_reconnect_ns = ktime_get_ns();
>> +
>>           return true;
>> +    }
>>
>>       return false;
>>   }
>> @@ -549,6 +554,8 @@
>>   {
>>       if (ctrl->state != NVME_CTRL_DELETING &&
>>           ctrl->state != NVME_CTRL_DEAD &&
>> +            !(ctrl->state == NVME_CTRL_CONNECTING && 
>> ctrl->opts->fail_fast_tmo_ns >= 0 &&
>> +            ((ktime_get_ns() - ctrl->start_reconnect_ns) >  
>> ctrl->opts->fail_fast_tmo_ns)) &&
>>           !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>>           return BLK_STS_RESOURCE;
>
> I cannot comprehend what is going on here...
>
> We should have a dedicated delayed_work that transitions the controller
> to a FAIL_FAST state and cancels the inflight requests again. This
> work should be triggered when the error is detected.

I hope you're not suggesting a FAILFAST state.  No new controller state 
is needed.

I do agree that managing the time since transitioning to CONNECTING can 
be handled better and can address "abort all now" rather than waiting 
for retries to kick in.

In other words:
Add a controller flag of "failfast_expired"
When entering CONNECTING, schedule a delayed work item based on failfast 
timeout value.
If transition out of CONNECTING, terminate delayed work item and ensure 
failfast_expired is false.
If delayed work item expires: set "failfast_expired" flag to true. Run 
through all inflight ios and cancel them.
Update nvmf_fail_nonready_command() (above) per above, but with check on 
"!ctrl->failfast_expired".

-- james

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme