Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] nvme-fabrics: reject I/O to offline device
@ 2019-12-01  7:59 Victor Gladkov
  2019-12-02 22:26 ` Chaitanya Kulkarni
  2019-12-02 22:47 ` James Smart
  0 siblings, 2 replies; 9+ messages in thread
From: Victor Gladkov @ 2019-12-01  7:59 UTC (permalink / raw)
  To: linux-nvme

Issue Description:
Commands get stuck while Host NVMe controller (TCP or RDMA) is in reconnect state. 
NVMe controller enters into reconnect state when it loses connection with the target. It tries to reconnect every 10 seconds (default) until successful reconnection or until reconnect time-out is reached. The default reconnect time out is 10 minutes.
This behavior is different than ISCSI where Commands during reconnect state returns with the following error: "rejecting I/O to offline device"

Fix Description:
Added a kernel module parameter "nvmef_reconnect_failfast" for nvme-fabrics module (default is true).
Interfere in the decision whether to queue IO command or retry IO command. The interface takes into account the controller reconnect state, in a way that during reconnect state, IO commands shall fail immediacy (default) or according to IO command timeout (depends on the module parameter value), and IO retry is prevented. As a result, commands do not get stuck in in reconnect state.

branch nvme-5.5

---
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 74b8818..ef89aff 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -13,6 +13,10 @@
 #include "nvme.h"
 #include "fabrics.h"

+static bool nvmef_reconnect_failfast = 1;
+module_param_named(nvmef_reconnect_failfast, nvmef_reconnect_failfast, bool, S_IRUGO);
+MODULE_PARM_DESC(nvmef_reconnect_failfast, "failfast flag for I/O when controler is reconnecting, else use I/O command timeout (default true).");
+
 static LIST_HEAD(nvmf_transports);
 static DECLARE_RWSEM(nvmf_transports_rwsem);

@@ -549,6 +553,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 {
        if (ctrl->state != NVME_CTRL_DELETING &&
            ctrl->state != NVME_CTRL_DEAD &&
+           !(ctrl->state == NVME_CTRL_CONNECTING && (((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout)) || nvmef_reconnect_failfast)) &&
            !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
                return BLK_STS_RESOURCE;


Regards,
Victor


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-01  7:59 [PATCH] nvme-fabrics: reject I/O to offline device Victor Gladkov
@ 2019-12-02 22:26 ` Chaitanya Kulkarni
  2019-12-02 22:47 ` James Smart
  1 sibling, 0 replies; 9+ messages in thread
From: Chaitanya Kulkarni @ 2019-12-02 22:26 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme

On 12/01/2019 12:14 AM, Victor Gladkov wrote:
> Issue Description:
> Commands get stuck while Host NVMe controller (TCP or RDMA) is in reconnect state.
> NVMe controller enters into reconnect state when it loses connection with the target. It tries to reconnect every 10 seconds (default) until successful reconnection or until reconnect time-out is reached. The default reconnect time out is 10 minutes.
> This behavior is different than ISCSI where Commands during reconnect state returns with the following error: "rejecting I/O to offline device"


The values NVMF_DEF_RECONNECT_DELAY and NVMF_OPT_CTRL_LOSS_TMO are
configurable using :-

  602         { NVMF_OPT_RECONNECT_DELAY,     "reconnect_delay=%d"    },
  603         { NVMF_OPT_CTRL_LOSS_TMO,       "ctrl_loss_tmo=%d"      }, 


Is it possible to get the same behavior with proper combination of above
two values set ?




_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-01  7:59 [PATCH] nvme-fabrics: reject I/O to offline device Victor Gladkov
  2019-12-02 22:26 ` Chaitanya Kulkarni
@ 2019-12-02 22:47 ` James Smart
  2019-12-03 10:04   ` Victor Gladkov
  1 sibling, 1 reply; 9+ messages in thread
From: James Smart @ 2019-12-02 22:47 UTC (permalink / raw)
  To: linux-nvme

On 11/30/2019 11:59 PM, Victor Gladkov wrote:
> Issue Description:
> Commands get stuck while Host NVMe controller (TCP or RDMA) is in reconnect state.
> NVMe controller enters into reconnect state when it loses connection with the target. It tries to reconnect every 10 seconds (default) until successful reconnection or until reconnect time-out is reached. The default reconnect time out is 10 minutes.
> This behavior is different than ISCSI where Commands during reconnect state returns with the following error: "rejecting I/O to offline device"
>
> Fix Description:
> Added a kernel module parameter "nvmef_reconnect_failfast" for nvme-fabrics module (default is true).
> Interfere in the decision whether to queue IO command or retry IO command. The interface takes into account the controller reconnect state, in a way that during reconnect state, IO commands shall fail immediacy (default) or according to IO command timeout (depends on the module parameter value), and IO retry is prevented. As a result, commands do not get stuck in in reconnect state.

This the patch seems incorrect at least as described. Multipathing 
inherently will "fastfail" and send to other paths. So the only way 
something is "stuck" is if it's last path. If last path, we definitely 
don't want to prematurely release i/o before we've given the subsystem 
every opportunity to reconnect.

What I hear you saying is you don't like the kernel default 
controller-loss-timeout of 600s. What was designed, if you didn't like 
the kernel default, was to use the per-connection "--cltr-loss-tmo" 
option for "nvme connect".  The auto-connect scripts or the admin script 
that specifies the connection can set whatever value it likes.

If that seems hard to do, perhaps it's time to implement the options 
that allow for a config file to specify new values to be used on all 
connections, or on connections to specific subsystems, and so on. But I 
don't think the kernel needs to change.

-- james



>
> branch nvme-5.5
>
> ---
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 74b8818..ef89aff 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -13,6 +13,10 @@
>   #include "nvme.h"
>   #include "fabrics.h"
>
> +static bool nvmef_reconnect_failfast = 1;
> +module_param_named(nvmef_reconnect_failfast, nvmef_reconnect_failfast, bool, S_IRUGO);
> +MODULE_PARM_DESC(nvmef_reconnect_failfast, "failfast flag for I/O when controler is reconnecting, else use I/O command timeout (default true).");
> +
>   static LIST_HEAD(nvmf_transports);
>   static DECLARE_RWSEM(nvmf_transports_rwsem);
>
> @@ -549,6 +553,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
>   {
>          if (ctrl->state != NVME_CTRL_DELETING &&
>              ctrl->state != NVME_CTRL_DEAD &&
> +           !(ctrl->state == NVME_CTRL_CONNECTING && (((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout)) || nvmef_reconnect_failfast)) &&
>              !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>                  return BLK_STS_RESOURCE;
>
>
> Regards,
> Victor
>
>
> _______________________________________________
> linux-nvme mailing list
> linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-02 22:47 ` James Smart
@ 2019-12-03 10:04   ` Victor Gladkov
  2019-12-03 16:19     ` James Smart
  0 siblings, 1 reply; 9+ messages in thread
From: Victor Gladkov @ 2019-12-03 10:04 UTC (permalink / raw)
  To: James Smart, linux-nvme


On 12/03/2019 00:47 AM, James Smart wrote:
> On 11/30/2019 11:59 PM, Victor Gladkov wrote:
> > Issue Description:
> > Commands get stuck while Host NVMe controller (TCP or RDMA) is in
> reconnect state.
> > NVMe controller enters into reconnect state when it loses connection with
> the target. It tries to reconnect every 10 seconds (default) until successful
> reconnection or until reconnect time-out is reached. The default reconnect
> time out is 10 minutes.
> > This behavior is different than ISCSI where Commands during reconnect
> state returns with the following error: "rejecting I/O to offline device"
> >
> > Fix Description:
> > Added a kernel module parameter "nvmef_reconnect_failfast" for nvme-
> fabrics module (default is true).
> > Interfere in the decision whether to queue IO command or retry IO
> command. The interface takes into account the controller reconnect state, in
> a way that during reconnect state, IO commands shall fail immediacy (default)
> or according to IO command timeout (depends on the module parameter
> value), and IO retry is prevented. As a result, commands do not get stuck in in
> reconnect state.
> 
> This the patch seems incorrect at least as described. Multipathing inherently
> will "fastfail" and send to other paths. So the only way something is "stuck" is
> if it's last path. If last path, we definitely don't want to prematurely release
> i/o before we've given the subsystem every opportunity to reconnect.
> 
> What I hear you saying is you don't like the kernel default controller-loss-
> timeout of 600s. What was designed, if you didn't like the kernel default, was
> to use the per-connection "--cltr-loss-tmo"
> option for "nvme connect".  The auto-connect scripts or the admin script that
> specifies the connection can set whatever value it likes.
> 
> If that seems hard to do, perhaps it's time to implement the options that
> allow for a config file to specify new values to be used on all connections, or
> on connections to specific subsystems, and so on. But I don't think the kernel
> needs to change.
> 
> -- james

James, thank you for the suggestion.
But let me explain it differently.

Applications are expecting commands to complete with success or error
within a certain timeout (30 seconds by default). 
The NVMe host is enforcing that timeout while it is connected, never the less, 
during reconnection, the timeout is not enforced and commands may get stuck for a long period or even forever.

The controller-loss-timeout should not affect IO timeout policy, these are two different policies.

Regards,
Victor
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-03 10:04   ` Victor Gladkov
@ 2019-12-03 16:19     ` James Smart
  2019-12-04  8:28       ` Victor Gladkov
  0 siblings, 1 reply; 9+ messages in thread
From: James Smart @ 2019-12-03 16:19 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme


On 12/3/2019 2:04 AM, Victor Gladkov wrote:
> On 12/03/2019 00:47 AM, James Smart wrote:
>> O
>> James, thank you for the suggestion.
>> But let me explain it differently.
>>
>> Applications are expecting commands to complete with success or error
>> within a certain timeout (30 seconds by default).
>> The NVMe host is enforcing that timeout while it is connected, never the less,
>> during reconnection, the timeout is not enforced and commands may get stuck for a long period or even forever.
>>
>> The controller-loss-timeout should not affect IO timeout policy, these are two different policies.
>>
>> Regards,
>> Victor

Ok - which says what does make sense to add is the portion:

   !(ctrl->state == NVME_CTRL_CONNECTING && ((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout)))


But I don't think we need the failfast flag.

-- james


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-03 16:19     ` James Smart
@ 2019-12-04  8:28       ` Victor Gladkov
  2019-12-06  0:38         ` James Smart
  0 siblings, 1 reply; 9+ messages in thread
From: Victor Gladkov @ 2019-12-04  8:28 UTC (permalink / raw)
  To: James Smart, linux-nvme

On 12/03/2019 06:19 PM, James Smart wrote:
> On 12/3/2019 2:04 AM, Victor Gladkov wrote:
> > On 12/03/2019 00:47 AM, James Smart wrote:
> >> O
> >> The controller-loss-timeout should not affect IO timeout policy, these are
> two different policies.

> 
> Ok - which says what does make sense to add is the portion:
> 
>    !(ctrl->state == NVME_CTRL_CONNECTING && ((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout)))
> 
> 
> But I don't think we need the failfast flag.
> 
> -- james

OK. I think, it's good enough.

This is updated patch:

---
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 74b8818..b58abc1 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -549,6 +549,8 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 {
        if (ctrl->state != NVME_CTRL_DELETING &&
            ctrl->state != NVME_CTRL_DEAD &&
+           !(ctrl->state == NVME_CTRL_CONNECTING &&
+            ((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout))) &&
            !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
                return BLK_STS_RESOURCE;

---

Regards,
Victor
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-04  8:28       ` Victor Gladkov
@ 2019-12-06  0:38         ` James Smart
  2019-12-06 22:18           ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: James Smart @ 2019-12-06  0:38 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme



On 12/4/2019 12:28 AM, Victor Gladkov wrote:
> On 12/03/2019 06:19 PM, James Smart wrote:
>> On 12/3/2019 2:04 AM, Victor Gladkov wrote:
>>> On 12/03/2019 00:47 AM, James Smart wrote:
>>>> O
>>>> The controller-loss-timeout should not affect IO timeout policy, these are
>> two different policies.
>> Ok - which says what does make sense to add is the portion:
>>
>>     !(ctrl->state == NVME_CTRL_CONNECTING && ((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout)))
>>
>>
>> But I don't think we need the failfast flag.
>>
>> -- james
> OK. I think, it's good enough.
>
> This is updated patch:
>
> ---
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 74b8818..b58abc1 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -549,6 +549,8 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
>   {
>          if (ctrl->state != NVME_CTRL_DELETING &&
>              ctrl->state != NVME_CTRL_DEAD &&
> +           !(ctrl->state == NVME_CTRL_CONNECTING &&
> +            ((ktime_get_ns() - rq->start_time_ns) > jiffies_to_nsecs(rq->timeout))) &&
>              !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>                  return BLK_STS_RESOURCE;
>

Did you test this to ensure it's doing what you expect. I'm not sure 
that all the timers are set right at this point. Most I/O's timeout from 
a deadline time stamped at blk_mq_start_request(). But that routine is 
actually called by the transports post the 
nvmf_check_ready/fail_nonready calls.  E.g. the io is not yet in flight, 
thus queued, and the blk-mq internal queuing doesn't count against the 
io timeout.  I can't see anything that guarantees start_time_ns is set.

-- james


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-06  0:38         ` James Smart
@ 2019-12-06 22:18           ` Sagi Grimberg
  2019-12-08 12:31             ` Hannes Reinecke
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2019-12-06 22:18 UTC (permalink / raw)
  To: James Smart, Victor Gladkov, linux-nvme


>> ---
>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>> index 74b8818..b58abc1 100644
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -549,6 +549,8 @@ blk_status_t nvmf_fail_nonready_command(struct 
>> nvme_ctrl *ctrl,
>>   {
>>          if (ctrl->state != NVME_CTRL_DELETING &&
>>              ctrl->state != NVME_CTRL_DEAD &&
>> +           !(ctrl->state == NVME_CTRL_CONNECTING &&
>> +            ((ktime_get_ns() - rq->start_time_ns) > 
>> jiffies_to_nsecs(rq->timeout))) &&
>>              !blk_noretry_request(rq) && !(rq->cmd_flags & 
>> REQ_NVME_MPATH))
>>                  return BLK_STS_RESOURCE;
>>
> 
> Did you test this to ensure it's doing what you expect. I'm not sure 
> that all the timers are set right at this point. Most I/O's timeout from 
> a deadline time stamped at blk_mq_start_request(). But that routine is 
> actually called by the transports post the 
> nvmf_check_ready/fail_nonready calls.  E.g. the io is not yet in flight, 
> thus queued, and the blk-mq internal queuing doesn't count against the 
> io timeout.  I can't see anything that guarantees start_time_ns is set.

I'm not sure this behavior for failing I/O always desired? some
consumers would actually not want the I/O to fail prematurely if we
are not multipathing...

I think we need a fail_fast_tmo set in when establishing the controller
to get it right.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] nvme-fabrics: reject I/O to offline device
  2019-12-06 22:18           ` Sagi Grimberg
@ 2019-12-08 12:31             ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2019-12-08 12:31 UTC (permalink / raw)
  To: Sagi Grimberg, James Smart, Victor Gladkov, linux-nvme

On 12/6/19 11:18 PM, Sagi Grimberg wrote:
> 
>>> ---
>>> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
>>> index 74b8818..b58abc1 100644
>>> --- a/drivers/nvme/host/fabrics.c
>>> +++ b/drivers/nvme/host/fabrics.c
>>> @@ -549,6 +549,8 @@ blk_status_t nvmf_fail_nonready_command(struct 
>>> nvme_ctrl *ctrl,
>>>   {
>>>          if (ctrl->state != NVME_CTRL_DELETING &&
>>>              ctrl->state != NVME_CTRL_DEAD &&
>>> +           !(ctrl->state == NVME_CTRL_CONNECTING &&
>>> +            ((ktime_get_ns() - rq->start_time_ns) > 
>>> jiffies_to_nsecs(rq->timeout))) &&
>>>              !blk_noretry_request(rq) && !(rq->cmd_flags & 
>>> REQ_NVME_MPATH))
>>>                  return BLK_STS_RESOURCE;
>>>
>>
>> Did you test this to ensure it's doing what you expect. I'm not sure 
>> that all the timers are set right at this point. Most I/O's timeout 
>> from a deadline time stamped at blk_mq_start_request(). But that 
>> routine is actually called by the transports post the 
>> nvmf_check_ready/fail_nonready calls.  E.g. the io is not yet in 
>> flight, thus queued, and the blk-mq internal queuing doesn't count 
>> against the io timeout.  I can't see anything that guarantees 
>> start_time_ns is set.
> 
> I'm not sure this behavior for failing I/O always desired? some
> consumers would actually not want the I/O to fail prematurely if we
> are not multipathing...
> 
> I think we need a fail_fast_tmo set in when establishing the controller
> to get it right.
> 
Agreed. This whole patch looks like someone is trying to reimplement 
fast_io_fail_tmo / dev_loss_tmo.
As we're moving into unreliable fabrics I guess we'll need a similar 
mechanism.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-01  7:59 [PATCH] nvme-fabrics: reject I/O to offline device Victor Gladkov
2019-12-02 22:26 ` Chaitanya Kulkarni
2019-12-02 22:47 ` James Smart
2019-12-03 10:04   ` Victor Gladkov
2019-12-03 16:19     ` James Smart
2019-12-04  8:28       ` Victor Gladkov
2019-12-06  0:38         ` James Smart
2019-12-06 22:18           ` Sagi Grimberg
2019-12-08 12:31             ` Hannes Reinecke

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git