All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9] nvme-fabrics: reject I/O to offline device
@ 2020-09-29 15:27 Victor Gladkov
  2020-09-29 18:19 ` Sagi Grimberg
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Victor Gladkov @ 2020-09-29 15:27 UTC (permalink / raw)
  To: linux-nvme
  Cc: Sagi Grimberg, James Smart, Ewan D. Milne, Hannes Reinecke,
	Keith Busch, Christoph Hellwig

Commands get stuck while Host NVMe-oF controller is in reconnect state.
NVMe controller enters into reconnect state when it loses connection with the
target. It tries to reconnect every 10 seconds
(default) until successful reconnection or until reconnect time-out is reached.
The default reconnect time out is 10 minutes.

Applications are expecting commands to complete with success or error within
a certain timeout (30 seconds by default).  The NVMe host is enforcing that
timeout while it is connected, never the less, during reconnection, the timeout
is not enforced and commands may get stuck for a long period or even forever.

To fix this long delay due to the default timeout we introduce new session
parameter "fast_io_fail_tmo". The timeout is measured in seconds from the
controller reconnect, any command beyond that timeout is rejected. The new
parameter value may be passed during 'connect'.
The default value of -1 means no timeout (similar to current behavior).

We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.

When the controller is entering the CONNECTING state, we schedule the
delayed_work based on failfast timeout value. If the transition is out of
CONNECTING, terminate delayed work item and ensure failfast_expired is
false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED"
flag to true.

We also update nvmf_fail_nonready_command() and
nvme_available_path() functions with check the
"NVME_CTRL_FAILFAST_EXPIRED" controller flag.

For multipath (function nvme_available_path()):
The path will not be considered available if
"NVME_CTRL_FAILFAST_EXPIRED" controller flag  is set
and the controller in NVME_CTRL_CONNECTING state.
This prevents commands from getting stuck
when available paths have tried to reconnect for too long.

Signed-off-by: Victor Gladkov <victor.gladkov at kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
Reviewed-by: Hannes Reinecke <hare at suse.de>

---
Changes from V8:

1. Added multipath behavior description as requested by Sagi Grimberg
    (Fri Sep 18 16:38:58 EDT 2020)

Changes from V7:

1. Expanded the patch description as requested by James Smart
    (Thu Aug 13 11:00:25 EDT 2020)

Changes from V6:

1. Changed according to Hannes Reinecke review:
    in nvme_start_failfast_work() and nvme_stop_failfast_work() procedures.

Changes from V5:

1. Drop the "off" string option for fast_io_fail_tmo.

Changes from V4:

1. Remove subsysnqn from dev_info just keep "failfast expired"
    in nvme_failfast_work().
2. Remove excess lock in nvme_failfast_work().
3. Change "timeout disabled" value to -1, '0' is to fail I/O right away now.
4. Add "off" string for fast_io_fail_tmo option as a -1 equivalent.

Changes from V3:

1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
    nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
    for PCIe transport when is called nvme_change_ctrl_state()
    from nvme_reset_work(), since we don't set the ctrl->opts for
    PCIe transport.
2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
    for macro NVMF_ALLOWED_OPTS definition.
3. Just like all the state change code add a switch for newly
    added state handling outside of the state machine in
    nvme_state_change().
4. nvme_available_path() add /* fallthru */ after if..break inside
    switch which is under list_for_each_entry_rcu().
5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
6. Fix the tabs before if in nvme_available_path() and line wrap
    for the same.
7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
    instead of == to get rid of the parentheses and avoid char > 80.
8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
9. Change the commit log style to match the one we have in the NVMe
    repo.

Changes from V2:

1. Several coding style and small fixes.
2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
3. Don't call functionality from the state machine.
4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
    semantics.

Changes from V1:

1. Add a new session parameter called "fast_fail_tmo". The timeout
    is measured in seconds from the controller reconnect, any command
    beyond that timeout is rejected. The new parameter value may be
    passed during 'connect', and its default value is 30 seconds.
    A value of 0 means no timeout (similar to current behavior).
2. Add a controller flag of "failfast_expired".
3. Add dedicated delayed_work that update the "failfast_expired"
    controller flag.
4. When entering CONNECTING, schedule the delayed_work based on
    failfast timeout value. If transition out of CONNECTING, terminate
    delayed work item and ensure failfast_expired is false.
    If delayed work item expires: set "failfast_expired" flag to true.
5. Update nvmf_fail_nonready_command() with check the
    "failfast_expired" controller flag.

---
 drivers/nvme/host/core.c      | 49
++++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/fabrics.c   | 25 +++++++++++++++++++---
 drivers/nvme/host/fabrics.h   |  5 +++++
 drivers/nvme/host/multipath.c |  5 ++++-
 drivers/nvme/host/nvme.h      |  3 +++
 5 files changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
f3c037f..ca990bb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -137,6 +137,37 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)  }
EXPORT_SYMBOL_GPL(nvme_try_sched_reset);

+static void nvme_failfast_work(struct work_struct *work) {
+	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_ctrl, failfast_work);
+
+	if (ctrl->state != NVME_CTRL_CONNECTING)
+		return;
+
+	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	dev_info(ctrl->device, "failfast expired\n");
+	nvme_kick_requeue_lists(ctrl);
+}
+
+static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl) {
+	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == -1)
+		return;
+
+	schedule_delayed_work(&ctrl->failfast_work,
+			      ctrl->opts->fast_io_fail_tmo * HZ); }
+
+static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl) {
+	if (!ctrl->opts)
+		return;
+
+	cancel_delayed_work_sync(&ctrl->failfast_work);
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); }
+
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl)  {
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) @@ -387,8
+418,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	}

 	spin_unlock_irqrestore(&ctrl->lock, flags);
-	if (changed && ctrl->state == NVME_CTRL_LIVE)
+	if (changed) {
+		switch (ctrl->state) {
+			case NVME_CTRL_LIVE:
+				if (old_state == NVME_CTRL_CONNECTING)
+					nvme_stop_failfast_work(ctrl);
  				nvme_kick_requeue_lists(ctrl);
+				break;
+			case NVME_CTRL_CONNECTING:
+				if (old_state == NVME_CTRL_RESETTING)
+					nvme_start_failfast_work(ctrl);
+				break;
+			default:
+				break;
+		}
+	}
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -4045,6 +4089,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)  {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
+	nvme_stop_failfast_work(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
 }
@@ -4111,6 +4156,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct
device *dev,
 	int ret;

 	ctrl->state = NVME_CTRL_NEW;
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
@@ -4125,6 +4171,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct
device *dev,
 	init_waitqueue_head(&ctrl->state_wq);

 	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
+	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
 	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
 	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index
2a6c819..4afe173 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct
nvme_ctrl *ctrl,  {
 	if (ctrl->state != NVME_CTRL_DELETING &&
 	    ctrl->state != NVME_CTRL_DEAD &&
+	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;

@@ -612,6 +613,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl,
struct request *rq,
 	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
 	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
 	{ NVMF_OPT_TOS,			"tos=%d"		},
+	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };

@@ -631,6 +633,7 @@ static int nvmf_parse_options(struct
nvmf_ctrl_options *opts,
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
+	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
 	opts->hdr_digest = false;
 	opts->data_digest = false;
 	opts->tos = -1; /* < 0 == use transport default */ @@ -751,6 +754,17
@@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 				pr_warn("ctrl_loss_tmo < 0 will reconnect
forever\n");
 			ctrl_loss_tmo = token;
 			break;
+		case NVMF_OPT_FAIL_FAST_TMO:
+			if (match_int(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (token >= 0)
+				pr_warn("I/O will fail on reconnect controller after"
+					" %d sec\n", token);
+			opts->fast_io_fail_tmo = token;
+			break;
 		case NVMF_OPT_HOSTNQN:
 			if (opts->host) {
 				pr_err("hostnqn already user-assigned: %s\n", @@ -
881,11 +895,16 @@ static int nvmf_parse_options(struct nvmf_ctrl_options
*opts,
 		opts->nr_poll_queues = 0;
 		opts->duplicate_connect = true;
 	}
-	if (ctrl_loss_tmo < 0)
+	if (ctrl_loss_tmo < 0) {
 		opts->max_reconnects = -1;
-	else
+	} else {
 		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
 						opts->reconnect_delay);
+		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
+			pr_warn("failfast tmo (%d) larger than controller "
+				"loss tmo (%d)\n",
+				opts->fast_io_fail_tmo, ctrl_loss_tmo);
+	}

 	if (!opts->host) {
 		kref_get(&nvmf_default_host->ref);
@@ -985,7 +1004,7 @@ void nvmf_free_options(struct nvmf_ctrl_options
*opts)
 #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE |
NVMF_OPT_NR_IO_QUEUES | \
 				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
 				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT
|\
-				 NVMF_OPT_DISABLE_SQFLOW)
+				 NVMF_OPT_DISABLE_SQFLOW |
NVMF_OPT_FAIL_FAST_TMO)

 static struct nvme_ctrl *
 nvmf_create_ctrl(struct device *dev, const char *buf) diff --git
a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index
a0ec40a..05a1158 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -15,6 +15,8 @@
 #define NVMF_DEF_RECONNECT_DELAY	10
 /* default to 600 seconds of reconnect attempts before giving up */
 #define NVMF_DEF_CTRL_LOSS_TMO		600
+/* default is -1: the fail fast mechanism is disable  */
+#define NVMF_DEF_FAIL_FAST_TMO		-1

 /*
  * Define a host as seen by the target.  We allocate one at boot, but also @@ -
56,6 +58,7 @@ enum {
 	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
 	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
 	NVMF_OPT_TOS		= 1 << 19,
+	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
 };

 /**
@@ -89,6 +92,7 @@ enum {
  * @nr_write_queues: number of queues for write I/O
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
+ * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
+	int			fast_io_fail_tmo;
 };

 /*
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 54603bd..d8b7f45 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -278,9 +278,12 @@ static bool nvme_available_path(struct
nvme_ns_head *head)

 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		switch (ns->ctrl->state) {
+		case NVME_CTRL_CONNECTING:
+			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
+				     &ns->ctrl->flags))
+				break;
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_RESETTING:
-		case NVME_CTRL_CONNECTING:
 			/* fallthru */
 			return true;
 		default:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index
2e04a36..9112886 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -256,6 +256,7 @@ struct nvme_ctrl {
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
+	struct delayed_work failfast_work;
 	struct nvme_command ka_cmd;
 	struct work_struct fw_act_work;
 	unsigned long events;
@@ -290,6 +291,8 @@ struct nvme_ctrl {
 	u16 icdoff;
 	u16 maxcmd;
 	int nr_reconnects;
+	unsigned long flags;
+#define NVME_CTRL_FAILFAST_EXPIRED	0
 	struct nvmf_ctrl_options *opts;

 	struct page *discard_page;
--
1.8.3.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-09-29 15:27 [PATCH v9] nvme-fabrics: reject I/O to offline device Victor Gladkov
@ 2020-09-29 18:19 ` Sagi Grimberg
  2020-09-30  5:46 ` Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Sagi Grimberg @ 2020-09-29 18:19 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme
  Cc: Keith Busch, Christoph Hellwig, James Smart, Ewan D. Milne,
	Hannes Reinecke

Overall I am fine with this version.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


On 9/29/20 8:27 AM, Victor Gladkov wrote:
> Commands get stuck while Host NVMe-oF controller is in reconnect state.
> NVMe controller enters into reconnect state when it loses connection with the
> target. It tries to reconnect every 10 seconds
> (default) until successful reconnection or until reconnect time-out is reached.
> The default reconnect time out is 10 minutes.
> 
> Applications are expecting commands to complete with success or error within
> a certain timeout (30 seconds by default).  The NVMe host is enforcing that
> timeout while it is connected, never the less, during reconnection, the timeout
> is not enforced and commands may get stuck for a long period or even forever.
> 
> To fix this long delay due to the default timeout we introduce new session
> parameter "fast_io_fail_tmo". The timeout is measured in seconds from the
> controller reconnect, any command beyond that timeout is rejected. The new
> parameter value may be passed during 'connect'.
> The default value of -1 means no timeout (similar to current behavior).
> 
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
> 
> When the controller is entering the CONNECTING state, we schedule the
> delayed_work based on failfast timeout value. If the transition is out of
> CONNECTING, terminate delayed work item and ensure failfast_expired is
> false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED"
> flag to true.
> 
> We also update nvmf_fail_nonready_command() and
> nvme_available_path() functions with check the
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
> 
> For multipath (function nvme_available_path()):
> The path will not be considered available if
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag  is set
> and the controller in NVME_CTRL_CONNECTING state.
> This prevents commands from getting stuck
> when available paths have tried to reconnect for too long.
> 
> Signed-off-by: Victor Gladkov <victor.gladkov at kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
> Reviewed-by: Hannes Reinecke <hare at suse.de>
> 
> ---
> Changes from V8:
> 
> 1. Added multipath behavior description as requested by Sagi Grimberg
>      (Fri Sep 18 16:38:58 EDT 2020)
> 
> Changes from V7:
> 
> 1. Expanded the patch description as requested by James Smart
>      (Thu Aug 13 11:00:25 EDT 2020)
> 
> Changes from V6:
> 
> 1. Changed according to Hannes Reinecke review:
>      in nvme_start_failfast_work() and nvme_stop_failfast_work() procedures.
> 
> Changes from V5:
> 
> 1. Drop the "off" string option for fast_io_fail_tmo.
> 
> Changes from V4:
> 
> 1. Remove subsysnqn from dev_info just keep "failfast expired"
>      in nvme_failfast_work().
> 2. Remove excess lock in nvme_failfast_work().
> 3. Change "timeout disabled" value to -1, '0' is to fail I/O right away now.
> 4. Add "off" string for fast_io_fail_tmo option as a -1 equivalent.
> 
> Changes from V3:
> 
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
>      nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
>      for PCIe transport when is called nvme_change_ctrl_state()
>      from nvme_reset_work(), since we don't set the ctrl->opts for
>      PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
>      for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
>      added state handling outside of the state machine in
>      nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
>      switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
>      for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
>      instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
>      repo.
> 
> Changes from V2:
> 
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
>      semantics.
> 
> Changes from V1:
> 
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
>      is measured in seconds from the controller reconnect, any command
>      beyond that timeout is rejected. The new parameter value may be
>      passed during 'connect', and its default value is 30 seconds.
>      A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
>      controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
>      failfast timeout value. If transition out of CONNECTING, terminate
>      delayed work item and ensure failfast_expired is false.
>      If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
>      "failfast_expired" controller flag.
> 
> ---
>   drivers/nvme/host/core.c      | 49
> ++++++++++++++++++++++++++++++++++++++++++-
>   drivers/nvme/host/fabrics.c   | 25 +++++++++++++++++++---
>   drivers/nvme/host/fabrics.h   |  5 +++++
>   drivers/nvme/host/multipath.c |  5 ++++-
>   drivers/nvme/host/nvme.h      |  3 +++
>   5 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
> f3c037f..ca990bb 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -137,6 +137,37 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)  }
> EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
> 
> +static void nvme_failfast_work(struct work_struct *work) {
> +	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
> +			struct nvme_ctrl, failfast_work);
> +
> +	if (ctrl->state != NVME_CTRL_CONNECTING)
> +		return;
> +
> +	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +	dev_info(ctrl->device, "failfast expired\n");
> +	nvme_kick_requeue_lists(ctrl);
> +}
> +
> +static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl) {
> +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == -1)
> +		return;
> +
> +	schedule_delayed_work(&ctrl->failfast_work,
> +			      ctrl->opts->fast_io_fail_tmo * HZ); }
> +
> +static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl) {
> +	if (!ctrl->opts)
> +		return;
> +
> +	cancel_delayed_work_sync(&ctrl->failfast_work);
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); }
> +
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl)  {
>   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) @@ -387,8
> +418,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   	}
> 
>   	spin_unlock_irqrestore(&ctrl->lock, flags);
> -	if (changed && ctrl->state == NVME_CTRL_LIVE)
> +	if (changed) {
> +		switch (ctrl->state) {
> +			case NVME_CTRL_LIVE:
> +				if (old_state == NVME_CTRL_CONNECTING)
> +					nvme_stop_failfast_work(ctrl);
>    				nvme_kick_requeue_lists(ctrl);
> +				break;
> +			case NVME_CTRL_CONNECTING:
> +				if (old_state == NVME_CTRL_RESETTING)
> +					nvme_start_failfast_work(ctrl);
> +				break;
> +			default:
> +				break;
> +		}
> +	}
>   	return changed;
>   }
>   EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
> @@ -4045,6 +4089,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)  {
>   	nvme_mpath_stop(ctrl);
>   	nvme_stop_keep_alive(ctrl);
> +	nvme_stop_failfast_work(ctrl);
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fw_act_work);
>   }
> @@ -4111,6 +4156,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct
> device *dev,
>   	int ret;
> 
>   	ctrl->state = NVME_CTRL_NEW;
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>   	spin_lock_init(&ctrl->lock);
>   	mutex_init(&ctrl->scan_lock);
>   	INIT_LIST_HEAD(&ctrl->namespaces);
> @@ -4125,6 +4171,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct
> device *dev,
>   	init_waitqueue_head(&ctrl->state_wq);
> 
>   	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
> +	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
>   	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
>   	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index
> 2a6c819..4afe173 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct
> nvme_ctrl *ctrl,  {
>   	if (ctrl->state != NVME_CTRL_DELETING &&
>   	    ctrl->state != NVME_CTRL_DEAD &&
> +	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>   	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>   		return BLK_STS_RESOURCE;
> 
> @@ -612,6 +613,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl,
> struct request *rq,
>   	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
>   	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
>   	{ NVMF_OPT_TOS,			"tos=%d"		},
> +	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
>   	{ NVMF_OPT_ERR,			NULL			}
>   };
> 
> @@ -631,6 +633,7 @@ static int nvmf_parse_options(struct
> nvmf_ctrl_options *opts,
>   	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
>   	opts->kato = NVME_DEFAULT_KATO;
>   	opts->duplicate_connect = false;
> +	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
>   	opts->hdr_digest = false;
>   	opts->data_digest = false;
>   	opts->tos = -1; /* < 0 == use transport default */ @@ -751,6 +754,17
> @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   				pr_warn("ctrl_loss_tmo < 0 will reconnect
> forever\n");
>   			ctrl_loss_tmo = token;
>   			break;
> +		case NVMF_OPT_FAIL_FAST_TMO:
> +			if (match_int(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (token >= 0)
> +				pr_warn("I/O will fail on reconnect controller after"
> +					" %d sec\n", token);
> +			opts->fast_io_fail_tmo = token;
> +			break;
>   		case NVMF_OPT_HOSTNQN:
>   			if (opts->host) {
>   				pr_err("hostnqn already user-assigned: %s\n", @@ -
> 881,11 +895,16 @@ static int nvmf_parse_options(struct nvmf_ctrl_options
> *opts,
>   		opts->nr_poll_queues = 0;
>   		opts->duplicate_connect = true;
>   	}
> -	if (ctrl_loss_tmo < 0)
> +	if (ctrl_loss_tmo < 0) {
>   		opts->max_reconnects = -1;
> -	else
> +	} else {
>   		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
>   						opts->reconnect_delay);
> +		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
> +			pr_warn("failfast tmo (%d) larger than controller "
> +				"loss tmo (%d)\n",
> +				opts->fast_io_fail_tmo, ctrl_loss_tmo);
> +	}
> 
>   	if (!opts->host) {
>   		kref_get(&nvmf_default_host->ref);
> @@ -985,7 +1004,7 @@ void nvmf_free_options(struct nvmf_ctrl_options
> *opts)
>   #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE |
> NVMF_OPT_NR_IO_QUEUES | \
>   				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
>   				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT
> |\
> -				 NVMF_OPT_DISABLE_SQFLOW)
> +				 NVMF_OPT_DISABLE_SQFLOW |
> NVMF_OPT_FAIL_FAST_TMO)
> 
>   static struct nvme_ctrl *
>   nvmf_create_ctrl(struct device *dev, const char *buf) diff --git
> a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index
> a0ec40a..05a1158 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -15,6 +15,8 @@
>   #define NVMF_DEF_RECONNECT_DELAY	10
>   /* default to 600 seconds of reconnect attempts before giving up */
>   #define NVMF_DEF_CTRL_LOSS_TMO		600
> +/* default is -1: the fail fast mechanism is disable  */
> +#define NVMF_DEF_FAIL_FAST_TMO		-1
> 
>   /*
>    * Define a host as seen by the target.  We allocate one at boot, but also @@ -
> 56,6 +58,7 @@ enum {
>   	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
>   	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
>   	NVMF_OPT_TOS		= 1 << 19,
> +	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
>   };
> 
>   /**
> @@ -89,6 +92,7 @@ enum {
>    * @nr_write_queues: number of queues for write I/O
>    * @nr_poll_queues: number of queues for polling I/O
>    * @tos: type of service
> + * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
>    */
>   struct nvmf_ctrl_options {
>   	unsigned		mask;
> @@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
>   	unsigned int		nr_write_queues;
>   	unsigned int		nr_poll_queues;
>   	int			tos;
> +	int			fast_io_fail_tmo;
>   };
> 
>   /*
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 54603bd..d8b7f45 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -278,9 +278,12 @@ static bool nvme_available_path(struct
> nvme_ns_head *head)
> 
>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>   		switch (ns->ctrl->state) {
> +		case NVME_CTRL_CONNECTING:
> +			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
> +				     &ns->ctrl->flags))
> +				break;
>   		case NVME_CTRL_LIVE:
>   		case NVME_CTRL_RESETTING:
> -		case NVME_CTRL_CONNECTING:
>   			/* fallthru */
>   			return true;
>   		default:
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index
> 2e04a36..9112886 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -256,6 +256,7 @@ struct nvme_ctrl {
>   	struct work_struct scan_work;
>   	struct work_struct async_event_work;
>   	struct delayed_work ka_work;
> +	struct delayed_work failfast_work;
>   	struct nvme_command ka_cmd;
>   	struct work_struct fw_act_work;
>   	unsigned long events;
> @@ -290,6 +291,8 @@ struct nvme_ctrl {
>   	u16 icdoff;
>   	u16 maxcmd;
>   	int nr_reconnects;
> +	unsigned long flags;
> +#define NVME_CTRL_FAILFAST_EXPIRED	0
>   	struct nvmf_ctrl_options *opts;
> 
>   	struct page *discard_page;
> --
> 1.8.3.1
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-09-29 15:27 [PATCH v9] nvme-fabrics: reject I/O to offline device Victor Gladkov
  2020-09-29 18:19 ` Sagi Grimberg
@ 2020-09-30  5:46 ` Hannes Reinecke
  2020-09-30  6:39 ` Christoph Hellwig
  2020-10-01  8:55 ` Hannes Reinecke
  3 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2020-09-30  5:46 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Ewan D. Milne,
	James Smart

On 9/29/20 5:27 PM, Victor Gladkov wrote:
> Commands get stuck while Host NVMe-oF controller is in reconnect state.
> NVMe controller enters into reconnect state when it loses connection with the
> target. It tries to reconnect every 10 seconds
> (default) until successful reconnection or until reconnect time-out is reached.
> The default reconnect time out is 10 minutes.
> 
> Applications are expecting commands to complete with success or error within
> a certain timeout (30 seconds by default).  The NVMe host is enforcing that
> timeout while it is connected, never the less, during reconnection, the timeout
> is not enforced and commands may get stuck for a long period or even forever.
> 
> To fix this long delay due to the default timeout we introduce new session
> parameter "fast_io_fail_tmo". The timeout is measured in seconds from the
> controller reconnect, any command beyond that timeout is rejected. The new
> parameter value may be passed during 'connect'.
> The default value of -1 means no timeout (similar to current behavior).
> 
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
> 
> When the controller is entering the CONNECTING state, we schedule the
> delayed_work based on failfast timeout value. If the transition is out of
> CONNECTING, terminate delayed work item and ensure failfast_expired is
> false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED"
> flag to true.
> 
> We also update nvmf_fail_nonready_command() and
> nvme_available_path() functions with check the
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
> 
> For multipath (function nvme_available_path()):
> The path will not be considered available if
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag  is set
> and the controller in NVME_CTRL_CONNECTING state.
> This prevents commands from getting stuck
> when available paths have tried to reconnect for too long.
> 
> Signed-off-by: Victor Gladkov <victor.gladkov at kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
> Reviewed-by: Hannes Reinecke <hare at suse.de>
> 
> ---
> Changes from V8:
> 
> 1. Added multipath behavior description as requested by Sagi Grimberg
>      (Fri Sep 18 16:38:58 EDT 2020)
> 
> Changes from V7:
> 
> 1. Expanded the patch description as requested by James Smart
>      (Thu Aug 13 11:00:25 EDT 2020)
> 
> Changes from V6:
> 
> 1. Changed according to Hannes Reinecke review:
>      in nvme_start_failfast_work() and nvme_stop_failfast_work() procedures.
> 
> Changes from V5:
> 
> 1. Drop the "off" string option for fast_io_fail_tmo.
> 
> Changes from V4:
> 
> 1. Remove subsysnqn from dev_info just keep "failfast expired"
>      in nvme_failfast_work().
> 2. Remove excess lock in nvme_failfast_work().
> 3. Change "timeout disabled" value to -1, '0' is to fail I/O right away now.
> 4. Add "off" string for fast_io_fail_tmo option as a -1 equivalent.
> 
> Changes from V3:
> 
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
>      nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
>      for PCIe transport when is called nvme_change_ctrl_state()
>      from nvme_reset_work(), since we don't set the ctrl->opts for
>      PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
>      for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
>      added state handling outside of the state machine in
>      nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
>      switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
>      for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
>      instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
>      repo.
> 
> Changes from V2:
> 
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
>      semantics.
> 
> Changes from V1:
> 
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
>      is measured in seconds from the controller reconnect, any command
>      beyond that timeout is rejected. The new parameter value may be
>      passed during 'connect', and its default value is 30 seconds.
>      A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
>      controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
>      failfast timeout value. If transition out of CONNECTING, terminate
>      delayed work item and ensure failfast_expired is false.
>      If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
>      "failfast_expired" controller flag.
> 
> ---
>   drivers/nvme/host/core.c      | 49
> ++++++++++++++++++++++++++++++++++++++++++-
>   drivers/nvme/host/fabrics.c   | 25 +++++++++++++++++++---
>   drivers/nvme/host/fabrics.h   |  5 +++++
>   drivers/nvme/host/multipath.c |  5 ++++-
>   drivers/nvme/host/nvme.h      |  3 +++
>   5 files changed, 82 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
> f3c037f..ca990bb 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -137,6 +137,37 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)  }
> EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
> 
> +static void nvme_failfast_work(struct work_struct *work) {
> +	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
> +			struct nvme_ctrl, failfast_work);
> +
> +	if (ctrl->state != NVME_CTRL_CONNECTING)
> +		return;
> +
> +	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +	dev_info(ctrl->device, "failfast expired\n");
> +	nvme_kick_requeue_lists(ctrl);
> +}
> +
> +static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl) {
> +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == -1)
> +		return;
> +
> +	schedule_delayed_work(&ctrl->failfast_work,
> +			      ctrl->opts->fast_io_fail_tmo * HZ); }
> +
> +static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl) {
> +	if (!ctrl->opts)
> +		return;
> +
> +	cancel_delayed_work_sync(&ctrl->failfast_work);
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); }
> +

Formatting; closing brace needs to go on a separate line.

>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl)  {
>   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) @@ -387,8
> +418,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   	}
> 
>   	spin_unlock_irqrestore(&ctrl->lock, flags);
> -	if (changed && ctrl->state == NVME_CTRL_LIVE)
> +	if (changed) {
> +		switch (ctrl->state) {
> +			case NVME_CTRL_LIVE:
> +				if (old_state == NVME_CTRL_CONNECTING)
> +					nvme_stop_failfast_work(ctrl);
>    				nvme_kick_requeue_lists(ctrl);
> +				break;
> +			case NVME_CTRL_CONNECTING:
> +				if (old_state == NVME_CTRL_RESETTING)
> +					nvme_start_failfast_work(ctrl);
> +				break;
> +			default:
> +				break;
> +		}
> +	}
>   	return changed;
>   }
>   EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
> @@ -4045,6 +4089,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)  {
>   	nvme_mpath_stop(ctrl);
>   	nvme_stop_keep_alive(ctrl);
> +	nvme_stop_failfast_work(ctrl);
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fw_act_work);
>   }
> @@ -4111,6 +4156,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct
> device *dev,
>   	int ret;
> 
>   	ctrl->state = NVME_CTRL_NEW;
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>   	spin_lock_init(&ctrl->lock);
>   	mutex_init(&ctrl->scan_lock);
>   	INIT_LIST_HEAD(&ctrl->namespaces);
> @@ -4125,6 +4171,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct
> device *dev,
>   	init_waitqueue_head(&ctrl->state_wq);
> 
>   	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
> +	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
>   	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
>   	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index
> 2a6c819..4afe173 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct
> nvme_ctrl *ctrl,  {
>   	if (ctrl->state != NVME_CTRL_DELETING &&
>   	    ctrl->state != NVME_CTRL_DEAD &&
> +	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>   	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>   		return BLK_STS_RESOURCE;
> 
> @@ -612,6 +613,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl,
> struct request *rq,
>   	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
>   	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
>   	{ NVMF_OPT_TOS,			"tos=%d"		},
> +	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
>   	{ NVMF_OPT_ERR,			NULL			}
>   };
> 
> @@ -631,6 +633,7 @@ static int nvmf_parse_options(struct
> nvmf_ctrl_options *opts,
>   	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
>   	opts->kato = NVME_DEFAULT_KATO;
>   	opts->duplicate_connect = false;
> +	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
>   	opts->hdr_digest = false;
>   	opts->data_digest = false;
>   	opts->tos = -1; /* < 0 == use transport default */ @@ -751,6 +754,17
> @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   				pr_warn("ctrl_loss_tmo < 0 will reconnect
> forever\n");
>   			ctrl_loss_tmo = token;
>   			break;
> +		case NVMF_OPT_FAIL_FAST_TMO:
> +			if (match_int(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (token >= 0)
> +				pr_warn("I/O will fail on reconnect controller after"
> +					" %d sec\n", token);
> +			opts->fast_io_fail_tmo = token;
> +			break;
>   		case NVMF_OPT_HOSTNQN:
>   			if (opts->host) {
>   				pr_err("hostnqn already user-assigned: %s\n", @@ -
> 881,11 +895,16 @@ static int nvmf_parse_options(struct nvmf_ctrl_options
> *opts,
>   		opts->nr_poll_queues = 0;
>   		opts->duplicate_connect = true;
>   	}
> -	if (ctrl_loss_tmo < 0)
> +	if (ctrl_loss_tmo < 0) {
>   		opts->max_reconnects = -1;
> -	else
> +	} else {

Formatting: no need for braces around single-line statements.

>   		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
>   						opts->reconnect_delay);
> +		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
> +			pr_warn("failfast tmo (%d) larger than controller "
> +				"loss tmo (%d)\n",
> +				opts->fast_io_fail_tmo, ctrl_loss_tmo);
> +	}

If we already check for that condition, shouldn't we disable 
fast_io_fail_tmo in that situation to clarify things?

> 
>   	if (!opts->host) {
>   		kref_get(&nvmf_default_host->ref);
> @@ -985,7 +1004,7 @@ void nvmf_free_options(struct nvmf_ctrl_options
> *opts)
>   #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE |
> NVMF_OPT_NR_IO_QUEUES | \
>   				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
>   				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT
> |\
> -				 NVMF_OPT_DISABLE_SQFLOW)
> +				 NVMF_OPT_DISABLE_SQFLOW |
> NVMF_OPT_FAIL_FAST_TMO)
> 
>   static struct nvme_ctrl *
>   nvmf_create_ctrl(struct device *dev, const char *buf) diff --git
> a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index
> a0ec40a..05a1158 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -15,6 +15,8 @@
>   #define NVMF_DEF_RECONNECT_DELAY	10
>   /* default to 600 seconds of reconnect attempts before giving up */
>   #define NVMF_DEF_CTRL_LOSS_TMO		600
> +/* default is -1: the fail fast mechanism is disable  */
> +#define NVMF_DEF_FAIL_FAST_TMO		-1
> 

disabled

>   /*
>    * Define a host as seen by the target.  We allocate one at boot, but also @@ -
> 56,6 +58,7 @@ enum {
>   	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
>   	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
>   	NVMF_OPT_TOS		= 1 << 19,
> +	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
>   };
> 
>   /**
> @@ -89,6 +92,7 @@ enum {
>    * @nr_write_queues: number of queues for write I/O
>    * @nr_poll_queues: number of queues for polling I/O
>    * @tos: type of service
> + * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
>    */
>   struct nvmf_ctrl_options {
>   	unsigned		mask;
> @@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
>   	unsigned int		nr_write_queues;
>   	unsigned int		nr_poll_queues;
>   	int			tos;
> +	int			fast_io_fail_tmo;
>   };
> 
>   /*
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 54603bd..d8b7f45 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -278,9 +278,12 @@ static bool nvme_available_path(struct
> nvme_ns_head *head)
> 
>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>   		switch (ns->ctrl->state) {
> +		case NVME_CTRL_CONNECTING:
> +			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
> +				     &ns->ctrl->flags))
> +				break;

No. We shouldn't select this path, but that doesn't mean that all other 
paths in this list can't be selected, either; they might be coming from
different controllers.
Please use 'continue' here.

>   		case NVME_CTRL_LIVE:
>   		case NVME_CTRL_RESETTING:
> -		case NVME_CTRL_CONNECTING:
>   			/* fallthru */
>   			return true;
>   		default:
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index
> 2e04a36..9112886 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -256,6 +256,7 @@ struct nvme_ctrl {
>   	struct work_struct scan_work;
>   	struct work_struct async_event_work;
>   	struct delayed_work ka_work;
> +	struct delayed_work failfast_work;
>   	struct nvme_command ka_cmd;
>   	struct work_struct fw_act_work;
>   	unsigned long events;
> @@ -290,6 +291,8 @@ struct nvme_ctrl {
>   	u16 icdoff;
>   	u16 maxcmd;
>   	int nr_reconnects;
> +	unsigned long flags;
> +#define NVME_CTRL_FAILFAST_EXPIRED	0
>   	struct nvmf_ctrl_options *opts;
> 
>   	struct page *discard_page;
> --
> 1.8.3.1
> 
Otherwise looks good.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-09-29 15:27 [PATCH v9] nvme-fabrics: reject I/O to offline device Victor Gladkov
  2020-09-29 18:19 ` Sagi Grimberg
  2020-09-30  5:46 ` Hannes Reinecke
@ 2020-09-30  6:39 ` Christoph Hellwig
  2020-10-01  8:55 ` Hannes Reinecke
  3 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2020-09-30  6:39 UTC (permalink / raw)
  To: Victor Gladkov
  Cc: Sagi Grimberg, James Smart, linux-nvme, Hannes Reinecke,
	Keith Busch, Christoph Hellwig, Ewan D. Milne

The diff seems malformed, probably mangled by your mailer.  Can you
resend a working diff?  Feel free to fix up the obvious style problems,
otherwise I can do that as well.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-09-29 15:27 [PATCH v9] nvme-fabrics: reject I/O to offline device Victor Gladkov
                   ` (2 preceding siblings ...)
  2020-09-30  6:39 ` Christoph Hellwig
@ 2020-10-01  8:55 ` Hannes Reinecke
  2020-11-15 15:45   ` Victor Gladkov
  3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2020-10-01  8:55 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Ewan D. Milne,
	James Smart

[-- Attachment #1: Type: text/plain, Size: 7349 bytes --]

On 9/29/20 5:27 PM, Victor Gladkov wrote:
> Commands get stuck while Host NVMe-oF controller is in reconnect state.
> NVMe controller enters into reconnect state when it loses connection with the
> target. It tries to reconnect every 10 seconds
> (default) until successful reconnection or until reconnect time-out is reached.
> The default reconnect time out is 10 minutes.
> 
> Applications are expecting commands to complete with success or error within
> a certain timeout (30 seconds by default).  The NVMe host is enforcing that
> timeout while it is connected, never the less, during reconnection, the timeout
> is not enforced and commands may get stuck for a long period or even forever.
> 
> To fix this long delay due to the default timeout we introduce new session
> parameter "fast_io_fail_tmo". The timeout is measured in seconds from the
> controller reconnect, any command beyond that timeout is rejected. The new
> parameter value may be passed during 'connect'.
> The default value of -1 means no timeout (similar to current behavior).
> 
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
> 
> When the controller is entering the CONNECTING state, we schedule the
> delayed_work based on failfast timeout value. If the transition is out of
> CONNECTING, terminate delayed work item and ensure failfast_expired is
> false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED"
> flag to true.
> 
> We also update nvmf_fail_nonready_command() and
> nvme_available_path() functions with check the
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
> 
> For multipath (function nvme_available_path()):
> The path will not be considered available if
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag  is set
> and the controller in NVME_CTRL_CONNECTING state.
> This prevents commands from getting stuck
> when available paths have tried to reconnect for too long.
> 
> Signed-off-by: Victor Gladkov <victor.gladkov at kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
> Reviewed-by: Hannes Reinecke <hare at suse.de>
> 
> ---
> Changes from V8:
> 
> 1. Added multipath behavior description as requested by Sagi Grimberg
>      (Fri Sep 18 16:38:58 EDT 2020)
> 
> Changes from V7:
> 
> 1. Expanded the patch description as requested by James Smart
>      (Thu Aug 13 11:00:25 EDT 2020)
> 
> Changes from V6:
> 
> 1. Changed according to Hannes Reinecke review:
>      in nvme_start_failfast_work() and nvme_stop_failfast_work() procedures.
> 
> Changes from V5:
> 
> 1. Drop the "off" string option for fast_io_fail_tmo.
> 
> Changes from V4:
> 
> 1. Remove subsysnqn from dev_info just keep "failfast expired"
>      in nvme_failfast_work().
> 2. Remove excess lock in nvme_failfast_work().
> 3. Change "timeout disabled" value to -1, '0' is to fail I/O right away now.
> 4. Add "off" string for fast_io_fail_tmo option as a -1 equivalent.
> 
> Changes from V3:
> 
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
>      nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
>      for PCIe transport when is called nvme_change_ctrl_state()
>      from nvme_reset_work(), since we don't set the ctrl->opts for
>      PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
>      for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
>      added state handling outside of the state machine in
>      nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
>      switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
>      for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
>      instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
>      repo.
> 
> Changes from V2:
> 
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
>      semantics.
> 
> Changes from V1:
> 
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
>      is measured in seconds from the controller reconnect, any command
>      beyond that timeout is rejected. The new parameter value may be
>      passed during 'connect', and its default value is 30 seconds.
>      A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
>      controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
>      failfast timeout value. If transition out of CONNECTING, terminate
>      delayed work item and ensure failfast_expired is false.
>      If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
>      "failfast_expired" controller flag.
> 
> ---
>   drivers/nvme/host/core.c      | 49
> ++++++++++++++++++++++++++++++++++++++++++-
>   drivers/nvme/host/fabrics.c   | 25 +++++++++++++++++++---
>   drivers/nvme/host/fabrics.h   |  5 +++++
>   drivers/nvme/host/multipath.c |  5 ++++-
>   drivers/nvme/host/nvme.h      |  3 +++
>   5 files changed, 82 insertions(+), 5 deletions(-)
> 
I did some more experiments with this, and found that there are some 
issues with ANA handling.
If reconnect works, but the ANA state indicates that we still can't sent 
I/O (eg by still being in transitioning), we hit the 'requeueing I/O' 
state despite fast_io_fail_tmo being set. Not sure if that's the 
expected outcome.
For that it might be better to move the FAILFAST_EXPIRED bit into the 
namespace, as then we could selectively clear the bit in 
nvme_failfast_work():

@@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
  static void nvme_failfast_work(struct work_struct *work) {
         struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
                         struct nvme_ctrl, failfast_work);
+       struct nvme_ns *ns;

-       if (ctrl->state != NVME_CTRL_CONNECTING)
-               return;
-
-
-       set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+       down_read(&ctrl->namespaces_rwsem);
+       list_for_each_entry(ns, &ctrl->namespaces, list) {
+               if (ctrl->state != NVME_CTRL_LIVE ||
+                   (ns->ana_state != NVME_ANA_OPTIMIZED &&
+                    ns->ana_state != NVME_ANA_NONOPTIMIZED))
+                       set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
+       }
+       up_read(&ctrl->namespaces_rwsem);
         dev_info(ctrl->device, "failfast expired\n");

...and we could leave the failfast worker running even after the 
controller transitioned to LIVE.
Cf the attached patch for details.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

[-- Attachment #2: 0001-nvme-move-FAILFAST_EXPIRED-bit-into-namespace.patch --]
[-- Type: text/x-patch, Size: 5552 bytes --]

From a9bfa1edc7566bdaac8d330cd813fe2c0d0728cb Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Thu, 1 Oct 2020 10:52:27 +0200
Subject: [PATCH] nvme: move FAILFAST_EXPIRED bit into namespace

Rather than having the FAILFAST_EXPIRED bit being a flag on the
controller move it into the namespace. This allows the failfast
mechanism to cover ANA states, too, which might indicate that
I/O isn't possible despite the controller being LIVE.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c      | 26 ++++++++++++++++----------
 drivers/nvme/host/fabrics.c   |  4 +++-
 drivers/nvme/host/multipath.c | 11 ++++++-----
 drivers/nvme/host/nvme.h      |  3 +--
 4 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0d571aa3cc97..b458508b0c91 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
 static void nvme_failfast_work(struct work_struct *work) {
 	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
 			struct nvme_ctrl, failfast_work);
+	struct nvme_ns *ns;
 
-	if (ctrl->state != NVME_CTRL_CONNECTING)
-		return;
-
-
-	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ctrl->state != NVME_CTRL_LIVE ||
+		    (ns->ana_state != NVME_ANA_OPTIMIZED &&
+		     ns->ana_state != NVME_ANA_NONOPTIMIZED))
+			set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
+	}
+	up_read(&ctrl->namespaces_rwsem);
 	dev_info(ctrl->device, "failfast expired\n");
 	nvme_kick_requeue_lists(ctrl);
 }
@@ -172,11 +176,16 @@ static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
 
 static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
 {
+	struct nvme_ns *ns;
+
 	if (!ctrl->opts)
 		return;
 
 	cancel_delayed_work_sync(&ctrl->failfast_work);
-	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		clear_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
+	up_read(&ctrl->namespaces_rwsem);
 }
 
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
@@ -391,11 +400,9 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	switch (new_state) {
 	case NVME_CTRL_LIVE:
 		switch (old_state) {
-		case NVME_CTRL_CONNECTING:
-			nvme_stop_failfast_work(ctrl);
-			fallthrough;
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_RESETTING:
+		case NVME_CTRL_CONNECTING:
 			changed = true;
 			fallthrough;
 		default:
@@ -4482,7 +4489,6 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	int ret;
 
 	ctrl->state = NVME_CTRL_NEW;
-	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 6404cbf41c96..d706fb5c3b19 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -547,9 +547,11 @@ static struct nvmf_transport_ops *nvmf_lookup_transport(
 blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 		struct request *rq)
 {
+	struct nvme_ns *ns = rq->rq_disk->private_data;
+
 	if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
 	    ctrl->state != NVME_CTRL_DEAD &&
-	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
+	    !test_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0e08ca6e0264..2738f08b2a07 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -155,9 +155,8 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
 	    ns->ctrl->state != NVME_CTRL_DELETING)
 		return true;
 	if (test_bit(NVME_NS_ANA_PENDING, &ns->flags) ||
-	    test_bit(NVME_NS_REMOVING, &ns->flags))
-		return true;
-	if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
+	    test_bit(NVME_NS_REMOVING, &ns->flags) ||
+	    test_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags))
 		return true;
 	return false;
 }
@@ -281,7 +280,7 @@ static bool nvme_available_path(struct nvme_ns_head *head)
 	struct nvme_ns *ns;
 
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
-		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
+		if (test_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags))
 			continue;
 		switch (ns->ctrl->state) {
 		case NVME_CTRL_LIVE:
@@ -490,8 +489,10 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
 	ns->ana_state = desc->state;
 	clear_bit(NVME_NS_ANA_PENDING, &ns->flags);
 
-	if (nvme_state_is_live(ns->ana_state))
+	if (nvme_state_is_live(ns->ana_state)) {
+		clear_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
 		nvme_mpath_set_live(ns);
+	}
 }
 
 static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 37be7d16ab15..3163aa9fcf6d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -338,8 +338,6 @@ struct nvme_ctrl {
 	u16 icdoff;
 	u16 maxcmd;
 	int nr_reconnects;
-	unsigned long flags;
-#define NVME_CTRL_FAILFAST_EXPIRED	0
 	struct nvmf_ctrl_options *opts;
 
 	struct page *discard_page;
@@ -451,6 +449,7 @@ struct nvme_ns {
 #define NVME_NS_REMOVING	0
 #define NVME_NS_DEAD     	1
 #define NVME_NS_ANA_PENDING	2
+#define NVME_NS_FAILFAST_EXPIRED 3
 
 	struct nvme_fault_inject fault_inject;
 
-- 
2.16.4


[-- Attachment #3: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-10-01  8:55 ` Hannes Reinecke
@ 2020-11-15 15:45   ` Victor Gladkov
  2020-11-16  9:54     ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Victor Gladkov @ 2020-11-15 15:45 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Ewan D. Milne,
	James Smart

> -----Original Message-----
> From: Hannes Reinecke [mailto:hare@suse.de]
> Sent: Thursday, 01 October, 2020 11:55

> > ---
> >   drivers/nvme/host/core.c      | 49
> > ++++++++++++++++++++++++++++++++++++++++++-
> >   drivers/nvme/host/fabrics.c   | 25 +++++++++++++++++++---
> >   drivers/nvme/host/fabrics.h   |  5 +++++
> >   drivers/nvme/host/multipath.c |  5 ++++-
> >   drivers/nvme/host/nvme.h      |  3 +++
> >   5 files changed, 82 insertions(+), 5 deletions(-)
> >
> I did some more experiments with this, and found that there are some issues
> with ANA handling.
> If reconnect works, but the ANA state indicates that we still can't sent I/O (eg
> by still being in transitioning), we hit the 'requeueing I/O'
> state despite fast_io_fail_tmo being set. Not sure if that's the expected
> outcome.
> For that it might be better to move the FAILFAST_EXPIRED bit into the
> namespace, as then we could selectively clear the bit in
> nvme_failfast_work():
> 
> @@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
>   static void nvme_failfast_work(struct work_struct *work) {
>          struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
>                          struct nvme_ctrl, failfast_work);
> +       struct nvme_ns *ns;
> 
> -       if (ctrl->state != NVME_CTRL_CONNECTING)
> -               return;
> -
> -
> -       set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +       down_read(&ctrl->namespaces_rwsem);
> +       list_for_each_entry(ns, &ctrl->namespaces, list) {
> +               if (ctrl->state != NVME_CTRL_LIVE ||
> +                   (ns->ana_state != NVME_ANA_OPTIMIZED &&
> +                    ns->ana_state != NVME_ANA_NONOPTIMIZED))
> +                       set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
> +       }
> +       up_read(&ctrl->namespaces_rwsem);
>          dev_info(ctrl->device, "failfast expired\n");
> 
> ...and we could leave the failfast worker running even after the controller
> transitioned to LIVE.
> Cf the attached patch for details.
> 
> Cheers,
> 
> Hannes
> --

I'm not sure what makes sense to move the FAILFAST_EXPIRED bit into the namespace,
Because the failfast mechanism characterizes the controller as a whole.

Any comments on this?

Regards,
Victor


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-11-15 15:45   ` Victor Gladkov
@ 2020-11-16  9:54     ` Hannes Reinecke
  2020-11-17  8:39       ` Sagi Grimberg
  0 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2020-11-16  9:54 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, Ewan D. Milne,
	James Smart

On 11/15/20 4:45 PM, Victor Gladkov wrote:
>> -----Original Message-----
>> From: Hannes Reinecke [mailto:hare@suse.de]
>> Sent: Thursday, 01 October, 2020 11:55
> 
>>> ---
>>>    drivers/nvme/host/core.c      | 49
>>> ++++++++++++++++++++++++++++++++++++++++++-
>>>    drivers/nvme/host/fabrics.c   | 25 +++++++++++++++++++---
>>>    drivers/nvme/host/fabrics.h   |  5 +++++
>>>    drivers/nvme/host/multipath.c |  5 ++++-
>>>    drivers/nvme/host/nvme.h      |  3 +++
>>>    5 files changed, 82 insertions(+), 5 deletions(-)
>>>
>> I did some more experiments with this, and found that there are some issues
>> with ANA handling.
>> If reconnect works, but the ANA state indicates that we still can't sent I/O (eg
>> by still being in transitioning), we hit the 'requeueing I/O'
>> state despite fast_io_fail_tmo being set. Not sure if that's the expected
>> outcome.
>> For that it might be better to move the FAILFAST_EXPIRED bit into the
>> namespace, as then we could selectively clear the bit in
>> nvme_failfast_work():
>>
>> @@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
>>    static void nvme_failfast_work(struct work_struct *work) {
>>           struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
>>                           struct nvme_ctrl, failfast_work);
>> +       struct nvme_ns *ns;
>>
>> -       if (ctrl->state != NVME_CTRL_CONNECTING)
>> -               return;
>> -
>> -
>> -       set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>> +       down_read(&ctrl->namespaces_rwsem);
>> +       list_for_each_entry(ns, &ctrl->namespaces, list) {
>> +               if (ctrl->state != NVME_CTRL_LIVE ||
>> +                   (ns->ana_state != NVME_ANA_OPTIMIZED &&
>> +                    ns->ana_state != NVME_ANA_NONOPTIMIZED))
>> +                       set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
>> +       }
>> +       up_read(&ctrl->namespaces_rwsem);
>>           dev_info(ctrl->device, "failfast expired\n");
>>
>> ...and we could leave the failfast worker running even after the controller
>> transitioned to LIVE.
>> Cf the attached patch for details.
>>
>> Cheers,
>>
>> Hannes
>> --
> 
> I'm not sure what makes sense to move the FAILFAST_EXPIRED bit into the namespace,
> Because the failfast mechanism characterizes the controller as a whole.
> 
Oh, yes, I'm aware of that. But the problem here is with multipath; how 
do we handle the situation where all controllers have the 
'failfast_expired' bit set?
Should I/O be terminated (which I think it should, given that failfast 
is supposed to terminate the I/O)?
Or should I/O continue to run (as it does with your original patch)?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-11-16  9:54     ` Hannes Reinecke
@ 2020-11-17  8:39       ` Sagi Grimberg
  2020-11-20 13:09         ` Hannes Reinecke
  0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2020-11-17  8:39 UTC (permalink / raw)
  To: Hannes Reinecke, Victor Gladkov, linux-nvme
  Cc: Keith Busch, James Smart, Ewan D. Milne, Christoph Hellwig


>>> @@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
>>>    static void nvme_failfast_work(struct work_struct *work) {
>>>           struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
>>>                           struct nvme_ctrl, failfast_work);
>>> +       struct nvme_ns *ns;
>>>
>>> -       if (ctrl->state != NVME_CTRL_CONNECTING)
>>> -               return;
>>> -
>>> -
>>> -       set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>>> +       down_read(&ctrl->namespaces_rwsem);
>>> +       list_for_each_entry(ns, &ctrl->namespaces, list) {
>>> +               if (ctrl->state != NVME_CTRL_LIVE ||
>>> +                   (ns->ana_state != NVME_ANA_OPTIMIZED &&
>>> +                    ns->ana_state != NVME_ANA_NONOPTIMIZED))
>>> +                       set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
>>> +       }
>>> +       up_read(&ctrl->namespaces_rwsem);
>>>           dev_info(ctrl->device, "failfast expired\n");
>>>
>>> ...and we could leave the failfast worker running even after the 
>>> controller
>>> transitioned to LIVE.
>>> Cf the attached patch for details.
>>>
>>> Cheers,
>>>
>>> Hannes
>>> -- 
>>
>> I'm not sure what makes sense to move the FAILFAST_EXPIRED bit into 
>> the namespace,
>> Because the failfast mechanism characterizes the controller as a whole.
>>
> Oh, yes, I'm aware of that. But the problem here is with multipath; how 
> do we handle the situation where all controllers have the 
> 'failfast_expired' bit set?
> Should I/O be terminated (which I think it should, given that failfast 
> is supposed to terminate the I/O)?
> Or should I/O continue to run (as it does with your original patch)?

I do agree that fast_io_fail_tmo _is_ a controller attribute and should
remain as such.

I do see what is your point Hannes, however I also think it's
problematic that the host may fail arbitrary I/O if the controller
happens to enter ANA inaccessible state (or have state transition
timeout) for a period that happens to be longer than what the user
happen to set (without communicating any of this to the controller).

IFF we want to address this (I'm still not sure), we probably want
to activate failfast timeout in ANA state transition (and clear it
when we exit it). Then we can modify nvme_available_path() to take
NVME_CTRL_FAILFAST_EXPIRED into account.

Anyways, I think that this can be an incremental patch because it
doesn't change the behavior today with respect to ANA states (or
transition between them) e.g. queue the outstanding I/O.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
  2020-11-17  8:39       ` Sagi Grimberg
@ 2020-11-20 13:09         ` Hannes Reinecke
  0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2020-11-20 13:09 UTC (permalink / raw)
  To: Sagi Grimberg, Victor Gladkov, linux-nvme
  Cc: Keith Busch, James Smart, Ewan D. Milne, Christoph Hellwig

On 11/17/20 9:39 AM, Sagi Grimberg wrote:
> 
>>>> @@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
>>>>    static void nvme_failfast_work(struct work_struct *work) {
>>>>           struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
>>>>                           struct nvme_ctrl, failfast_work);
>>>> +       struct nvme_ns *ns;
>>>>
>>>> -       if (ctrl->state != NVME_CTRL_CONNECTING)
>>>> -               return;
>>>> -
>>>> -
>>>> -       set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>>>> +       down_read(&ctrl->namespaces_rwsem);
>>>> +       list_for_each_entry(ns, &ctrl->namespaces, list) {
>>>> +               if (ctrl->state != NVME_CTRL_LIVE ||
>>>> +                   (ns->ana_state != NVME_ANA_OPTIMIZED &&
>>>> +                    ns->ana_state != NVME_ANA_NONOPTIMIZED))
>>>> +                       set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
>>>> +       }
>>>> +       up_read(&ctrl->namespaces_rwsem);
>>>>           dev_info(ctrl->device, "failfast expired\n");
>>>>
>>>> ...and we could leave the failfast worker running even after the 
>>>> controller
>>>> transitioned to LIVE.
>>>> Cf the attached patch for details.
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>> -- 
>>>
>>> I'm not sure what makes sense to move the FAILFAST_EXPIRED bit into 
>>> the namespace,
>>> Because the failfast mechanism characterizes the controller as a whole.
>>>
>> Oh, yes, I'm aware of that. But the problem here is with multipath; 
>> how do we handle the situation where all controllers have the 
>> 'failfast_expired' bit set?
>> Should I/O be terminated (which I think it should, given that failfast 
>> is supposed to terminate the I/O)?
>> Or should I/O continue to run (as it does with your original patch)?
> 
> I do agree that fast_io_fail_tmo _is_ a controller attribute and should
> remain as such.
> 
> I do see what is your point Hannes, however I also think it's
> problematic that the host may fail arbitrary I/O if the controller
> happens to enter ANA inaccessible state (or have state transition
> timeout) for a period that happens to be longer than what the user
> happen to set (without communicating any of this to the controller).
> 
> IFF we want to address this (I'm still not sure), we probably want
> to activate failfast timeout in ANA state transition (and clear it
> when we exit it). Then we can modify nvme_available_path() to take
> NVME_CTRL_FAILFAST_EXPIRED into account.
> 
> Anyways, I think that this can be an incremental patch because it
> doesn't change the behavior today with respect to ANA states (or
> transition between them) e.g. queue the outstanding I/O.
> 
Okay then.
You can add:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v9] nvme-fabrics: reject I/O to offline device
       [not found] <3e9337bfbd7f410eb632e96a44b43924@kioxia.com>
@ 2020-09-30 13:14 ` Hannes Reinecke
  0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2020-09-30 13:14 UTC (permalink / raw)
  To: Victor Gladkov
  Cc: Sagi Grimberg, James Smart, linux-nvme, Ewan D. Milne,
	Keith Busch, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2476 bytes --]

On 9/30/20 2:31 PM, Victor Gladkov wrote:
> On 9/30/20 8:47 PM, Hannes Reinecke wrote:
>   
>>>    		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
>>>    						opts->reconnect_delay);
>>> +		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
>>> +			pr_warn("failfast tmo (%d) larger than controller "
>>> +				"loss tmo (%d)\n",
>>> +				opts->fast_io_fail_tmo, ctrl_loss_tmo);
>>> +	}
>>
>> If we already check for that condition, shouldn't we disable fast_io_fail_tmo in
>> that situation to clarify things?
>>
> 
> OK for me. I don't mind
> 
>>>
>>>    	if (!opts->host) {
>>>    		kref_get(&nvmf_default_host->ref);
>>> @@ -985,7 +1004,7 @@ void nvmf_free_options(struct nvmf_ctrl_options
>>> *opts)
>>>    #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE |
>>> NVMF_OPT_NR_IO_QUEUES | \
>>>    				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
>>>    				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT
>>> |\
>>> -				 NVMF_OPT_DISABLE_SQFLOW)
>>> +				 NVMF_OPT_DISABLE_SQFLOW |
>>> NVMF_OPT_FAIL_FAST_TMO)
>>>
>>>    static struct nvme_ctrl *
>>>    nvmf_create_ctrl(struct device *dev, const char *buf) diff --git
>>> a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index
> 
>>> diff --git a/drivers/nvme/host/multipath.c
>>> b/drivers/nvme/host/multipath.c index 54603bd..d8b7f45 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -278,9 +278,12 @@ static bool nvme_available_path(struct
>>> nvme_ns_head *head)
>>>
>>>    	list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>    		switch (ns->ctrl->state) {
>>> +		case NVME_CTRL_CONNECTING:
>>> +			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
>>> +				     &ns->ctrl->flags))
>>> +				break;
>>
>> No. We shouldn't select this path, but that doesn't mean that all other paths in
>> this list can't be selected, either; they might be coming from different
>> controllers.
>> Please use 'continue' here.
>>
> 
>   The 'break' doesn't interrupt the 'for_each' loop, but only go out from 'switch'
> 
Ah. But still; that is not quite correct as we'll need to intercept 
things at nvme_ns_head_submit_bio() to make the correct decision there.

I've attached the modified version I'm working with; please check if 
you're okay with it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

[-- Attachment #2: 0001-nvme-fabrics-reject-I-O-to-offline-device.patch --]
[-- Type: text/x-patch, Size: 10481 bytes --]

From fc2e2c5ceff5d0807623ccd83d735965ba5b7ec0 Mon Sep 17 00:00:00 2001
From: Victor Gladkov <Victor.Gladkov@kioxia.com>
Date: Tue, 29 Sep 2020 15:27:38 +0000
Subject: [PATCH] nvme-fabrics: reject I/O to offline device

Commands get stuck while Host NVMe-oF controller is in reconnect state.
NVMe controller enters into reconnect state when it loses connection with the
target. It tries to reconnect every 10 seconds
(default) until successful reconnection or until reconnect time-out is reached.
The default reconnect time out is 10 minutes.

Applications are expecting commands to complete with success or error within
a certain timeout (30 seconds by default).  The NVMe host is enforcing that
timeout while it is connected, never the less, during reconnection, the timeout
is not enforced and commands may get stuck for a long period or even forever.

To fix this long delay due to the default timeout we introduce new session
parameter "fast_io_fail_tmo". The timeout is measured in seconds from the
controller reconnect, any command beyond that timeout is rejected. The new
parameter value may be passed during 'connect'.
The default value of -1 means no timeout (similar to current behavior).

We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.

When the controller is entering the CONNECTING state, we schedule the
delayed_work based on failfast timeout value. If the transition is out of
CONNECTING, terminate delayed work item and ensure failfast_expired is
false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED"
flag to true.

We also update nvmf_fail_nonready_command() and
nvme_available_path() functions with check the
"NVME_CTRL_FAILFAST_EXPIRED" controller flag.

For multipath (function nvme_available_path()):
The path will not be considered available if
"NVME_CTRL_FAILFAST_EXPIRED" controller flag  is set
and the controller in NVME_CTRL_CONNECTING state.
This prevents commands from getting stuck
when available paths have tried to reconnect for too long.

Signed-off-by: Victor Gladkov <victor.gladkov at kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
Reviewed-by: Hannes Reinecke <hare at suse.de>
---
 drivers/nvme/host/core.c      | 42 ++++++++++++++++++++++++++++++++++++++++--
 drivers/nvme/host/fabrics.c   | 24 ++++++++++++++++++++++--
 drivers/nvme/host/fabrics.h   |  5 +++++
 drivers/nvme/host/multipath.c |  4 ++++
 drivers/nvme/host/nvme.h      |  3 +++
 5 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 385b10317873..9610d81f2070 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -148,6 +148,37 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
 
+static void nvme_failfast_work(struct work_struct *work) {
+	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_ctrl, failfast_work);
+
+	if (ctrl->state != NVME_CTRL_CONNECTING)
+		return;
+
+
+	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	dev_info(ctrl->device, "failfast expired\n");
+	nvme_kick_requeue_lists(ctrl);
+}
+
+static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == -1)
+		return;
+
+	schedule_delayed_work(&ctrl->failfast_work,
+			      ctrl->opts->fast_io_fail_tmo * HZ);
+}
+
+static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->opts)
+		return;
+
+	cancel_delayed_work_sync(&ctrl->failfast_work);
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+}
+
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
 {
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
@@ -360,9 +391,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	switch (new_state) {
 	case NVME_CTRL_LIVE:
 		switch (old_state) {
+		case NVME_CTRL_CONNECTING:
+			nvme_stop_failfast_work(ctrl);
+			fallthrough;
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_RESETTING:
-		case NVME_CTRL_CONNECTING:
 			changed = true;
 			fallthrough;
 		default:
@@ -381,8 +414,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		break;
 	case NVME_CTRL_CONNECTING:
 		switch (old_state) {
-		case NVME_CTRL_NEW:
 		case NVME_CTRL_RESETTING:
+			nvme_start_failfast_work(ctrl);
+			fallthrough;
+		case NVME_CTRL_NEW:
 			changed = true;
 			fallthrough;
 		default:
@@ -4343,6 +4378,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
+	nvme_stop_failfast_work(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
 }
@@ -4408,6 +4444,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	int ret;
 
 	ctrl->state = NVME_CTRL_NEW;
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
@@ -4424,6 +4461,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	init_waitqueue_head(&ctrl->state_wq);
 
 	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
+	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
 	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
 	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
 
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 8575724734e0..6404cbf41c96 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 {
 	if (ctrl->state != NVME_CTRL_DELETING_NOIO &&
 	    ctrl->state != NVME_CTRL_DEAD &&
+	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;
 
@@ -615,6 +616,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
 	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
 	{ NVMF_OPT_TOS,			"tos=%d"		},
+	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -634,6 +636,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
+	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
 	opts->hdr_digest = false;
 	opts->data_digest = false;
 	opts->tos = -1; /* < 0 == use transport default */
@@ -754,6 +757,17 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 				pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n");
 			ctrl_loss_tmo = token;
 			break;
+		case NVMF_OPT_FAIL_FAST_TMO:
+			if (match_int(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (token >= 0)
+				pr_warn("I/O will fail on reconnect "
+					"controller after %d sec\n", token);
+			opts->fast_io_fail_tmo = token;
+			break;
 		case NVMF_OPT_HOSTNQN:
 			if (opts->host) {
 				pr_err("hostnqn already user-assigned: %s\n",
@@ -886,9 +900,14 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	}
 	if (ctrl_loss_tmo < 0)
 		opts->max_reconnects = -1;
-	else
+	else {
 		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
 						opts->reconnect_delay);
+		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
+			pr_warn("failfast tmo (%d) larger than controller "
+				"dev loss tmo (%d)\n",
+				opts->fast_io_fail_tmo, ctrl_loss_tmo);
+	}
 
 	if (!opts->host) {
 		kref_get(&nvmf_default_host->ref);
@@ -988,7 +1007,8 @@ EXPORT_SYMBOL_GPL(nvmf_free_options);
 #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
 				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
 				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
-				 NVMF_OPT_DISABLE_SQFLOW)
+				 NVMF_OPT_DISABLE_SQFLOW | \
+				 NVMF_OPT_FAIL_FAST_TMO)
 
 static struct nvme_ctrl *
 nvmf_create_ctrl(struct device *dev, const char *buf)
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index a9c1e3b4585e..2ce0451c00ff 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -15,6 +15,8 @@
 #define NVMF_DEF_RECONNECT_DELAY	10
 /* default to 600 seconds of reconnect attempts before giving up */
 #define NVMF_DEF_CTRL_LOSS_TMO		600
+/* default is -1: the fail fast mechanism is disabled */
+#define NVMF_DEF_FAIL_FAST_TMO		-1
 
 /*
  * Define a host as seen by the target.  We allocate one at boot, but also
@@ -56,6 +58,7 @@ enum {
 	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
 	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
 	NVMF_OPT_TOS		= 1 << 19,
+	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
 };
 
 /**
@@ -89,6 +92,7 @@ enum {
  * @nr_write_queues: number of queues for write I/O
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
+ * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
+	int			fast_io_fail_tmo;
 };
 
 /*
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 74896be40c17..0e08ca6e0264 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -157,6 +157,8 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
 	if (test_bit(NVME_NS_ANA_PENDING, &ns->flags) ||
 	    test_bit(NVME_NS_REMOVING, &ns->flags))
 		return true;
+	if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
+		return true;
 	return false;
 }
 
@@ -279,6 +281,8 @@ static bool nvme_available_path(struct nvme_ns_head *head)
 	struct nvme_ns *ns;
 
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
+			continue;
 		switch (ns->ctrl->state) {
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_RESETTING:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 566776100126..37be7d16ab15 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -304,6 +304,7 @@ struct nvme_ctrl {
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
+	struct delayed_work failfast_work;
 	struct nvme_command ka_cmd;
 	struct work_struct fw_act_work;
 	unsigned long events;
@@ -337,6 +338,8 @@ struct nvme_ctrl {
 	u16 icdoff;
 	u16 maxcmd;
 	int nr_reconnects;
+	unsigned long flags;
+#define NVME_CTRL_FAILFAST_EXPIRED	0
 	struct nvmf_ctrl_options *opts;
 
 	struct page *discard_page;
-- 
2.16.4


[-- Attachment #3: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-20 13:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-29 15:27 [PATCH v9] nvme-fabrics: reject I/O to offline device Victor Gladkov
2020-09-29 18:19 ` Sagi Grimberg
2020-09-30  5:46 ` Hannes Reinecke
2020-09-30  6:39 ` Christoph Hellwig
2020-10-01  8:55 ` Hannes Reinecke
2020-11-15 15:45   ` Victor Gladkov
2020-11-16  9:54     ` Hannes Reinecke
2020-11-17  8:39       ` Sagi Grimberg
2020-11-20 13:09         ` Hannes Reinecke
     [not found] <3e9337bfbd7f410eb632e96a44b43924@kioxia.com>
2020-09-30 13:14 ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.