All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4] nvme-fabrics: reject I/O to offline device
@ 2020-03-02  3:30 Chaitanya Kulkarni
  2020-03-07 12:15 ` Hannes Reinecke
  2020-03-31 18:21 ` Alex Lyakas
  0 siblings, 2 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-02  3:30 UTC (permalink / raw)
  To: linux-nvme
  Cc: sagi, snitzer, Chaitanya Kulkarni, james.smart, Victor.Gladkov,
	hare, hch

From: Victor Gladkov <victor.gladkov@kioxia.com>

Commands get stuck while Host NVMe controller (TCP or RDMA) is in
reconnect state. NVMe controller enters into reconnect state when it
loses connection with the target. It tries to reconnect every 10
seconds (default) until successful reconnection or until reconnect
time-out is reached. The default reconnect time out is 10 minutes.

To fix this long delay due to the default timeout we introduce new
session parameter "fast_io_fail_tmo". The timeout is measured in
seconds from the controller reconnect, any command beyond that timeout
is rejected. The new parameter value may be passed during 'connect'.
The default value of 0 means no timeout (similar to current behavior).

We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.

When the controller is entering the CONNECTING state, we schedule the
delayed_work based on failfast timeout value. If the transition is out
of CONNECTING, terminate delayed work item and ensure failfast_expired
is false. If delayed work item expires then set
"NVME_CTRL_FAILFAST_EXPIRED" flag to true.

We also update nvmf_fail_nonready_command() and nvme_available_path()
functions with check the "NVME_CTRL_FAILFAST_EXPIRED" controller flag.

Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
---
Changes From V3:

1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd 
   nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
   for PCIe transport when is called nvme_change_ctrl_state()
   from nvme_reset_work(), since we don't set the ctrl->opts for
   PCIe transport.
2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
   for macro NVMF_ALLOWED_OPTS definition.
3. Just like all the state change code add a switch for newly
   added state handling outside of the state machine in
   nvme_state_change().
4. nvme_available_path() add /* fallthru */ after if..break inside
   switch which is under list_for_each_entry_rcu().
5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
6. Fix the tabs before if in nvme_available_path() and line wrap
   for the same.
7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
   instead of == to get rid of the parentheses and avoid char > 80.
8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
9. Change the commit log style to match the one we have in the NVMe
   repo.

Changes from V2:

1. Several coding style and small fixes.
2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
3. Don't call functionality from the state machine.
4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
   semantics.

Changes from V1:

1. Add a new session parameter called "fast_fail_tmo". The timeout
   is measured in seconds from the controller reconnect, any command
   beyond that timeout is rejected. The new parameter value may be
   passed during 'connect', and its default value is 30 seconds. 
   A value of 0 means no timeout (similar to current behavior).
2. Add a controller flag of "failfast_expired".
3. Add dedicated delayed_work that update the "failfast_expired" 
   controller flag.
4. When entering CONNECTING, schedule the delayed_work based on
   failfast timeout value. If transition out of CONNECTING, terminate
   delayed work item and ensure failfast_expired is false.
   If delayed work item expires: set "failfast_expired" flag to true.
5. Update nvmf_fail_nonready_command() with check the
   "failfast_expired" controller flag.
---
 drivers/nvme/host/core.c      | 53 +++++++++++++++++++++++++++++++++--
 drivers/nvme/host/fabrics.c   | 32 +++++++++++++++++----
 drivers/nvme/host/fabrics.h   |  5 ++++
 drivers/nvme/host/multipath.c |  6 +++-
 drivers/nvme/host/nvme.h      |  3 ++
 5 files changed, 90 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84914223c537..fb7b3ce2a271 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -136,6 +136,39 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
 
+static void nvme_failfast_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_ctrl, failfast_work);
+
+	spin_lock_irq(&ctrl->lock);
+	if (ctrl->state != NVME_CTRL_CONNECTING)
+		goto out;
+
+	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	dev_info(ctrl->device, "failfast expired set for controller %s\n",
+		ctrl->opts->subsysnqn);
+	nvme_kick_requeue_lists(ctrl);
+out:
+	spin_unlock_irq(&ctrl->lock);
+}
+
+static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
+		return;
+	schedule_delayed_work(&ctrl->failfast_work,
+			      ctrl->opts->fast_io_fail_tmo * HZ);
+}
+
+static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
+		return;
+	cancel_delayed_work_sync(&ctrl->failfast_work);
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+}
+
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
 {
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
@@ -395,8 +428,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	}
 
 	spin_unlock_irqrestore(&ctrl->lock, flags);
-	if (changed && ctrl->state == NVME_CTRL_LIVE)
-		nvme_kick_requeue_lists(ctrl);
+	if (changed) {
+		switch (ctrl->state) {
+		case NVME_CTRL_LIVE:
+			nvme_kick_requeue_lists(ctrl);
+			if (old_state == NVME_CTRL_CONNECTING)
+				nvme_stop_failfast_work(ctrl);
+			break;
+		case NVME_CTRL_CONNECTING:
+			if (old_state == NVME_CTRL_RESETTING)
+				nvme_start_failfast_work(ctrl);
+			break;
+		default:
+			break;
+		}
+	}
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -3992,6 +4038,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
+	nvme_stop_failfast_work(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
 }
@@ -4056,6 +4103,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	int ret;
 
 	ctrl->state = NVME_CTRL_NEW;
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
@@ -4070,6 +4118,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	init_waitqueue_head(&ctrl->state_wq);
 
 	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
+	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
 	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
 	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
 
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 74b8818ac9a1..7f54e7e4f528 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 {
 	if (ctrl->state != NVME_CTRL_DELETING &&
 	    ctrl->state != NVME_CTRL_DEAD &&
+	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;
 
@@ -612,6 +613,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
 	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
 	{ NVMF_OPT_TOS,			"tos=%d"		},
+	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -631,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
+	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
 	opts->hdr_digest = false;
 	opts->data_digest = false;
 	opts->tos = -1; /* < 0 == use transport default */
@@ -751,6 +754,19 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 				pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n");
 			ctrl_loss_tmo = token;
 			break;
+		case NVMF_OPT_FAIL_FAST_TMO:
+			if (match_int(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (token)
+				pr_warn("fail_fast_tmo != 0, I/O will fail on"
+					" reconnect controller after %d sec\n",
+					token);
+
+			opts->fast_io_fail_tmo = token;
+			break;
 		case NVMF_OPT_HOSTNQN:
 			if (opts->host) {
 				pr_err("hostnqn already user-assigned: %s\n",
@@ -881,11 +897,14 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		opts->nr_poll_queues = 0;
 		opts->duplicate_connect = true;
 	}
-	if (ctrl_loss_tmo < 0)
+	if (ctrl_loss_tmo < 0) {
 		opts->max_reconnects = -1;
-	else
-		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
-						opts->reconnect_delay);
+	} else {
+		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
+			pr_warn("failfast tmo (%d) larger than controller "
+				"loss tmo (%d)\n",
+				opts->fast_io_fail_tmo, ctrl_loss_tmo);
+	}
 
 	if (!opts->host) {
 		kref_get(&nvmf_default_host->ref);
@@ -984,8 +1003,9 @@ EXPORT_SYMBOL_GPL(nvmf_free_options);
 #define NVMF_REQUIRED_OPTS	(NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
 #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
 				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
-				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
-				 NVMF_OPT_DISABLE_SQFLOW)
+				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT | \
+				 NVMF_OPT_DISABLE_SQFLOW | \
+				 NVMF_OPT_FAIL_FAST_TMO)
 
 static struct nvme_ctrl *
 nvmf_create_ctrl(struct device *dev, const char *buf)
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index a0ec40ab62ee..a078ddea9429 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -15,6 +15,8 @@
 #define NVMF_DEF_RECONNECT_DELAY	10
 /* default to 600 seconds of reconnect attempts before giving up */
 #define NVMF_DEF_CTRL_LOSS_TMO		600
+/* default is 0: the fail fast mechanism is disable  */
+#define NVMF_DEF_FAIL_FAST_TMO		0
 
 /*
  * Define a host as seen by the target.  We allocate one at boot, but also
@@ -56,6 +58,7 @@ enum {
 	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
 	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
 	NVMF_OPT_TOS		= 1 << 19,
+	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
 };
 
 /**
@@ -89,6 +92,7 @@ enum {
  * @nr_write_queues: number of queues for write I/O
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
+ * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
+	unsigned int		fast_io_fail_tmo;
 };
 
 /*
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 797c18337d96..7bccdad52a78 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -281,9 +281,13 @@ static bool nvme_available_path(struct nvme_ns_head *head)
 
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		switch (ns->ctrl->state) {
+		case NVME_CTRL_CONNECTING:
+			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
+				     &ns->ctrl->flags))
+				break;
+			/* fallthru */
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_RESETTING:
-		case NVME_CTRL_CONNECTING:
 			/* fallthru */
 			return true;
 		default:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1024fec7914c..b6a199eaac68 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -256,6 +256,7 @@ struct nvme_ctrl {
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
+	struct delayed_work failfast_work;
 	struct nvme_command ka_cmd;
 	struct work_struct fw_act_work;
 	unsigned long events;
@@ -289,6 +290,8 @@ struct nvme_ctrl {
 	u16 icdoff;
 	u16 maxcmd;
 	int nr_reconnects;
+	unsigned long flags;
+#define NVME_CTRL_FAILFAST_EXPIRED	0
 	struct nvmf_ctrl_options *opts;
 
 	struct page *discard_page;
-- 
2.22.1


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-03-02  3:30 [PATCH V4] nvme-fabrics: reject I/O to offline device Chaitanya Kulkarni
@ 2020-03-07 12:15 ` Hannes Reinecke
  2020-03-10  0:09   ` Chaitanya Kulkarni
  2020-03-31 18:21 ` Alex Lyakas
  1 sibling, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2020-03-07 12:15 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-nvme
  Cc: hch, Victor.Gladkov, sagi, snitzer, james.smart

On 3/2/20 4:30 AM, Chaitanya Kulkarni wrote:
> From: Victor Gladkov <victor.gladkov@kioxia.com>
> 
> Commands get stuck while Host NVMe controller (TCP or RDMA) is in
> reconnect state. NVMe controller enters into reconnect state when it
> loses connection with the target. It tries to reconnect every 10
> seconds (default) until successful reconnection or until reconnect
> time-out is reached. The default reconnect time out is 10 minutes.
> 
> To fix this long delay due to the default timeout we introduce new
> session parameter "fast_io_fail_tmo". The timeout is measured in
> seconds from the controller reconnect, any command beyond that timeout
> is rejected. The new parameter value may be passed during 'connect'.
> The default value of 0 means no timeout (similar to current behavior).
> 
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
> 
> When the controller is entering the CONNECTING state, we schedule the
> delayed_work based on failfast timeout value. If the transition is out
> of CONNECTING, terminate delayed work item and ensure failfast_expired
> is false. If delayed work item expires then set
> "NVME_CTRL_FAILFAST_EXPIRED" flag to true.
> 
> We also update nvmf_fail_nonready_command() and nvme_available_path()
> functions with check the "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
> 
> Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> ---
> Changes From V3:
> 
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
>     nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
>     for PCIe transport when is called nvme_change_ctrl_state()
>     from nvme_reset_work(), since we don't set the ctrl->opts for
>     PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
>     for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
>     added state handling outside of the state machine in
>     nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
>     switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
>     for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
>     instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
>     repo.
> 
> Changes from V2:
> 
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
>     semantics.
> 
> Changes from V1:
> 
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
>     is measured in seconds from the controller reconnect, any command
>     beyond that timeout is rejected. The new parameter value may be
>     passed during 'connect', and its default value is 30 seconds.
>     A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
>     controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
>     failfast timeout value. If transition out of CONNECTING, terminate
>     delayed work item and ensure failfast_expired is false.
>     If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
>     "failfast_expired" controller flag.
> ---
>   drivers/nvme/host/core.c      | 53 +++++++++++++++++++++++++++++++++--
>   drivers/nvme/host/fabrics.c   | 32 +++++++++++++++++----
>   drivers/nvme/host/fabrics.h   |  5 ++++
>   drivers/nvme/host/multipath.c |  6 +++-
>   drivers/nvme/host/nvme.h      |  3 ++
>   5 files changed, 90 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 84914223c537..fb7b3ce2a271 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -136,6 +136,39 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)
>   }
>   EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
>   
> +static void nvme_failfast_work(struct work_struct *work)
> +{
> +	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
> +			struct nvme_ctrl, failfast_work);
> +
> +	spin_lock_irq(&ctrl->lock);
> +	if (ctrl->state != NVME_CTRL_CONNECTING)
> +		goto out;
> +
> +	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +	dev_info(ctrl->device, "failfast expired set for controller %s\n",
> +		ctrl->opts->subsysnqn);
> +	nvme_kick_requeue_lists(ctrl);
> +out:
> +	spin_unlock_irq(&ctrl->lock);
> +}
> +
> +static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
> +{
> +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
> +		return;
> +	schedule_delayed_work(&ctrl->failfast_work,
> +			      ctrl->opts->fast_io_fail_tmo * HZ);
> +}
> +
> +static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
> +{
> +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
> +		return;
> +	cancel_delayed_work_sync(&ctrl->failfast_work);
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +}
> +
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
>   {
>   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> @@ -395,8 +428,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   	}
>   
>   	spin_unlock_irqrestore(&ctrl->lock, flags);
> -	if (changed && ctrl->state == NVME_CTRL_LIVE)
> -		nvme_kick_requeue_lists(ctrl);
> +	if (changed) {
> +		switch (ctrl->state) {
> +		case NVME_CTRL_LIVE:
> +			nvme_kick_requeue_lists(ctrl);
> +			if (old_state == NVME_CTRL_CONNECTING)
> +				nvme_stop_failfast_work(ctrl);
> +			break;
This is a bit awkward; nvme_stop_failfast_work() will call
cancel_delayed_work_sync(), which in turn might need to wait for an 
already running workqueue function, which already has the ctrl lock.
Deadlock.
So you need to call 'nvme_stop_failfast_work()' without the ctrl lock held.

Other than that I'm very much in favour of this approach.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-03-07 12:15 ` Hannes Reinecke
@ 2020-03-10  0:09   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2020-03-10  0:09 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme
  Cc: hch, Victor.Gladkov, sagi, snitzer, james.smart

On 03/07/2020 04:15 AM, Hannes Reinecke wrote:
>>    int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
>> >   {
>> >   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>> >@@ -395,8 +428,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>> >   	}
>> >
>> >   	spin_unlock_irqrestore(&ctrl->lock, flags);
>> >-	if (changed && ctrl->state == NVME_CTRL_LIVE)
>> >-		nvme_kick_requeue_lists(ctrl);
>> >+	if (changed) {
>> >+		switch (ctrl->state) {
>> >+		case NVME_CTRL_LIVE:
>> >+			nvme_kick_requeue_lists(ctrl);
>> >+			if (old_state == NVME_CTRL_CONNECTING)
>> >+				nvme_stop_failfast_work(ctrl);
>> >+			break;
> This is a bit awkward; nvme_stop_failfast_work() will call
> cancel_delayed_work_sync(), which in turn might need to wait for an
> already running workqueue function, which already has the ctrl lock.
> Deadlock.
> So you need to call 'nvme_stop_failfast_work()' without the ctrl lock held.
>

There are two calls for the nvme_stop_failfast_work() in host/core.c :-
1. nvme_change_ctrl_state() :-
    Here we unlock the ctrl->lock, then call nvme_stop_failfast_work().
2. nvme_stop_ctrl() :-
    This is not taking ctrl->lock.

I'll read some more code, but please correct me if I did not
understand the scenario that you have mentioned.

> Other than that I'm very much in favour of this approach.
>
> Cheers,
>
> Hannes


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-03-02  3:30 [PATCH V4] nvme-fabrics: reject I/O to offline device Chaitanya Kulkarni
  2020-03-07 12:15 ` Hannes Reinecke
@ 2020-03-31 18:21 ` Alex Lyakas
  2020-04-02  4:48   ` Chaitanya Kulkarni
       [not found]   ` <bb9948035bea461a8864a8cc88513e1b@kioxia.com>
  1 sibling, 2 replies; 10+ messages in thread
From: Alex Lyakas @ 2020-03-31 18:21 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Sagi Grimberg, Victor.Gladkov
  Cc: hare, hch, snitzer, linux-nvme, james.smart

Hello Chaitanya, Victor,

On Mon, Mar 2, 2020 at 5:30 AM Chaitanya Kulkarni
<chaitanya.kulkarni@wdc.com> wrote:
>
> From: Victor Gladkov <victor.gladkov@kioxia.com>
>
> Commands get stuck while Host NVMe controller (TCP or RDMA) is in
> reconnect state. NVMe controller enters into reconnect state when it
> loses connection with the target. It tries to reconnect every 10
> seconds (default) until successful reconnection or until reconnect
> time-out is reached. The default reconnect time out is 10 minutes.
>
> To fix this long delay due to the default timeout we introduce new
> session parameter "fast_io_fail_tmo". The timeout is measured in
> seconds from the controller reconnect, any command beyond that timeout
> is rejected. The new parameter value may be passed during 'connect'.
> The default value of 0 means no timeout (similar to current behavior).
>
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
>
> When the controller is entering the CONNECTING state, we schedule the
> delayed_work based on failfast timeout value. If the transition is out
> of CONNECTING, terminate delayed work item and ensure failfast_expired
> is false. If delayed work item expires then set
> "NVME_CTRL_FAILFAST_EXPIRED" flag to true.
>
> We also update nvmf_fail_nonready_command() and nvme_available_path()
> functions with check the "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
>
> Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

Thank you for addressing this issue. I asked about this scenario in
June 2019 in http://lists.infradead.org/pipermail/linux-nvme/2019-June/024766.html.

I tested the patch on top of Mellanox OFED 4.7 with kernel 4.14. From
my perspective, the direction is very good. But I think the problem is
only partially addressed.

When a controller enters a CONNECTING state and the fast_io_fail_tmo
expires, all new IOs to this controller are failed immediately. This
is great!

However, in-flight IOs, i.e., those that were issued before the
controller got disconnected, are still stuck until the controller
succeeds to reconnect, or forever. I believe those IOs need to be
errored out as well after fast_io_fail_tmo expires.

I did some debugging to try to accomplish that. I thought that the
crux is that in-flight IOs are failed, and retried due to non-zero
nvme_max_retries parameter. And by this time the request queue is
quiesced by blk_mq_quiesce_queue(), and that's why in-flight IOs get
stuck. I indeed see that nvme_retry_req() is called for some IOs. But
I also see that after that, the request queue is un-quiesced via:
nvme_rdma_error_recovery_work() =>
    nvme_rdma_teardown_io_queues() => nvme_stop_queues() // this
quiesces the queue
   nvme_start_queues() // this un-quiesces the queue

Can anybody perhaps give a hint on the approach to error out the
in-flight IOs? I can modify the patch and test.

Thanks,
Alex.


> ---
> Changes From V3:
>
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
>    nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
>    for PCIe transport when is called nvme_change_ctrl_state()
>    from nvme_reset_work(), since we don't set the ctrl->opts for
>    PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
>    for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
>    added state handling outside of the state machine in
>    nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
>    switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
>    for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
>    instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
>    repo.
>
> Changes from V2:
>
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
>    semantics.
>
> Changes from V1:
>
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
>    is measured in seconds from the controller reconnect, any command
>    beyond that timeout is rejected. The new parameter value may be
>    passed during 'connect', and its default value is 30 seconds.
>    A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
>    controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
>    failfast timeout value. If transition out of CONNECTING, terminate
>    delayed work item and ensure failfast_expired is false.
>    If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
>    "failfast_expired" controller flag.
> ---
>  drivers/nvme/host/core.c      | 53 +++++++++++++++++++++++++++++++++--
>  drivers/nvme/host/fabrics.c   | 32 +++++++++++++++++----
>  drivers/nvme/host/fabrics.h   |  5 ++++
>  drivers/nvme/host/multipath.c |  6 +++-
>  drivers/nvme/host/nvme.h      |  3 ++
>  5 files changed, 90 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 84914223c537..fb7b3ce2a271 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -136,6 +136,39 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)
>  }
>  EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
>
> +static void nvme_failfast_work(struct work_struct *work)
> +{
> +       struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
> +                       struct nvme_ctrl, failfast_work);
> +
> +       spin_lock_irq(&ctrl->lock);
> +       if (ctrl->state != NVME_CTRL_CONNECTING)
> +               goto out;
> +
> +       set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +       dev_info(ctrl->device, "failfast expired set for controller %s\n",
> +               ctrl->opts->subsysnqn);
> +       nvme_kick_requeue_lists(ctrl);
> +out:
> +       spin_unlock_irq(&ctrl->lock);
> +}
> +
> +static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
> +{
> +       if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
> +               return;
> +       schedule_delayed_work(&ctrl->failfast_work,
> +                             ctrl->opts->fast_io_fail_tmo * HZ);
> +}
> +
> +static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
> +{
> +       if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
> +               return;
> +       cancel_delayed_work_sync(&ctrl->failfast_work);
> +       clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +}
> +
>  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
>  {
>         if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> @@ -395,8 +428,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>         }
>
>         spin_unlock_irqrestore(&ctrl->lock, flags);
> -       if (changed && ctrl->state == NVME_CTRL_LIVE)
> -               nvme_kick_requeue_lists(ctrl);
> +       if (changed) {
> +               switch (ctrl->state) {
> +               case NVME_CTRL_LIVE:
> +                       nvme_kick_requeue_lists(ctrl);
> +                       if (old_state == NVME_CTRL_CONNECTING)
> +                               nvme_stop_failfast_work(ctrl);
> +                       break;
> +               case NVME_CTRL_CONNECTING:
> +                       if (old_state == NVME_CTRL_RESETTING)
> +                               nvme_start_failfast_work(ctrl);
> +                       break;
> +               default:
> +                       break;
> +               }
> +       }
>         return changed;
>  }
>  EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
> @@ -3992,6 +4038,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>  {
>         nvme_mpath_stop(ctrl);
>         nvme_stop_keep_alive(ctrl);
> +       nvme_stop_failfast_work(ctrl);
>         flush_work(&ctrl->async_event_work);
>         cancel_work_sync(&ctrl->fw_act_work);
>  }
> @@ -4056,6 +4103,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>         int ret;
>
>         ctrl->state = NVME_CTRL_NEW;
> +       clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>         spin_lock_init(&ctrl->lock);
>         mutex_init(&ctrl->scan_lock);
>         INIT_LIST_HEAD(&ctrl->namespaces);
> @@ -4070,6 +4118,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>         init_waitqueue_head(&ctrl->state_wq);
>
>         INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
> +       INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
>         memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
>         ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
>
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 74b8818ac9a1..7f54e7e4f528 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
>  {
>         if (ctrl->state != NVME_CTRL_DELETING &&
>             ctrl->state != NVME_CTRL_DEAD &&
> +           !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>             !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>                 return BLK_STS_RESOURCE;
>
> @@ -612,6 +613,7 @@ static const match_table_t opt_tokens = {
>         { NVMF_OPT_NR_WRITE_QUEUES,     "nr_write_queues=%d"    },
>         { NVMF_OPT_NR_POLL_QUEUES,      "nr_poll_queues=%d"     },
>         { NVMF_OPT_TOS,                 "tos=%d"                },
> +       { NVMF_OPT_FAIL_FAST_TMO,       "fast_io_fail_tmo=%d"   },
>         { NVMF_OPT_ERR,                 NULL                    }
>  };
>
> @@ -631,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>         opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
>         opts->kato = NVME_DEFAULT_KATO;
>         opts->duplicate_connect = false;
> +       opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
>         opts->hdr_digest = false;
>         opts->data_digest = false;
>         opts->tos = -1; /* < 0 == use transport default */
> @@ -751,6 +754,19 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>                                 pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n");
>                         ctrl_loss_tmo = token;
>                         break;
> +               case NVMF_OPT_FAIL_FAST_TMO:
> +                       if (match_int(args, &token)) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +
> +                       if (token)
> +                               pr_warn("fail_fast_tmo != 0, I/O will fail on"
> +                                       " reconnect controller after %d sec\n",
> +                                       token);
> +
> +                       opts->fast_io_fail_tmo = token;
> +                       break;
>                 case NVMF_OPT_HOSTNQN:
>                         if (opts->host) {
>                                 pr_err("hostnqn already user-assigned: %s\n",
> @@ -881,11 +897,14 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>                 opts->nr_poll_queues = 0;
>                 opts->duplicate_connect = true;
>         }
> -       if (ctrl_loss_tmo < 0)
> +       if (ctrl_loss_tmo < 0) {
>                 opts->max_reconnects = -1;
> -       else
> -               opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
> -                                               opts->reconnect_delay);
> +       } else {
> +               if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
> +                       pr_warn("failfast tmo (%d) larger than controller "
> +                               "loss tmo (%d)\n",
> +                               opts->fast_io_fail_tmo, ctrl_loss_tmo);
> +       }
>
>         if (!opts->host) {
>                 kref_get(&nvmf_default_host->ref);
> @@ -984,8 +1003,9 @@ EXPORT_SYMBOL_GPL(nvmf_free_options);
>  #define NVMF_REQUIRED_OPTS     (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
>  #define NVMF_ALLOWED_OPTS      (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
>                                  NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
> -                                NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
> -                                NVMF_OPT_DISABLE_SQFLOW)
> +                                NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT | \
> +                                NVMF_OPT_DISABLE_SQFLOW | \
> +                                NVMF_OPT_FAIL_FAST_TMO)
>
>  static struct nvme_ctrl *
>  nvmf_create_ctrl(struct device *dev, const char *buf)
> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> index a0ec40ab62ee..a078ddea9429 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -15,6 +15,8 @@
>  #define NVMF_DEF_RECONNECT_DELAY       10
>  /* default to 600 seconds of reconnect attempts before giving up */
>  #define NVMF_DEF_CTRL_LOSS_TMO         600
> +/* default is 0: the fail fast mechanism is disable  */
> +#define NVMF_DEF_FAIL_FAST_TMO         0
>
>  /*
>   * Define a host as seen by the target.  We allocate one at boot, but also
> @@ -56,6 +58,7 @@ enum {
>         NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
>         NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
>         NVMF_OPT_TOS            = 1 << 19,
> +       NVMF_OPT_FAIL_FAST_TMO  = 1 << 20,
>  };
>
>  /**
> @@ -89,6 +92,7 @@ enum {
>   * @nr_write_queues: number of queues for write I/O
>   * @nr_poll_queues: number of queues for polling I/O
>   * @tos: type of service
> + * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
>   */
>  struct nvmf_ctrl_options {
>         unsigned                mask;
> @@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
>         unsigned int            nr_write_queues;
>         unsigned int            nr_poll_queues;
>         int                     tos;
> +       unsigned int            fast_io_fail_tmo;
>  };
>
>  /*
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 797c18337d96..7bccdad52a78 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -281,9 +281,13 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>
>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>                 switch (ns->ctrl->state) {
> +               case NVME_CTRL_CONNECTING:
> +                       if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
> +                                    &ns->ctrl->flags))
> +                               break;
> +                       /* fallthru */
>                 case NVME_CTRL_LIVE:
>                 case NVME_CTRL_RESETTING:
> -               case NVME_CTRL_CONNECTING:
>                         /* fallthru */
>                         return true;
>                 default:
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 1024fec7914c..b6a199eaac68 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -256,6 +256,7 @@ struct nvme_ctrl {
>         struct work_struct scan_work;
>         struct work_struct async_event_work;
>         struct delayed_work ka_work;
> +       struct delayed_work failfast_work;
>         struct nvme_command ka_cmd;
>         struct work_struct fw_act_work;
>         unsigned long events;
> @@ -289,6 +290,8 @@ struct nvme_ctrl {
>         u16 icdoff;
>         u16 maxcmd;
>         int nr_reconnects;
> +       unsigned long flags;
> +#define NVME_CTRL_FAILFAST_EXPIRED     0
>         struct nvmf_ctrl_options *opts;
>
>         struct page *discard_page;
> --
> 2.22.1
>
>
> _______________________________________________
> linux-nvme mailing list
> linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-03-31 18:21 ` Alex Lyakas
@ 2020-04-02  4:48   ` Chaitanya Kulkarni
  2020-04-02 17:00     ` Alex Lyakas
       [not found]   ` <bb9948035bea461a8864a8cc88513e1b@kioxia.com>
  1 sibling, 1 reply; 10+ messages in thread
From: Chaitanya Kulkarni @ 2020-04-02  4:48 UTC (permalink / raw)
  To: Alex Lyakas, Sagi Grimberg, Victor.Gladkov
  Cc: hare, hch, snitzer, linux-nvme, james.smart

On 3/31/20 11:21 AM, Alex Lyakas wrote:
> Thank you for addressing this issue. I asked about this scenario in
> June 2019 in http://lists.infradead.org/pipermail/linux-nvme/2019-June/024766.html.
>
> I tested the patch on top of Mellanox OFED 4.7 with kernel 4.14. From
> my perspective, the direction is very good. But I think the problem is
> only partially addressed.
>
> When a controller enters a CONNECTING state and the fast_io_fail_tmo
> expires, all new IOs to this controller are failed immediately. This
> is great!
>
> However, in-flight IOs, i.e., those that were issued before the
> controller got disconnected, are still stuck until the controller
> succeeds to reconnect, or forever. I believe those IOs need to be
> errored out as well after fast_io_fail_tmo expires.
>
> I did some debugging to try to accomplish that. I thought that the
> crux is that in-flight IOs are failed, and retried due to non-zero
> nvme_max_retries parameter. And by this time the request queue is
> quiesced by blk_mq_quiesce_queue(), and that's why in-flight IOs get
> stuck. I indeed see that nvme_retry_req() is called for some IOs. But
> I also see that after that, the request queue is un-quiesced via:
> nvme_rdma_error_recovery_work() =>
>     nvme_rdma_teardown_io_queues() => nvme_stop_queues() // this
> quiesces the queue
>    nvme_start_queues() // this un-quiesces the queue
>
> Can anybody perhaps give a hint on the approach to error out the
> in-flight IOs? I can modify the patch and test.
>
> Thanks,
> Alex.


Thanks for the feedback. Let me see if I can generate the scenario and
have a test for this.

I'll send out the updated patch soon.


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-04-02  4:48   ` Chaitanya Kulkarni
@ 2020-04-02 17:00     ` Alex Lyakas
  0 siblings, 0 replies; 10+ messages in thread
From: Alex Lyakas @ 2020-04-02 17:00 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Sagi Grimberg, snitzer, james.smart, linux-nvme, Victor.Gladkov,
	hare, hch

Hi Chaitanya, Victor,

I believe the following modification to nvme_failfast_work addresses the issue:

static void nvme_failfast_work(struct work_struct *work)
{
    struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
        struct nvme_ctrl, failfast_work);
    bool run_queues = false;

    spin_lock_irq(&ctrl->lock);
    if (ctrl->state != NVME_CTRL_CONNECTING)
        goto out;

    set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
    dev_info(ctrl->device, "failfast expired set for controller %s\n",
        ctrl->opts->subsysnqn);
    nvme_kick_requeue_lists(ctrl);
    run_queues = true;
out:
    spin_unlock_irq(&ctrl->lock);

    if (run_queues) {
        struct nvme_ns *ns = NULL;

        down_read(&ctrl->namespaces_rwsem);
        list_for_each_entry(ns, &ctrl->namespaces, list) {
            blk_mq_run_hw_queues(ns->queue, true/*async*/);
        }
        up_read(&ctrl->namespaces_rwsem);
    }
}

Basically, calling blk_mq_run_hw_queues(). I see that in-flight IO
also fails after fast_io_fail_tmo, as expected.

Can you please comment?

Thanks,
Alex.

On Thu, Apr 2, 2020 at 7:48 AM Chaitanya Kulkarni
<Chaitanya.Kulkarni@wdc.com> wrote:
>
> On 3/31/20 11:21 AM, Alex Lyakas wrote:
> > Thank you for addressing this issue. I asked about this scenario in
> > June 2019 in http://lists.infradead.org/pipermail/linux-nvme/2019-June/024766.html.
> >
> > I tested the patch on top of Mellanox OFED 4.7 with kernel 4.14. From
> > my perspective, the direction is very good. But I think the problem is
> > only partially addressed.
> >
> > When a controller enters a CONNECTING state and the fast_io_fail_tmo
> > expires, all new IOs to this controller are failed immediately. This
> > is great!
> >
> > However, in-flight IOs, i.e., those that were issued before the
> > controller got disconnected, are still stuck until the controller
> > succeeds to reconnect, or forever. I believe those IOs need to be
> > errored out as well after fast_io_fail_tmo expires.
> >
> > I did some debugging to try to accomplish that. I thought that the
> > crux is that in-flight IOs are failed, and retried due to non-zero
> > nvme_max_retries parameter. And by this time the request queue is
> > quiesced by blk_mq_quiesce_queue(), and that's why in-flight IOs get
> > stuck. I indeed see that nvme_retry_req() is called for some IOs. But
> > I also see that after that, the request queue is un-quiesced via:
> > nvme_rdma_error_recovery_work() =>
> >     nvme_rdma_teardown_io_queues() => nvme_stop_queues() // this
> > quiesces the queue
> >    nvme_start_queues() // this un-quiesces the queue
> >
> > Can anybody perhaps give a hint on the approach to error out the
> > in-flight IOs? I can modify the patch and test.
> >
> > Thanks,
> > Alex.
>
>
> Thanks for the feedback. Let me see if I can generate the scenario and
> have a test for this.
>
> I'll send out the updated patch soon.
>

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V4] nvme-fabrics: reject I/O to offline device
       [not found]   ` <bb9948035bea461a8864a8cc88513e1b@kioxia.com>
@ 2020-06-29 15:09     ` Victor Gladkov
  2020-06-29 19:16       ` Sagi Grimberg
  2020-07-02 14:35       ` Hannes Reinecke
  0 siblings, 2 replies; 10+ messages in thread
From: Victor Gladkov @ 2020-06-29 15:09 UTC (permalink / raw)
  To: linux-nvme; +Cc: Sagi Grimberg

I'd like to remind you that this improvement is awaiting for commit.

Commands get stuck while Host NVMe controller (TCP or RDMA) is in
reconnect state. NVMe controller enters into reconnect state when it
loses connection with the target. It tries to reconnect every 10
seconds (default) until successful reconnection or until reconnect
time-out is reached. The default reconnect time out is 10 minutes.

To fix this long delay due to the default timeout we introduce new
session parameter "fast_io_fail_tmo". The timeout is measured in
seconds from the controller reconnect, any command beyond that
timeout is rejected. The new parameter value may be passed during
'connect'.
The default value of 0 means no timeout (similar to current behavior).

We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.

When the controller is entering the CONNECTING state, we schedule
the delayed_work based on failfast timeout value. If the transition
is out of CONNECTING, terminate delayed work item and ensure
failfast_expired is false. If delayed work item expires then set
"NVME_CTRL_FAILFAST_EXPIRED" flag to true.

We also update nvmf_fail_nonready_command() and
nvme_available_path() functions with check the
"NVME_CTRL_FAILFAST_EXPIRED" controller flag.

Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

---
Changes From V3:

1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
    nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
    for PCIe transport when is called nvme_change_ctrl_state()
    from nvme_reset_work(), since we don't set the ctrl->opts for
    PCIe transport.
2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
    for macro NVMF_ALLOWED_OPTS definition.
3. Just like all the state change code add a switch for newly
    added state handling outside of the state machine in
    nvme_state_change().
4. nvme_available_path() add /* fallthru */ after if..break inside
    switch which is under list_for_each_entry_rcu().
5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
6. Fix the tabs before if in nvme_available_path() and line wrap
    for the same.
7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
    instead of == to get rid of the parentheses and avoid char > 80.
8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
9. Change the commit log style to match the one we have in the NVMe
    repo.

Changes from V2:

1. Several coding style and small fixes.
2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
3. Don't call functionality from the state machine.
4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
    semantics.

Changes from V1:

1. Add a new session parameter called "fast_fail_tmo". The timeout
    is measured in seconds from the controller reconnect, any command
    beyond that timeout is rejected. The new parameter value may be
    passed during 'connect', and its default value is 30 seconds.
    A value of 0 means no timeout (similar to current behavior).
2. Add a controller flag of "failfast_expired".
3. Add dedicated delayed_work that update the "failfast_expired"
    controller flag.
4. When entering CONNECTING, schedule the delayed_work based on
    failfast timeout value. If transition out of CONNECTING, terminate
    delayed work item and ensure failfast_expired is false.
    If delayed work item expires: set "failfast_expired" flag to true.
5. Update nvmf_fail_nonready_command() with check the
    "failfast_expired" controller flag.


---
 drivers/nvme/host/core.c      | 51 ++++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/fabrics.c   | 30 ++++++++++++++++++++-----
 drivers/nvme/host/fabrics.h   |  5 +++++
 drivers/nvme/host/multipath.c |  6 ++++-
 drivers/nvme/host/nvme.h      |  3 +++
 5 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3c037f..8010f83 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -137,6 +137,39 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_try_sched_reset);

+static void nvme_failfast_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_ctrl, failfast_work);
+
+	spin_lock_irq(&ctrl->lock);
+	if (ctrl->state != NVME_CTRL_CONNECTING)
+		goto out;
+
+	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	dev_info(ctrl->device, "failfast expired set for controller %s\n",
+		ctrl->opts->subsysnqn);
+	nvme_kick_requeue_lists(ctrl);
+out:
+	spin_unlock_irq(&ctrl->lock);
+}
+
+static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
+		return;
+	schedule_delayed_work(&ctrl->failfast_work,
+			      ctrl->opts->fast_io_fail_tmo * HZ);
+}
+
+static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
+		return;
+	cancel_delayed_work_sync(&ctrl->failfast_work);
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+}
+
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
 {
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
@@ -387,8 +420,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	}

 	spin_unlock_irqrestore(&ctrl->lock, flags);
-	if (changed && ctrl->state == NVME_CTRL_LIVE)
+	if (changed) {
+		switch (ctrl->state) {
+		case NVME_CTRL_LIVE:
 			nvme_kick_requeue_lists(ctrl);
+			if (old_state == NVME_CTRL_CONNECTING)
+				nvme_stop_failfast_work(ctrl);
+			break;
+		case NVME_CTRL_CONNECTING:
+			if (old_state == NVME_CTRL_RESETTING)
+				nvme_start_failfast_work(ctrl);
+			break;
+		default:
+			break;
+		}
+	}
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -4045,6 +4091,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
+	nvme_stop_failfast_work(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
 }
@@ -4111,6 +4158,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	int ret;

 	ctrl->state = NVME_CTRL_NEW;
+	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
@@ -4125,6 +4173,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	init_waitqueue_head(&ctrl->state_wq);

 	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
+	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
 	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
 	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 2a6c819..3a39714 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
 {
 	if (ctrl->state != NVME_CTRL_DELETING &&
 	    ctrl->state != NVME_CTRL_DEAD &&
+	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
 		return BLK_STS_RESOURCE;

@@ -612,6 +613,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
 	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
 	{ NVMF_OPT_TOS,			"tos=%d"		},
+	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };

@@ -631,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
 	opts->kato = NVME_DEFAULT_KATO;
 	opts->duplicate_connect = false;
+	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
 	opts->hdr_digest = false;
 	opts->data_digest = false;
 	opts->tos = -1; /* < 0 == use transport default */
@@ -751,6 +754,19 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 				pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n");
 			ctrl_loss_tmo = token;
 			break;
+		case NVMF_OPT_FAIL_FAST_TMO:
+			if (match_int(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (token)
+				pr_warn("fail_fast_tmo != 0, I/O will fail on"
+					" reconnect controller after %d sec\n",
+					token);
+
+			opts->fast_io_fail_tmo = token;
+			break;
 		case NVMF_OPT_HOSTNQN:
 			if (opts->host) {
 				pr_err("hostnqn already user-assigned: %s\n",
@@ -881,11 +897,14 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		opts->nr_poll_queues = 0;
 		opts->duplicate_connect = true;
 	}
-	if (ctrl_loss_tmo < 0)
+	if (ctrl_loss_tmo < 0) {
 		opts->max_reconnects = -1;
-	else
-		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
-						opts->reconnect_delay);
+	} else {
+		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
+			pr_warn("failfast tmo (%d) larger than controller "
+				"loss tmo (%d)\n",
+				opts->fast_io_fail_tmo, ctrl_loss_tmo);
+	}

 	if (!opts->host) {
 		kref_get(&nvmf_default_host->ref);
@@ -985,7 +1004,8 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
 #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
 				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
 				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT | \
-				 NVMF_OPT_DISABLE_SQFLOW)
+				 NVMF_OPT_DISABLE_SQFLOW | \
+				 NVMF_OPT_FAIL_FAST_TMO)

 static struct nvme_ctrl *
 nvmf_create_ctrl(struct device *dev, const char *buf)
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index a0ec40a..a078dde 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -15,6 +15,8 @@
 #define NVMF_DEF_RECONNECT_DELAY	10
 /* default to 600 seconds of reconnect attempts before giving up */
 #define NVMF_DEF_CTRL_LOSS_TMO		600
+/* default is 0: the fail fast mechanism is disable  */
+#define NVMF_DEF_FAIL_FAST_TMO		0

 /*
  * Define a host as seen by the target.  We allocate one at boot, but also
@@ -56,6 +58,7 @@ enum {
 	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
 	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
 	NVMF_OPT_TOS		= 1 << 19,
+	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
 };

 /**
@@ -89,6 +92,7 @@ enum {
  * @nr_write_queues: number of queues for write I/O
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
+ * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
+	unsigned int		fast_io_fail_tmo;
 };

 /*
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 54603bd..4b66504 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -278,9 +278,13 @@ static bool nvme_available_path(struct nvme_ns_head *head)

 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		switch (ns->ctrl->state) {
+		case NVME_CTRL_CONNECTING:
+			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
+				     &ns->ctrl->flags))
+				break;
+			/* fallthru */
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_RESETTING:
-		case NVME_CTRL_CONNECTING:
 			/* fallthru */
 			return true;
 		default:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2e04a36..9112886 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -256,6 +256,7 @@ struct nvme_ctrl {
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
+	struct delayed_work failfast_work;
 	struct nvme_command ka_cmd;
 	struct work_struct fw_act_work;
 	unsigned long events;
@@ -290,6 +291,8 @@ struct nvme_ctrl {
 	u16 icdoff;
 	u16 maxcmd;
 	int nr_reconnects;
+	unsigned long flags;
+#define NVME_CTRL_FAILFAST_EXPIRED	0
 	struct nvmf_ctrl_options *opts;

 	struct page *discard_page;
--
1.8.3.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-06-29 15:09     ` Victor Gladkov
@ 2020-06-29 19:16       ` Sagi Grimberg
  2020-06-30 13:20         ` Victor Gladkov
  2020-07-02 14:35       ` Hannes Reinecke
  1 sibling, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2020-06-29 19:16 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme


> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f3c037f..8010f83 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -137,6 +137,39 @@ int nvme_try_sched_reset(struct nvme_ctrl *ctrl)
>   }
>   EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
> 
> +static void nvme_failfast_work(struct work_struct *work)
> +{
> +	struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
> +			struct nvme_ctrl, failfast_work);
> +
> +	spin_lock_irq(&ctrl->lock);
> +	if (ctrl->state != NVME_CTRL_CONNECTING)
> +		goto out;
> +
> +	set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +	dev_info(ctrl->device, "failfast expired set for controller %s\n",
> +		ctrl->opts->subsysnqn);

What does the subsysnqn has to do with anything? just keep "failfast 
expired"

> +	nvme_kick_requeue_lists(ctrl);
> +out:
> +	spin_unlock_irq(&ctrl->lock);

What is the lock protecting against? I don't see why this is
needed.

> +}
> +
> +static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl)
> +{
> +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)

the timeout disabled should be -1, 0 is to fail I/O right away.

Maybe also add "off" string as a -1 equivalent.

> +		return;
> +	schedule_delayed_work(&ctrl->failfast_work,
> +			      ctrl->opts->fast_io_fail_tmo * HZ);
> +}
> +
> +static inline void nvme_stop_failfast_work(struct nvme_ctrl *ctrl)
> +{
> +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
> +		return;
> +	cancel_delayed_work_sync(&ctrl->failfast_work);
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> +}
> +
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
>   {
>   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> @@ -387,8 +420,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   	}
> 
>   	spin_unlock_irqrestore(&ctrl->lock, flags);
> -	if (changed && ctrl->state == NVME_CTRL_LIVE)
> +	if (changed) {
> +		switch (ctrl->state) {
> +		case NVME_CTRL_LIVE:
>   			nvme_kick_requeue_lists(ctrl);
> +			if (old_state == NVME_CTRL_CONNECTING)
> +				nvme_stop_failfast_work(ctrl);
> +			break;
> +		case NVME_CTRL_CONNECTING:
> +			if (old_state == NVME_CTRL_RESETTING)
> +				nvme_start_failfast_work(ctrl);
> +			break;
> +		default:
> +			break;
> +		}
> +	}
>   	return changed;
>   }
>   EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
> @@ -4045,6 +4091,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>   {
>   	nvme_mpath_stop(ctrl);
>   	nvme_stop_keep_alive(ctrl);
> +	nvme_stop_failfast_work(ctrl);
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fw_act_work);
>   }
> @@ -4111,6 +4158,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>   	int ret;
> 
>   	ctrl->state = NVME_CTRL_NEW;
> +	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>   	spin_lock_init(&ctrl->lock);
>   	mutex_init(&ctrl->scan_lock);
>   	INIT_LIST_HEAD(&ctrl->namespaces);
> @@ -4125,6 +4173,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>   	init_waitqueue_head(&ctrl->state_wq);
> 
>   	INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
> +	INIT_DELAYED_WORK(&ctrl->failfast_work, nvme_failfast_work);
>   	memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
>   	ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
> 
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index 2a6c819..3a39714 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -549,6 +549,7 @@ blk_status_t nvmf_fail_nonready_command(struct nvme_ctrl *ctrl,
>   {
>   	if (ctrl->state != NVME_CTRL_DELETING &&
>   	    ctrl->state != NVME_CTRL_DEAD &&
> +	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>   	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
>   		return BLK_STS_RESOURCE;
> 
> @@ -612,6 +613,7 @@ bool __nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
>   	{ NVMF_OPT_NR_POLL_QUEUES,	"nr_poll_queues=%d"	},
>   	{ NVMF_OPT_TOS,			"tos=%d"		},
> +	{ NVMF_OPT_FAIL_FAST_TMO,	"fast_io_fail_tmo=%d"	},
>   	{ NVMF_OPT_ERR,			NULL			}
>   };
> 
> @@ -631,6 +633,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
>   	opts->kato = NVME_DEFAULT_KATO;
>   	opts->duplicate_connect = false;
> +	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
>   	opts->hdr_digest = false;
>   	opts->data_digest = false;
>   	opts->tos = -1; /* < 0 == use transport default */
> @@ -751,6 +754,19 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   				pr_warn("ctrl_loss_tmo < 0 will reconnect forever\n");
>   			ctrl_loss_tmo = token;
>   			break;
> +		case NVMF_OPT_FAIL_FAST_TMO:
> +			if (match_int(args, &token)) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +
> +			if (token)
> +				pr_warn("fail_fast_tmo != 0, I/O will fail on"
> +					" reconnect controller after %d sec\n",
> +					token);
> +
> +			opts->fast_io_fail_tmo = token;
> +			break;

Again, -1 should disable, not 0.

>   		case NVMF_OPT_HOSTNQN:
>   			if (opts->host) {
>   				pr_err("hostnqn already user-assigned: %s\n",
> @@ -881,11 +897,14 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
>   		opts->nr_poll_queues = 0;
>   		opts->duplicate_connect = true;
>   	}
> -	if (ctrl_loss_tmo < 0)
> +	if (ctrl_loss_tmo < 0) {
>   		opts->max_reconnects = -1;
> -	else
> -		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
> -						opts->reconnect_delay);
> +	} else {
> +		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
> +			pr_warn("failfast tmo (%d) larger than controller "
> +				"loss tmo (%d)\n",
> +				opts->fast_io_fail_tmo, ctrl_loss_tmo);
> +	}

it can be that ctrl_loss_tmo=-1 which means reconnect forever.

> 
>   	if (!opts->host) {
>   		kref_get(&nvmf_default_host->ref);
> @@ -985,7 +1004,8 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
>   #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
>   				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
>   				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT | \
> -				 NVMF_OPT_DISABLE_SQFLOW)
> +				 NVMF_OPT_DISABLE_SQFLOW | \
> +				 NVMF_OPT_FAIL_FAST_TMO)
> 
>   static struct nvme_ctrl *
>   nvmf_create_ctrl(struct device *dev, const char *buf)
> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
> index a0ec40a..a078dde 100644
> --- a/drivers/nvme/host/fabrics.h
> +++ b/drivers/nvme/host/fabrics.h
> @@ -15,6 +15,8 @@
>   #define NVMF_DEF_RECONNECT_DELAY	10
>   /* default to 600 seconds of reconnect attempts before giving up */
>   #define NVMF_DEF_CTRL_LOSS_TMO		600
> +/* default is 0: the fail fast mechanism is disable  */
> +#define NVMF_DEF_FAIL_FAST_TMO		0
> 
>   /*
>    * Define a host as seen by the target.  We allocate one at boot, but also
> @@ -56,6 +58,7 @@ enum {
>   	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
>   	NVMF_OPT_NR_POLL_QUEUES = 1 << 18,
>   	NVMF_OPT_TOS		= 1 << 19,
> +	NVMF_OPT_FAIL_FAST_TMO	= 1 << 20,
>   };
> 
>   /**
> @@ -89,6 +92,7 @@ enum {
>    * @nr_write_queues: number of queues for write I/O
>    * @nr_poll_queues: number of queues for polling I/O
>    * @tos: type of service
> + * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
>    */
>   struct nvmf_ctrl_options {
>   	unsigned		mask;
> @@ -111,6 +115,7 @@ struct nvmf_ctrl_options {
>   	unsigned int		nr_write_queues;
>   	unsigned int		nr_poll_queues;
>   	int			tos;
> +	unsigned int		fast_io_fail_tmo;
>   };
> 
>   /*
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 54603bd..4b66504 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -278,9 +278,13 @@ static bool nvme_available_path(struct nvme_ns_head *head)
> 
>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>   		switch (ns->ctrl->state) {
> +		case NVME_CTRL_CONNECTING:
> +			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
> +				     &ns->ctrl->flags))
> +				break;
> +			/* fallthru */

I absolutely don't see why this needs to happen. this setting is
on a controller, so this absolutely should not affect the mpath
device node. so nak on this.

>   		case NVME_CTRL_LIVE:
>   		case NVME_CTRL_RESETTING:
> -		case NVME_CTRL_CONNECTING:
>   			/* fallthru */
>   			return true;
>   		default:
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 2e04a36..9112886 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -256,6 +256,7 @@ struct nvme_ctrl {
>   	struct work_struct scan_work;
>   	struct work_struct async_event_work;
>   	struct delayed_work ka_work;
> +	struct delayed_work failfast_work;
>   	struct nvme_command ka_cmd;
>   	struct work_struct fw_act_work;
>   	unsigned long events;
> @@ -290,6 +291,8 @@ struct nvme_ctrl {
>   	u16 icdoff;
>   	u16 maxcmd;
>   	int nr_reconnects;
> +	unsigned long flags;
> +#define NVME_CTRL_FAILFAST_EXPIRED	0
>   	struct nvmf_ctrl_options *opts;
> 
>   	struct page *discard_page;
> --
> 1.8.3.1
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-06-29 19:16       ` Sagi Grimberg
@ 2020-06-30 13:20         ` Victor Gladkov
  0 siblings, 0 replies; 10+ messages in thread
From: Victor Gladkov @ 2020-06-30 13:20 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme

Hi Sagi.
Thank you for reviewing.
I'm going pull out new patch updated according to your comments.
In the meantime, see my comments below.

> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> > +	dev_info(ctrl->device, "failfast expired set for controller %s\n",
> > +		ctrl->opts->subsysnqn);
> 
> What does the subsysnqn has to do with anything? just keep "failfast expired"[VG] 


[VG] I agree with you

> 
> > +	nvme_kick_requeue_lists(ctrl);
> > +out:
> > +	spin_unlock_irq(&ctrl->lock);
> 
> What is the lock protecting against? I don't see why this is needed.
> 

[VG]  You're right.

> > +}
> > +
> > +static inline void nvme_start_failfast_work(struct nvme_ctrl *ctrl) {
> > +	if (!ctrl->opts || ctrl->opts->fast_io_fail_tmo == 0)
> 
> the timeout disabled should be -1, 0 is to fail I/O right away.
> 
> Maybe also add "off" string as a -1 equivalent.
> 

[VG]  It's ok to me. I'll add "off" string option.

> > diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> > index 2a6c819..3a39714 100644
> > --- a/drivers/nvme/host/fabrics.c
> > +++ b/drivers/nvme/host/fabrics.c
> > @@ -631,6 +633,7 @@ static int nvmf_parse_options(struct
> nvmf_ctrl_options *opts,
> >   	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
> >   	opts->kato = NVME_DEFAULT_KATO;
> >   	opts->duplicate_connect = false;
> > +	opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
> >   	opts->hdr_digest = false;
> >   	opts->data_digest = false;
> >   	opts->tos = -1; /* < 0 == use transport default */ @@ -751,6
> > +754,19 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
> >   				pr_warn("ctrl_loss_tmo < 0 will reconnect
> forever\n");
> >   			ctrl_loss_tmo = token;
> >   			break;
> > +		case NVMF_OPT_FAIL_FAST_TMO:
> > +			if (match_int(args, &token)) {
> > +				ret = -EINVAL;
> > +				goto out;
> > +			}
> > +
> > +			if (token)
> > +				pr_warn("fail_fast_tmo != 0, I/O will fail on"
> > +					" reconnect controller after %d sec\n",
> > +					token);
> > +
> > +			opts->fast_io_fail_tmo = token;
> > +			break;
> 
> Again, -1 should disable, not 0.

[VG] You're right.

> 
> >   		case NVMF_OPT_HOSTNQN:
> >   			if (opts->host) {
> >   				pr_err("hostnqn already user-assigned: %s\n", @@ -
> 881,11 +897,14
> > @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
> >   		opts->nr_poll_queues = 0;
> >   		opts->duplicate_connect = true;
> >   	}
> > -	if (ctrl_loss_tmo < 0)
> > +	if (ctrl_loss_tmo < 0) {
> >   		opts->max_reconnects = -1;
> > -	else
> > -		opts->max_reconnects = DIV_ROUND_UP(ctrl_loss_tmo,
> > -						opts->reconnect_delay);
> > +	} else {
> > +		if (ctrl_loss_tmo < opts->fast_io_fail_tmo)
> > +			pr_warn("failfast tmo (%d) larger than controller "
> > +				"loss tmo (%d)\n",
> > +				opts->fast_io_fail_tmo, ctrl_loss_tmo);
> > +	}
> 
> it can be that ctrl_loss_tmo=-1 which means reconnect forever.

 [VG] OK

> > diff --git a/drivers/nvme/host/multipath.c
> > b/drivers/nvme/host/multipath.c index 54603bd..4b66504 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -278,9 +278,13 @@ static bool nvme_available_path(struct
> > nvme_ns_head *head)
> >
> >   	list_for_each_entry_rcu(ns, &head->list, siblings) {
> >   		switch (ns->ctrl->state) {
> > +		case NVME_CTRL_CONNECTING:
> > +			if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
> > +				     &ns->ctrl->flags))
> > +				break;
> > +			/* fallthru */
> 
> I absolutely don't see why this needs to happen. this setting is on a controller,
> so this absolutely should not affect the mpath device node. so nak on this.

[VG] In case of multipath is enabled we need announce the path as unavailable,
 otherwise the command will requeue but not aborted.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4] nvme-fabrics: reject I/O to offline device
  2020-06-29 15:09     ` Victor Gladkov
  2020-06-29 19:16       ` Sagi Grimberg
@ 2020-07-02 14:35       ` Hannes Reinecke
  1 sibling, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2020-07-02 14:35 UTC (permalink / raw)
  To: Victor Gladkov, linux-nvme; +Cc: Sagi Grimberg

On 6/29/20 5:09 PM, Victor Gladkov wrote:
> I'd like to remind you that this improvement is awaiting for commit.
> 
> Commands get stuck while Host NVMe controller (TCP or RDMA) is in
> reconnect state. NVMe controller enters into reconnect state when it
> loses connection with the target. It tries to reconnect every 10
> seconds (default) until successful reconnection or until reconnect
> time-out is reached. The default reconnect time out is 10 minutes.
> 
> To fix this long delay due to the default timeout we introduce new
> session parameter "fast_io_fail_tmo". The timeout is measured in
> seconds from the controller reconnect, any command beyond that
> timeout is rejected. The new parameter value may be passed during
> 'connect'.
> The default value of 0 means no timeout (similar to current behavior).
> 
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
> 
> When the controller is entering the CONNECTING state, we schedule
> the delayed_work based on failfast timeout value. If the transition
> is out of CONNECTING, terminate delayed work item and ensure
> failfast_expired is false. If delayed work item expires then set
> "NVME_CTRL_FAILFAST_EXPIRED" flag to true.
> 
> We also update nvmf_fail_nonready_command() and
> nvme_available_path() functions with check the
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
> 
> Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> 
> ---
> Changes From V3:
> 
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
>      nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
>      for PCIe transport when is called nvme_change_ctrl_state()
>      from nvme_reset_work(), since we don't set the ctrl->opts for
>      PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
>      for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
>      added state handling outside of the state machine in
>      nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
>      switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
>      for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
>      instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
>      repo.
> 
> Changes from V2:
> 
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
>      semantics.
> 
> Changes from V1:
> 
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
>      is measured in seconds from the controller reconnect, any command
>      beyond that timeout is rejected. The new parameter value may be
>      passed during 'connect', and its default value is 30 seconds.
>      A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
>      controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
>      failfast timeout value. If transition out of CONNECTING, terminate
>      delayed work item and ensure failfast_expired is false.
>      If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
>      "failfast_expired" controller flag.
> 
> 
> ---
>   drivers/nvme/host/core.c      | 51 ++++++++++++++++++++++++++++++++++++++++++-
>   drivers/nvme/host/fabrics.c   | 30 ++++++++++++++++++++-----
>   drivers/nvme/host/fabrics.h   |  5 +++++
>   drivers/nvme/host/multipath.c |  6 ++++-
>   drivers/nvme/host/nvme.h      |  3 +++
>   5 files changed, 88 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hanne
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-07-02 14:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-02  3:30 [PATCH V4] nvme-fabrics: reject I/O to offline device Chaitanya Kulkarni
2020-03-07 12:15 ` Hannes Reinecke
2020-03-10  0:09   ` Chaitanya Kulkarni
2020-03-31 18:21 ` Alex Lyakas
2020-04-02  4:48   ` Chaitanya Kulkarni
2020-04-02 17:00     ` Alex Lyakas
     [not found]   ` <bb9948035bea461a8864a8cc88513e1b@kioxia.com>
2020-06-29 15:09     ` Victor Gladkov
2020-06-29 19:16       ` Sagi Grimberg
2020-06-30 13:20         ` Victor Gladkov
2020-07-02 14:35       ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.