All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] nvme: ANATT handling
@ 2018-06-07  7:35 Hannes Reinecke
  2018-06-07  7:35 ` [PATCH 1/4] nvmet: make ANATT configurable Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07  7:35 UTC (permalink / raw)


Hi all,

this patchset implements ANA transition timeout handling for NVMe-oF.
On the target side it implements a delayed workqueue for updating the
ANA state, and on the host side a simple timer is started when either
a namespace is found to be in ANA state change state or if an I/O
command is returned with a status indicating a non-matching ANA state.
If the timer expires the controller is reset.

Patches are relative to hchs block.git nvme-ana branch.

As usual, comments and reviews are welcome.

Hannes Reinecke (4):
  nvmet: make ANATT configurable
  nvmet: ANA transition timeout handling
  nvme: ANA transition timeout handling
  nvme: start ANATT timer on out-of-order state changes

 drivers/nvme/host/core.c        |  3 ++
 drivers/nvme/host/multipath.c   | 62 +++++++++++++++++++++++++++-----
 drivers/nvme/host/nvme.h        |  6 ++++
 drivers/nvme/target/admin-cmd.c |  2 +-
 drivers/nvme/target/configfs.c  | 39 ++++++++++++++++++++-
 drivers/nvme/target/core.c      | 78 +++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h     | 14 ++++++++
 7 files changed, 193 insertions(+), 11 deletions(-)

-- 
2.12.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] nvmet: make ANATT configurable
  2018-06-07  7:35 [PATCH 0/4] nvme: ANATT handling Hannes Reinecke
@ 2018-06-07  7:35 ` Hannes Reinecke
  2018-06-07 12:06   ` Christoph Hellwig
  2018-06-07 13:03   ` Sagi Grimberg
  2018-06-07  7:35 ` [PATCH 2/4] nvmet: ANA transition timeout handling Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07  7:35 UTC (permalink / raw)


Signed-off-by: Hannes Reinecke <hare at suse.com>
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/target/admin-cmd.c |  2 +-
 drivers/nvme/target/configfs.c  | 34 ++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h     |  2 ++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 8de264ece826..a49e1435fa54 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -315,7 +315,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->msdbd = ctrl->ops->msdbd;
 
 	id->anacap = (1 << 0) | (1 << 1) | (1 << 2) | (1 << 3) | (1 << 4);
-	id->anatt = 10; /* random value */
+	id->anatt = req->port->anatt;
 	id->anagrpmax = cpu_to_le32(NVMET_MAX_ANAGRPS);
 	id->nanagrpid = cpu_to_le32(NVMET_MAX_ANAGRPS);
 
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index c5a770ac2f34..cca4de356818 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -265,6 +265,38 @@ static ssize_t nvmet_addr_trtype_store(struct config_item *item,
 
 CONFIGFS_ATTR(nvmet_, addr_trtype);
 
+static ssize_t nvmet_anatt_show(struct config_item *item,
+		char *page)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+
+	return snprintf(page, PAGE_SIZE, "%d\n", port->anatt);
+}
+
+static ssize_t nvmet_anatt_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_port *port = to_nvmet_port(item);
+	int ret;
+	u8 ana_tt;
+
+	ret = kstrtou8(page, 0, &ana_tt);
+	if (ret || ana_tt == 0)
+		return -EINVAL;
+
+	if (port->enabled) {
+		pr_err("Cannot modify ANA transition timeout while enabled\n");
+		return -EACCES;
+	}
+	down_write(&nvmet_config_sem);
+	port->anatt = ana_tt;
+	up_write(&nvmet_config_sem);
+
+	return count;
+}
+
+CONFIGFS_ATTR(nvmet_, anatt);
+
 /*
  * Namespace structures & file operation functions below
  */
@@ -1033,6 +1065,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
 	&nvmet_attr_addr_traddr,
 	&nvmet_attr_addr_trsvcid,
 	&nvmet_attr_addr_trtype,
+	&nvmet_attr_anatt,
 	NULL,
 };
 
@@ -1077,6 +1110,7 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 	INIT_LIST_HEAD(&port->entry);
 	INIT_LIST_HEAD(&port->subsystems);
 	INIT_LIST_HEAD(&port->referrals);
+	port->anatt = NVMET_DEFAULT_ANATT;
 
 	port->disc_addr.portid = cpu_to_le16(portid);
 	config_group_init_type_name(&port->group, name, &nvmet_port_type);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 1f562732baf3..948209d5f803 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -132,6 +132,7 @@ struct nvmet_port {
 	enum nvme_ana_state		*ana_state;
 	void				*priv;
 	bool				enabled;
+	u8				anatt;
 };
 
 static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
@@ -399,6 +400,7 @@ u32 nvmet_get_log_page_len(struct nvme_command *cmd);
  */
 #define NVMET_MAX_ANAGRPS	128
 #define NVMET_DEFAULT_ANA_GRPID	1
+#define NVMET_DEFAULT_ANATT	10
 
 #define NVMET_KAS		10
 #define NVMET_DISC_KATO		120
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] nvmet: ANA transition timeout handling
  2018-06-07  7:35 [PATCH 0/4] nvme: ANATT handling Hannes Reinecke
  2018-06-07  7:35 ` [PATCH 1/4] nvmet: make ANATT configurable Hannes Reinecke
@ 2018-06-07  7:35 ` Hannes Reinecke
  2018-06-07 12:07   ` Christoph Hellwig
  2018-06-07 13:12   ` Sagi Grimberg
  2018-06-07  7:35 ` [PATCH 3/4] nvme: " Hannes Reinecke
  2018-06-07  7:35 ` [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes Hannes Reinecke
  3 siblings, 2 replies; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07  7:35 UTC (permalink / raw)


Whenever an ANA state change is triggered the ANA state for that
group ID is set to 'state change' and a delayed work is started,
which will be setting the port to the actual state after anatt
has expired.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/target/configfs.c |  5 ++-
 drivers/nvme/target/core.c     | 78 ++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h    | 12 +++++++
 3 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index cca4de356818..a39e35399dba 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -967,7 +967,9 @@ static ssize_t nvmet_ana_group_ana_state_store(struct config_item *item,
 	nvmet_ana_chgcnt++;
 	up_write(&nvmet_ana_sem);
 
-	nvmet_port_send_ana_event(grp->port);
+	nvmet_port_ana_state_change(grp->port, grp->grpid,
+				    nvmet_ana_state_names[i].state);
+
 	return count;
 }
 
@@ -1110,6 +1112,7 @@ static struct config_group *nvmet_ports_make(struct config_group *group,
 	INIT_LIST_HEAD(&port->entry);
 	INIT_LIST_HEAD(&port->subsystems);
 	INIT_LIST_HEAD(&port->referrals);
+	INIT_LIST_HEAD(&port->anatt_list);
 	port->anatt = NVMET_DEFAULT_ANATT;
 
 	port->disc_addr.portid = cpu_to_le16(portid);
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 72c573c0a8df..dfd163d98488 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -20,6 +20,7 @@
 
 static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX];
 static DEFINE_IDA(cntlid_ida);
+static struct workqueue_struct *nvmet_ana_wq;
 
 /*
  * This read/write semaphore is used to synchronize access to configuration
@@ -217,6 +218,65 @@ void nvmet_port_send_ana_event(struct nvmet_port *port)
 	up_read(&nvmet_config_sem);
 }
 
+void nvmet_ana_state_change_work(struct work_struct *work)
+{
+	struct nvmet_ana_change_event *ev = container_of(work,
+		struct nvmet_ana_change_event, work.work);
+	struct nvmet_port *port = ev->port;
+
+	if (!port->enabled)
+		return;
+	mutex_lock(&port->anatt_list_lock);
+	list_del_init(&ev->entry);
+	mutex_unlock(&port->anatt_list_lock);
+
+	down_write(&nvmet_ana_sem);
+	port->ana_state[ev->grpid] = ev->state;
+	nvmet_ana_chgcnt++;
+	up_write(&nvmet_ana_sem);
+	kfree(ev);
+
+	nvmet_port_send_ana_event(port);
+}
+
+void nvmet_port_ana_state_change(struct nvmet_port *port,
+				 u32 grpid, enum nvme_ana_state state)
+{
+	struct nvmet_ana_change_event *ev;
+	enum nvme_ana_state tmp_state = NVME_ANA_CHANGE;
+
+	if (!port->enabled)
+		return;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		tmp_state = state;
+
+	down_write(&nvmet_ana_sem);
+	port->ana_state[grpid] = tmp_state;
+	nvmet_ana_chgcnt++;
+	up_write(&nvmet_ana_sem);
+
+	if (!ev) {
+		nvmet_port_send_ana_event(port);
+		return;
+	}
+
+	INIT_DELAYED_WORK(&ev->work, nvmet_ana_state_change_work);
+	ev->port = port;
+	ev->grpid = grpid;
+	ev->state = state;
+	mutex_lock(&port->anatt_list_lock);
+	list_add_tail(&ev->entry, &port->anatt_list);
+	mutex_unlock(&port->anatt_list_lock);
+	/*
+	 * Reduce the delay by 1 sec to not accidentally trigger an
+	 * ANA transition timeout failure on the host.
+	 */
+	queue_delayed_work(nvmet_ana_wq, &ev->work,
+			   (port->anatt - 1) * HZ);
+}
+
 int nvmet_register_transport(const struct nvmet_fabrics_ops *ops)
 {
 	int ret = 0;
@@ -276,11 +336,22 @@ int nvmet_enable_port(struct nvmet_port *port)
 void nvmet_disable_port(struct nvmet_port *port)
 {
 	const struct nvmet_fabrics_ops *ops;
+	struct nvmet_ana_change_event *ev, *tmp;
 
 	lockdep_assert_held(&nvmet_config_sem);
 
 	port->enabled = false;
 
+	mutex_lock(&port->anatt_list_lock);
+	list_for_each_entry_safe(ev, tmp, &port->anatt_list, entry) {
+		if (ev->port == port) {
+			list_del_init(&ev->entry);
+			cancel_delayed_work_sync(&ev->work);
+			kfree(ev);
+		}
+	}
+	mutex_unlock(&port->anatt_list_lock);
+
 	ops = nvmet_transports[port->disc_addr.trtype];
 	ops->remove_port(port);
 	module_put(ops->owner);
@@ -1162,6 +1233,11 @@ static int __init nvmet_init(void)
 {
 	int error;
 
+	nvmet_ana_wq = alloc_workqueue("nvmet_ana_wq",
+			WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
+	if (!nvmet_ana_wq)
+		return -ENOMEM;
+
 	nvmet_ana_group_enabled[NVMET_DEFAULT_ANA_GRPID] = 1;
 
 	error = nvmet_init_discovery();
@@ -1176,6 +1252,7 @@ static int __init nvmet_init(void)
 out_exit_discovery:
 	nvmet_exit_discovery();
 out:
+	destroy_workqueue(nvmet_ana_wq);
 	return error;
 }
 
@@ -1184,6 +1261,7 @@ static void __exit nvmet_exit(void)
 	nvmet_exit_configfs();
 	nvmet_exit_discovery();
 	ida_destroy(&cntlid_ida);
+	destroy_workqueue(nvmet_ana_wq);
 
 	BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_entry) != 1024);
 	BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_hdr) != 1024);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 948209d5f803..a5a309ca003a 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -129,6 +129,8 @@ struct nvmet_port {
 	struct list_head		referrals;
 	struct config_group		ana_groups_group;
 	struct nvmet_ana_group		ana_group1;
+	struct list_head		anatt_list;
+	struct mutex			anatt_list_lock;
 	enum nvme_ana_state		*ana_state;
 	void				*priv;
 	bool				enabled;
@@ -320,6 +322,14 @@ struct nvmet_async_event {
 	u8			log_page;
 };
 
+struct nvmet_ana_change_event {
+	struct delayed_work	work;
+	struct list_head	entry;
+	struct nvmet_port	*port;
+	enum nvme_ana_state	state;
+	u8			grpid;
+};
+
 u16 nvmet_parse_connect_cmd(struct nvmet_req *req);
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req);
 u16 nvmet_file_parse_io_cmd(struct nvmet_req *req);
@@ -364,6 +374,8 @@ void nvmet_ns_free(struct nvmet_ns *ns);
 
 void nvmet_send_ana_event(struct nvmet_subsys *subsys);
 void nvmet_port_send_ana_event(struct nvmet_port *port);
+void nvmet_port_ana_state_change(struct nvmet_port *port,
+				 u32 grpid, enum nvme_ana_state state);
 
 int nvmet_register_transport(const struct nvmet_fabrics_ops *ops);
 void nvmet_unregister_transport(const struct nvmet_fabrics_ops *ops);
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] nvme: ANA transition timeout handling
  2018-06-07  7:35 [PATCH 0/4] nvme: ANATT handling Hannes Reinecke
  2018-06-07  7:35 ` [PATCH 1/4] nvmet: make ANATT configurable Hannes Reinecke
  2018-06-07  7:35 ` [PATCH 2/4] nvmet: ANA transition timeout handling Hannes Reinecke
@ 2018-06-07  7:35 ` Hannes Reinecke
  2018-06-07 12:09   ` Christoph Hellwig
  2018-06-07 13:16   ` Sagi Grimberg
  2018-06-07  7:35 ` [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes Hannes Reinecke
  3 siblings, 2 replies; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07  7:35 UTC (permalink / raw)


Add a timer for tracking ANA transition timeout, and reset the
controller once the timeout expires.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c      |  3 +++
 drivers/nvme/host/multipath.c | 34 ++++++++++++++++++++++++++++------
 drivers/nvme/host/nvme.h      |  6 ++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e62de51209b2..3b578ae62cb5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2378,6 +2378,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->kas = le16_to_cpu(id->kas);
 	ctrl->max_namespaces = le32_to_cpu(id->mnan);
 	ctrl->anacap = id->anacap;
+	ctrl->anatt = id->anatt;
 	ctrl->nanagrpid = le32_to_cpu(id->nanagrpid);
 	ctrl->anagrpmax = le32_to_cpu(id->anagrpmax);
 
@@ -3025,6 +3026,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
+	timer_setup(&ns->anatt_timer, nvme_anatt_timedout, 0);
 
 	kref_init(&ns->kref);
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
@@ -3113,6 +3115,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 			blk_integrity_unregister(ns->disk);
 	}
 
+	del_timer_sync(&ns->anatt_timer);
 	mutex_lock(&ns->ctrl->subsys->lock);
 	nvme_mpath_clear_current_path(ns);
 	list_del_rcu(&ns->siblings);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index d30229a21c97..7a37f5959ad4 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -62,13 +62,12 @@ void nvme_failover_req(struct request *req)
 	 */
 	switch (nvme_req(req)->status & 0x7ff) {
 	case NVME_SC_ANA_TRANSITION:
-		/*
-		 * XXX: We should verify the controller doesn't die on during
-		 * the transition.  But that means we per-group timeout from
-		 * when we first hit the change state, so this won't be
-		 * entirely trivial..
-		 */
 		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
+		if (!timer_pending(&ns->anatt_timer)) {
+			ns->anatt_timer.expires =
+				ns->ctrl->anatt * HZ + jiffies;
+			add_timer(&ns->anatt_timer);
+		}
 		break;
 	case NVME_SC_ANA_PERSISTENT_LOSS:
 		nvme_update_ana_state(ns, NVME_ANA_PERSISTENT_LOSS);
@@ -390,11 +389,34 @@ static int nvme_process_ana_log(struct nvme_ctrl *ctrl, bool groups_only)
 static void nvme_ana_work(struct work_struct *work)
 {
 	struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ana_work);
+	struct nvme_ns *ns;
 
 	nvme_process_ana_log(ctrl, false);
+	down_write(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		enum nvme_ana_state state = nvme_ns_ana_state(ns);
+
+		if (timer_pending(&ns->anatt_timer) &&
+		    state != NVME_ANA_CHANGE)
+			del_timer_sync(&ns->anatt_timer);
+		if (!timer_pending(&ns->anatt_timer) &&
+		    state == NVME_ANA_CHANGE) {
+			ns->anatt_timer.expires =
+				ns->ctrl->anatt * HZ + jiffies;
+			add_timer(&ns->anatt_timer);
+		}
+	}
+	up_write(&ctrl->namespaces_rwsem);
 	nvme_kick_requeue_lists(ctrl);
 }
 
+void nvme_anatt_timedout(struct timer_list *t)
+{
+	struct nvme_ns *ns = from_timer(ns, t, anatt_timer);
+
+	nvme_reset_ctrl(ns->ctrl);
+}
+
 int nvme_configure_ana(struct nvme_ctrl *ctrl)
 {
 	int error;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index be2585576bad..d059b5522993 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -200,6 +200,7 @@ struct nvme_ctrl {
 
 	/* asymmetric namespace access: */
 	u8 anacap;
+	u8 anatt;
 	u32 anagrpmax;
 	u32 nanagrpid;
 	enum nvme_ana_state *ana_state;
@@ -301,6 +302,7 @@ struct nvme_ns {
 	struct nvm_dev *ndev;
 	struct kref kref;
 	struct nvme_ns_head *head;
+	struct timer_list anatt_timer;
 
 	int lba_shift;
 	u16 ms;
@@ -469,6 +471,7 @@ void nvme_mpath_add_disk(struct nvme_ns *ns);
 void nvme_mpath_remove_disk(struct nvme_ns_head *head);
 int nvme_configure_ana(struct nvme_ctrl *ctrl);
 void nvme_deconfigure_ana(struct nvme_ctrl *ctrl);
+void nvme_anatt_timedout(struct timer_list *t);
 
 static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
 {
@@ -531,6 +534,9 @@ static inline int nvme_configure_ana(struct nvme_ctrl *ctrl)
 static inline void nvme_deconfigure_ana(struct nvme_ctrl *ctrl)
 {
 }
+void nvme_anatt_timedout(struct timer_list *t)
+{
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 #ifdef CONFIG_NVM
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07  7:35 [PATCH 0/4] nvme: ANATT handling Hannes Reinecke
                   ` (2 preceding siblings ...)
  2018-06-07  7:35 ` [PATCH 3/4] nvme: " Hannes Reinecke
@ 2018-06-07  7:35 ` Hannes Reinecke
  2018-06-07 12:11   ` Christoph Hellwig
  2018-06-07 13:20   ` Sagi Grimberg
  3 siblings, 2 replies; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07  7:35 UTC (permalink / raw)


If I/O is terminated due to an invalid ANA state we should be starting
the ANATT timer as we really should have received an ANA AEN signalling
the state change.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/multipath.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 7a37f5959ad4..7f34ae260ca9 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -49,7 +49,10 @@ static void nvme_update_ana_state(struct nvme_ns *ns, enum nvme_ana_state state)
 void nvme_failover_req(struct request *req)
 {
 	struct nvme_ns *ns = req->q->queuedata;
+	struct device *dev = disk_to_dev(ns->disk);
 	unsigned long flags;
+	enum nvme_ana_state ana_state;
+	bool ana_state_changed = false;
 
 	spin_lock_irqsave(&ns->head->requeue_lock, flags);
 	blk_steal_bios(&ns->head->requeue_list, req);
@@ -60,26 +63,45 @@ void nvme_failover_req(struct request *req)
 	 * Reset the controller for any non-ANA error as we don't know what
 	 * caused the error:
 	 */
+	ana_state = READ_ONCE(ns->ctrl->ana_state[ns->anagrpid]);
 	switch (nvme_req(req)->status & 0x7ff) {
 	case NVME_SC_ANA_TRANSITION:
-		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
-		if (!timer_pending(&ns->anatt_timer)) {
-			ns->anatt_timer.expires =
-				ns->ctrl->anatt * HZ + jiffies;
-			add_timer(&ns->anatt_timer);
+		if (ana_state != NVME_ANA_CHANGE) {
+			nvme_update_ana_state(ns, NVME_ANA_CHANGE);
+			ana_state_changed = true;
 		}
 		break;
 	case NVME_SC_ANA_PERSISTENT_LOSS:
-		nvme_update_ana_state(ns, NVME_ANA_PERSISTENT_LOSS);
+		if (ana_state != NVME_ANA_PERSISTENT_LOSS) {
+			nvme_update_ana_state(ns, NVME_ANA_PERSISTENT_LOSS);
+			ana_state_changed = true;
+		}
 		break;
 	case NVME_SC_ANA_INACCESSIBLE:
-		nvme_update_ana_state(ns, NVME_ANA_INACCESSIBLE);
+		if (ana_state != NVME_ANA_INACCESSIBLE) {
+			nvme_update_ana_state(ns, NVME_ANA_INACCESSIBLE);
+			ana_state_changed = true;
+		}
 		break;
 	default:
 		nvme_reset_ctrl(ns->ctrl);
 		break;
 	}
 
+	if (ana_state_changed) {
+		/*
+		 * We should have received an ANA AEN signalling the state
+		 * change. As we haven't (or we wouldn't ever reach here)
+		 * give it a benefit of doubt to start the ANATT timer
+		 * before resetting the controller.
+		 */
+		if (!timer_pending(&ns->anatt_timer)) {
+			ns->anatt_timer.expires =
+				ns->ctrl->anatt * HZ + jiffies;
+			add_timer(&ns->anatt_timer);
+		}
+	}
+
 	kblockd_schedule_work(&ns->head->requeue_work);
 }
 
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/4] nvmet: make ANATT configurable
  2018-06-07  7:35 ` [PATCH 1/4] nvmet: make ANATT configurable Hannes Reinecke
@ 2018-06-07 12:06   ` Christoph Hellwig
  2018-06-07 12:42     ` Hannes Reinecke
  2018-06-07 13:03   ` Sagi Grimberg
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 12:06 UTC (permalink / raw)


Needs a description.  Especially to explain why you want it
per-port.  In the end the transition timeout is something determined
by the failover backend, which in no way maps to a port.  In fact
it doesn't map to anything, which is why I would suggest to just
make it global.  This still won't cover the case of using different
backed storage types with different failover characteristics, but
at least it ?sn't too confusing.

> +	ret = kstrtou8(page, 0, &ana_tt);
> +	if (ret || ana_tt == 0)
> +		return -EINVAL;

Please propagate the actual error returned by kstrtou8.

> +
> +	if (port->enabled) {
> +		pr_err("Cannot modify ANA transition timeout while enabled\n");
> +		return -EACCES;
> +	}
> +	down_write(&nvmet_config_sem);
> +	port->anatt = ana_tt;
> +	up_write(&nvmet_config_sem);

The enabled check needs to be under nvmet_config_sem to be race free.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/4] nvmet: ANA transition timeout handling
  2018-06-07  7:35 ` [PATCH 2/4] nvmet: ANA transition timeout handling Hannes Reinecke
@ 2018-06-07 12:07   ` Christoph Hellwig
  2018-06-07 13:31     ` Hannes Reinecke
  2018-06-07 13:12   ` Sagi Grimberg
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 12:07 UTC (permalink / raw)


On Thu, Jun 07, 2018@09:35:54AM +0200, Hannes Reinecke wrote:
> Whenever an ANA state change is triggered the ANA state for that
> group ID is set to 'state change' and a delayed work is started,
> which will be setting the port to the actual state after anatt
> has expired.

What is the point?  We want to move to the actual state as soon as we
can, where as soon is defined by a backend.  So we need manual control
over exact states, including change state.

Even for testing it makes much more sense to have entirely manual
control, so that your userspace test framework can trivial inject too
long transition faults.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] nvme: ANA transition timeout handling
  2018-06-07  7:35 ` [PATCH 3/4] nvme: " Hannes Reinecke
@ 2018-06-07 12:09   ` Christoph Hellwig
  2018-06-07 12:52     ` Hannes Reinecke
  2018-06-07 13:16   ` Sagi Grimberg
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 12:09 UTC (permalink / raw)


> +	timer_setup(&ns->anatt_timer, nvme_anatt_timedout, 0);

ANATT is per-group, so having a timer in the namespace is wrong.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07  7:35 ` [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes Hannes Reinecke
@ 2018-06-07 12:11   ` Christoph Hellwig
  2018-06-07 12:37     ` Hannes Reinecke
  2018-06-07 13:20   ` Sagi Grimberg
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 12:11 UTC (permalink / raw)


> +	ana_state = READ_ONCE(ns->ctrl->ana_state[ns->anagrpid]);
>  	switch (nvme_req(req)->status & 0x7ff) {
>  	case NVME_SC_ANA_TRANSITION:
> -		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
> -		if (!timer_pending(&ns->anatt_timer)) {
> -			ns->anatt_timer.expires =
> -				ns->ctrl->anatt * HZ + jiffies;
> -			add_timer(&ns->anatt_timer);
> +		if (ana_state != NVME_ANA_CHANGE) {
> +			nvme_update_ana_state(ns, NVME_ANA_CHANGE);
> +			ana_state_changed = true;

Needs an xchg operation to be atomic.

>  	case NVME_SC_ANA_PERSISTENT_LOSS:
> -		nvme_update_ana_state(ns, NVME_ANA_PERSISTENT_LOSS);
> +		if (ana_state != NVME_ANA_PERSISTENT_LOSS) {
> +			nvme_update_ana_state(ns, NVME_ANA_PERSISTENT_LOSS);
> +			ana_state_changed = true;
> +		}
>  		break;
>  	case NVME_SC_ANA_INACCESSIBLE:
> -		nvme_update_ana_state(ns, NVME_ANA_INACCESSIBLE);
> +		if (ana_state != NVME_ANA_INACCESSIBLE) {
> +			nvme_update_ana_state(ns, NVME_ANA_INACCESSIBLE);
> +			ana_state_changed = true;
> +		}
>  		break;

We don't need to start the ANA timer for these states, it should
only happen for the CHANGE state.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 12:11   ` Christoph Hellwig
@ 2018-06-07 12:37     ` Hannes Reinecke
  2018-06-07 13:10       ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07 12:37 UTC (permalink / raw)


On Thu, 7 Jun 2018 14:11:35 +0200
Christoph Hellwig <hch@lst.de> wrote:

> > +	ana_state = READ_ONCE(ns->ctrl->ana_state[ns->anagrpid]);
> >  	switch (nvme_req(req)->status & 0x7ff) {
> >  	case NVME_SC_ANA_TRANSITION:
> > -		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
> > -		if (!timer_pending(&ns->anatt_timer)) {
> > -			ns->anatt_timer.expires =
> > -				ns->ctrl->anatt * HZ + jiffies;
> > -			add_timer(&ns->anatt_timer);
> > +		if (ana_state != NVME_ANA_CHANGE) {
> > +			nvme_update_ana_state(ns, NVME_ANA_CHANGE);
> > +			ana_state_changed = true;  
> 
> Needs an xchg operation to be atomic.
> 
Okay.

> >  	case NVME_SC_ANA_PERSISTENT_LOSS:
> > -		nvme_update_ana_state(ns,
> > NVME_ANA_PERSISTENT_LOSS);
> > +		if (ana_state != NVME_ANA_PERSISTENT_LOSS) {
> > +			nvme_update_ana_state(ns,
> > NVME_ANA_PERSISTENT_LOSS);
> > +			ana_state_changed = true;
> > +		}
> >  		break;
> >  	case NVME_SC_ANA_INACCESSIBLE:
> > -		nvme_update_ana_state(ns, NVME_ANA_INACCESSIBLE);
> > +		if (ana_state != NVME_ANA_INACCESSIBLE) {
> > +			nvme_update_ana_state(ns,
> > NVME_ANA_INACCESSIBLE);
> > +			ana_state_changed = true;
> > +		}
> >  		break;  
> 
> We don't need to start the ANA timer for these states, it should
> only happen for the CHANGE state.
> 

Hmm. When we're getting these error it means that there is a divergence
between the ANA state on the host and the target.

Which really shouldn't happen, as we _should_ have received an ANA AEN
signalling us this state. As there is a race condition here (ANA AENs
are being send via the admin connection, and it might be that we
haven't gotten around to processing the ANA state yet) we should be
starting the ANATT timer to catch any outstanding AENs.

And if we're not getting an ANA AEN within ANATT the controller should
be considered hosed, and we need to reset it.
Hence the code.

If you disagree, what _should_ be the appropriate handling in this
case, ie we have the most current ANA log page, _and_ get this error
indicating a state mismatch?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] nvmet: make ANATT configurable
  2018-06-07 12:06   ` Christoph Hellwig
@ 2018-06-07 12:42     ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07 12:42 UTC (permalink / raw)


On Thu, 7 Jun 2018 14:06:12 +0200
Christoph Hellwig <hch@lst.de> wrote:

> Needs a description.  Especially to explain why you want it
> per-port.  In the end the transition timeout is something determined
> by the failover backend, which in no way maps to a port.  In fact
> it doesn't map to anything, which is why I would suggest to just
> make it global.  This still won't cover the case of using different
> backed storage types with different failover characteristics, but
> at least it ?sn't too confusing.
> 
Well, I just moved it to the position where the state change actually
happend, and figured it'll give us the most flexibility.

> > +	ret = kstrtou8(page, 0, &ana_tt);
> > +	if (ret || ana_tt == 0)
> > +		return -EINVAL;  
> 
> Please propagate the actual error returned by kstrtou8.
> 
Okay.

> > +
> > +	if (port->enabled) {
> > +		pr_err("Cannot modify ANA transition timeout while
> > enabled\n");
> > +		return -EACCES;
> > +	}
> > +	down_write(&nvmet_config_sem);
> > +	port->anatt = ana_tt;
> > +	up_write(&nvmet_config_sem);  
> 
> The enabled check needs to be under nvmet_config_sem to be race free.
> 

Okay, will be doing so.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] nvme: ANA transition timeout handling
  2018-06-07 12:09   ` Christoph Hellwig
@ 2018-06-07 12:52     ` Hannes Reinecke
  2018-06-07 13:11       ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07 12:52 UTC (permalink / raw)


On Thu, 7 Jun 2018 14:09:41 +0200
Christoph Hellwig <hch@lst.de> wrote:

> > +	timer_setup(&ns->anatt_timer, nvme_anatt_timedout, 0);  
> 
> ANATT is per-group, so having a timer in the namespace is wrong.
> 

But would requiring us to create a separate ana group structure on the
host. Could be looking into that if you insist, but this implementation
is way easier.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] nvmet: make ANATT configurable
  2018-06-07  7:35 ` [PATCH 1/4] nvmet: make ANATT configurable Hannes Reinecke
  2018-06-07 12:06   ` Christoph Hellwig
@ 2018-06-07 13:03   ` Sagi Grimberg
  1 sibling, 0 replies; 26+ messages in thread
From: Sagi Grimberg @ 2018-06-07 13:03 UTC (permalink / raw)


Usually stuff are implemented before userspace knobs
are exposed for them, maybe move this patch to the
back?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 12:37     ` Hannes Reinecke
@ 2018-06-07 13:10       ` Christoph Hellwig
  2018-06-07 13:20         ` Hannes Reinecke
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 13:10 UTC (permalink / raw)


On Thu, Jun 07, 2018@02:37:50PM +0200, Hannes Reinecke wrote:
> > We don't need to start the ANA timer for these states, it should
> > only happen for the CHANGE state.
> > 
> 
> Hmm. When we're getting these error it means that there is a divergence
> between the ANA state on the host and the target.
> 
> Which really shouldn't happen, as we _should_ have received an ANA AEN
> signalling us this state. As there is a race condition here (ANA AENs
> are being send via the admin connection, and it might be that we
> haven't gotten around to processing the ANA state yet) we should be
> starting the ANATT timer to catch any outstanding AENs.

We should receive an AEN, yes.  But there is absolutely no ordering
guarantees either in the ANA spec, or in NVMe in general for multiple
queues (admin vs I/O).  But that was intentional in the specification,
and has nothing to do with ANATT at all.

> And if we're not getting an ANA AEN within ANATT the controller should
> be considered hosed, and we need to reset it.

I can't find anything in the spec saying that.

> If you disagree, what _should_ be the appropriate handling in this
> case, ie we have the most current ANA log page, _and_ get this error
> indicating a state mismatch?

Update our in-memory state ASAP, and then just way for the AEN to
happen eventually.  As the current code does.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] nvme: ANA transition timeout handling
  2018-06-07 12:52     ` Hannes Reinecke
@ 2018-06-07 13:11       ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 13:11 UTC (permalink / raw)


On Thu, Jun 07, 2018@02:52:59PM +0200, Hannes Reinecke wrote:
> On Thu, 7 Jun 2018 14:09:41 +0200
> Christoph Hellwig <hch@lst.de> wrote:
> 
> > > +	timer_setup(&ns->anatt_timer, nvme_anatt_timedout, 0);  
> > 
> > ANATT is per-group, so having a timer in the namespace is wrong.
> > 
> 
> But would requiring us to create a separate ana group structure on the
> host. Could be looking into that if you insist, but this implementation
> is way easier.

For now just add an array of timers, similar to ctrl->ana_state.
Eventually we might need an actual ana_group structure if the two arrays
become too sparse in practice, but we can defer that for now.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/4] nvmet: ANA transition timeout handling
  2018-06-07  7:35 ` [PATCH 2/4] nvmet: ANA transition timeout handling Hannes Reinecke
  2018-06-07 12:07   ` Christoph Hellwig
@ 2018-06-07 13:12   ` Sagi Grimberg
  1 sibling, 0 replies; 26+ messages in thread
From: Sagi Grimberg @ 2018-06-07 13:12 UTC (permalink / raw)



> +	INIT_DELAYED_WORK(&ev->work, nvmet_ana_state_change_work);
> +	ev->port = port;
> +	ev->grpid = grpid;
> +	ev->state = state;
> +	mutex_lock(&port->anatt_list_lock);
> +	list_add_tail(&ev->entry, &port->anatt_list);
> +	mutex_unlock(&port->anatt_list_lock);
> +	/*
> +	 * Reduce the delay by 1 sec to not accidentally trigger an
> +	 * ANA transition timeout failure on the host.
> +	 */
> +	queue_delayed_work(nvmet_ana_wq, &ev->work,
> +			   (port->anatt - 1) * HZ);

hmm, what is that giving us?

> +}
> +
>   int nvmet_register_transport(const struct nvmet_fabrics_ops *ops)
>   {
>   	int ret = 0;
> @@ -276,11 +336,22 @@ int nvmet_enable_port(struct nvmet_port *port)
>   void nvmet_disable_port(struct nvmet_port *port)
>   {
>   	const struct nvmet_fabrics_ops *ops;
> +	struct nvmet_ana_change_event *ev, *tmp;
>   
>   	lockdep_assert_held(&nvmet_config_sem);
>   
>   	port->enabled = false;
>   
> +	mutex_lock(&port->anatt_list_lock);
> +	list_for_each_entry_safe(ev, tmp, &port->anatt_list, entry) {
> +		if (ev->port == port) {
> +			list_del_init(&ev->entry);
> +			cancel_delayed_work_sync(&ev->work);
> +			kfree(ev);

I'd splice to a local list and cancel+free outside the mutex.

> +		}
> +	}
> +	mutex_unlock(&port->anatt_list_lock);
> +
>   	ops = nvmet_transports[port->disc_addr.trtype];
>   	ops->remove_port(port);
>   	module_put(ops->owner);
> @@ -1162,6 +1233,11 @@ static int __init nvmet_init(void)
>   {
>   	int error;
>   
> +	nvmet_ana_wq = alloc_workqueue("nvmet_ana_wq",
> +			WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
> +	if (!nvmet_ana_wq)
> +		return -ENOMEM;
> +

Why a dedicated workqueue? and why a mem reclaim one?

whats wrong with system_wq/system_unbound_wq/system_long_wq?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] nvme: ANA transition timeout handling
  2018-06-07  7:35 ` [PATCH 3/4] nvme: " Hannes Reinecke
  2018-06-07 12:09   ` Christoph Hellwig
@ 2018-06-07 13:16   ` Sagi Grimberg
  2018-06-07 13:26     ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Sagi Grimberg @ 2018-06-07 13:16 UTC (permalink / raw)




On 06/07/2018 10:35 AM, Hannes Reinecke wrote:
> Add a timer for tracking ANA transition timeout, and reset the
> controller once the timeout expires.
> 
> Signed-off-by: Hannes Reinecke <hare at suse.com>
> ---
>   drivers/nvme/host/core.c      |  3 +++
>   drivers/nvme/host/multipath.c | 34 ++++++++++++++++++++++++++++------
>   drivers/nvme/host/nvme.h      |  6 ++++++
>   3 files changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e62de51209b2..3b578ae62cb5 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2378,6 +2378,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>   	ctrl->kas = le16_to_cpu(id->kas);
>   	ctrl->max_namespaces = le32_to_cpu(id->mnan);
>   	ctrl->anacap = id->anacap;
> +	ctrl->anatt = id->anatt;
>   	ctrl->nanagrpid = le32_to_cpu(id->nanagrpid);
>   	ctrl->anagrpmax = le32_to_cpu(id->anagrpmax);
>   
> @@ -3025,6 +3026,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>   	ns->queue->queuedata = ns;
>   	ns->ctrl = ctrl;
> +	timer_setup(&ns->anatt_timer, nvme_anatt_timedout, 0);
>   
>   	kref_init(&ns->kref);
>   	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
> @@ -3113,6 +3115,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   			blk_integrity_unregister(ns->disk);
>   	}
>   
> +	del_timer_sync(&ns->anatt_timer);
>   	mutex_lock(&ns->ctrl->subsys->lock);
>   	nvme_mpath_clear_current_path(ns);
>   	list_del_rcu(&ns->siblings);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index d30229a21c97..7a37f5959ad4 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -62,13 +62,12 @@ void nvme_failover_req(struct request *req)
>   	 */
>   	switch (nvme_req(req)->status & 0x7ff) {
>   	case NVME_SC_ANA_TRANSITION:
> -		/*
> -		 * XXX: We should verify the controller doesn't die on during
> -		 * the transition.  But that means we per-group timeout from
> -		 * when we first hit the change state, so this won't be
> -		 * entirely trivial..
> -		 */
>   		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
> +		if (!timer_pending(&ns->anatt_timer)) {
> +			ns->anatt_timer.expires =
> +				ns->ctrl->anatt * HZ + jiffies;
> +			add_timer(&ns->anatt_timer);
> +		}

Shouldn't this del and re-add the timer if its pending?

>   		break;
>   	case NVME_SC_ANA_PERSISTENT_LOSS:
>   		nvme_update_ana_state(ns, NVME_ANA_PERSISTENT_LOSS);
> @@ -390,11 +389,34 @@ static int nvme_process_ana_log(struct nvme_ctrl *ctrl, bool groups_only)
>   static void nvme_ana_work(struct work_struct *work)
>   {
>   	struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ana_work);
> +	struct nvme_ns *ns;
>   
>   	nvme_process_ana_log(ctrl, false);
> +	down_write(&ctrl->namespaces_rwsem);
> +	list_for_each_entry(ns, &ctrl->namespaces, list) {
> +		enum nvme_ana_state state = nvme_ns_ana_state(ns);
> +
> +		if (timer_pending(&ns->anatt_timer) &&
> +		    state != NVME_ANA_CHANGE)
> +			del_timer_sync(&ns->anatt_timer);
> +		if (!timer_pending(&ns->anatt_timer) &&
> +		    state == NVME_ANA_CHANGE) {
> +			ns->anatt_timer.expires =
> +				ns->ctrl->anatt * HZ + jiffies;
> +			add_timer(&ns->anatt_timer);
> +		}

Called from couple of sites, can move to helper?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 13:10       ` Christoph Hellwig
@ 2018-06-07 13:20         ` Hannes Reinecke
  2018-06-07 13:46           ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07 13:20 UTC (permalink / raw)


On Thu, 7 Jun 2018 15:10:26 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Jun 07, 2018@02:37:50PM +0200, Hannes Reinecke wrote:
> > > We don't need to start the ANA timer for these states, it should
> > > only happen for the CHANGE state.
> > >   
> > 
> > Hmm. When we're getting these error it means that there is a
> > divergence between the ANA state on the host and the target.
> > 
> > Which really shouldn't happen, as we _should_ have received an ANA
> > AEN signalling us this state. As there is a race condition here
> > (ANA AENs are being send via the admin connection, and it might be
> > that we haven't gotten around to processing the ANA state yet) we
> > should be starting the ANATT timer to catch any outstanding AENs.  
> 
> We should receive an AEN, yes.  But there is absolutely no ordering
> guarantees either in the ANA spec, or in NVMe in general for multiple
> queues (admin vs I/O).  But that was intentional in the specification,
> and has nothing to do with ANATT at all.
> 
Precisely.
Looking closer, there are not even guarantees within which time any
target should be sending the AEN.
So we have to start somewhere when defining a sensible timeframe during
which we should have received an ANA AEN. And the one timer we already
have is ANATT, so it looked logical to use that.

Especially due to this paragraph:
> If no controllers are reporting ANA Optimized state or ANA
> Non-Optimized state, then a transition may be occurring such that a
> controller reporting the Inaccessible state may become accessible and
> the host should retry the command on the controller reporting
> Inaccessible state for at least ANATT seconds (refer to Figure 109).

which seems to imply to me that we can have an implicit transition on
the target while reporting the inaccessible state error, and the use of
ANATT here is indeed applicable.

> > And if we're not getting an ANA AEN within ANATT the controller
> > should be considered hosed, and we need to reset it.  
> 
> I can't find anything in the spec saying that.
> 
No, surely not; any error recovery action are implementation dependent.
But the only error recovery we have is controller reset, so ...

> > If you disagree, what _should_ be the appropriate handling in this
> > case, ie we have the most current ANA log page, _and_ get this error
> > indicating a state mismatch?  
> 
> Update our in-memory state ASAP, and then just way for the AEN to
> happen eventually.  As the current code does.
> 

This code is essentially error handling.
Yes, I do agree that the spec says it shouldn't happen.
But this doesn't mean it _can't_ happen (lost frames, anyone?), so we
need to prepare for it.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07  7:35 ` [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes Hannes Reinecke
  2018-06-07 12:11   ` Christoph Hellwig
@ 2018-06-07 13:20   ` Sagi Grimberg
  1 sibling, 0 replies; 26+ messages in thread
From: Sagi Grimberg @ 2018-06-07 13:20 UTC (permalink / raw)




On 06/07/2018 10:35 AM, Hannes Reinecke wrote:
> If I/O is terminated due to an invalid ANA state we should be starting
> the ANATT timer as we really should have received an ANA AEN signalling
> the state change.
> 
> Signed-off-by: Hannes Reinecke <hare at suse.com>
> ---
>   drivers/nvme/host/multipath.c | 36 +++++++++++++++++++++++++++++-------
>   1 file changed, 29 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 7a37f5959ad4..7f34ae260ca9 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -49,7 +49,10 @@ static void nvme_update_ana_state(struct nvme_ns *ns, enum nvme_ana_state state)
>   void nvme_failover_req(struct request *req)
>   {
>   	struct nvme_ns *ns = req->q->queuedata;
> +	struct device *dev = disk_to_dev(ns->disk);
>   	unsigned long flags;
> +	enum nvme_ana_state ana_state;
> +	bool ana_state_changed = false;
>   
>   	spin_lock_irqsave(&ns->head->requeue_lock, flags);
>   	blk_steal_bios(&ns->head->requeue_list, req);
> @@ -60,26 +63,45 @@ void nvme_failover_req(struct request *req)
>   	 * Reset the controller for any non-ANA error as we don't know what
>   	 * caused the error:
>   	 */
> +	ana_state = READ_ONCE(ns->ctrl->ana_state[ns->anagrpid]);
>   	switch (nvme_req(req)->status & 0x7ff) {
>   	case NVME_SC_ANA_TRANSITION:
> -		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
> -		if (!timer_pending(&ns->anatt_timer)) {
> -			ns->anatt_timer.expires =
> -				ns->ctrl->anatt * HZ + jiffies;
> -			add_timer(&ns->anatt_timer);

Wasn't this added in the previous patch? its a bit confusing... can you
re-arrange the patches differently? maybe squash?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] nvme: ANA transition timeout handling
  2018-06-07 13:16   ` Sagi Grimberg
@ 2018-06-07 13:26     ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 13:26 UTC (permalink / raw)


On Thu, Jun 07, 2018@04:16:09PM +0300, Sagi Grimberg wrote:
>> -		/*
>> -		 * XXX: We should verify the controller doesn't die on during
>> -		 * the transition.  But that means we per-group timeout from
>> -		 * when we first hit the change state, so this won't be
>> -		 * entirely trivial..
>> -		 */
>>   		nvme_update_ana_state(ns, NVME_ANA_CHANGE);
>> +		if (!timer_pending(&ns->anatt_timer)) {
>> +			ns->anatt_timer.expires =
>> +				ns->ctrl->anatt * HZ + jiffies;
>> +			add_timer(&ns->anatt_timer);
>> +		}
>
> Shouldn't this del and re-add the timer if its pending?

Basically what we need to do is an xchg on the state variable,
at which point it can't ever be pending.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/4] nvmet: ANA transition timeout handling
  2018-06-07 12:07   ` Christoph Hellwig
@ 2018-06-07 13:31     ` Hannes Reinecke
  2018-06-07 13:41       ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07 13:31 UTC (permalink / raw)


On Thu, 7 Jun 2018 14:07:46 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Jun 07, 2018@09:35:54AM +0200, Hannes Reinecke wrote:
> > Whenever an ANA state change is triggered the ANA state for that
> > group ID is set to 'state change' and a delayed work is started,
> > which will be setting the port to the actual state after anatt
> > has expired.  
> 
> What is the point?  We want to move to the actual state as soon as we
> can, where as soon is defined by a backend.  So we need manual control
> over exact states, including change state.
> 
> Even for testing it makes much more sense to have entirely manual
> control, so that your userspace test framework can trivial inject too
> long transition faults.
> 
But we would _require_ userspace interaction to facilitate correct
operation. And with a purely manual configuration on the target any
ANATT value becomes meaningless as we have no way of enforcing it.

And with that patch I don't need any userspace framework; just updating
the state is enough. And I even can set ANATT to 1 to have the state
update done directly.

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/4] nvmet: ANA transition timeout handling
  2018-06-07 13:31     ` Hannes Reinecke
@ 2018-06-07 13:41       ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 13:41 UTC (permalink / raw)


On Thu, Jun 07, 2018@03:31:55PM +0200, Hannes Reinecke wrote:
> But we would _require_ userspace interaction to facilitate correct
> operation. And with a purely manual configuration on the target any
> ANATT value becomes meaningless as we have no way of enforcing it.

?  The whole change state is entirely optional.  For 90% of the
users you'll never enter it.  And when you need it, we need to
control when to enter and exit it.  The timer is of absolutely
no use.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 13:20         ` Hannes Reinecke
@ 2018-06-07 13:46           ` Christoph Hellwig
  2018-06-07 14:01             ` Hannes Reinecke
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 13:46 UTC (permalink / raw)


On Thu, Jun 07, 2018@03:20:36PM +0200, Hannes Reinecke wrote:
> Precisely.
> Looking closer, there are not even guarantees within which time any
> target should be sending the AEN.
> So we have to start somewhere when defining a sensible timeframe during
> which we should have received an ANA AEN. And the one timer we already
> have is ANATT, so it looked logical to use that.

ANATT has no relevance here.  And more importantly not sending the
AEN until some time later is not going to cause us any grave
consequence - we already notice the non-live states by the status,
so we don't send any I/O.

> Especially due to this paragraph:
> > If no controllers are reporting ANA Optimized state or ANA
> > Non-Optimized state, then a transition may be occurring such that a
> > controller reporting the Inaccessible state may become accessible and
> > the host should retry the command on the controller reporting
> > Inaccessible state for at least ANATT seconds (refer to Figure 109).
> 
> which seems to imply to me that we can have an implicit transition on
> the target while reporting the inaccessible state error, and the use of
> ANATT here is indeed applicable.

But we'll still get the AEN.  The whole point is the above is that
if we had a queue_if_no_path=0 mode we should not give up before
ANATT.

> This code is essentially error handling.

Chances are this code is going to add more errors than it handles,
as it doesn't follow the spec.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 13:46           ` Christoph Hellwig
@ 2018-06-07 14:01             ` Hannes Reinecke
  2018-06-07 14:22               ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Hannes Reinecke @ 2018-06-07 14:01 UTC (permalink / raw)


On Thu, 7 Jun 2018 15:46:12 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Jun 07, 2018@03:20:36PM +0200, Hannes Reinecke wrote:
[ .. ]
> 
> > Especially due to this paragraph:  
> > > If no controllers are reporting ANA Optimized state or ANA
> > > Non-Optimized state, then a transition may be occurring such that
> > > a controller reporting the Inaccessible state may become
> > > accessible and the host should retry the command on the
> > > controller reporting Inaccessible state for at least ANATT
> > > seconds (refer to Figure 109).  
> > 
> > which seems to imply to me that we can have an implicit transition
> > on the target while reporting the inaccessible state error, and the
> > use of ANATT here is indeed applicable.  
> 
> But we'll still get the AEN.  The whole point is the above is that
> if we had a queue_if_no_path=0 mode we should not give up before
> ANATT.
> 
There is no guarantee that we will get the AEN.
Yes, the spec says we should, but it's just a single message which
might easily be lost if the corresponding frame is dropped.
Think of FC, which supposedly is a lossless fabric, yet a sizeable
piece of the spec deals with recovery from lost frames.
It _will_ happen.

To rephrase my problem statement:

What is the appropriate action if our copy of the ANA log page
indicates an optimized or non-optimized ANA state, but we're getting an
I/O error back indicating an inaccessible or persistent-loss ANA state?
If this is a simple active/passive scenario we will have set all paths
to inactive, leaving us with no paths to sent I/O to.

We _could_ try to implement the last-resort handling of sending I/Os to
the inaccessible paths (as per ANA base spec), but as we're not having
a round-robin selector we're ending up always hitting the first
inaccessible path. If that's the 'true' inaccessible one we will never
be able to send I/O even though the target is fully operable.

With this patch we detect this situation, and reset the controller
forcing us to re-read the ANA log page. After which we can do normal
I/O again.

If you have a better solution I'm all ears.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 14:01             ` Hannes Reinecke
@ 2018-06-07 14:22               ` Christoph Hellwig
  2018-06-07 15:20                 ` Sagi Grimberg
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-06-07 14:22 UTC (permalink / raw)


On Thu, Jun 07, 2018@04:01:52PM +0200, Hannes Reinecke wrote:
> There is no guarantee that we will get the AEN.
> Yes, the spec says we should, but it's just a single message which
> might easily be lost if the corresponding frame is dropped.
> Think of FC, which supposedly is a lossless fabric, yet a sizeable
> piece of the spec deals with recovery from lost frames.
> It _will_ happen.

NVMe requires a reliable transport.  The Fabrics spec very explicitly
states that, and PCIe provides the gurantees as well.

If we lose random admin command replies we have much, much deeper
problems.  E.g. if we lose the AEN reply we would never get any other
AEN ever because we are not going to resubmit the command.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes
  2018-06-07 14:22               ` Christoph Hellwig
@ 2018-06-07 15:20                 ` Sagi Grimberg
  0 siblings, 0 replies; 26+ messages in thread
From: Sagi Grimberg @ 2018-06-07 15:20 UTC (permalink / raw)



>> There is no guarantee that we will get the AEN.
>> Yes, the spec says we should, but it's just a single message which
>> might easily be lost if the corresponding frame is dropped.
>> Think of FC, which supposedly is a lossless fabric, yet a sizeable
>> piece of the spec deals with recovery from lost frames.
>> It _will_ happen.
> 
> NVMe requires a reliable transport.  The Fabrics spec very explicitly
> states that, and PCIe provides the gurantees as well.
> 
> If we lose random admin command replies we have much, much deeper
> problems.  E.g. if we lose the AEN reply we would never get any other
> AEN ever because we are not going to resubmit the command.

I agree here, if the wire is AEN is not getting through, the transport
is responsible that nothing else makes it through either (reliable,
in-order delivery).

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-06-07 15:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-07  7:35 [PATCH 0/4] nvme: ANATT handling Hannes Reinecke
2018-06-07  7:35 ` [PATCH 1/4] nvmet: make ANATT configurable Hannes Reinecke
2018-06-07 12:06   ` Christoph Hellwig
2018-06-07 12:42     ` Hannes Reinecke
2018-06-07 13:03   ` Sagi Grimberg
2018-06-07  7:35 ` [PATCH 2/4] nvmet: ANA transition timeout handling Hannes Reinecke
2018-06-07 12:07   ` Christoph Hellwig
2018-06-07 13:31     ` Hannes Reinecke
2018-06-07 13:41       ` Christoph Hellwig
2018-06-07 13:12   ` Sagi Grimberg
2018-06-07  7:35 ` [PATCH 3/4] nvme: " Hannes Reinecke
2018-06-07 12:09   ` Christoph Hellwig
2018-06-07 12:52     ` Hannes Reinecke
2018-06-07 13:11       ` Christoph Hellwig
2018-06-07 13:16   ` Sagi Grimberg
2018-06-07 13:26     ` Christoph Hellwig
2018-06-07  7:35 ` [PATCH 4/4] nvme: start ANATT timer on out-of-order state changes Hannes Reinecke
2018-06-07 12:11   ` Christoph Hellwig
2018-06-07 12:37     ` Hannes Reinecke
2018-06-07 13:10       ` Christoph Hellwig
2018-06-07 13:20         ` Hannes Reinecke
2018-06-07 13:46           ` Christoph Hellwig
2018-06-07 14:01             ` Hannes Reinecke
2018-06-07 14:22               ` Christoph Hellwig
2018-06-07 15:20                 ` Sagi Grimberg
2018-06-07 13:20   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.