All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] nvme: sanitize multipathing
@ 2017-10-02 13:55 Hannes Reinecke
  2017-10-02 13:55 ` [PATCH 1/6] nvme: display 'CMIC' controller attribute Hannes Reinecke
                   ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


Hi Christoph,

here's a patchset to sanitize nvme multipath handling, following
the discusion we've had at ALPSS.
Here I've settled for the '/dev/nvmsXnZ' device nodes (which you
said I could do :-), and added the necessary bits and piece to
get it registered nicely with sysfs.
We've also found an issue when retrying I/O after path failures;
we probably should be checking if this is multipath I/O before
deciding on retry or not.

Patches are relative to your nvme-mpath branch.

As usual, comments and reviews are welcome.

Hannes Reinecke (5):
  nvme: display 'CMIC' controller attribute
  nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ'
  nvme: claim block devices
  nvme: Export subsystems to /sys/class/nvme-subsys
  nvme: ignore retries for multipath devices

Johannes Thumshirn (1):
  nvme: display 'NMIC' namespace attribute

 drivers/nvme/host/core.c | 99 +++++++++++++++++++++++++++++++++++++-----------
 drivers/nvme/host/nvme.h |  5 ++-
 2 files changed, 80 insertions(+), 24 deletions(-)

-- 
1.8.5.6

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/6] nvme: display 'CMIC' controller attribute
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
@ 2017-10-02 13:55 ` Hannes Reinecke
  2017-10-02 16:15   ` Christoph Hellwig
  2017-10-02 13:55 ` [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ' Hannes Reinecke
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


To figure out if a controller is multipath capable (ie if the
controller supports namespace sharing) we need to expose the CMIC
value to sysfs.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 12 ++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c296f27..b1e61c9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2061,6 +2061,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
+	ctrl->cmic = id->cmic;
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
@@ -2486,6 +2487,16 @@ static ssize_t nvme_sysfs_show_address(struct device *dev,
 }
 static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
 
+static ssize_t nvme_sysfs_show_cmic(struct device *dev,
+				    struct device_attribute *attr,
+				    char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", ctrl->cmic);
+}
+static DEVICE_ATTR(cmic, S_IRUGO, nvme_sysfs_show_cmic, NULL);
+
 static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_reset_controller.attr,
 	&dev_attr_rescan_controller.attr,
@@ -2498,6 +2509,7 @@ static ssize_t nvme_sysfs_show_address(struct device *dev,
 	&dev_attr_subsysnqn.attr,
 	&dev_attr_address.attr,
 	&dev_attr_state.attr,
+	&dev_attr_cmic.attr,
 	NULL
 };
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index febc21d..10fffbc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -170,6 +170,7 @@ struct nvme_ctrl {
 	u16 kas;
 	u8 npss;
 	u8 apsta;
+	u8 cmic;
 	unsigned int shutdown_timeout;
 	unsigned int kato;
 	bool subsystem;
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ'
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
  2017-10-02 13:55 ` [PATCH 1/6] nvme: display 'CMIC' controller attribute Hannes Reinecke
@ 2017-10-02 13:55 ` Hannes Reinecke
  2017-10-02 16:16   ` Christoph Hellwig
  2017-10-02 13:55 ` [PATCH 3/6] nvme: claim block devices Hannes Reinecke
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


Align with overall nvme device naming scheme.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b1e61c9..f52f0ab 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2689,7 +2689,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	head->disk->private_data = head;
 	head->disk->queue = q;
 	head->disk->flags = GENHD_FL_EXT_DEVT;
-	sprintf(head->disk->disk_name, "nvm-sub%dn%d",
+	sprintf(head->disk->disk_name, "nvms%dn%d",
 			ctrl->subsys->instance, nsid);
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
  2017-10-02 13:55 ` [PATCH 1/6] nvme: display 'CMIC' controller attribute Hannes Reinecke
  2017-10-02 13:55 ` [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ' Hannes Reinecke
@ 2017-10-02 13:55 ` Hannes Reinecke
  2017-10-02 16:42   ` Christoph Hellwig
  2017-10-02 13:55 ` [PATCH 4/6] nvme: display 'NMIC' namespace attribute Hannes Reinecke
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


When setting up a multipath device we need to claim the underlying
block devices to avoid other systems and/or programs to access it.
And we should be using the standard holders/slave sysfs relationship
instead of the hand-crafted 'mpath' links.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 34 ++++++++++++++++++++++++----------
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f52f0ab..8924f48 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2805,6 +2805,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
 	bool new = true;
+	static char *_nvme_claim_ptr = "NVMe shared namespace path";
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -2885,15 +2886,26 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 				ns->head->disk->disk_name);
 	}
 
-	if (sysfs_create_link(&disk_to_dev(ns->disk)->kobj,
-			&disk_to_dev(ns->head->disk)->kobj, "mpath"))
-		pr_warn("%s: failed to create sysfs link to mpath device\n",
+	ns->bdev = bdget_disk(ns->disk, 0);
+	if (!ns->bdev) {
+		pr_warn("%s: failed to get bdev\n", ns->disk->disk_name);
+		return;
+	}
+	if (blkdev_get(ns->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL,
+		       _nvme_claim_ptr) < 0) {
+		pr_warn("%s: failed to claim bdev\n",
 			ns->disk->disk_name);
-	if (sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
-			&disk_to_dev(ns->disk)->kobj, ns->disk->disk_name))
-		pr_warn("%s: failed to create sysfs link from mpath device\n",
+		bdput(ns->bdev);
+		ns->bdev = NULL;
+		nvme_put_ns_head(ns->head);
+		return;
+	}
+	if (bd_link_disk_holder(ns->bdev, ns->head->disk)) {
+		pr_warn("%s: failed to create sysfs link to mpath device\n",
 			ns->disk->disk_name);
-
+		blkdev_put(ns->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+		ns->bdev = NULL;
+	}
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2921,11 +2933,13 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 			blk_integrity_unregister(ns->disk);
 		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
 					&nvme_ns_id_attr_group);
-		sysfs_remove_link(&disk_to_dev(ns->disk)->kobj, "mpath");
-		sysfs_remove_link(&disk_to_dev(ns->head->disk)->kobj,
-				ns->disk->disk_name);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
+		if (ns->bdev) {
+			bd_unlink_disk_holder(ns->bdev, ns->head->disk);
+			blkdev_put(ns->bdev,
+				   FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+		}
 		del_gendisk(ns->disk);
 		blk_cleanup_queue(ns->queue);
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 10fffbc..a8a79bc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -252,6 +252,7 @@ struct nvme_ns {
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct block_device *bdev;
 	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] nvme: display 'NMIC' namespace attribute
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
                   ` (2 preceding siblings ...)
  2017-10-02 13:55 ` [PATCH 3/6] nvme: claim block devices Hannes Reinecke
@ 2017-10-02 13:55 ` Hannes Reinecke
  2017-10-02 16:16   ` Christoph Hellwig
  2017-10-02 13:55 ` [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys Hannes Reinecke
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


From: Johannes Thumshirn <jthumshirn@suse.de>

To figure out if a namespace is shared via multipath we need to expose
the 'NMIC' value to sysfs.

Signed-off-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 12 ++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8924f48..955fedf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2292,6 +2292,16 @@ static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
 		return disk->private_data;
 }
 
+static ssize_t nmic_show(struct device *dev, struct device_attribute *attr,
+			 char *buf)
+{
+	struct nvme_ns_head *head = dev_to_ns_head(dev);
+	struct nvme_ns_ids *ids = &head->ids;
+
+	return sprintf(buf, "0x%x\n", ids->nmic);
+}
+static DEVICE_ATTR(nmic, S_IRUGO, nmic_show, NULL);
+
 static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
@@ -2367,6 +2377,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 	&dev_attr_nguid.attr,
 	&dev_attr_eui.attr,
 	&dev_attr_nsid.attr,
+	&dev_attr_nmic.attr,
 	NULL,
 };
 
@@ -2663,6 +2674,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	kref_init(&head->ref);
 
 	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
+	head->ids.nmic = id->nmic;
 
 	ret = __nvme_check_ids(ctrl->subsys, head);
 	if (ret) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a8a79bc..1f195ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -222,6 +222,7 @@ struct nvme_ns_ids {
 	u8	eui64[8];
 	u8	nguid[16];
 	uuid_t	uuid;
+	u8 nmic;
 };
 
 /*
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
                   ` (3 preceding siblings ...)
  2017-10-02 13:55 ` [PATCH 4/6] nvme: display 'NMIC' namespace attribute Hannes Reinecke
@ 2017-10-02 13:55 ` Hannes Reinecke
  2017-10-02 16:18   ` Christoph Hellwig
  2017-10-02 13:55 ` [PATCH 6/6] nvme: ignore retries for multipath devices Hannes Reinecke
  2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
  6 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


Create a class for each subsystem to create a 'device' link
for each subsystem device.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 36 +++++++++++++++++++++++++-----------
 drivers/nvme/host/nvme.h |  2 +-
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 955fedf..1ef11ca 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -79,6 +79,7 @@
 static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
+static struct class *nvme_subsys_class;
 static const struct attribute_group nvme_ns_id_attr_group;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
@@ -1905,10 +1906,10 @@ static void __nvme_free_subsystem(struct nvme_subsystem *subsys)
 	kfree(subsys);
 }
 
-static void nvme_free_subsystem(struct kref *ref)
+static void nvme_free_subsystem(struct device *dev)
 {
 	struct nvme_subsystem *subsys =
-			container_of(ref, struct nvme_subsystem, ref);
+			container_of(dev, struct nvme_subsystem, dev);
 
 	mutex_lock(&nvme_subsystems_lock);
 	list_del(&subsys->entry);
@@ -1916,11 +1917,6 @@ static void nvme_free_subsystem(struct kref *ref)
 	__nvme_free_subsystem(subsys);
 }
 
-static void nvme_put_subsystem(struct nvme_subsystem *subsys)
-{
-	kref_put(&subsys->ref, nvme_free_subsystem);
-}
-
 static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 {
 	struct nvme_subsystem *subsys;
@@ -1930,7 +1926,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 	list_for_each_entry(subsys, &nvme_subsystems, entry) {
 		if (strcmp(subsys->subnqn, subsysnqn))
 			continue;
-		if (!kref_get_unless_zero(&subsys->ref))
+		if (!get_device(&subsys->dev))
 			continue;
 		return subsys;
 	}
@@ -1953,13 +1949,17 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 
 	INIT_LIST_HEAD(&subsys->ctrls);
 	INIT_LIST_HEAD(&subsys->nsheads);
-	kref_init(&subsys->ref);
 	nvme_init_subnqn(subsys, ctrl, id);
 	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
 	memcpy(subsys->model, id->mn, sizeof(subsys->model));
 	memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev));
 	subsys->vendor_id = le16_to_cpu(id->vid);
 	mutex_init(&subsys->lock);
+	subsys->dev.class = nvme_subsys_class;
+	subsys->dev.parent = NULL;
+	subsys->dev.release = &nvme_free_subsystem;
+	dev_set_name(&subsys->dev, "subsys%u", subsys->instance);
+	device_initialize(&subsys->dev);
 
 	mutex_lock(&nvme_subsystems_lock);
 	found = __nvme_find_get_subsystem(subsys->subnqn);
@@ -1989,6 +1989,12 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
 	mutex_unlock(&subsys->lock);
 
+	if (!found) {
+		ret = device_add(&subsys->dev);
+		if (ret)
+			dev_err(ctrl->device,
+				"failed to register subsys device\n");
+	}
 	return 0;
 out_free_subsys:
 	kfree(subsys);
@@ -2891,7 +2897,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 			ns->disk->disk_name);
 
 	if (new) {
-		add_disk(ns->head->disk);
+		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
 		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
 				&nvme_ns_id_attr_group))
 			pr_warn("%s: failed to create sysfs group for identification\n",
@@ -3309,7 +3315,7 @@ static void nvme_free_ctrl(struct kref *kref)
 	ctrl->ops->free_ctrl(ctrl);
 
 	if (subsys)
-		nvme_put_subsystem(subsys);
+		put_device(&subsys->dev);
 }
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl)
@@ -3500,8 +3506,15 @@ int __init nvme_core_init(void)
 		goto unregister_chrdev;
 	}
 
+	nvme_subsys_class = class_create(THIS_MODULE, "nvme-subsys");
+	if (IS_ERR(nvme_subsys_class)) {
+		result = PTR_ERR(nvme_subsys_class);
+		goto destroy_class;
+	}
 	return 0;
 
+destroy_class:
+	class_destroy(nvme_class);
 unregister_chrdev:
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 destroy_wq:
@@ -3512,6 +3525,7 @@ int __init nvme_core_init(void)
 void nvme_core_exit(void)
 {
 	ida_destroy(&nvme_subsystems_ida);
+	class_destroy(nvme_subsys_class);
 	class_destroy(nvme_class);
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1f195ff..878ca87 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -202,11 +202,11 @@ struct nvme_ctrl {
 };
 
 struct nvme_subsystem {
+	struct device		dev;
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
 	struct list_head	nsheads;
-	struct kref		ref;
 	char			subnqn[NVMF_NQN_SIZE];
 	char			serial[20];
 	char			model[40];
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] nvme: ignore retries for multipath devices
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
                   ` (4 preceding siblings ...)
  2017-10-02 13:55 ` [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys Hannes Reinecke
@ 2017-10-02 13:55 ` Hannes Reinecke
  2017-10-02 16:22   ` Christoph Hellwig
  2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
  6 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 13:55 UTC (permalink / raw)


For multipath devices we should switch paths after the retries
are exhausted, and not return an error.
IE we should first retry on the current path, and then switch
paths once the retries are exhausted.
Once all paths are down and all retries are exhausted we will
still retry the command, but then it'll be handled with the
all paths down logic in nvme_make_request().

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1ef11ca..ac7676a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -201,7 +201,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
 		return false;
 	if (nvme_req(req)->status & NVME_SC_DNR)
 		return false;
-	if (nvme_req(req)->retries >= nvme_max_retries)
+	if (nvme_req(req)->retries >= nvme_max_retries &&
+	    !(req->cmd_flags & REQ_NVME_MPATH))
 		return false;
 	return true;
 }
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
                   ` (5 preceding siblings ...)
  2017-10-02 13:55 ` [PATCH 6/6] nvme: ignore retries for multipath devices Hannes Reinecke
@ 2017-10-02 14:35 ` Johannes Thumshirn
  2017-10-02 14:43   ` Keith Busch
                     ` (3 more replies)
  6 siblings, 4 replies; 42+ messages in thread
From: Johannes Thumshirn @ 2017-10-02 14:35 UTC (permalink / raw)


When we do a path failover we should also reset the retry counter to 0
so we're not accidently failing I/O after a path failover.

Signed-off-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1ef11ca0fc5e..f1d8b239b7b6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -118,6 +118,7 @@ static void nvme_failover_req(struct request *req)
 	blk_mq_end_request(req, 0);
 
 	nvme_reset_ctrl(ns->ctrl);
+	nvme_req(req)->retries = 0;
 	kblockd_schedule_work(&ns->head->requeue_work);
 }
 
-- 
2.12.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
@ 2017-10-02 14:43   ` Keith Busch
  2017-10-02 15:08     ` Hannes Reinecke
  2017-10-02 14:45   ` Johannes Thumshirn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2017-10-02 14:43 UTC (permalink / raw)


On Mon, Oct 02, 2017@04:35:03PM +0200, Johannes Thumshirn wrote:
> When we do a path failover we should also reset the retry counter to 0
> so we're not accidently failing I/O after a path failover.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1ef11ca0fc5e..f1d8b239b7b6 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -118,6 +118,7 @@ static void nvme_failover_req(struct request *req)
>  	blk_mq_end_request(req, 0);
>  
>  	nvme_reset_ctrl(ns->ctrl);
> +	nvme_req(req)->retries = 0;
>  	kblockd_schedule_work(&ns->head->requeue_work);
>  }

Won't this make it possible to retry indefinitely? We've removed the retry
time limit, so the only thing preventing inifinite retries on retryable
errors is the max retry limit.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
  2017-10-02 14:43   ` Keith Busch
@ 2017-10-02 14:45   ` Johannes Thumshirn
  2017-10-02 16:25     ` Christoph Hellwig
  2017-10-02 14:46   ` Johannes Thumshirn
  2017-10-02 16:19   ` Christoph Hellwig
  3 siblings, 1 reply; 42+ messages in thread
From: Johannes Thumshirn @ 2017-10-02 14:45 UTC (permalink / raw)


Gah, please ignore this one. I tested on the wrong device.

But FYI, when triggering a path failure in my RDMA setup (setting one
switch port down) I get these nice messages:
[ 1148.124063] nvme nvme0: SEND for CQE 0xffff8817c3720180 failed with
status transport retry counter exceeded (12)
[ 1148.180489] nvme nvme0: failed nvme_keep_alive_end_io error=10
[ 1148.187887] nvme nvme0: Reconnecting in 10 seconds...
[ 1148.194356] print_req_error: I/O error, dev nvme0n1, sector 1690128
[ 1148.194361] print_req_error: I/O error, dev nvme0n1, sector 1692168
[ 1148.194367] print_req_error: I/O error, dev nvme0n1, sector 1694208
[ 1148.379058] XFS (nvms0n1): writeback error on sector 1628688

the nvme ones are expected, but I don't really like to see writeback
errors from the FS here. Something's still a bit off.

Byte,
        Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
  2017-10-02 14:43   ` Keith Busch
  2017-10-02 14:45   ` Johannes Thumshirn
@ 2017-10-02 14:46   ` Johannes Thumshirn
  2017-10-02 16:19   ` Christoph Hellwig
  3 siblings, 0 replies; 42+ messages in thread
From: Johannes Thumshirn @ 2017-10-02 14:46 UTC (permalink / raw)


Gah, please ignore this one. I tested on the wrong device.

But FYI, when triggering a path failure in my RDMA setup (setting one
switch port down) I get these nice messages:
[ 1148.124063] nvme nvme0: SEND for CQE 0xffff8817c3720180 failed with
status transport retry counter exceeded (12)
[ 1148.180489] nvme nvme0: failed nvme_keep_alive_end_io error=10
[ 1148.187887] nvme nvme0: Reconnecting in 10 seconds...
[ 1148.194356] print_req_error: I/O error, dev nvme0n1, sector 1690128
[ 1148.194361] print_req_error: I/O error, dev nvme0n1, sector 1692168
[ 1148.194367] print_req_error: I/O error, dev nvme0n1, sector 1694208
[ 1148.379058] XFS (nvms0n1): writeback error on sector 1628688

the nvme ones are expected, but I don't really like to see writeback
errors from the FS here. Something's still a bit off.

Byte,
        Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 14:43   ` Keith Busch
@ 2017-10-02 15:08     ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-02 15:08 UTC (permalink / raw)


On 10/02/2017 04:43 PM, Keith Busch wrote:
> On Mon, Oct 02, 2017@04:35:03PM +0200, Johannes Thumshirn wrote:
>> When we do a path failover we should also reset the retry counter to 0
>> so we're not accidently failing I/O after a path failover.
>>
>> Signed-off-by: Johannes Thumshirn <jthumshirn at suse.de>
>> ---
>>  drivers/nvme/host/core.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 1ef11ca0fc5e..f1d8b239b7b6 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -118,6 +118,7 @@ static void nvme_failover_req(struct request *req)
>>  	blk_mq_end_request(req, 0);
>>  
>>  	nvme_reset_ctrl(ns->ctrl);
>> +	nvme_req(req)->retries = 0;
>>  	kblockd_schedule_work(&ns->head->requeue_work);
>>  }
> 
> Won't this make it possible to retry indefinitely? We've removed the retry
> time limit, so the only thing preventing inifinite retries on retryable
> errors is the max retry limit.
> 
Please do check my earlier patch series.
The idea is to first retry paths, and then failover to other paths.
Once all paths are exhausted we'll requeue the I/O in nvme_make_request().
So from that we shouldn't have any infinite retries.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/6] nvme: display 'CMIC' controller attribute
  2017-10-02 13:55 ` [PATCH 1/6] nvme: display 'CMIC' controller attribute Hannes Reinecke
@ 2017-10-02 16:15   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:15 UTC (permalink / raw)


On Mon, Oct 02, 2017@03:55:54PM +0200, Hannes Reinecke wrote:
> To figure out if a controller is multipath capable (ie if the
> controller supports namespace sharing) we need to expose the CMIC
> value to sysfs.

If you care about the multipath-capability of a controller you can
get it using nvme-cli.  Especially given that there is no good reason
to query it except for curiosity.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ'
  2017-10-02 13:55 ` [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ' Hannes Reinecke
@ 2017-10-02 16:16   ` Christoph Hellwig
  2017-10-02 16:20     ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:16 UTC (permalink / raw)


On Mon, Oct 02, 2017@03:55:55PM +0200, Hannes Reinecke wrote:
> Align with overall nvme device naming scheme.

It is very close to the normal /dev/nvme* devices, so I feat it will
create a lot of confusion.

But given that all device name suggestions so far suck I'd be happy
to take a stawpoll of linux-nvme subscribers :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/6] nvme: display 'NMIC' namespace attribute
  2017-10-02 13:55 ` [PATCH 4/6] nvme: display 'NMIC' namespace attribute Hannes Reinecke
@ 2017-10-02 16:16   ` Christoph Hellwig
  2017-10-03 10:00     ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:16 UTC (permalink / raw)


On Mon, Oct 02, 2017@03:55:57PM +0200, Hannes Reinecke wrote:
> From: Johannes Thumshirn <jthumshirn at suse.de>
> 
> To figure out if a namespace is shared via multipath we need to expose
> the 'NMIC' value to sysfs.

What for?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys
  2017-10-02 13:55 ` [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys Hannes Reinecke
@ 2017-10-02 16:18   ` Christoph Hellwig
  2017-10-02 16:53     ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:18 UTC (permalink / raw)


On Mon, Oct 02, 2017@03:55:58PM +0200, Hannes Reinecke wrote:
> Create a class for each subsystem to create a 'device' link
> for each subsystem device.

I thought about this as it seems useful.  The only downside is that
iff we run into subsystems with duplicate NQNs (or fake NQNs using
MN/SN) we are in deep trouble once this is exposed in sysfs.

Keith was the one most concerned about that, so I'd like him to
chime in.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
                     ` (2 preceding siblings ...)
  2017-10-02 14:46   ` Johannes Thumshirn
@ 2017-10-02 16:19   ` Christoph Hellwig
  3 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:19 UTC (permalink / raw)


On Mon, Oct 02, 2017@04:35:03PM +0200, Johannes Thumshirn wrote:
> When we do a path failover we should also reset the retry counter to 0
> so we're not accidently failing I/O after a path failover.
> 
> Signed-off-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1ef11ca0fc5e..f1d8b239b7b6 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -118,6 +118,7 @@ static void nvme_failover_req(struct request *req)
>  	blk_mq_end_request(req, 0);
>  
>  	nvme_reset_ctrl(ns->ctrl);
> +	nvme_req(req)->retries = 0;

Once we called blk_mq_end_request the request is gone, so this is
use after free.  And also very odd given that any multipath resubmit
will allocate a new request, so I can't see how it could help to start
with.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ'
  2017-10-02 16:16   ` Christoph Hellwig
@ 2017-10-02 16:20     ` Keith Busch
  2017-10-11 14:32       ` Guan Junxiong
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2017-10-02 16:20 UTC (permalink / raw)


On Mon, Oct 02, 2017@06:16:21PM +0200, Christoph Hellwig wrote:
> On Mon, Oct 02, 2017@03:55:55PM +0200, Hannes Reinecke wrote:
> > Align with overall nvme device naming scheme.
> 
> It is very close to the normal /dev/nvme* devices, so I feat it will
> create a lot of confusion.
> 
> But given that all device name suggestions so far suck I'd be happy
> to take a stawpoll of linux-nvme subscribers :)

Of the options availble, +1 to Hannes' nvmsXnY.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 6/6] nvme: ignore retries for multipath devices
  2017-10-02 13:55 ` [PATCH 6/6] nvme: ignore retries for multipath devices Hannes Reinecke
@ 2017-10-02 16:22   ` Christoph Hellwig
  2017-10-03 10:02     ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:22 UTC (permalink / raw)


> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1ef11ca..ac7676a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -201,7 +201,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
>  		return false;
>  	if (nvme_req(req)->status & NVME_SC_DNR)
>  		return false;
> -	if (nvme_req(req)->retries >= nvme_max_retries)
> +	if (nvme_req(req)->retries >= nvme_max_retries &&
> +	    !(req->cmd_flags & REQ_NVME_MPATH))
>  		return false;
>  	return true;

All failover logic is inside a nvme_req_needs_retry() conditional,
so this change looks completely broken - it basically disables
failover.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] nvme: reset retires after path failover
  2017-10-02 14:45   ` Johannes Thumshirn
@ 2017-10-02 16:25     ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:25 UTC (permalink / raw)


On Mon, Oct 02, 2017@04:45:34PM +0200, Johannes Thumshirn wrote:
> Gah, please ignore this one. I tested on the wrong device.
> 
> But FYI, when triggering a path failure in my RDMA setup (setting one
> switch port down) I get these nice messages:
> [ 1148.124063] nvme nvme0: SEND for CQE 0xffff8817c3720180 failed with
> status transport retry counter exceeded (12)
> [ 1148.180489] nvme nvme0: failed nvme_keep_alive_end_io error=10
> [ 1148.187887] nvme nvme0: Reconnecting in 10 seconds...
> [ 1148.194356] print_req_error: I/O error, dev nvme0n1, sector 1690128
> [ 1148.194361] print_req_error: I/O error, dev nvme0n1, sector 1692168
> [ 1148.194367] print_req_error: I/O error, dev nvme0n1, sector 1694208
> [ 1148.379058] XFS (nvms0n1): writeback error on sector 1628688
> 
> the nvme ones are expected, but I don't really like to see writeback
> errors from the FS here. Something's still a bit off.

I think the problem is our host-internal aborts that have the DNR
bit set.  My earier patches excluded DNR as a reason not to failover
and we'll either need to get back to that or remove DNR from these
sorts of errors.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-02 13:55 ` [PATCH 3/6] nvme: claim block devices Hannes Reinecke
@ 2017-10-02 16:42   ` Christoph Hellwig
  2017-10-03 10:08     ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:42 UTC (permalink / raw)


On Mon, Oct 02, 2017@03:55:56PM +0200, Hannes Reinecke wrote:
> When setting up a multipath device we need to claim the underlying
> block devices to avoid other systems and/or programs to access it.
> And we should be using the standard holders/slave sysfs relationship
> instead of the hand-crafted 'mpath' links.

This completely breaks backwards compatibility:

root at testvm:~# mkfs.xfs /dev/nvme0n1 -f
mkfs.xfs: cannot open /dev/nvme0n1: Device or resource busy

Also we really do not want and outstanding struct block_device reference
all the time - struct block_device should only have a reference if
the block device node is in use or a file system is mounted.  Avoiding
this case was the whole point of my refactor to store the gendisk
instead of the block_device in struct bio.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys
  2017-10-02 16:18   ` Christoph Hellwig
@ 2017-10-02 16:53     ` Keith Busch
  2017-10-02 16:56       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2017-10-02 16:53 UTC (permalink / raw)


On Mon, Oct 02, 2017@06:18:08PM +0200, Christoph Hellwig wrote:
> On Mon, Oct 02, 2017@03:55:58PM +0200, Hannes Reinecke wrote:
> > Create a class for each subsystem to create a 'device' link
> > for each subsystem device.
> 
> I thought about this as it seems useful.  The only downside is that
> iff we run into subsystems with duplicate NQNs (or fake NQNs using
> MN/SN) we are in deep trouble once this is exposed in sysfs.
> 
> Keith was the one most concerned about that, so I'd like him to
> chime in.

The driver will unbind from the controller if it detects an unsupported
duplicate nvme subsystem. The same logic will prevent duplciate sysfs
entries in this patch, so I think we're okay from that standpoint.

My concern was the user can't query or fix the controller that has the
duplicate subsystem name since we unbind from it. Unbinding is probably
the right thing to do, though, so the confused subsystem can be fixed
when the user isolates it from the others.

BTW, I just notice nvme_init_subsystem needs call nvme_put_subsystem in
the invalid duplicate subnqn case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys
  2017-10-02 16:53     ` Keith Busch
@ 2017-10-02 16:56       ` Christoph Hellwig
  2017-10-02 17:15         ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-02 16:56 UTC (permalink / raw)


On Mon, Oct 02, 2017@10:53:18AM -0600, Keith Busch wrote:
> The driver will unbind from the controller if it detects an unsupported
> duplicate nvme subsystem. The same logic will prevent duplciate sysfs
> entries in this patch, so I think we're okay from that standpoint.
>
> My concern was the user can't query or fix the controller that has the
> duplicate subsystem name since we unbind from it. Unbinding is probably
> the right thing to do, though, so the confused subsystem can be fixed
> when the user isolates it from the others.
> 

Right now that's what we do.  But if this actually shows up in the wild
in e.g. consumer products we'll have to come up with a work around and
just allow that case and skip matches for it.  It would be a little ugly
but doable - once we expose the name in sysfs we are in deep trouble
with duplicate names.

> BTW, I just notice nvme_init_subsystem needs call nvme_put_subsystem in
> the invalid duplicate subnqn case.

We call __nvme_free_subsystem in that case, which should do the job.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys
  2017-10-02 16:56       ` Christoph Hellwig
@ 2017-10-02 17:15         ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2017-10-02 17:15 UTC (permalink / raw)


On Mon, Oct 02, 2017@06:56:32PM +0200, Christoph Hellwig wrote:
> 
> > BTW, I just notice nvme_init_subsystem needs call nvme_put_subsystem in
> > the invalid duplicate subnqn case.
> 
> We call __nvme_free_subsystem in that case, which should do the job.

We're calling __nvme_free_subsystem on the one that was allocated, not
the one that was found. The 'found' will have an additional reference
taken on it that isn't tracked.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/6] nvme: display 'NMIC' namespace attribute
  2017-10-02 16:16   ` Christoph Hellwig
@ 2017-10-03 10:00     ` Hannes Reinecke
  2017-10-03 11:49       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-03 10:00 UTC (permalink / raw)


On 10/02/2017 06:16 PM, Christoph Hellwig wrote:
> On Mon, Oct 02, 2017@03:55:57PM +0200, Hannes Reinecke wrote:
>> From: Johannes Thumshirn <jthumshirn at suse.de>
>>
>> To figure out if a namespace is shared via multipath we need to expose
>> the 'NMIC' value to sysfs.
> 
> What for?
> 
Do stop systemd/udev doing weird things on the device.
If we have the 'NMIC' attribute in sysfs we can evaluate it during event
handling, and set the 'SYSTEMD_READY=0' flag if NMIC=1.
By setting this attribute udev is instructed to not do any fancy
checking, and systemd will leave it alone.
Otherwise systemd will happily forward the block device to all attached
services, btrfs will start checking the device, blkid will be scanning
for filesystems, swap will declaring this device as a valid swap space
and the like.

One of the painful lessons learned when moving multipath-tools to
systemd; we absolutely need an indicator in sysfs to handle multipath
devices race-free.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 6/6] nvme: ignore retries for multipath devices
  2017-10-02 16:22   ` Christoph Hellwig
@ 2017-10-03 10:02     ` Hannes Reinecke
  2017-10-03 11:53       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-03 10:02 UTC (permalink / raw)


On 10/02/2017 06:22 PM, Christoph Hellwig wrote:
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 1ef11ca..ac7676a 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -201,7 +201,8 @@ static inline bool nvme_req_needs_retry(struct request *req)
>>  		return false;
>>  	if (nvme_req(req)->status & NVME_SC_DNR)
>>  		return false;
>> -	if (nvme_req(req)->retries >= nvme_max_retries)
>> +	if (nvme_req(req)->retries >= nvme_max_retries &&
>> +	    !(req->cmd_flags & REQ_NVME_MPATH))
>>  		return false;
>>  	return true;
> 
> All failover logic is inside a nvme_req_needs_retry() conditional,
> so this change looks completely broken - it basically disables
> failover.
> 
Not in our tests.
Without this patch we'd been seeing I/O errors during failover; with
this patch I/O continues on the failover path.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-02 16:42   ` Christoph Hellwig
@ 2017-10-03 10:08     ` Hannes Reinecke
  2017-10-03 11:55       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-03 10:08 UTC (permalink / raw)


On 10/02/2017 06:42 PM, Christoph Hellwig wrote:
> On Mon, Oct 02, 2017@03:55:56PM +0200, Hannes Reinecke wrote:
>> When setting up a multipath device we need to claim the underlying
>> block devices to avoid other systems and/or programs to access it.
>> And we should be using the standard holders/slave sysfs relationship
>> instead of the hand-crafted 'mpath' links.
> 
> This completely breaks backwards compatibility:
> 
> root at testvm:~# mkfs.xfs /dev/nvme0n1 -f
> mkfs.xfs: cannot open /dev/nvme0n1: Device or resource busy
> 
> Also we really do not want and outstanding struct block_device reference
> all the time - struct block_device should only have a reference if
> the block device node is in use or a file system is mounted.  Avoiding
> this case was the whole point of my refactor to store the gendisk
> instead of the block_device in struct bio.
> 
Uh-oh.

And now the fun begins.

With this patch dm-multipath will ignore this device, so it's
_impossible_ to setup any device-mapper or md device on top of the
'nvme' device.
Without this patch md will happily autostart any RAID device on these
devices when it find a signature.
Or the user can mount it.
Or swap can attach it as a swapspace.

_AND_ you can do the same with the subsystem device, too, without any
indication that anything is amiss.

Do we really want this?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/6] nvme: display 'NMIC' namespace attribute
  2017-10-03 10:00     ` Hannes Reinecke
@ 2017-10-03 11:49       ` Christoph Hellwig
  2017-10-03 16:01         ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-03 11:49 UTC (permalink / raw)


On Tue, Oct 03, 2017@12:00:16PM +0200, Hannes Reinecke wrote:
> Do stop systemd/udev doing weird things on the device.

Please define weird things, preferably actual practical wierd things
you've observed with this implementation and then document them here
in the patch description.

> If we have the 'NMIC' attribute in sysfs we can evaluate it during event
> handling, and set the 'SYSTEMD_READY=0' flag if NMIC=1.

Bonus points of including such sniplets in the patch description,
similar to how I offered udev rules in the main multipath path.

> One of the painful lessons learned when moving multipath-tools to
> systemd; we absolutely need an indicator in sysfs to handle multipath
> devices race-free.

dm-multipath devices require userspace setup, nvme-mpath ones don't.
That's a huge difference that would make udev rules a lot simpler.
But if they don't I'd like to see a good explanation here, both to
understand why you'd want this flag, and also if there is a way to
just do the right thing from the kernel.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 6/6] nvme: ignore retries for multipath devices
  2017-10-03 10:02     ` Hannes Reinecke
@ 2017-10-03 11:53       ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-03 11:53 UTC (permalink / raw)


On Tue, Oct 03, 2017@12:02:38PM +0200, Hannes Reinecke wrote:
> >>  	if (nvme_req(req)->status & NVME_SC_DNR)
> >>  		return false;
> >> -	if (nvme_req(req)->retries >= nvme_max_retries)
> >> +	if (nvme_req(req)->retries >= nvme_max_retries &&
> >> +	    !(req->cmd_flags & REQ_NVME_MPATH))
> >>  		return false;
> >>  	return true;
> > 
> > All failover logic is inside a nvme_req_needs_retry() conditional,
> > so this change looks completely broken - it basically disables
> > failover.
> > 
> Not in our tests.
> Without this patch we'd been seeing I/O errors during failover; with
> this patch I/O continues on the failover path.

http://git.infradead.org/users/hch/block.git/blob/refs/heads/nvme-mpath:/drivers/nvme/host/core.c#l208

210	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
211		if (nvme_req_needs_failover(req)) {
212			nvme_failover_req(req);
213			return;
214		 }

The only call to nvme_failover_req is guarded by nvme_req_needs_retry,
and you change needs_retry to return true for MPATH requests that
exceed the number of retries.  I just don't see how we'd hit the
max_retries count, as each retry before should have already taken
nvme_req_needs_failover before.  What error code do you see this
with?  What kinds of device/setup?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-03 10:08     ` Hannes Reinecke
@ 2017-10-03 11:55       ` Christoph Hellwig
  2017-10-04  5:42         ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-03 11:55 UTC (permalink / raw)


On Tue, Oct 03, 2017@12:08:26PM +0200, Hannes Reinecke wrote:
> With this patch dm-multipath will ignore this device, so it's
> _impossible_ to setup any device-mapper or md device on top of the
> 'nvme' device.
> Without this patch md will happily autostart any RAID device on these
> devices when it find a signature.
> Or the user can mount it.
> Or swap can attach it as a swapspace.
> 
> _AND_ you can do the same with the subsystem device, too, without any
> indication that anything is amiss.
> 
> Do we really want this?

I though we agreed last week to do dynamic claiming.

That is if someone claims the multipath-device we extend that claim to
all /dev/nvmeXnY devices, and if someone claims one of the /dev/nvmeXnY
devices we extent it to the multipath node.  That should solve all of
the above issues.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/6] nvme: display 'NMIC' namespace attribute
  2017-10-03 11:49       ` Christoph Hellwig
@ 2017-10-03 16:01         ` Hannes Reinecke
  0 siblings, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-03 16:01 UTC (permalink / raw)


On 10/03/2017 01:49 PM, Christoph Hellwig wrote:
> On Tue, Oct 03, 2017@12:00:16PM +0200, Hannes Reinecke wrote:
>> Do stop systemd/udev doing weird things on the device.
> 
> Please define weird things, preferably actual practical wierd things
> you've observed with this implementation and then document them here
> in the patch description.
> 
Any event will be passed to systemd, and ends up being passed as a
'target' to the systemd service evaluation routines.
One of the services happens to be the systemd-fsck at .service, which will
be calling fsck on every device found in /etc/fstab.

If now this device is specified with an fs uuid (as it's common
nowadays) systemd will be invoking fsck on each device carrying this UUID.
As we're having _two_ devices with the same UUID (at the very least;
actually it's 'number of paths + 1'), systemd will be invoking fsck on
_each_ of those devices.
All pointing to the same underlying namespace.
I'd be very curious to figure out how fsck deals with such a situation.

And then systemd will try to _mount_ each of those devices, which is
where the real fun begins. Depending on the timing it might chose to
mount any of those devices, more often than not at the same time as
another fsck instance is running on another path.
Again, I sincerely doubt that either mount nor fsck is able to handle
these situations.

The errors from these scenarios will vary wildly, from simple I/O errors
to filesystems corruption.

Hence it's adviseable to set the 'SYSTEMD_READY=0' flag on all
underlying paths, and only forward the 'real' multipathed device to
systemd for evaluation.
But this needs to be done _before_ any other event has a chance to
interrupt here.
_And_ we cannot call an external userspace program here, as this is
multipathing, and we might be called for an 'add' event when all paths
are down, at which point we cannot access the root-fs and udev will
stall trying to load the said program.
Hence the event will never finished processing, the path will never
recover, and your system is dead with all paths running.
Not good.

>> If we have the 'NMIC' attribute in sysfs we can evaluate it during event
>> handling, and set the 'SYSTEMD_READY=0' flag if NMIC=1.
> 
> Bonus points of including such sniplets in the patch description,
> similar to how I offered udev rules in the main multipath path.
> 
Okay, will be doing so.

>> One of the painful lessons learned when moving multipath-tools to
>> systemd; we absolutely need an indicator in sysfs to handle multipath
>> devices race-free.
> 
> dm-multipath devices require userspace setup, nvme-mpath ones don't.
> That's a huge difference that would make udev rules a lot simpler.
> But if they don't I'd like to see a good explanation here, both to
> understand why you'd want this flag, and also if there is a way to
> just do the right thing from the kernel.
> 

As mentioned above, yes, the udev rules will be massively simpler.
Essentially, the rule just boils down to setting the 'SYSTEMD_READY=0'
flag when NMIC is set.
But for that we need to figure out _if_ that bit is set, and we need to
do so without any external userspace program.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-03 11:55       ` Christoph Hellwig
@ 2017-10-04  5:42         ` Hannes Reinecke
  2017-10-04  6:15           ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-04  5:42 UTC (permalink / raw)


On 10/03/2017 01:55 PM, Christoph Hellwig wrote:
> On Tue, Oct 03, 2017@12:08:26PM +0200, Hannes Reinecke wrote:
>> With this patch dm-multipath will ignore this device, so it's
>> _impossible_ to setup any device-mapper or md device on top of the
>> 'nvme' device.
>> Without this patch md will happily autostart any RAID device on these
>> devices when it find a signature.
>> Or the user can mount it.
>> Or swap can attach it as a swapspace.
>>
>> _AND_ you can do the same with the subsystem device, too, without any
>> indication that anything is amiss.
>>
>> Do we really want this?
> 
> I though we agreed last week to do dynamic claiming.
> 
> That is if someone claims the multipath-device we extend that claim to
> all /dev/nvmeXnY devices, and if someone claims one of the /dev/nvmeXnY
> devices we extent it to the multipath node.  That should solve all of
> the above issues.
> 
Hmm. Not sure how you would be doing that. Who should be doing the
claiming? Typically the claim is done whenever a device is created on
top of the other...

What about an alternative plan: make creation of the subsystem device
fully dynamic.
But if a subsystem device is created it will always claim the underlying
device. Then we can make the creation dependent on the NMIC attribute,
and existing setups would not be affected.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-04  5:42         ` Hannes Reinecke
@ 2017-10-04  6:15           ` Christoph Hellwig
  2017-10-04  6:33             ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-04  6:15 UTC (permalink / raw)


On Wed, Oct 04, 2017@07:42:00AM +0200, Hannes Reinecke wrote:
> Hmm. Not sure how you would be doing that. Who should be doing the
> claiming? Typically the claim is done whenever a device is created on
> top of the other...

We'd need a callback in the driver if it is claimed, and use that
to for propagating the claim, or use a shared struture to record the
claim.  I haven't looked into the details yet, though.

> What about an alternative plan: make creation of the subsystem device
> fully dynamic.
> But if a subsystem device is created it will always claim the underlying
> device. Then we can make the creation dependent on the NMIC attribute,
> and existing setups would not be affected.

This doesn't work because a lot of devices can just set NMIC.  E.g.
every namespace exported by the Linux NVMe target.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-04  6:15           ` Christoph Hellwig
@ 2017-10-04  6:33             ` Hannes Reinecke
  2017-10-04  7:13               ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-04  6:33 UTC (permalink / raw)


On 10/04/2017 08:15 AM, Christoph Hellwig wrote:
> On Wed, Oct 04, 2017@07:42:00AM +0200, Hannes Reinecke wrote:
>> Hmm. Not sure how you would be doing that. Who should be doing the
>> claiming? Typically the claim is done whenever a device is created on
>> top of the other...
> 
> We'd need a callback in the driver if it is claimed, and use that
> to for propagating the claim, or use a shared struture to record the
> claim.  I haven't looked into the details yet, though.
> 
>> What about an alternative plan: make creation of the subsystem device
>> fully dynamic.
>> But if a subsystem device is created it will always claim the underlying
>> device. Then we can make the creation dependent on the NMIC attribute,
>> and existing setups would not be affected.
> 
> This doesn't work because a lot of devices can just set NMIC.  E.g.
> every namespace exported by the Linux NVMe target.
> 
But as it's fully dynamic we can decide how to handle each device on a
device-by-device basis, with a common default policy.
My idea is to have a default policy (create/not create subsystem devices
if NMIC is set) set via kernel command-line, and udev rules for devices
requiring separate handling.

Will be drafting up a set of patches to demonstrate the idea.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-04  6:33             ` Hannes Reinecke
@ 2017-10-04  7:13               ` Christoph Hellwig
  2017-10-04  7:26                 ` Hannes Reinecke
  2017-10-05  1:08                 ` Martin K. Petersen
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-04  7:13 UTC (permalink / raw)


On Wed, Oct 04, 2017@08:33:07AM +0200, Hannes Reinecke wrote:
> > On Wed, Oct 04, 2017@07:42:00AM +0200, Hannes Reinecke wrote:
> >> Hmm. Not sure how you would be doing that. Who should be doing the
> >> claiming? Typically the claim is done whenever a device is created on
> >> top of the other...
> > 
> > We'd need a callback in the driver if it is claimed, and use that
> > to for propagating the claim, or use a shared struture to record the
> > claim.  I haven't looked into the details yet, though.
> > 
> >> What about an alternative plan: make creation of the subsystem device
> >> fully dynamic.
> >> But if a subsystem device is created it will always claim the underlying
> >> device. Then we can make the creation dependent on the NMIC attribute,
> >> and existing setups would not be affected.
> > 
> > This doesn't work because a lot of devices can just set NMIC.  E.g.
> > every namespace exported by the Linux NVMe target.
> > 
> But as it's fully dynamic we can decide how to handle each device on a
> device-by-device basis, with a common default policy.

What do you mean with fully dynamic?

> My idea is to have a default policy (create/not create subsystem devices
> if NMIC is set) set via kernel command-line, and udev rules for devices
> requiring separate handling.

Yikes.  That's exactly where I do _not_ want to go.  No more arcane
magic setup that you need to be in the in group for to know like
dm-multipath.  Things must just work, period.

The only other option I could think of would be to turn names around:
make /dev/nvmeX (chardev) and /dev/nvmeXnY the per-subsystem devices
that are multipathed if available.  We'd then need new devices for the
invdividual controllers.  With a little luck we'd get away with just
creating character devices.  This would however create a problem for
dm-multipath users, which mostly is you and your paterners/customers.

Note that in general dm-multipath should continue working but it would
generally just see one path.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-04  7:13               ` Christoph Hellwig
@ 2017-10-04  7:26                 ` Hannes Reinecke
  2017-10-05  1:08                 ` Martin K. Petersen
  1 sibling, 0 replies; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-04  7:26 UTC (permalink / raw)


On 10/04/2017 09:13 AM, Christoph Hellwig wrote:
> On Wed, Oct 04, 2017@08:33:07AM +0200, Hannes Reinecke wrote:
>>> On Wed, Oct 04, 2017@07:42:00AM +0200, Hannes Reinecke wrote:
>>>> Hmm. Not sure how you would be doing that. Who should be doing the
>>>> claiming? Typically the claim is done whenever a device is created on
>>>> top of the other...
>>>
>>> We'd need a callback in the driver if it is claimed, and use that
>>> to for propagating the claim, or use a shared struture to record the
>>> claim.  I haven't looked into the details yet, though.
>>>
>>>> What about an alternative plan: make creation of the subsystem device
>>>> fully dynamic.
>>>> But if a subsystem device is created it will always claim the underlying
>>>> device. Then we can make the creation dependent on the NMIC attribute,
>>>> and existing setups would not be affected.
>>>
>>> This doesn't work because a lot of devices can just set NMIC.  E.g.
>>> every namespace exported by the Linux NVMe target.
>>>
>> But as it's fully dynamic we can decide how to handle each device on a
>> device-by-device basis, with a common default policy.
> 
> What do you mean with fully dynamic?
> 
>> My idea is to have a default policy (create/not create subsystem devices
>> if NMIC is set) set via kernel command-line, and udev rules for devices
>> requiring separate handling.
> 
> Yikes.  That's exactly where I do _not_ want to go.  No more arcane
> magic setup that you need to be in the in group for to know like
> dm-multipath.  Things must just work, period.
> 
:-)
I knew you'd be answering that.
I do agree that we need to make the setup as simple as possible.
But at the same time we should avoid the mistakes we've made with
dm-multipathing.

> The only other option I could think of would be to turn names around:
> make /dev/nvmeX (chardev) and /dev/nvmeXnY the per-subsystem devices
> that are multipathed if available.  We'd then need new devices for the
> invdividual controllers.  With a little luck we'd get away with just
> creating character devices.  This would however create a problem for
> dm-multipath users, which mostly is you and your paterners/customers.
> 
_That_ idea is by far the best approach I've seen so far.
Which would essentially move the entire path hierarchy _below_ the nvme
device, and we would always have a single device node to speak with.
Which will decrease the maintenance burden for us poor souls having to
make a distribution out of it.

_Plus_ we can rearrange things underneath it even when mounted, so with
this approach it would be possible to move from non-multipath to
multipath and vice versa.

So go for it.

> Note that in general dm-multipath should continue working but it would
> generally just see one path.
> 
Yeah, to be expected. But I guess I can find ways around this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-04  7:13               ` Christoph Hellwig
  2017-10-04  7:26                 ` Hannes Reinecke
@ 2017-10-05  1:08                 ` Martin K. Petersen
  2017-10-05  6:51                   ` Johannes Thumshirn
  1 sibling, 1 reply; 42+ messages in thread
From: Martin K. Petersen @ 2017-10-05  1:08 UTC (permalink / raw)



Christoph,

> The only other option I could think of would be to turn names around:
> make /dev/nvmeX (chardev) and /dev/nvmeXnY the per-subsystem devices
> that are multipathed if available.  We'd then need new devices for the
> invdividual controllers.

I like that approach.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-05  1:08                 ` Martin K. Petersen
@ 2017-10-05  6:51                   ` Johannes Thumshirn
  2017-10-05 14:05                     ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Johannes Thumshirn @ 2017-10-05  6:51 UTC (permalink / raw)


On Wed, Oct 04, 2017@09:08:57PM -0400,  Martin K. Petersen  wrote:
> 
> Christoph,
> 
> > The only other option I could think of would be to turn names around:
> > make /dev/nvmeX (chardev) and /dev/nvmeXnY the per-subsystem devices
> > that are multipathed if available.  We'd then need new devices for the
> > invdividual controllers.
> 
> I like that approach.

/me too

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-05  6:51                   ` Johannes Thumshirn
@ 2017-10-05 14:05                     ` Keith Busch
  2017-10-05 14:39                       ` Hannes Reinecke
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2017-10-05 14:05 UTC (permalink / raw)


On Thu, Oct 05, 2017@08:51:24AM +0200, Johannes Thumshirn wrote:
> On Wed, Oct 04, 2017@09:08:57PM -0400,  Martin K. Petersen  wrote:
> > 
> > Christoph,
> > 
> > > The only other option I could think of would be to turn names around:
> > > make /dev/nvmeX (chardev) and /dev/nvmeXnY the per-subsystem devices
> > > that are multipathed if available.  We'd then need new devices for the
> > > invdividual controllers.
> > 
> > I like that approach.
> 
> /me too

I can get on board with this approach for the namespaces, but the
controller char devs should stay as they are for compatibility and
existing tools. The IO commands (including reservations) are not
controller specific, so not knowing which path is used will okay for
namespaces. Most admin commands are controller specific, so we can't
just shift the controller handle to the subsystem level.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-05 14:05                     ` Keith Busch
@ 2017-10-05 14:39                       ` Hannes Reinecke
  2017-10-05 14:47                         ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Hannes Reinecke @ 2017-10-05 14:39 UTC (permalink / raw)


On 10/05/2017 04:05 PM, Keith Busch wrote:
> On Thu, Oct 05, 2017@08:51:24AM +0200, Johannes Thumshirn wrote:
>> On Wed, Oct 04, 2017@09:08:57PM -0400,  Martin K. Petersen  wrote:
>>>
>>> Christoph,
>>>
>>>> The only other option I could think of would be to turn names around:
>>>> make /dev/nvmeX (chardev) and /dev/nvmeXnY the per-subsystem devices
>>>> that are multipathed if available.  We'd then need new devices for the
>>>> invdividual controllers.
>>>
>>> I like that approach.
>>
>> /me too
> 
> I can get on board with this approach for the namespaces, but the
> controller char devs should stay as they are for compatibility and
> existing tools. The IO commands (including reservations) are not
> controller specific, so not knowing which path is used will okay for
> namespaces. Most admin commands are controller specific, so we can't
> just shift the controller handle to the subsystem level.
> 
I'm fine with having individual char devs for the controller and the
namespaces. That was the plan anyway, and they don't get into play with
systemd et al.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare at suse.com			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/6] nvme: claim block devices
  2017-10-05 14:39                       ` Hannes Reinecke
@ 2017-10-05 14:47                         ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2017-10-05 14:47 UTC (permalink / raw)


On Thu, Oct 05, 2017@04:39:23PM +0200, Hannes Reinecke wrote:
> I'm fine with having individual char devs for the controller and the
> namespaces. That was the plan anyway, and they don't get into play with
> systemd et al.

How hardcoded is the assumption that /dev/nvmeX is the chardev for
/dev/nvmeXn*?

I thought that was quite deep and thus would want to keep /dev/nvmeX
as a multipathed devices (maybe some lightweight multipathing, aka
always go to the first controller) and then have another chardev
like /dev/nvmeXpN or /dev/nvmeXcN for each controller?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ'
  2017-10-02 16:20     ` Keith Busch
@ 2017-10-11 14:32       ` Guan Junxiong
  0 siblings, 0 replies; 42+ messages in thread
From: Guan Junxiong @ 2017-10-11 14:32 UTC (permalink / raw)




On Tuesday, October 03, 2017 12:20 AM, Keith Busch wrote:
> On Mon, Oct 02, 2017@06:16:21PM +0200, Christoph Hellwig wrote:
>> On Mon, Oct 02, 2017@03:55:55PM +0200, Hannes Reinecke wrote:
>>> Align with overall nvme device naming scheme.
>>
>> It is very close to the normal /dev/nvme* devices, so I feat it will
>> create a lot of confusion.
>>
>> But given that all device name suggestions so far suck I'd be happy
>> to take a stawpoll of linux-nvme subscribers :)
> 
> Of the options availble, +1 to Hannes' nvmsXnY.
> 

+1 for Hanne's nvmsXnY

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-10-11 14:32 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-02 13:55 [PATCH 0/6] nvme: sanitize multipathing Hannes Reinecke
2017-10-02 13:55 ` [PATCH 1/6] nvme: display 'CMIC' controller attribute Hannes Reinecke
2017-10-02 16:15   ` Christoph Hellwig
2017-10-02 13:55 ` [PATCH 2/6] nvme: use 'nvmsXnZ' instead of 'nvm-subXnZ' Hannes Reinecke
2017-10-02 16:16   ` Christoph Hellwig
2017-10-02 16:20     ` Keith Busch
2017-10-11 14:32       ` Guan Junxiong
2017-10-02 13:55 ` [PATCH 3/6] nvme: claim block devices Hannes Reinecke
2017-10-02 16:42   ` Christoph Hellwig
2017-10-03 10:08     ` Hannes Reinecke
2017-10-03 11:55       ` Christoph Hellwig
2017-10-04  5:42         ` Hannes Reinecke
2017-10-04  6:15           ` Christoph Hellwig
2017-10-04  6:33             ` Hannes Reinecke
2017-10-04  7:13               ` Christoph Hellwig
2017-10-04  7:26                 ` Hannes Reinecke
2017-10-05  1:08                 ` Martin K. Petersen
2017-10-05  6:51                   ` Johannes Thumshirn
2017-10-05 14:05                     ` Keith Busch
2017-10-05 14:39                       ` Hannes Reinecke
2017-10-05 14:47                         ` Christoph Hellwig
2017-10-02 13:55 ` [PATCH 4/6] nvme: display 'NMIC' namespace attribute Hannes Reinecke
2017-10-02 16:16   ` Christoph Hellwig
2017-10-03 10:00     ` Hannes Reinecke
2017-10-03 11:49       ` Christoph Hellwig
2017-10-03 16:01         ` Hannes Reinecke
2017-10-02 13:55 ` [PATCH 5/6] nvme: Export subsystems to /sys/class/nvme-subsys Hannes Reinecke
2017-10-02 16:18   ` Christoph Hellwig
2017-10-02 16:53     ` Keith Busch
2017-10-02 16:56       ` Christoph Hellwig
2017-10-02 17:15         ` Keith Busch
2017-10-02 13:55 ` [PATCH 6/6] nvme: ignore retries for multipath devices Hannes Reinecke
2017-10-02 16:22   ` Christoph Hellwig
2017-10-03 10:02     ` Hannes Reinecke
2017-10-03 11:53       ` Christoph Hellwig
2017-10-02 14:35 ` [PATCH] nvme: reset retires after path failover Johannes Thumshirn
2017-10-02 14:43   ` Keith Busch
2017-10-02 15:08     ` Hannes Reinecke
2017-10-02 14:45   ` Johannes Thumshirn
2017-10-02 16:25     ` Christoph Hellwig
2017-10-02 14:46   ` Johannes Thumshirn
2017-10-02 16:19   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.