All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: nvme multipath support
@ 2017-08-23 17:58 ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

There are a couple questions left in the individual patches, comments
welcome.

Note that this series requires the previous series to remove bi_bdev,
in doubt use the git tree below for testing.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

^ permalink raw reply	[flat|nested] 122+ messages in thread

* RFC: nvme multipath support
@ 2017-08-23 17:58 ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

There are a couple questions left in the individual patches, comments
welcome.

Note that this series requires the previous series to remove bi_bdev,
in doubt use the git tree below for testing.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 01/10] nvme: report more detailed status codes to the block layer
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1db8de0bee87..b8ecd155be19 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -113,7 +113,16 @@ static blk_status_t nvme_error_status(struct request *req)
 	case NVME_SC_WRITE_FAULT:
 	case NVME_SC_READ_ERROR:
 	case NVME_SC_UNWRITTEN_BLOCK:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_READ_ONLY:
 		return BLK_STS_MEDIUM;
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_INVALID_PI:
+		return BLK_STS_PROTECTION;
+	case NVME_SC_RESERVATION_CONFLICT:
+		return BLK_STS_NEXUS;
 	default:
 		return BLK_STS_IOERR;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 01/10] nvme: report more detailed status codes to the block layer
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1db8de0bee87..b8ecd155be19 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -113,7 +113,16 @@ static blk_status_t nvme_error_status(struct request *req)
 	case NVME_SC_WRITE_FAULT:
 	case NVME_SC_READ_ERROR:
 	case NVME_SC_UNWRITTEN_BLOCK:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_READ_ONLY:
 		return BLK_STS_MEDIUM;
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_INVALID_PI:
+		return BLK_STS_PROTECTION;
+	case NVME_SC_RESERVATION_CONFLICT:
+		return BLK_STS_NEXUS;
 	default:
 		return BLK_STS_IOERR;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b8ecd155be19..f91c649c9ca5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -176,9 +176,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
 	enum nvme_ctrl_state old_state;
+	unsigned long flags;
 	bool changed = false;
 
-	spin_lock_irq(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	old_state = ctrl->state;
 	switch (new_state) {
@@ -239,7 +240,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	if (changed)
 		ctrl->state = new_state;
 
-	spin_unlock_irq(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	return changed;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b8ecd155be19..f91c649c9ca5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -176,9 +176,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
 	enum nvme_ctrl_state old_state;
+	unsigned long flags;
 	bool changed = false;
 
-	spin_lock_irq(&ctrl->lock);
+	spin_lock_irqsave(&ctrl->lock, flags);
 
 	old_state = ctrl->state;
 	switch (new_state) {
@@ -239,7 +240,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	if (changed)
 		ctrl->state = new_state;
 
-	spin_unlock_irq(&ctrl->lock);
+	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	return changed;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 03/10] nvme: remove unused struct nvme_ns fields
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

And move the flags for the flags field near that field while touching
this area.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/nvme.h | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2c8a02be46fd..82074b68f36f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -210,13 +210,9 @@ struct nvme_ns {
 	bool ext;
 	u8 pi_type;
 	unsigned long flags;
-	u16 noiob;
-
 #define NVME_NS_REMOVING 0
 #define NVME_NS_DEAD     1
-
-	u64 mode_select_num_blocks;
-	u32 mode_select_block_len;
+	u16 noiob;
 };
 
 struct nvme_ctrl_ops {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 03/10] nvme: remove unused struct nvme_ns fields
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


And move the flags for the flags field near that field while touching
this area.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/nvme.h | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2c8a02be46fd..82074b68f36f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -210,13 +210,9 @@ struct nvme_ns {
 	bool ext;
 	u8 pi_type;
 	unsigned long flags;
-	u16 noiob;
-
 #define NVME_NS_REMOVING 0
 #define NVME_NS_DEAD     1
-
-	u64 mode_select_num_blocks;
-	u32 mode_select_block_len;
+	u16 noiob;
 };
 
 struct nvme_ctrl_ops {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 04/10] nvme: remove nvme_revalidate_ns
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

The function is used in two places, and the shared code for those will
diverge later in this series.

Instead factor out a new helper to get the ids for a namespace, simplify
the calling conventions for nvme_identify_ns and just open code the
sequence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 100 +++++++++++++++++++++++++----------------------
 1 file changed, 53 insertions(+), 47 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f91c649c9ca5..157dbb7b328d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -783,7 +783,8 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 	return error;
 }
 
-static int nvme_identify_ns_descs(struct nvme_ns *ns, unsigned nsid)
+static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
+		u8 *eui64, u8 *nguid, uuid_t *uuid)
 {
 	struct nvme_command c = { };
 	int status;
@@ -799,7 +800,7 @@ static int nvme_identify_ns_descs(struct nvme_ns *ns, unsigned nsid)
 	if (!data)
 		return -ENOMEM;
 
-	status = nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, data,
+	status = nvme_submit_sync_cmd(ctrl->admin_q, &c, data,
 				      NVME_IDENTIFY_DATA_SIZE);
 	if (status)
 		goto free_data;
@@ -813,33 +814,33 @@ static int nvme_identify_ns_descs(struct nvme_ns *ns, unsigned nsid)
 		switch (cur->nidt) {
 		case NVME_NIDT_EUI64:
 			if (cur->nidl != NVME_NIDT_EUI64_LEN) {
-				dev_warn(ns->ctrl->device,
+				dev_warn(ctrl->device,
 					 "ctrl returned bogus length: %d for NVME_NIDT_EUI64\n",
 					 cur->nidl);
 				goto free_data;
 			}
 			len = NVME_NIDT_EUI64_LEN;
-			memcpy(ns->eui, data + pos + sizeof(*cur), len);
+			memcpy(eui64, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_NGUID:
 			if (cur->nidl != NVME_NIDT_NGUID_LEN) {
-				dev_warn(ns->ctrl->device,
+				dev_warn(ctrl->device,
 					 "ctrl returned bogus length: %d for NVME_NIDT_NGUID\n",
 					 cur->nidl);
 				goto free_data;
 			}
 			len = NVME_NIDT_NGUID_LEN;
-			memcpy(ns->nguid, data + pos + sizeof(*cur), len);
+			memcpy(nguid, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_UUID:
 			if (cur->nidl != NVME_NIDT_UUID_LEN) {
-				dev_warn(ns->ctrl->device,
+				dev_warn(ctrl->device,
 					 "ctrl returned bogus length: %d for NVME_NIDT_UUID\n",
 					 cur->nidl);
 				goto free_data;
 			}
 			len = NVME_NIDT_UUID_LEN;
-			uuid_copy(&ns->uuid, data + pos + sizeof(*cur));
+			uuid_copy(uuid, data + pos + sizeof(*cur));
 			break;
 		default:
 			/* Skip unnkown types */
@@ -864,9 +865,10 @@ static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *n
 	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
 }
 
-static int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
-		struct nvme_id_ns **id)
+static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
+		unsigned nsid)
 {
+	struct nvme_id_ns *id;
 	struct nvme_command c = { };
 	int error;
 
@@ -875,15 +877,18 @@ static int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 	c.identify.nsid = cpu_to_le32(nsid);
 	c.identify.cns = NVME_ID_CNS_NS;
 
-	*id = kmalloc(sizeof(struct nvme_id_ns), GFP_KERNEL);
-	if (!*id)
-		return -ENOMEM;
+	id = kmalloc(sizeof(*id), GFP_KERNEL);
+	if (!id)
+		return NULL;
 
-	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
-			sizeof(struct nvme_id_ns));
-	if (error)
-		kfree(*id);
-	return error;
+	error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id));
+	if (error) {
+		dev_warn(ctrl->device, "Identify namespace failed\n");
+		kfree(id);
+		return NULL;
+	}
+
+	return id;
 }
 
 static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
@@ -1174,32 +1179,21 @@ static void nvme_config_discard(struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
-static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
+static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
+		struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
 {
-	if (nvme_identify_ns(ns->ctrl, ns->ns_id, id)) {
-		dev_warn(ns->ctrl->device, "Identify namespace failed\n");
-		return -ENODEV;
-	}
-
-	if ((*id)->ncap == 0) {
-		kfree(*id);
-		return -ENODEV;
-	}
-
-	if (ns->ctrl->vs >= NVME_VS(1, 1, 0))
-		memcpy(ns->eui, (*id)->eui64, sizeof(ns->eui));
-	if (ns->ctrl->vs >= NVME_VS(1, 2, 0))
-		memcpy(ns->nguid, (*id)->nguid, sizeof(ns->nguid));
-	if (ns->ctrl->vs >= NVME_VS(1, 3, 0)) {
+	if (ctrl->vs >= NVME_VS(1, 1, 0))
+		memcpy(eui64, id->eui64, sizeof(id->eui64));
+	if (ctrl->vs >= NVME_VS(1, 2, 0))
+		memcpy(nguid, id->nguid, sizeof(id->nguid));
+	if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 		 /* Don't treat error as fatal we potentially
 		  * already have a NGUID or EUI-64
 		  */
-		if (nvme_identify_ns_descs(ns, ns->ns_id))
-			dev_warn(ns->ctrl->device,
+		if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+			dev_warn(ctrl->device,
 				 "%s: Identify Descriptors failed\n", __func__);
 	}
-
-	return 0;
 }
 
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
@@ -1240,22 +1234,28 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 static int nvme_revalidate_disk(struct gendisk *disk)
 {
 	struct nvme_ns *ns = disk->private_data;
-	struct nvme_id_ns *id = NULL;
-	int ret;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct nvme_id_ns *id;
+	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
 		set_capacity(disk, 0);
 		return -ENODEV;
 	}
 
-	ret = nvme_revalidate_ns(ns, &id);
-	if (ret)
-		return ret;
+	id = nvme_identify_ns(ctrl, ns->ns_id);
+	if (!id)
+		return -ENODEV;
 
-	__nvme_revalidate_disk(disk, id);
-	kfree(id);
+	if (id->ncap == 0) {
+		ret = -ENODEV;
+		goto out;
+	}
 
-	return 0;
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+out:
+	kfree(id);
+	return ret;
 }
 
 static char nvme_pr_type(enum pr_type type)
@@ -2347,9 +2347,15 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
 
-	if (nvme_revalidate_ns(ns, &id))
+	id = nvme_identify_ns(ctrl, nsid);
+	if (!id)
 		goto out_free_queue;
 
+	if (id->ncap == 0)
+		goto out_free_id;
+
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+
 	if (nvme_nvm_ns_supported(ns, id) &&
 				nvme_nvm_register(ns, disk_name, node)) {
 		dev_warn(ctrl->device, "%s: LightNVM init failure\n", __func__);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 04/10] nvme: remove nvme_revalidate_ns
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


The function is used in two places, and the shared code for those will
diverge later in this series.

Instead factor out a new helper to get the ids for a namespace, simplify
the calling conventions for nvme_identify_ns and just open code the
sequence.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 100 +++++++++++++++++++++++++----------------------
 1 file changed, 53 insertions(+), 47 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f91c649c9ca5..157dbb7b328d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -783,7 +783,8 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 	return error;
 }
 
-static int nvme_identify_ns_descs(struct nvme_ns *ns, unsigned nsid)
+static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
+		u8 *eui64, u8 *nguid, uuid_t *uuid)
 {
 	struct nvme_command c = { };
 	int status;
@@ -799,7 +800,7 @@ static int nvme_identify_ns_descs(struct nvme_ns *ns, unsigned nsid)
 	if (!data)
 		return -ENOMEM;
 
-	status = nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, data,
+	status = nvme_submit_sync_cmd(ctrl->admin_q, &c, data,
 				      NVME_IDENTIFY_DATA_SIZE);
 	if (status)
 		goto free_data;
@@ -813,33 +814,33 @@ static int nvme_identify_ns_descs(struct nvme_ns *ns, unsigned nsid)
 		switch (cur->nidt) {
 		case NVME_NIDT_EUI64:
 			if (cur->nidl != NVME_NIDT_EUI64_LEN) {
-				dev_warn(ns->ctrl->device,
+				dev_warn(ctrl->device,
 					 "ctrl returned bogus length: %d for NVME_NIDT_EUI64\n",
 					 cur->nidl);
 				goto free_data;
 			}
 			len = NVME_NIDT_EUI64_LEN;
-			memcpy(ns->eui, data + pos + sizeof(*cur), len);
+			memcpy(eui64, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_NGUID:
 			if (cur->nidl != NVME_NIDT_NGUID_LEN) {
-				dev_warn(ns->ctrl->device,
+				dev_warn(ctrl->device,
 					 "ctrl returned bogus length: %d for NVME_NIDT_NGUID\n",
 					 cur->nidl);
 				goto free_data;
 			}
 			len = NVME_NIDT_NGUID_LEN;
-			memcpy(ns->nguid, data + pos + sizeof(*cur), len);
+			memcpy(nguid, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_UUID:
 			if (cur->nidl != NVME_NIDT_UUID_LEN) {
-				dev_warn(ns->ctrl->device,
+				dev_warn(ctrl->device,
 					 "ctrl returned bogus length: %d for NVME_NIDT_UUID\n",
 					 cur->nidl);
 				goto free_data;
 			}
 			len = NVME_NIDT_UUID_LEN;
-			uuid_copy(&ns->uuid, data + pos + sizeof(*cur));
+			uuid_copy(uuid, data + pos + sizeof(*cur));
 			break;
 		default:
 			/* Skip unnkown types */
@@ -864,9 +865,10 @@ static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *n
 	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
 }
 
-static int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
-		struct nvme_id_ns **id)
+static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
+		unsigned nsid)
 {
+	struct nvme_id_ns *id;
 	struct nvme_command c = { };
 	int error;
 
@@ -875,15 +877,18 @@ static int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 	c.identify.nsid = cpu_to_le32(nsid);
 	c.identify.cns = NVME_ID_CNS_NS;
 
-	*id = kmalloc(sizeof(struct nvme_id_ns), GFP_KERNEL);
-	if (!*id)
-		return -ENOMEM;
+	id = kmalloc(sizeof(*id), GFP_KERNEL);
+	if (!id)
+		return NULL;
 
-	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
-			sizeof(struct nvme_id_ns));
-	if (error)
-		kfree(*id);
-	return error;
+	error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id));
+	if (error) {
+		dev_warn(ctrl->device, "Identify namespace failed\n");
+		kfree(id);
+		return NULL;
+	}
+
+	return id;
 }
 
 static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
@@ -1174,32 +1179,21 @@ static void nvme_config_discard(struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
 }
 
-static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
+static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
+		struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
 {
-	if (nvme_identify_ns(ns->ctrl, ns->ns_id, id)) {
-		dev_warn(ns->ctrl->device, "Identify namespace failed\n");
-		return -ENODEV;
-	}
-
-	if ((*id)->ncap == 0) {
-		kfree(*id);
-		return -ENODEV;
-	}
-
-	if (ns->ctrl->vs >= NVME_VS(1, 1, 0))
-		memcpy(ns->eui, (*id)->eui64, sizeof(ns->eui));
-	if (ns->ctrl->vs >= NVME_VS(1, 2, 0))
-		memcpy(ns->nguid, (*id)->nguid, sizeof(ns->nguid));
-	if (ns->ctrl->vs >= NVME_VS(1, 3, 0)) {
+	if (ctrl->vs >= NVME_VS(1, 1, 0))
+		memcpy(eui64, id->eui64, sizeof(id->eui64));
+	if (ctrl->vs >= NVME_VS(1, 2, 0))
+		memcpy(nguid, id->nguid, sizeof(id->nguid));
+	if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 		 /* Don't treat error as fatal we potentially
 		  * already have a NGUID or EUI-64
 		  */
-		if (nvme_identify_ns_descs(ns, ns->ns_id))
-			dev_warn(ns->ctrl->device,
+		if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+			dev_warn(ctrl->device,
 				 "%s: Identify Descriptors failed\n", __func__);
 	}
-
-	return 0;
 }
 
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
@@ -1240,22 +1234,28 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 static int nvme_revalidate_disk(struct gendisk *disk)
 {
 	struct nvme_ns *ns = disk->private_data;
-	struct nvme_id_ns *id = NULL;
-	int ret;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct nvme_id_ns *id;
+	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
 		set_capacity(disk, 0);
 		return -ENODEV;
 	}
 
-	ret = nvme_revalidate_ns(ns, &id);
-	if (ret)
-		return ret;
+	id = nvme_identify_ns(ctrl, ns->ns_id);
+	if (!id)
+		return -ENODEV;
 
-	__nvme_revalidate_disk(disk, id);
-	kfree(id);
+	if (id->ncap == 0) {
+		ret = -ENODEV;
+		goto out;
+	}
 
-	return 0;
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+out:
+	kfree(id);
+	return ret;
 }
 
 static char nvme_pr_type(enum pr_type type)
@@ -2347,9 +2347,15 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
 
-	if (nvme_revalidate_ns(ns, &id))
+	id = nvme_identify_ns(ctrl, nsid);
+	if (!id)
 		goto out_free_queue;
 
+	if (id->ncap == 0)
+		goto out_free_id;
+
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+
 	if (nvme_nvm_ns_supported(ns, id) &&
 				nvme_nvm_register(ns, disk_name, node)) {
 		dev_warn(ctrl->device, "%s: LightNVM init failure\n", __func__);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Instead validate that these identifiers do not change, as that is
prohibited by the specification.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 157dbb7b328d..179ade01745b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1236,6 +1236,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id;
+	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
+	uuid_t uuid = uuid_null;
 	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
@@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
+	if (!uuid_equal(&ns->uuid, &uuid) ||
+	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
+	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+		dev_err(ctrl->device,
+			"identifiers changed for nsid %d\n", ns->ns_id);
+		ret = -ENODEV;
+	}
+
 out:
 	kfree(id);
 	return ret;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


Instead validate that these identifiers do not change, as that is
prohibited by the specification.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 157dbb7b328d..179ade01745b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1236,6 +1236,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id;
+	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
+	uuid_t uuid = uuid_null;
 	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
@@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
+	if (!uuid_equal(&ns->uuid, &uuid) ||
+	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
+	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+		dev_err(ctrl->device,
+			"identifiers changed for nsid %d\n", ns->ns_id);
+		ret = -ENODEV;
+	}
+
 out:
 	kfree(id);
 	return ret;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 06/10] nvme: track subsystems
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c    | 111 ++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h    |  12 ++++-
 3 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 179ade01745b..8884000dfbdd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -71,6 +71,9 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1741,14 +1744,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
 		string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl,
+		struct nvme_id_ctrl *id)
 {
 	size_t nqnlen;
 	int off;
 
 	nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
 	if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-		strcpy(ctrl->subnqn, id->subnqn);
+		strcpy(subsys->subnqn, id->subnqn);
 		return;
 	}
 
@@ -1756,14 +1760,91 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
 	/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-	off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+	off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
 			"nqn.2014.08.org.nvmexpress:%4x%4x",
 			le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-	memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+	memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
 	off += sizeof(id->sn);
-	memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+	memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
 	off += sizeof(id->mn);
-	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void nvme_destroy_subsystem(struct kref *ref)
+{
+	struct nvme_subsystem *subsys =
+			container_of(ref, struct nvme_subsystem, ref);
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_del(&subsys->entry);
+	mutex_unlock(&nvme_subsystems_lock);
+
+	kfree(subsys);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+	kref_put(&subsys->ref, nvme_destroy_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	lockdep_assert_held(&nvme_subsystems_lock);
+
+	list_for_each_entry(subsys, &nvme_subsystems, entry) {
+		if (strcmp(subsys->subnqn, subsysnqn))
+			continue;
+		if (!kref_get_unless_zero(&subsys->ref))
+			continue;
+		return subsys;
+	}
+
+	return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+	struct nvme_subsystem *subsys, *found;
+
+	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+	if (!subsys)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&subsys->ctrls);
+	kref_init(&subsys->ref);
+	nvme_init_subnqn(subsys, ctrl, id);
+	mutex_init(&subsys->lock);
+
+	mutex_lock(&nvme_subsystems_lock);
+	found = __nvme_find_get_subsystem(subsys->subnqn);
+	if (found) {
+		/*
+		 * Verify that the subsystem actually supports multiple
+		 * controllers, else bail out.
+		 */
+		kfree(subsys);
+		if (!(id->cmic & (1 << 1))) {
+			dev_err(ctrl->device,
+				"ignoring ctrl due to duplicate subnqn (%s).\n",
+				found->subnqn);
+			mutex_unlock(&nvme_subsystems_lock);
+			return -EINVAL;
+		}
+
+		subsys = found;
+	} else {
+		list_add_tail(&subsys->entry, &nvme_subsystems);
+	}
+
+	ctrl->subsys = subsys;
+	mutex_unlock(&nvme_subsystems_lock);
+
+	mutex_lock(&subsys->lock);
+	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
+	mutex_unlock(&subsys->lock);
+
+	return 0;
 }
 
 /*
@@ -1801,7 +1882,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
-	nvme_init_subnqn(ctrl, id);
+	ret = nvme_init_subsystem(ctrl, id);
+	if (ret) {
+		kfree(id);
+		return ret;
+	}
 
 	if (!ctrl->identified) {
 		/*
@@ -2219,7 +2304,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
 {
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 
-	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
+	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
 }
 static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
 
@@ -2757,12 +2842,22 @@ EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 static void nvme_free_ctrl(struct kref *kref)
 {
 	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	put_device(ctrl->device);
 	nvme_release_instance(ctrl);
 	ida_destroy(&ctrl->ns_ida);
 
+	if (subsys) {
+		mutex_lock(&subsys->lock);
+		list_del(&ctrl->subsys_entry);
+		mutex_unlock(&subsys->lock);
+	}
+
 	ctrl->ops->free_ctrl(ctrl);
+
+	if (subsys)
+		nvme_put_subsystem(subsys);
 }
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index cf8c6163db9e..6238452af6a4 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -873,10 +873,10 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
-	if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
+	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
-			ctrl->subnqn);
+			ctrl->subsys->subnqn);
 		mutex_unlock(&nvmf_transports_mutex);
 		ctrl->ops->delete_ctrl(ctrl);
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 82074b68f36f..913eaef6fc33 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -133,13 +133,15 @@ struct nvme_ctrl {
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
+	struct nvme_subsystem *subsys;
+	struct list_head subsys_entry;
+
 	struct opal_dev *opal_dev;
 
 	char name[12];
 	char serial[20];
 	char model[40];
 	char firmware_rev[8];
-	char subnqn[NVMF_NQN_SIZE];
 	u16 cntlid;
 
 	u32 ctrl_config;
@@ -188,6 +190,14 @@ struct nvme_ctrl {
 	struct nvmf_ctrl_options *opts;
 };
 
+struct nvme_subsystem {
+	struct list_head	entry;
+	struct mutex		lock;
+	struct list_head	ctrls;
+	struct kref		ref;
+	char			subnqn[NVMF_NQN_SIZE];
+};
+
 struct nvme_ns {
 	struct list_head list;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 06/10] nvme: track subsystems
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c    | 111 ++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h    |  12 ++++-
 3 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 179ade01745b..8884000dfbdd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -71,6 +71,9 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1741,14 +1744,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
 		string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl,
+		struct nvme_id_ctrl *id)
 {
 	size_t nqnlen;
 	int off;
 
 	nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
 	if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-		strcpy(ctrl->subnqn, id->subnqn);
+		strcpy(subsys->subnqn, id->subnqn);
 		return;
 	}
 
@@ -1756,14 +1760,91 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
 	/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-	off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+	off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
 			"nqn.2014.08.org.nvmexpress:%4x%4x",
 			le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-	memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+	memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
 	off += sizeof(id->sn);
-	memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+	memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
 	off += sizeof(id->mn);
-	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void nvme_destroy_subsystem(struct kref *ref)
+{
+	struct nvme_subsystem *subsys =
+			container_of(ref, struct nvme_subsystem, ref);
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_del(&subsys->entry);
+	mutex_unlock(&nvme_subsystems_lock);
+
+	kfree(subsys);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+	kref_put(&subsys->ref, nvme_destroy_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	lockdep_assert_held(&nvme_subsystems_lock);
+
+	list_for_each_entry(subsys, &nvme_subsystems, entry) {
+		if (strcmp(subsys->subnqn, subsysnqn))
+			continue;
+		if (!kref_get_unless_zero(&subsys->ref))
+			continue;
+		return subsys;
+	}
+
+	return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+	struct nvme_subsystem *subsys, *found;
+
+	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+	if (!subsys)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&subsys->ctrls);
+	kref_init(&subsys->ref);
+	nvme_init_subnqn(subsys, ctrl, id);
+	mutex_init(&subsys->lock);
+
+	mutex_lock(&nvme_subsystems_lock);
+	found = __nvme_find_get_subsystem(subsys->subnqn);
+	if (found) {
+		/*
+		 * Verify that the subsystem actually supports multiple
+		 * controllers, else bail out.
+		 */
+		kfree(subsys);
+		if (!(id->cmic & (1 << 1))) {
+			dev_err(ctrl->device,
+				"ignoring ctrl due to duplicate subnqn (%s).\n",
+				found->subnqn);
+			mutex_unlock(&nvme_subsystems_lock);
+			return -EINVAL;
+		}
+
+		subsys = found;
+	} else {
+		list_add_tail(&subsys->entry, &nvme_subsystems);
+	}
+
+	ctrl->subsys = subsys;
+	mutex_unlock(&nvme_subsystems_lock);
+
+	mutex_lock(&subsys->lock);
+	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
+	mutex_unlock(&subsys->lock);
+
+	return 0;
 }
 
 /*
@@ -1801,7 +1882,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
-	nvme_init_subnqn(ctrl, id);
+	ret = nvme_init_subsystem(ctrl, id);
+	if (ret) {
+		kfree(id);
+		return ret;
+	}
 
 	if (!ctrl->identified) {
 		/*
@@ -2219,7 +2304,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
 {
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 
-	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
+	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
 }
 static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
 
@@ -2757,12 +2842,22 @@ EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 static void nvme_free_ctrl(struct kref *kref)
 {
 	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	put_device(ctrl->device);
 	nvme_release_instance(ctrl);
 	ida_destroy(&ctrl->ns_ida);
 
+	if (subsys) {
+		mutex_lock(&subsys->lock);
+		list_del(&ctrl->subsys_entry);
+		mutex_unlock(&subsys->lock);
+	}
+
 	ctrl->ops->free_ctrl(ctrl);
+
+	if (subsys)
+		nvme_put_subsystem(subsys);
 }
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index cf8c6163db9e..6238452af6a4 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -873,10 +873,10 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
-	if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
+	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
-			ctrl->subnqn);
+			ctrl->subsys->subnqn);
 		mutex_unlock(&nvmf_transports_mutex);
 		ctrl->ops->delete_ctrl(ctrl);
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 82074b68f36f..913eaef6fc33 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -133,13 +133,15 @@ struct nvme_ctrl {
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
+	struct nvme_subsystem *subsys;
+	struct list_head subsys_entry;
+
 	struct opal_dev *opal_dev;
 
 	char name[12];
 	char serial[20];
 	char model[40];
 	char firmware_rev[8];
-	char subnqn[NVMF_NQN_SIZE];
 	u16 cntlid;
 
 	u32 ctrl_config;
@@ -188,6 +190,14 @@ struct nvme_ctrl {
 	struct nvmf_ctrl_options *opts;
 };
 
+struct nvme_subsystem {
+	struct list_head	entry;
+	struct mutex		lock;
+	struct list_head	ctrls;
+	struct kref		ref;
+	char			subnqn[NVMF_NQN_SIZE];
+};
+
 struct nvme_ns {
 	struct list_head list;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Introduce a new struct nvme_ns_head [1] that holds information about
an actual namespace, unlike struct nvme_ns, which only holds the
per-controller namespace information.  For private namespaces there
is a 1:1 relation of the two, but for shared namespaces this lets us
discover all the paths to it.  For now only the identifiers are moved
to the new structure, but most of the information in struct nvme_ns
should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

[1] comments welcome if you have a better name for it, the current one is
    horrible.  One idea would be to rename the current struct nvme_ns
    to struct nvme_ns_link or similar and use the nvme_ns name for the
    new structure.  But that would involve a lot of churn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c     | 218 +++++++++++++++++++++++++++++++++++--------
 drivers/nvme/host/lightnvm.c |  14 +--
 drivers/nvme/host/nvme.h     |  26 +++++-
 3 files changed, 208 insertions(+), 50 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8884000dfbdd..abc5911a8a66 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -249,10 +249,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_destroy_ns_head(struct kref *ref)
+{
+	struct nvme_ns_head *head =
+		container_of(ref, struct nvme_ns_head, ref);
+
+	list_del_init(&head->entry);
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+	kref_put(&head->ref, nvme_destroy_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	if (ns->head)
+		nvme_put_ns_head(ns->head);
+
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
 
@@ -422,7 +440,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -453,7 +471,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->dsm.nr = cpu_to_le32(segments - 1);
 	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -492,7 +510,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -977,7 +995,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	memset(&c, 0, sizeof(c));
 	c.rw.opcode = io.opcode;
 	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.rw.slba = cpu_to_le64(io.slba);
 	c.rw.length = cpu_to_le16(io.nblocks);
 	c.rw.control = cpu_to_le16(io.control);
@@ -1041,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
-		return ns->ns_id;
+		return ns->head->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
@@ -1248,7 +1266,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
-	id = nvme_identify_ns(ctrl, ns->ns_id);
+	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
 
@@ -1257,12 +1275,12 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
-	if (!uuid_equal(&ns->uuid, &uuid) ||
-	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
-	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, eui64, nguid, &uuid);
+	if (!uuid_equal(&ns->head->uuid, &uuid) ||
+	    memcmp(&ns->head->nguid, &nguid, sizeof(ns->head->nguid)) ||
+	    memcmp(&ns->head->eui64, &eui64, sizeof(ns->head->eui64))) {
 		dev_err(ctrl->device,
-			"identifiers changed for nsid %d\n", ns->ns_id);
+			"identifiers changed for nsid %d\n", ns->head->ns_id);
 		ret = -ENODEV;
 	}
 
@@ -1303,7 +1321,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
 	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
@@ -1812,6 +1830,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	if (!subsys)
 		return -ENOMEM;
 	INIT_LIST_HEAD(&subsys->ctrls);
+	INIT_LIST_HEAD(&subsys->nsheads);
 	kref_init(&subsys->ref);
 	nvme_init_subnqn(subsys, ctrl, id);
 	mutex_init(&subsys->lock);
@@ -2132,14 +2151,14 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 	int serial_len = sizeof(ctrl->serial);
 	int model_len = sizeof(ctrl->model);
 
-	if (!uuid_is_null(&ns->uuid))
-		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
+	if (!uuid_is_null(&ns->head->uuid))
+		return sprintf(buf, "uuid.%pU\n", &ns->head->uuid);
 
-	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
-		return sprintf(buf, "eui.%16phN\n", ns->nguid);
+	if (memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
+		return sprintf(buf, "eui.%16phN\n", ns->head->nguid);
 
-	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
-		return sprintf(buf, "eui.%8phN\n", ns->eui);
+	if (memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
+		return sprintf(buf, "eui.%8phN\n", ns->head->eui64);
 
 	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
 				  ctrl->serial[serial_len - 1] == '\0'))
@@ -2149,7 +2168,8 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 		model_len--;
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
-		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
+		serial_len, ctrl->serial, model_len, ctrl->model,
+		ns->head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2157,7 +2177,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->nguid);
+	return sprintf(buf, "%pU\n", ns->head->nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2169,12 +2189,12 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
 	 */
-	if (uuid_is_null(&ns->uuid)) {
+	if (uuid_is_null(&ns->head->uuid)) {
 		printk_ratelimited(KERN_WARNING
 				   "No UUID available providing old NGUID\n");
-		return sprintf(buf, "%pU\n", ns->nguid);
+		return sprintf(buf, "%pU\n", ns->head->nguid);
 	}
-	return sprintf(buf, "%pU\n", &ns->uuid);
+	return sprintf(buf, "%pU\n", &ns->head->uuid);
 }
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
@@ -2182,7 +2202,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->eui);
+	return sprintf(buf, "%8phd\n", ns->head->eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2190,7 +2210,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->ns_id);
+	return sprintf(buf, "%d\n", ns->head->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
@@ -2210,16 +2230,16 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
 
 	if (a == &dev_attr_uuid.attr) {
-		if (uuid_is_null(&ns->uuid) ||
-		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (uuid_is_null(&ns->head->uuid) ||
+		    !memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_nguid.attr) {
-		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (!memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_eui.attr) {
-		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
+		if (!memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
 			return 0;
 	}
 	return a->mode;
@@ -2357,12 +2377,122 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
+		unsigned nsid)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
+			return h;
+	}
+
+	return NULL;
+}
+
+static int __nvme_check_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_head *new)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if ((!uuid_is_null(&new->uuid) &&
+		     uuid_equal(&new->uuid, &h->uuid)) ||
+		    (memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
+		     memcmp(&new->nguid, &h->nguid, sizeof(new->nguid))) ||
+		    (memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
+		     memcmp(&new->eui64, &h->eui64, sizeof(new->eui64))))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
+		unsigned nsid, struct nvme_id_ns *id)
+{
+	struct nvme_ns_head *head;
+	int ret = -ENOMEM;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto out;
+
+	INIT_LIST_HEAD(&head->list);
+	head->ns_id = nsid;
+	init_srcu_struct(&head->srcu);
+	kref_init(&head->ref);
+
+	nvme_report_ns_ids(ctrl, nsid, id, head->eui64, head->nguid,
+			&head->uuid);
+
+	ret = __nvme_check_ids(ctrl->subsys, head);
+	if (ret) {
+		dev_err(ctrl->device,
+			"duplicate IDs for nsid %d\n", nsid);
+		goto out_free_head;
+	}
+
+	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
+	return head;
+out_free_head:
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+out:
+	return ERR_PTR(ret);
+}
+
+static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
+		struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	bool is_shared = id->nmic & (1 << 0);
+	struct nvme_ns_head *head = NULL;
+	int ret = 0;
+
+	mutex_lock(&ctrl->subsys->lock);
+	if (is_shared)
+		head = __nvme_find_ns_head(ctrl->subsys, nsid);
+	if (!head) {
+		head = nvme_alloc_ns_head(ctrl, nsid, id);
+		if (IS_ERR(head)) {
+			ret = PTR_ERR(head);
+			goto out_unlock;
+		}
+	} else {
+		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
+		uuid_t uuid = uuid_null;
+
+		nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, &uuid);
+		if (!uuid_equal(&head->uuid, &uuid) ||
+		    memcmp(&head->nguid, &nguid, sizeof(head->nguid)) ||
+		    memcmp(&head->eui64, &eui64, sizeof(head->eui64))) {
+			dev_err(ctrl->device,
+				"IDs don't match for shared namespace %d\n",
+					nsid);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	list_add_tail(&ns->siblings, &head->list);
+	ns->head = head;
+
+out_unlock:
+	mutex_unlock(&ctrl->subsys->lock);
+	return ret;
+}
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
 	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
 
-	return nsa->ns_id - nsb->ns_id;
+	return nsa->head->ns_id - nsb->head->ns_id;
 }
 
 static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -2371,12 +2501,12 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid) {
+		if (ns->head->ns_id == nsid) {
 			kref_get(&ns->kref);
 			ret = ns;
 			break;
 		}
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			break;
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
@@ -2391,7 +2521,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
 	if (!ctrl->nr_streams)
 		return 0;
 
-	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	ret = nvme_get_stream_params(ctrl, &s, ns->head->ns_id);
 	if (ret)
 		return ret;
 
@@ -2433,7 +2563,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->ctrl = ctrl;
 
 	kref_init(&ns->kref);
-	ns->ns_id = nsid;
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
@@ -2449,17 +2578,18 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	if (nvme_init_ns_head(ns, nsid, id))
+		goto out_free_id;
 
 	if (nvme_nvm_ns_supported(ns, id) &&
 				nvme_nvm_register(ns, disk_name, node)) {
 		dev_warn(ctrl->device, "%s: LightNVM init failure\n", __func__);
-		goto out_free_id;
+		goto out_unlink_ns;
 	}
 
 	disk = alloc_disk_node(0, node);
 	if (!disk)
-		goto out_free_id;
+		goto out_unlink_ns;
 
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
@@ -2487,6 +2617,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 	return;
+ out_unlink_ns:
+	mutex_lock(&ctrl->subsys->lock);
+	list_del_rcu(&ns->siblings);
+	mutex_unlock(&ctrl->subsys->lock);
  out_free_id:
 	kfree(id);
  out_free_queue:
@@ -2499,6 +2633,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	struct nvme_ns_head *head = ns->head;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -2513,10 +2649,16 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 
+	mutex_lock(&ns->ctrl->subsys->lock);
+	if (head)
+		list_del_rcu(&ns->siblings);
+	mutex_unlock(&ns->ctrl->subsys->lock);
+
 	mutex_lock(&ns->ctrl->namespaces_mutex);
 	list_del_init(&ns->list);
 	mutex_unlock(&ns->ctrl->namespaces_mutex);
 
+	synchronize_srcu(&head->srcu);
 	nvme_put_ns(ns);
 }
 
@@ -2539,7 +2681,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			nvme_ns_remove(ns);
 	}
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index c1a28569e843..3c9505066b58 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	int ret;
 
 	c.identity.opcode = nvme_nvm_admin_identity;
-	c.identity.nsid = cpu_to_le32(ns->ns_id);
+	c.identity.nsid = cpu_to_le32(ns->head->ns_id);
 	c.identity.chnl_off = 0;
 
 	nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
 	int ret = 0;
 
 	c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
-	c.l2p.nsid = cpu_to_le32(ns->ns_id);
+	c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
 	entries = kmalloc(len, GFP_KERNEL);
 	if (!entries)
 		return -ENOMEM;
@@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
 	int ret = 0;
 
 	c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
-	c.get_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.get_bb.spba = cpu_to_le64(ppa.ppa);
 
 	bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
 	int ret = 0;
 
 	c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
-	c.set_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.set_bb.spba = cpu_to_le64(ppas->ppa);
 	c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
 	c.set_bb.value = type;
@@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
 				    struct nvme_nvm_command *c)
 {
 	c->ph_rw.opcode = rqd->opcode;
-	c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
 	c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
 	c->ph_rw.control = cpu_to_le16(rqd->flags);
@@ -691,7 +691,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
 
 	memset(&c, 0, sizeof(c));
 	c.ph_rw.opcode = vio.opcode;
-	c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.ph_rw.control = cpu_to_le16(vio.control);
 	c.ph_rw.length = cpu_to_le16(vio.nppas);
 
@@ -728,7 +728,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = vcmd.opcode;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 913eaef6fc33..f68a89be654b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -194,25 +194,41 @@ struct nvme_subsystem {
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
+	struct list_head	nsheads;
 	struct kref		ref;
 	char			subnqn[NVMF_NQN_SIZE];
 };
 
+/*
+ * Anchor structure for namespaces.  There is one for each namespace in a
+ * NVMe subsystem that any of our controllers can see, and the namespace
+ * structure for each controller is chained of it.  For private namespaces
+ * there is a 1:1 relation to our namespace structures, that is ->list
+ * only ever has a single entry for private namespaces.
+ */
+struct nvme_ns_head {
+	struct list_head	list;
+	struct srcu_struct      srcu;
+	unsigned		ns_id;
+	u8			eui64[8];
+	u8			nguid[16];
+	uuid_t			uuid;
+	struct list_head	entry;
+	struct kref		ref;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
+	struct nvme_ns_head *head;
 	int instance;
 
-	u8 eui[8];
-	u8 nguid[16];
-	uuid_t uuid;
-
-	unsigned ns_id;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


Introduce a new struct nvme_ns_head [1] that holds information about
an actual namespace, unlike struct nvme_ns, which only holds the
per-controller namespace information.  For private namespaces there
is a 1:1 relation of the two, but for shared namespaces this lets us
discover all the paths to it.  For now only the identifiers are moved
to the new structure, but most of the information in struct nvme_ns
should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

[1] comments welcome if you have a better name for it, the current one is
    horrible.  One idea would be to rename the current struct nvme_ns
    to struct nvme_ns_link or similar and use the nvme_ns name for the
    new structure.  But that would involve a lot of churn.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c     | 218 +++++++++++++++++++++++++++++++++++--------
 drivers/nvme/host/lightnvm.c |  14 +--
 drivers/nvme/host/nvme.h     |  26 +++++-
 3 files changed, 208 insertions(+), 50 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8884000dfbdd..abc5911a8a66 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -249,10 +249,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_destroy_ns_head(struct kref *ref)
+{
+	struct nvme_ns_head *head =
+		container_of(ref, struct nvme_ns_head, ref);
+
+	list_del_init(&head->entry);
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+	kref_put(&head->ref, nvme_destroy_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	if (ns->head)
+		nvme_put_ns_head(ns->head);
+
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
 
@@ -422,7 +440,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -453,7 +471,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->dsm.nr = cpu_to_le32(segments - 1);
 	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -492,7 +510,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -977,7 +995,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	memset(&c, 0, sizeof(c));
 	c.rw.opcode = io.opcode;
 	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.rw.slba = cpu_to_le64(io.slba);
 	c.rw.length = cpu_to_le16(io.nblocks);
 	c.rw.control = cpu_to_le16(io.control);
@@ -1041,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
-		return ns->ns_id;
+		return ns->head->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
@@ -1248,7 +1266,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
-	id = nvme_identify_ns(ctrl, ns->ns_id);
+	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
 
@@ -1257,12 +1275,12 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
-	if (!uuid_equal(&ns->uuid, &uuid) ||
-	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
-	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, eui64, nguid, &uuid);
+	if (!uuid_equal(&ns->head->uuid, &uuid) ||
+	    memcmp(&ns->head->nguid, &nguid, sizeof(ns->head->nguid)) ||
+	    memcmp(&ns->head->eui64, &eui64, sizeof(ns->head->eui64))) {
 		dev_err(ctrl->device,
-			"identifiers changed for nsid %d\n", ns->ns_id);
+			"identifiers changed for nsid %d\n", ns->head->ns_id);
 		ret = -ENODEV;
 	}
 
@@ -1303,7 +1321,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
 	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
@@ -1812,6 +1830,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	if (!subsys)
 		return -ENOMEM;
 	INIT_LIST_HEAD(&subsys->ctrls);
+	INIT_LIST_HEAD(&subsys->nsheads);
 	kref_init(&subsys->ref);
 	nvme_init_subnqn(subsys, ctrl, id);
 	mutex_init(&subsys->lock);
@@ -2132,14 +2151,14 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 	int serial_len = sizeof(ctrl->serial);
 	int model_len = sizeof(ctrl->model);
 
-	if (!uuid_is_null(&ns->uuid))
-		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
+	if (!uuid_is_null(&ns->head->uuid))
+		return sprintf(buf, "uuid.%pU\n", &ns->head->uuid);
 
-	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
-		return sprintf(buf, "eui.%16phN\n", ns->nguid);
+	if (memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
+		return sprintf(buf, "eui.%16phN\n", ns->head->nguid);
 
-	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
-		return sprintf(buf, "eui.%8phN\n", ns->eui);
+	if (memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
+		return sprintf(buf, "eui.%8phN\n", ns->head->eui64);
 
 	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
 				  ctrl->serial[serial_len - 1] == '\0'))
@@ -2149,7 +2168,8 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 		model_len--;
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
-		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
+		serial_len, ctrl->serial, model_len, ctrl->model,
+		ns->head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2157,7 +2177,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->nguid);
+	return sprintf(buf, "%pU\n", ns->head->nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2169,12 +2189,12 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
 	 */
-	if (uuid_is_null(&ns->uuid)) {
+	if (uuid_is_null(&ns->head->uuid)) {
 		printk_ratelimited(KERN_WARNING
 				   "No UUID available providing old NGUID\n");
-		return sprintf(buf, "%pU\n", ns->nguid);
+		return sprintf(buf, "%pU\n", ns->head->nguid);
 	}
-	return sprintf(buf, "%pU\n", &ns->uuid);
+	return sprintf(buf, "%pU\n", &ns->head->uuid);
 }
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
@@ -2182,7 +2202,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->eui);
+	return sprintf(buf, "%8phd\n", ns->head->eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2190,7 +2210,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->ns_id);
+	return sprintf(buf, "%d\n", ns->head->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
@@ -2210,16 +2230,16 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
 
 	if (a == &dev_attr_uuid.attr) {
-		if (uuid_is_null(&ns->uuid) ||
-		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (uuid_is_null(&ns->head->uuid) ||
+		    !memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_nguid.attr) {
-		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (!memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_eui.attr) {
-		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
+		if (!memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
 			return 0;
 	}
 	return a->mode;
@@ -2357,12 +2377,122 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
+		unsigned nsid)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
+			return h;
+	}
+
+	return NULL;
+}
+
+static int __nvme_check_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_head *new)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if ((!uuid_is_null(&new->uuid) &&
+		     uuid_equal(&new->uuid, &h->uuid)) ||
+		    (memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
+		     memcmp(&new->nguid, &h->nguid, sizeof(new->nguid))) ||
+		    (memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
+		     memcmp(&new->eui64, &h->eui64, sizeof(new->eui64))))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
+		unsigned nsid, struct nvme_id_ns *id)
+{
+	struct nvme_ns_head *head;
+	int ret = -ENOMEM;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto out;
+
+	INIT_LIST_HEAD(&head->list);
+	head->ns_id = nsid;
+	init_srcu_struct(&head->srcu);
+	kref_init(&head->ref);
+
+	nvme_report_ns_ids(ctrl, nsid, id, head->eui64, head->nguid,
+			&head->uuid);
+
+	ret = __nvme_check_ids(ctrl->subsys, head);
+	if (ret) {
+		dev_err(ctrl->device,
+			"duplicate IDs for nsid %d\n", nsid);
+		goto out_free_head;
+	}
+
+	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
+	return head;
+out_free_head:
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+out:
+	return ERR_PTR(ret);
+}
+
+static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
+		struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	bool is_shared = id->nmic & (1 << 0);
+	struct nvme_ns_head *head = NULL;
+	int ret = 0;
+
+	mutex_lock(&ctrl->subsys->lock);
+	if (is_shared)
+		head = __nvme_find_ns_head(ctrl->subsys, nsid);
+	if (!head) {
+		head = nvme_alloc_ns_head(ctrl, nsid, id);
+		if (IS_ERR(head)) {
+			ret = PTR_ERR(head);
+			goto out_unlock;
+		}
+	} else {
+		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
+		uuid_t uuid = uuid_null;
+
+		nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, &uuid);
+		if (!uuid_equal(&head->uuid, &uuid) ||
+		    memcmp(&head->nguid, &nguid, sizeof(head->nguid)) ||
+		    memcmp(&head->eui64, &eui64, sizeof(head->eui64))) {
+			dev_err(ctrl->device,
+				"IDs don't match for shared namespace %d\n",
+					nsid);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	list_add_tail(&ns->siblings, &head->list);
+	ns->head = head;
+
+out_unlock:
+	mutex_unlock(&ctrl->subsys->lock);
+	return ret;
+}
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
 	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
 
-	return nsa->ns_id - nsb->ns_id;
+	return nsa->head->ns_id - nsb->head->ns_id;
 }
 
 static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -2371,12 +2501,12 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid) {
+		if (ns->head->ns_id == nsid) {
 			kref_get(&ns->kref);
 			ret = ns;
 			break;
 		}
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			break;
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
@@ -2391,7 +2521,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
 	if (!ctrl->nr_streams)
 		return 0;
 
-	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	ret = nvme_get_stream_params(ctrl, &s, ns->head->ns_id);
 	if (ret)
 		return ret;
 
@@ -2433,7 +2563,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->ctrl = ctrl;
 
 	kref_init(&ns->kref);
-	ns->ns_id = nsid;
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
@@ -2449,17 +2578,18 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	if (nvme_init_ns_head(ns, nsid, id))
+		goto out_free_id;
 
 	if (nvme_nvm_ns_supported(ns, id) &&
 				nvme_nvm_register(ns, disk_name, node)) {
 		dev_warn(ctrl->device, "%s: LightNVM init failure\n", __func__);
-		goto out_free_id;
+		goto out_unlink_ns;
 	}
 
 	disk = alloc_disk_node(0, node);
 	if (!disk)
-		goto out_free_id;
+		goto out_unlink_ns;
 
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
@@ -2487,6 +2617,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 	return;
+ out_unlink_ns:
+	mutex_lock(&ctrl->subsys->lock);
+	list_del_rcu(&ns->siblings);
+	mutex_unlock(&ctrl->subsys->lock);
  out_free_id:
 	kfree(id);
  out_free_queue:
@@ -2499,6 +2633,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	struct nvme_ns_head *head = ns->head;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -2513,10 +2649,16 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 
+	mutex_lock(&ns->ctrl->subsys->lock);
+	if (head)
+		list_del_rcu(&ns->siblings);
+	mutex_unlock(&ns->ctrl->subsys->lock);
+
 	mutex_lock(&ns->ctrl->namespaces_mutex);
 	list_del_init(&ns->list);
 	mutex_unlock(&ns->ctrl->namespaces_mutex);
 
+	synchronize_srcu(&head->srcu);
 	nvme_put_ns(ns);
 }
 
@@ -2539,7 +2681,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			nvme_ns_remove(ns);
 	}
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index c1a28569e843..3c9505066b58 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	int ret;
 
 	c.identity.opcode = nvme_nvm_admin_identity;
-	c.identity.nsid = cpu_to_le32(ns->ns_id);
+	c.identity.nsid = cpu_to_le32(ns->head->ns_id);
 	c.identity.chnl_off = 0;
 
 	nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
 	int ret = 0;
 
 	c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
-	c.l2p.nsid = cpu_to_le32(ns->ns_id);
+	c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
 	entries = kmalloc(len, GFP_KERNEL);
 	if (!entries)
 		return -ENOMEM;
@@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
 	int ret = 0;
 
 	c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
-	c.get_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.get_bb.spba = cpu_to_le64(ppa.ppa);
 
 	bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
 	int ret = 0;
 
 	c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
-	c.set_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.set_bb.spba = cpu_to_le64(ppas->ppa);
 	c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
 	c.set_bb.value = type;
@@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
 				    struct nvme_nvm_command *c)
 {
 	c->ph_rw.opcode = rqd->opcode;
-	c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
 	c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
 	c->ph_rw.control = cpu_to_le16(rqd->flags);
@@ -691,7 +691,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
 
 	memset(&c, 0, sizeof(c));
 	c.ph_rw.opcode = vio.opcode;
-	c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.ph_rw.control = cpu_to_le16(vio.control);
 	c.ph_rw.length = cpu_to_le16(vio.nppas);
 
@@ -728,7 +728,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = vcmd.opcode;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 913eaef6fc33..f68a89be654b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -194,25 +194,41 @@ struct nvme_subsystem {
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
+	struct list_head	nsheads;
 	struct kref		ref;
 	char			subnqn[NVMF_NQN_SIZE];
 };
 
+/*
+ * Anchor structure for namespaces.  There is one for each namespace in a
+ * NVMe subsystem that any of our controllers can see, and the namespace
+ * structure for each controller is chained of it.  For private namespaces
+ * there is a 1:1 relation to our namespace structures, that is ->list
+ * only ever has a single entry for private namespaces.
+ */
+struct nvme_ns_head {
+	struct list_head	list;
+	struct srcu_struct      srcu;
+	unsigned		ns_id;
+	u8			eui64[8];
+	u8			nguid[16];
+	uuid_t			uuid;
+	struct list_head	entry;
+	struct kref		ref;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
+	struct nvme_ns_head *head;
 	int instance;
 
-	u8 eui[8];
-	u8 nguid[16];
-	uuid_t uuid;
-
-	unsigned ns_id;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions,
which requires the caller to not split the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       | 26 ++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0c7050f20a60..e91f3477538f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2224,6 +2224,32 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/*
+ * Fast-path version of generic_make_request.  The caller must ensure that
+ * generic_make_request_checks has already been called with the limits that
+ * fit the queue, and no recursion prevention is provided.
+ */
+blk_qc_t generic_make_request_fast(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	blk_qc_t ret = BLK_QC_T_NONE;
+
+	if (unlikely(blk_queue_enter(q, nowait))) {
+		if (nowait && !blk_queue_dying(q))
+			bio->bi_status = BLK_STS_AGAIN;
+		else
+			bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
+	ret = q->make_request_fn(q, bio);
+	blk_queue_exit(q);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(generic_make_request_fast);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..5f81151d181a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -938,6 +938,7 @@ do {								\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t generic_make_request_fast(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions,
which requires the caller to not split the bio.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c       | 26 ++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0c7050f20a60..e91f3477538f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2224,6 +2224,32 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/*
+ * Fast-path version of generic_make_request.  The caller must ensure that
+ * generic_make_request_checks has already been called with the limits that
+ * fit the queue, and no recursion prevention is provided.
+ */
+blk_qc_t generic_make_request_fast(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	blk_qc_t ret = BLK_QC_T_NONE;
+
+	if (unlikely(blk_queue_enter(q, nowait))) {
+		if (nowait && !blk_queue_dying(q))
+			bio->bi_status = BLK_STS_AGAIN;
+		else
+			bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
+	ret = q->make_request_fn(q, bio);
+	blk_queue_exit(q);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(generic_make_request_fast);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..5f81151d181a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -938,6 +938,7 @@ do {								\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t generic_make_request_fast(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 09/10] blk-mq: add a blk_steal_bios helper
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index e91f3477538f..db18e3befdd5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2702,6 +2702,26 @@ struct request *blk_fetch_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+	if (rq->bio) {
+		if (list->tail)
+			list->tail->bi_next = rq->bio;
+		else
+			list->head = rq->bio;
+		list->tail = rq->biotail;
+	}
+
+	rq->bio = NULL;
+	rq->biotail = NULL;
+	rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:      the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5f81151d181a..e8c11ad68809 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1112,6 +1112,8 @@ extern struct request *blk_peek_request(struct request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 09/10] blk-mq: add a blk_steal_bios helper
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c       | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index e91f3477538f..db18e3befdd5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2702,6 +2702,26 @@ struct request *blk_fetch_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+	if (rq->bio) {
+		if (list->tail)
+			list->tail->bi_next = rq->bio;
+		else
+			list->head = rq->bio;
+		list->tail = rq->biotail;
+	}
+
+	rq->bio = NULL;
+	rq->biotail = NULL;
+	rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:      the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5f81151d181a..e8c11ad68809 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1112,6 +1112,8 @@ extern struct request *blk_peek_request(struct request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 17:58 ` Christoph Hellwig
@ 2017-08-23 17:58   ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This patch adds initial multipath support to the nvme driver.  For each
namespace we create a new block device node, which can be used to access
that namespace through any of the controllers that refer to it.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

TODO: implement sysfs interfaces for the new subsystem and
subsystem-namespace object.  Unless we can come up with something
better than sysfs here..

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 248 +++++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h |   6 ++
 2 files changed, 236 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index abc5911a8a66..feec8a708b7d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -77,6 +77,8 @@ static DEFINE_MUTEX(nvme_subsystems_lock);
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
+static DEFINE_IDA(nvme_disk_ida);
+
 static struct class *nvme_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
@@ -131,16 +133,80 @@ static blk_status_t nvme_error_status(struct request *req)
 	}
 }
 
-static inline bool nvme_req_needs_retry(struct request *req)
+static bool nvme_failover_rq(struct request *req)
 {
-	if (blk_noretry_request(req))
+	struct nvme_ns *ns = req->q->queuedata;
+	unsigned long flags;
+
+	/*
+	 * Only fail over commands that came in through the the multipath
+	 * aware submissions path.  Note that ns->head might not be set up
+	 * for commands used during controller initialization, but those
+	 * must never set REQ_FAILFAST_TRANSPORT.
+	 */
+	if (!(req->cmd_flags & REQ_FAILFAST_TRANSPORT))
+		return false;
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
 		return false;
+	}
+
+	/* Anything else could be a path failure, so should be retried */
+	spin_lock_irqsave(&ns->head->requeue_lock, flags);
+	blk_steal_bios(&ns->head->requeue_list, req);
+	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+
+	nvme_reset_ctrl(ns->ctrl);
+	kblockd_schedule_work(&ns->head->requeue_work);
+	return true;
+}
+
+static inline bool nvme_req_needs_retry(struct request *req)
+{
 	if (nvme_req(req)->status & NVME_SC_DNR)
 		return false;
 	if (jiffies - req->start_time >= req->timeout)
 		return false;
 	if (nvme_req(req)->retries >= nvme_max_retries)
 		return false;
+	if (nvme_failover_rq(req))
+		return false;
+	if (blk_noretry_request(req))
+		return false;
 	return true;
 }
 
@@ -175,6 +241,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->head)
+			kblockd_schedule_work(&ns->head->requeue_work);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -242,9 +320,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
 	if (changed)
 		ctrl->state = new_state;
-
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
+	if (changed && ctrl->state == NVME_CTRL_LIVE)
+		nvme_kick_requeue_lists(ctrl);
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -254,6 +333,15 @@ static void nvme_destroy_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	del_gendisk(head->disk);
+	blk_set_queue_dying(head->disk->queue);
+	/* make sure all pending bios are cleaned up */
+	kblockd_schedule_work(&head->requeue_work);
+	flush_work(&head->requeue_work);
+	blk_cleanup_queue(head->disk->queue);
+	put_disk(head->disk);
+	ida_simple_remove(&nvme_disk_ida, head->instance);
+
 	list_del_init(&head->entry);
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -1128,8 +1216,10 @@ static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
 	if (blk_get_integrity(disk) &&
 	    (ns->pi_type != pi_type || ns->ms != old_ms ||
 	     bs != queue_logical_block_size(disk->queue) ||
-	     (ns->ms && ns->ext)))
+	     (ns->ms && ns->ext))) {
 		blk_integrity_unregister(disk);
+		blk_integrity_unregister(ns->head->disk);
+	}
 
 	ns->pi_type = pi_type;
 }
@@ -1157,7 +1247,9 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 	}
 	integrity.tuple_size = ns->ms;
 	blk_integrity_register(ns->disk, &integrity);
+	blk_integrity_register(ns->head->disk, &integrity);
 	blk_queue_max_integrity_segments(ns->queue, 1);
+	blk_queue_max_integrity_segments(ns->head->disk->queue, 1);
 }
 #else
 static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
@@ -1175,7 +1267,7 @@ static void nvme_set_chunk_size(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size));
 }
 
-static void nvme_config_discard(struct nvme_ns *ns)
+static void nvme_config_discard(struct nvme_ns *ns, struct request_queue *queue)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1186,18 +1278,18 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	if (ctrl->nr_streams && ns->sws && ns->sgs) {
 		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
 
-		ns->queue->limits.discard_alignment = sz;
-		ns->queue->limits.discard_granularity = sz;
+		queue->limits.discard_alignment = sz;
+		queue->limits.discard_granularity = sz;
 	} else {
 		ns->queue->limits.discard_alignment = logical_block_size;
 		ns->queue->limits.discard_granularity = logical_block_size;
 	}
-	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
-	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+	blk_queue_max_discard_sectors(queue, UINT_MAX);
+	blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue);
 
 	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
+		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
@@ -1238,17 +1330,25 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
+	blk_queue_logical_block_size(ns->head->disk->queue, bs);
 	if (ns->noiob)
 		nvme_set_chunk_size(ns);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) {
 		set_capacity(disk, 0);
-	else
+		if (ns->head)
+			set_capacity(ns->head->disk, 0);
+	} else {
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+		if (ns->head)
+			set_capacity(ns->head->disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+	}
 
-	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM) {
+		nvme_config_discard(ns, ns->queue);
+		nvme_config_discard(ns, ns->head->disk->queue);
+	}
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
@@ -2377,6 +2477,73 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (ns->ctrl->state == NVME_CTRL_LIVE) {
+			rcu_assign_pointer(head->current_path, ns);
+			return ns;
+		}
+	}
+
+	return NULL;
+}
+
+static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct nvme_ns *ns;
+	blk_qc_t ret = BLK_QC_T_NONE;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = srcu_dereference(head->current_path, &head->srcu);
+	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
+		ns = nvme_find_path(head);
+	if (likely(ns)) {
+		bio->bi_disk = ns->disk;
+		bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
+		ret = generic_make_request_fast(bio);
+	} else if (!list_empty_careful(&head->list)) {
+		printk_ratelimited("no path available - requeing I/O\n");
+
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		printk_ratelimited("no path - failing I/O\n");
+
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static const struct block_device_operations nvme_subsys_ops = {
+	.owner		= THIS_MODULE,
+};
+
+static void nvme_requeue_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&head->requeue_lock);
+	next = bio_list_get(&head->requeue_list);
+	spin_unlock_irq(&head->requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		generic_make_request_fast(bio);
+	}
+}
+
 static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
 		unsigned nsid)
 {
@@ -2416,6 +2583,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		unsigned nsid, struct nvme_id_ns *id)
 {
 	struct nvme_ns_head *head;
+	struct request_queue *q;
 	int ret = -ENOMEM;
 
 	head = kzalloc(sizeof(*head), GFP_KERNEL);
@@ -2424,6 +2592,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 
 	INIT_LIST_HEAD(&head->list);
 	head->ns_id = nsid;
+	bio_list_init(&head->requeue_list);
+	spin_lock_init(&head->requeue_lock);
+	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
 	kref_init(&head->ref);
 
@@ -2437,8 +2608,37 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		goto out_free_head;
 	}
 
+	ret = -ENOMEM;
+	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
+	if (!q)
+		goto out_free_head;
+	q->queuedata = head;
+	blk_queue_make_request(q, nvme_make_request);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	/* set to a default value for 512 until disk is validated */
+	blk_queue_logical_block_size(q, 512);
+	nvme_set_queue_limits(ctrl, q);
+
+	head->instance = ida_simple_get(&nvme_disk_ida, 1, 0, GFP_KERNEL);
+	if (head->instance < 0)
+		goto out_cleanup_queue;
+
+	head->disk = alloc_disk(0);
+	if (!head->disk)
+		goto out_ida_remove;
+	head->disk->fops = &nvme_subsys_ops;
+	head->disk->private_data = head;
+	head->disk->queue = q;
+	head->disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(head->disk->disk_name, "nvme/ns%d", head->instance);
+
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
+
+out_ida_remove:
+	ida_simple_remove(&nvme_disk_ida, head->instance);
+out_cleanup_queue:
+	blk_cleanup_queue(q);
 out_free_head:
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -2447,7 +2647,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 }
 
 static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
-		struct nvme_id_ns *id)
+		struct nvme_id_ns *id, bool *new)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	bool is_shared = id->nmic & (1 << 0);
@@ -2463,6 +2663,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = PTR_ERR(head);
 			goto out_unlock;
 		}
+
+		*new = true;
 	} else {
 		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
 		uuid_t uuid = uuid_null;
@@ -2477,6 +2679,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = -EINVAL;
 			goto out_unlock;
 		}
+
+		*new = false;
 	}
 
 	list_add_tail(&ns->siblings, &head->list);
@@ -2546,6 +2750,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
+	bool new = true;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -2578,7 +2783,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	if (nvme_init_ns_head(ns, nsid, id))
+	if (nvme_init_ns_head(ns, nsid, id, &new))
 		goto out_free_id;
 
 	if (nvme_nvm_ns_supported(ns, id) &&
@@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
+
+	if (new)
+		add_disk(ns->head->disk);
+
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2650,8 +2859,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	}
 
 	mutex_lock(&ns->ctrl->subsys->lock);
-	if (head)
+	if (head) {
+		rcu_assign_pointer(head->current_path, NULL);
 		list_del_rcu(&ns->siblings);
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	mutex_lock(&ns->ctrl->namespaces_mutex);
@@ -3201,6 +3412,7 @@ int __init nvme_core_init(void)
 
 void nvme_core_exit(void)
 {
+	ida_destroy(&nvme_disk_ida);
 	class_destroy(nvme_class);
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f68a89be654b..e8b28b7d38e8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -207,14 +207,20 @@ struct nvme_subsystem {
  * only ever has a single entry for private namespaces.
  */
 struct nvme_ns_head {
+	struct nvme_ns		*current_path;
+	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct bio_list		requeue_list;
+	spinlock_t		requeue_lock;
+	struct work_struct	requeue_work;
 	unsigned		ns_id;
 	u8			eui64[8];
 	u8			nguid[16];
 	uuid_t			uuid;
 	struct list_head	entry;
 	struct kref		ref;
+	int			instance;
 };
 
 struct nvme_ns {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-23 17:58   ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-23 17:58 UTC (permalink / raw)


This patch adds initial multipath support to the nvme driver.  For each
namespace we create a new block device node, which can be used to access
that namespace through any of the controllers that refer to it.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

TODO: implement sysfs interfaces for the new subsystem and
subsystem-namespace object.  Unless we can come up with something
better than sysfs here..

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 248 +++++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h |   6 ++
 2 files changed, 236 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index abc5911a8a66..feec8a708b7d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -77,6 +77,8 @@ static DEFINE_MUTEX(nvme_subsystems_lock);
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
+static DEFINE_IDA(nvme_disk_ida);
+
 static struct class *nvme_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
@@ -131,16 +133,80 @@ static blk_status_t nvme_error_status(struct request *req)
 	}
 }
 
-static inline bool nvme_req_needs_retry(struct request *req)
+static bool nvme_failover_rq(struct request *req)
 {
-	if (blk_noretry_request(req))
+	struct nvme_ns *ns = req->q->queuedata;
+	unsigned long flags;
+
+	/*
+	 * Only fail over commands that came in through the the multipath
+	 * aware submissions path.  Note that ns->head might not be set up
+	 * for commands used during controller initialization, but those
+	 * must never set REQ_FAILFAST_TRANSPORT.
+	 */
+	if (!(req->cmd_flags & REQ_FAILFAST_TRANSPORT))
+		return false;
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
 		return false;
+	}
+
+	/* Anything else could be a path failure, so should be retried */
+	spin_lock_irqsave(&ns->head->requeue_lock, flags);
+	blk_steal_bios(&ns->head->requeue_list, req);
+	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+
+	nvme_reset_ctrl(ns->ctrl);
+	kblockd_schedule_work(&ns->head->requeue_work);
+	return true;
+}
+
+static inline bool nvme_req_needs_retry(struct request *req)
+{
 	if (nvme_req(req)->status & NVME_SC_DNR)
 		return false;
 	if (jiffies - req->start_time >= req->timeout)
 		return false;
 	if (nvme_req(req)->retries >= nvme_max_retries)
 		return false;
+	if (nvme_failover_rq(req))
+		return false;
+	if (blk_noretry_request(req))
+		return false;
 	return true;
 }
 
@@ -175,6 +241,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->head)
+			kblockd_schedule_work(&ns->head->requeue_work);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -242,9 +320,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
 	if (changed)
 		ctrl->state = new_state;
-
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
+	if (changed && ctrl->state == NVME_CTRL_LIVE)
+		nvme_kick_requeue_lists(ctrl);
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -254,6 +333,15 @@ static void nvme_destroy_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	del_gendisk(head->disk);
+	blk_set_queue_dying(head->disk->queue);
+	/* make sure all pending bios are cleaned up */
+	kblockd_schedule_work(&head->requeue_work);
+	flush_work(&head->requeue_work);
+	blk_cleanup_queue(head->disk->queue);
+	put_disk(head->disk);
+	ida_simple_remove(&nvme_disk_ida, head->instance);
+
 	list_del_init(&head->entry);
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -1128,8 +1216,10 @@ static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
 	if (blk_get_integrity(disk) &&
 	    (ns->pi_type != pi_type || ns->ms != old_ms ||
 	     bs != queue_logical_block_size(disk->queue) ||
-	     (ns->ms && ns->ext)))
+	     (ns->ms && ns->ext))) {
 		blk_integrity_unregister(disk);
+		blk_integrity_unregister(ns->head->disk);
+	}
 
 	ns->pi_type = pi_type;
 }
@@ -1157,7 +1247,9 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 	}
 	integrity.tuple_size = ns->ms;
 	blk_integrity_register(ns->disk, &integrity);
+	blk_integrity_register(ns->head->disk, &integrity);
 	blk_queue_max_integrity_segments(ns->queue, 1);
+	blk_queue_max_integrity_segments(ns->head->disk->queue, 1);
 }
 #else
 static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
@@ -1175,7 +1267,7 @@ static void nvme_set_chunk_size(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size));
 }
 
-static void nvme_config_discard(struct nvme_ns *ns)
+static void nvme_config_discard(struct nvme_ns *ns, struct request_queue *queue)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1186,18 +1278,18 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	if (ctrl->nr_streams && ns->sws && ns->sgs) {
 		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
 
-		ns->queue->limits.discard_alignment = sz;
-		ns->queue->limits.discard_granularity = sz;
+		queue->limits.discard_alignment = sz;
+		queue->limits.discard_granularity = sz;
 	} else {
 		ns->queue->limits.discard_alignment = logical_block_size;
 		ns->queue->limits.discard_granularity = logical_block_size;
 	}
-	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
-	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+	blk_queue_max_discard_sectors(queue, UINT_MAX);
+	blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue);
 
 	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
+		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
@@ -1238,17 +1330,25 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
+	blk_queue_logical_block_size(ns->head->disk->queue, bs);
 	if (ns->noiob)
 		nvme_set_chunk_size(ns);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) {
 		set_capacity(disk, 0);
-	else
+		if (ns->head)
+			set_capacity(ns->head->disk, 0);
+	} else {
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+		if (ns->head)
+			set_capacity(ns->head->disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+	}
 
-	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM) {
+		nvme_config_discard(ns, ns->queue);
+		nvme_config_discard(ns, ns->head->disk->queue);
+	}
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
@@ -2377,6 +2477,73 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (ns->ctrl->state == NVME_CTRL_LIVE) {
+			rcu_assign_pointer(head->current_path, ns);
+			return ns;
+		}
+	}
+
+	return NULL;
+}
+
+static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct nvme_ns *ns;
+	blk_qc_t ret = BLK_QC_T_NONE;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = srcu_dereference(head->current_path, &head->srcu);
+	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
+		ns = nvme_find_path(head);
+	if (likely(ns)) {
+		bio->bi_disk = ns->disk;
+		bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
+		ret = generic_make_request_fast(bio);
+	} else if (!list_empty_careful(&head->list)) {
+		printk_ratelimited("no path available - requeing I/O\n");
+
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		printk_ratelimited("no path - failing I/O\n");
+
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static const struct block_device_operations nvme_subsys_ops = {
+	.owner		= THIS_MODULE,
+};
+
+static void nvme_requeue_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&head->requeue_lock);
+	next = bio_list_get(&head->requeue_list);
+	spin_unlock_irq(&head->requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		generic_make_request_fast(bio);
+	}
+}
+
 static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
 		unsigned nsid)
 {
@@ -2416,6 +2583,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		unsigned nsid, struct nvme_id_ns *id)
 {
 	struct nvme_ns_head *head;
+	struct request_queue *q;
 	int ret = -ENOMEM;
 
 	head = kzalloc(sizeof(*head), GFP_KERNEL);
@@ -2424,6 +2592,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 
 	INIT_LIST_HEAD(&head->list);
 	head->ns_id = nsid;
+	bio_list_init(&head->requeue_list);
+	spin_lock_init(&head->requeue_lock);
+	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
 	kref_init(&head->ref);
 
@@ -2437,8 +2608,37 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		goto out_free_head;
 	}
 
+	ret = -ENOMEM;
+	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
+	if (!q)
+		goto out_free_head;
+	q->queuedata = head;
+	blk_queue_make_request(q, nvme_make_request);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	/* set to a default value for 512 until disk is validated */
+	blk_queue_logical_block_size(q, 512);
+	nvme_set_queue_limits(ctrl, q);
+
+	head->instance = ida_simple_get(&nvme_disk_ida, 1, 0, GFP_KERNEL);
+	if (head->instance < 0)
+		goto out_cleanup_queue;
+
+	head->disk = alloc_disk(0);
+	if (!head->disk)
+		goto out_ida_remove;
+	head->disk->fops = &nvme_subsys_ops;
+	head->disk->private_data = head;
+	head->disk->queue = q;
+	head->disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(head->disk->disk_name, "nvme/ns%d", head->instance);
+
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
+
+out_ida_remove:
+	ida_simple_remove(&nvme_disk_ida, head->instance);
+out_cleanup_queue:
+	blk_cleanup_queue(q);
 out_free_head:
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -2447,7 +2647,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 }
 
 static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
-		struct nvme_id_ns *id)
+		struct nvme_id_ns *id, bool *new)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	bool is_shared = id->nmic & (1 << 0);
@@ -2463,6 +2663,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = PTR_ERR(head);
 			goto out_unlock;
 		}
+
+		*new = true;
 	} else {
 		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
 		uuid_t uuid = uuid_null;
@@ -2477,6 +2679,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = -EINVAL;
 			goto out_unlock;
 		}
+
+		*new = false;
 	}
 
 	list_add_tail(&ns->siblings, &head->list);
@@ -2546,6 +2750,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
+	bool new = true;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -2578,7 +2783,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	if (nvme_init_ns_head(ns, nsid, id))
+	if (nvme_init_ns_head(ns, nsid, id, &new))
 		goto out_free_id;
 
 	if (nvme_nvm_ns_supported(ns, id) &&
@@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
+
+	if (new)
+		add_disk(ns->head->disk);
+
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2650,8 +2859,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	}
 
 	mutex_lock(&ns->ctrl->subsys->lock);
-	if (head)
+	if (head) {
+		rcu_assign_pointer(head->current_path, NULL);
 		list_del_rcu(&ns->siblings);
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	mutex_lock(&ns->ctrl->namespaces_mutex);
@@ -3201,6 +3412,7 @@ int __init nvme_core_init(void)
 
 void nvme_core_exit(void)
 {
+	ida_destroy(&nvme_disk_ida);
 	class_destroy(nvme_class);
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f68a89be654b..e8b28b7d38e8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -207,14 +207,20 @@ struct nvme_subsystem {
  * only ever has a single entry for private namespaces.
  */
 struct nvme_ns_head {
+	struct nvme_ns		*current_path;
+	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct bio_list		requeue_list;
+	spinlock_t		requeue_lock;
+	struct work_struct	requeue_work;
 	unsigned		ns_id;
 	u8			eui64[8];
 	u8			nguid[16];
 	uuid_t			uuid;
 	struct list_head	entry;
 	struct kref		ref;
+	int			instance;
 };
 
 struct nvme_ns {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-23 18:21     ` Bart Van Assche
  -1 siblings, 0 replies; 122+ messages in thread
From: Bart Van Assche @ 2017-08-23 18:21 UTC (permalink / raw)
  To: hch, axboe; +Cc: keith.busch, linux-nvme, linux-block, sagi

T24gV2VkLCAyMDE3LTA4LTIzIGF0IDE5OjU4ICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90
ZToNCj4gK3N0YXRpYyBibGtfcWNfdCBudm1lX21ha2VfcmVxdWVzdChzdHJ1Y3QgcmVxdWVzdF9x
dWV1ZSAqcSwgc3RydWN0IGJpbyAqYmlvKQ0KPiArew0KPiArCXN0cnVjdCBudm1lX25zX2hlYWQg
KmhlYWQgPSBxLT5xdWV1ZWRhdGE7DQo+ICsJc3RydWN0IG52bWVfbnMgKm5zOw0KPiArCWJsa19x
Y190IHJldCA9IEJMS19RQ19UX05PTkU7DQo+ICsJaW50IHNyY3VfaWR4Ow0KPiArDQo+ICsJc3Jj
dV9pZHggPSBzcmN1X3JlYWRfbG9jaygmaGVhZC0+c3JjdSk7DQo+ICsJbnMgPSBzcmN1X2RlcmVm
ZXJlbmNlKGhlYWQtPmN1cnJlbnRfcGF0aCwgJmhlYWQtPnNyY3UpOw0KPiArCWlmICh1bmxpa2Vs
eSghbnMgfHwgbnMtPmN0cmwtPnN0YXRlICE9IE5WTUVfQ1RSTF9MSVZFKSkNCj4gKwkJbnMgPSBu
dm1lX2ZpbmRfcGF0aChoZWFkKTsNCj4gKwlpZiAobGlrZWx5KG5zKSkgew0KPiArCQliaW8tPmJp
X2Rpc2sgPSBucy0+ZGlzazsNCj4gKwkJYmlvLT5iaV9vcGYgfD0gUkVRX0ZBSUxGQVNUX1RSQU5T
UE9SVDsNCj4gKwkJcmV0ID0gZ2VuZXJpY19tYWtlX3JlcXVlc3RfZmFzdChiaW8pOw0KPiArCX0g
ZWxzZSBpZiAoIWxpc3RfZW1wdHlfY2FyZWZ1bCgmaGVhZC0+bGlzdCkpIHsNCj4gKwkJcHJpbnRr
X3JhdGVsaW1pdGVkKCJubyBwYXRoIGF2YWlsYWJsZSAtIHJlcXVlaW5nIEkvT1xuIik7DQo+ICsN
Cj4gKwkJc3Bpbl9sb2NrX2lycSgmaGVhZC0+cmVxdWV1ZV9sb2NrKTsNCj4gKwkJYmlvX2xpc3Rf
YWRkKCZoZWFkLT5yZXF1ZXVlX2xpc3QsIGJpbyk7DQo+ICsJCXNwaW5fdW5sb2NrX2lycSgmaGVh
ZC0+cmVxdWV1ZV9sb2NrKTsNCj4gKwl9IGVsc2Ugew0KPiArCQlwcmludGtfcmF0ZWxpbWl0ZWQo
Im5vIHBhdGggLSBmYWlsaW5nIEkvT1xuIik7DQo+ICsNCj4gKwkJYmlvLT5iaV9zdGF0dXMgPSBC
TEtfU1RTX0lPRVJSOw0KPiArCQliaW9fZW5kaW8oYmlvKTsNCj4gKwl9DQo+ICsNCj4gKwlzcmN1
X3JlYWRfdW5sb2NrKCZoZWFkLT5zcmN1LCBzcmN1X2lkeCk7DQo+ICsJcmV0dXJuIHJldDsNCj4g
K30NCg0KSGVsbG8gQ2hyaXN0b3BoLA0KDQpTaW5jZSBnZW5lcmljX21ha2VfcmVxdWVzdF9mYXN0
KCkgcmV0dXJucyBCTEtfU1RTX0FHQUlOIGZvciBhIGR5aW5nIHBhdGg6DQpjYW4gdGhlIHNhbWUg
a2luZCBvZiBzb2Z0IGxvY2t1cHMgb2NjdXIgd2l0aCB0aGUgTlZNZSBtdWx0aXBhdGhpbmcgY29k
ZSBhcw0Kd2l0aCB0aGUgY3VycmVudCB1cHN0cmVhbSBkZXZpY2UgbWFwcGVyIG11bHRpcGF0aGlu
ZyBjb2RlPyBTZWUgZS5nLg0KIltQQVRDSCAzLzddIGRtLW1wYXRoOiBEbyBub3QgbG9jayB1cCBh
IENQVSB3aXRoIHJlcXVldWluZyBhY3Rpdml0eSINCihodHRwczovL3d3dy5yZWRoYXQuY29tL2Fy
Y2hpdmVzL2RtLWRldmVsLzIwMTctQXVndXN0L21zZzAwMTI0Lmh0bWwpLg0KDQpBbm90aGVyIHF1
ZXN0aW9uIGFib3V0IHRoaXMgY29kZSBpcyB3aGF0IHdpbGwgaGFwcGVuIGlmDQpnZW5lcmljX21h
a2VfcmVxdWVzdF9mYXN0KCkgcmV0dXJucyBCTEtfU1RTX0FHQUlOIGFuZCB0aGUgc3VibWl0X2Jp
bygpIG9yDQpnZW5lcmljX21ha2VfcmVxdWVzdCgpIGNhbGxlciBpZ25vcmVzIHRoZSByZXR1cm4g
dmFsdWUgb2YgdGhlIGNhbGxlZA0KZnVuY3Rpb24/IEEgcXVpY2sgZ3JlcCByZXZlYWxlZCB0aGF0
IHRoZXJlIGlzIHBsZW50eSBvZiBjb2RlIHRoYXQgaWdub3Jlcw0KdGhlIHJldHVybiB2YWx1ZSBv
ZiB0aGVzZSBsYXN0IHR3byBmdW5jdGlvbnMuDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-23 18:21     ` Bart Van Assche
  0 siblings, 0 replies; 122+ messages in thread
From: Bart Van Assche @ 2017-08-23 18:21 UTC (permalink / raw)


On Wed, 2017-08-23@19:58 +0200, Christoph Hellwig wrote:
> +static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
> +{
> +	struct nvme_ns_head *head = q->queuedata;
> +	struct nvme_ns *ns;
> +	blk_qc_t ret = BLK_QC_T_NONE;
> +	int srcu_idx;
> +
> +	srcu_idx = srcu_read_lock(&head->srcu);
> +	ns = srcu_dereference(head->current_path, &head->srcu);
> +	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
> +		ns = nvme_find_path(head);
> +	if (likely(ns)) {
> +		bio->bi_disk = ns->disk;
> +		bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
> +		ret = generic_make_request_fast(bio);
> +	} else if (!list_empty_careful(&head->list)) {
> +		printk_ratelimited("no path available - requeing I/O\n");
> +
> +		spin_lock_irq(&head->requeue_lock);
> +		bio_list_add(&head->requeue_list, bio);
> +		spin_unlock_irq(&head->requeue_lock);
> +	} else {
> +		printk_ratelimited("no path - failing I/O\n");
> +
> +		bio->bi_status = BLK_STS_IOERR;
> +		bio_endio(bio);
> +	}
> +
> +	srcu_read_unlock(&head->srcu, srcu_idx);
> +	return ret;
> +}

Hello Christoph,

Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path:
can the same kind of soft lockups occur with the NVMe multipathing code as
with the current upstream device mapper multipathing code? See e.g.
"[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity"
(https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html).

Another question about this code is what will happen if
generic_make_request_fast() returns BLK_STS_AGAIN and the submit_bio() or
generic_make_request() caller ignores the return value of the called
function? A quick grep revealed that there is plenty of code that ignores
the return value of these last two functions.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 06/10] nvme: track subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-23 22:04     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-23 22:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

Looks great. A few minor comments below.

On Wed, Aug 23, 2017 at 07:58:11PM +0200, Christoph Hellwig wrote:
> +static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
> +{
> +	struct nvme_subsystem *subsys;
> +
> +	lockdep_assert_held(&nvme_subsystems_lock);
> +
> +	list_for_each_entry(subsys, &nvme_subsystems, entry) {
> +		if (strcmp(subsys->subnqn, subsysnqn))
> +			continue;
> +		if (!kref_get_unless_zero(&subsys->ref))
> +			continue;

You should be able to just return immediately here since there can't be
a duplicated subsysnqn in the list.

> +		return subsys;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> +{
> +	struct nvme_subsystem *subsys, *found;
> +
> +	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
> +	if (!subsys)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&subsys->ctrls);
> +	kref_init(&subsys->ref);
> +	nvme_init_subnqn(subsys, ctrl, id);
> +	mutex_init(&subsys->lock);
> +
> +	mutex_lock(&nvme_subsystems_lock);

This could be a spinlock instead of a mutex.

> +	found = __nvme_find_get_subsystem(subsys->subnqn);
> +	if (found) {
> +		/*
> +		 * Verify that the subsystem actually supports multiple
> +		 * controllers, else bail out.
> +		 */
> +		kfree(subsys);
> +		if (!(id->cmic & (1 << 1))) {
> +			dev_err(ctrl->device,
> +				"ignoring ctrl due to duplicate subnqn (%s).\n",
> +				found->subnqn);
> +			mutex_unlock(&nvme_subsystems_lock);
> +			return -EINVAL;

Returning -EINVAL here will cause nvme_init_identify to fail. Do we want
that to happen here? I think we want to be able to manage controllers
in such a state, but just checking if there's a good reason to not allow
them.

> +		}
> +
> +		subsys = found;
> +	} else {
> +		list_add_tail(&subsys->entry, &nvme_subsystems);
> +	}
> +
> +	ctrl->subsys = subsys;
> +	mutex_unlock(&nvme_subsystems_lock);
> +
> +	mutex_lock(&subsys->lock);
> +	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> +	mutex_unlock(&subsys->lock);
> +
> +	return 0;
>  }
>  
>  /*
> @@ -1801,7 +1882,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>  		return -EIO;
>  	}
>  
> -	nvme_init_subnqn(ctrl, id);
> +	ret = nvme_init_subsystem(ctrl, id);
> +	if (ret) {
> +		kfree(id);
> +		return ret;
> +	}

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 06/10] nvme: track subsystems
@ 2017-08-23 22:04     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-23 22:04 UTC (permalink / raw)


Looks great. A few minor comments below.

On Wed, Aug 23, 2017@07:58:11PM +0200, Christoph Hellwig wrote:
> +static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
> +{
> +	struct nvme_subsystem *subsys;
> +
> +	lockdep_assert_held(&nvme_subsystems_lock);
> +
> +	list_for_each_entry(subsys, &nvme_subsystems, entry) {
> +		if (strcmp(subsys->subnqn, subsysnqn))
> +			continue;
> +		if (!kref_get_unless_zero(&subsys->ref))
> +			continue;

You should be able to just return immediately here since there can't be
a duplicated subsysnqn in the list.

> +		return subsys;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> +{
> +	struct nvme_subsystem *subsys, *found;
> +
> +	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
> +	if (!subsys)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&subsys->ctrls);
> +	kref_init(&subsys->ref);
> +	nvme_init_subnqn(subsys, ctrl, id);
> +	mutex_init(&subsys->lock);
> +
> +	mutex_lock(&nvme_subsystems_lock);

This could be a spinlock instead of a mutex.

> +	found = __nvme_find_get_subsystem(subsys->subnqn);
> +	if (found) {
> +		/*
> +		 * Verify that the subsystem actually supports multiple
> +		 * controllers, else bail out.
> +		 */
> +		kfree(subsys);
> +		if (!(id->cmic & (1 << 1))) {
> +			dev_err(ctrl->device,
> +				"ignoring ctrl due to duplicate subnqn (%s).\n",
> +				found->subnqn);
> +			mutex_unlock(&nvme_subsystems_lock);
> +			return -EINVAL;

Returning -EINVAL here will cause nvme_init_identify to fail. Do we want
that to happen here? I think we want to be able to manage controllers
in such a state, but just checking if there's a good reason to not allow
them.

> +		}
> +
> +		subsys = found;
> +	} else {
> +		list_add_tail(&subsys->entry, &nvme_subsystems);
> +	}
> +
> +	ctrl->subsys = subsys;
> +	mutex_unlock(&nvme_subsystems_lock);
> +
> +	mutex_lock(&subsys->lock);
> +	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
> +	mutex_unlock(&subsys->lock);
> +
> +	return 0;
>  }
>  
>  /*
> @@ -1801,7 +1882,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>  		return -EIO;
>  	}
>  
> -	nvme_init_subnqn(ctrl, id);
> +	ret = nvme_init_subsystem(ctrl, id);
> +	if (ret) {
> +		kfree(id);
> +		return ret;
> +	}

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-23 22:53     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-23 22:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote:
> 
> TODO: implement sysfs interfaces for the new subsystem and
> subsystem-namespace object.  Unless we can come up with something
> better than sysfs here..

Can we get symlinks from the multipath'ed nvme block device to the
individual paths? I think it should be something like the following:
 
> @@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>  	if (ns->ndev && nvme_nvm_register_sysfs(ns))
>  		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
>  			ns->disk->disk_name);
> +
> +	if (new)
> +		add_disk(ns->head->disk);
+
+	sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
+			  &disk_to_dev(ns->disk)->kobj,
+			  ns->disk->name);

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-23 22:53     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-23 22:53 UTC (permalink / raw)


On Wed, Aug 23, 2017@07:58:15PM +0200, Christoph Hellwig wrote:
> 
> TODO: implement sysfs interfaces for the new subsystem and
> subsystem-namespace object.  Unless we can come up with something
> better than sysfs here..

Can we get symlinks from the multipath'ed nvme block device to the
individual paths? I think it should be something like the following:
 
> @@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>  	if (ns->ndev && nvme_nvm_register_sysfs(ns))
>  		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
>  			ns->disk->disk_name);
> +
> +	if (new)
> +		add_disk(ns->head->disk);
+
+	sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
+			  &disk_to_dev(ns->disk)->kobj,
+			  ns->disk->name);

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 06/10] nvme: track subsystems
  2017-08-23 22:04     ` Keith Busch
@ 2017-08-24  8:52       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-24  8:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Wed, Aug 23, 2017 at 06:04:34PM -0400, Keith Busch wrote:
> > +	struct nvme_subsystem *subsys;
> > +
> > +	lockdep_assert_held(&nvme_subsystems_lock);
> > +
> > +	list_for_each_entry(subsys, &nvme_subsystems, entry) {
> > +		if (strcmp(subsys->subnqn, subsysnqn))
> > +			continue;
> > +		if (!kref_get_unless_zero(&subsys->ref))
> > +			continue;
> 
> You should be able to just return immediately here since there can't be
> a duplicated subsysnqn in the list.

We could have a race where we tear one subsystem instance down and
are setting up another one.

> > +	INIT_LIST_HEAD(&subsys->ctrls);
> > +	kref_init(&subsys->ref);
> > +	nvme_init_subnqn(subsys, ctrl, id);
> > +	mutex_init(&subsys->lock);
> > +
> > +	mutex_lock(&nvme_subsystems_lock);
> 
> This could be a spinlock instead of a mutex.

We could.  But given that the lock is not in the hot path there seems
to be no point in making it a spinlock.

> > +	found = __nvme_find_get_subsystem(subsys->subnqn);
> > +	if (found) {
> > +		/*
> > +		 * Verify that the subsystem actually supports multiple
> > +		 * controllers, else bail out.
> > +		 */
> > +		kfree(subsys);
> > +		if (!(id->cmic & (1 << 1))) {
> > +			dev_err(ctrl->device,
> > +				"ignoring ctrl due to duplicate subnqn (%s).\n",
> > +				found->subnqn);
> > +			mutex_unlock(&nvme_subsystems_lock);
> > +			return -EINVAL;
> 
> Returning -EINVAL here will cause nvme_init_identify to fail. Do we want
> that to happen here? I think we want to be able to manage controllers
> in such a state, but just checking if there's a good reason to not allow
> them.

Without this we will get duplicate nvme_subsystem structures, messing
up the whole lookup.  We could mark them as buggy with a flag and
make sure controllers without CMIC bit 1 set will never be linked.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 06/10] nvme: track subsystems
@ 2017-08-24  8:52       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-24  8:52 UTC (permalink / raw)


On Wed, Aug 23, 2017@06:04:34PM -0400, Keith Busch wrote:
> > +	struct nvme_subsystem *subsys;
> > +
> > +	lockdep_assert_held(&nvme_subsystems_lock);
> > +
> > +	list_for_each_entry(subsys, &nvme_subsystems, entry) {
> > +		if (strcmp(subsys->subnqn, subsysnqn))
> > +			continue;
> > +		if (!kref_get_unless_zero(&subsys->ref))
> > +			continue;
> 
> You should be able to just return immediately here since there can't be
> a duplicated subsysnqn in the list.

We could have a race where we tear one subsystem instance down and
are setting up another one.

> > +	INIT_LIST_HEAD(&subsys->ctrls);
> > +	kref_init(&subsys->ref);
> > +	nvme_init_subnqn(subsys, ctrl, id);
> > +	mutex_init(&subsys->lock);
> > +
> > +	mutex_lock(&nvme_subsystems_lock);
> 
> This could be a spinlock instead of a mutex.

We could.  But given that the lock is not in the hot path there seems
to be no point in making it a spinlock.

> > +	found = __nvme_find_get_subsystem(subsys->subnqn);
> > +	if (found) {
> > +		/*
> > +		 * Verify that the subsystem actually supports multiple
> > +		 * controllers, else bail out.
> > +		 */
> > +		kfree(subsys);
> > +		if (!(id->cmic & (1 << 1))) {
> > +			dev_err(ctrl->device,
> > +				"ignoring ctrl due to duplicate subnqn (%s).\n",
> > +				found->subnqn);
> > +			mutex_unlock(&nvme_subsystems_lock);
> > +			return -EINVAL;
> 
> Returning -EINVAL here will cause nvme_init_identify to fail. Do we want
> that to happen here? I think we want to be able to manage controllers
> in such a state, but just checking if there's a good reason to not allow
> them.

Without this we will get duplicate nvme_subsystem structures, messing
up the whole lookup.  We could mark them as buggy with a flag and
make sure controllers without CMIC bit 1 set will never be linked.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 22:53     ` Keith Busch
@ 2017-08-24  8:52       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-24  8:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Wed, Aug 23, 2017 at 06:53:00PM -0400, Keith Busch wrote:
> On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote:
> > 
> > TODO: implement sysfs interfaces for the new subsystem and
> > subsystem-namespace object.  Unless we can come up with something
> > better than sysfs here..
> 
> Can we get symlinks from the multipath'ed nvme block device to the
> individual paths? I think it should be something like the following:
>  
> > @@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> >  	if (ns->ndev && nvme_nvm_register_sysfs(ns))
> >  		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
> >  			ns->disk->disk_name);
> > +
> > +	if (new)
> > +		add_disk(ns->head->disk);
> +
> +	sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
> +			  &disk_to_dev(ns->disk)->kobj,
> +			  ns->disk->name);

Yeah, probably..

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-24  8:52       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-24  8:52 UTC (permalink / raw)


On Wed, Aug 23, 2017@06:53:00PM -0400, Keith Busch wrote:
> On Wed, Aug 23, 2017@07:58:15PM +0200, Christoph Hellwig wrote:
> > 
> > TODO: implement sysfs interfaces for the new subsystem and
> > subsystem-namespace object.  Unless we can come up with something
> > better than sysfs here..
> 
> Can we get symlinks from the multipath'ed nvme block device to the
> individual paths? I think it should be something like the following:
>  
> > @@ -2616,6 +2821,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> >  	if (ns->ndev && nvme_nvm_register_sysfs(ns))
> >  		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
> >  			ns->disk->disk_name);
> > +
> > +	if (new)
> > +		add_disk(ns->head->disk);
> +
> +	sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
> +			  &disk_to_dev(ns->disk)->kobj,
> +			  ns->disk->name);

Yeah, probably..

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 18:21     ` Bart Van Assche
@ 2017-08-24  8:59       ` hch
  -1 siblings, 0 replies; 122+ messages in thread
From: hch @ 2017-08-24  8:59 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, axboe, keith.busch, linux-nvme, linux-block, sagi

On Wed, Aug 23, 2017 at 06:21:55PM +0000, Bart Van Assche wrote:
> On Wed, 2017-08-23 at 19:58 +0200, Christoph Hellwig wrote:
> > +static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
> > +{
> > +	struct nvme_ns_head *head = q->queuedata;
> > +	struct nvme_ns *ns;
> > +	blk_qc_t ret = BLK_QC_T_NONE;
> > +	int srcu_idx;
> > +
> > +	srcu_idx = srcu_read_lock(&head->srcu);
> > +	ns = srcu_dereference(head->current_path, &head->srcu);
> > +	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
> > +		ns = nvme_find_path(head);
> > +	if (likely(ns)) {
> > +		bio->bi_disk = ns->disk;
> > +		bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
> > +		ret = generic_make_request_fast(bio);
> > +	} else if (!list_empty_careful(&head->list)) {
> > +		printk_ratelimited("no path available - requeing I/O\n");
> > +
> > +		spin_lock_irq(&head->requeue_lock);
> > +		bio_list_add(&head->requeue_list, bio);
> > +		spin_unlock_irq(&head->requeue_lock);
> > +	} else {
> > +		printk_ratelimited("no path - failing I/O\n");
> > +
> > +		bio->bi_status = BLK_STS_IOERR;
> > +		bio_endio(bio);
> > +	}
> > +
> > +	srcu_read_unlock(&head->srcu, srcu_idx);
> > +	return ret;
> > +}
> 
> Hello Christoph,
> 
> Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path:
> can the same kind of soft lockups occur with the NVMe multipathing code as
> with the current upstream device mapper multipathing code? See e.g.
> "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity"
> (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html).

I suspect the code is not going to hit it because we check the controller
state before trying to queue I/O on the lower queue.  But if you point
me to a good reproducer test case I'd like to check.

Also does the "single queue" case in your mail refer to the old
request code?  nvme only uses blk-mq so it would not hit that.
But either way I think get_request should be fixed to return
BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN.

> Another question about this code is what will happen if
> generic_make_request_fast() returns BLK_STS_AGAIN and the submit_bio() or
> generic_make_request() caller ignores the return value of the called
> function? A quick grep revealed that there is plenty of code that ignores
> the return value of these last two functions.

generic_make_request and generic_make_request_fast only return
the polling cookie (blk_qc_t), not a block status.  Note that we do
not use blk_get_request / blk_mq_alloc_request for the request allocation
of the request on the lower device, so unless the caller passed REQ_NOWAIT
and is able to handle BLK_STS_AGAIN we won't ever return it.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-24  8:59       ` hch
  0 siblings, 0 replies; 122+ messages in thread
From: hch @ 2017-08-24  8:59 UTC (permalink / raw)


On Wed, Aug 23, 2017@06:21:55PM +0000, Bart Van Assche wrote:
> On Wed, 2017-08-23@19:58 +0200, Christoph Hellwig wrote:
> > +static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
> > +{
> > +	struct nvme_ns_head *head = q->queuedata;
> > +	struct nvme_ns *ns;
> > +	blk_qc_t ret = BLK_QC_T_NONE;
> > +	int srcu_idx;
> > +
> > +	srcu_idx = srcu_read_lock(&head->srcu);
> > +	ns = srcu_dereference(head->current_path, &head->srcu);
> > +	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
> > +		ns = nvme_find_path(head);
> > +	if (likely(ns)) {
> > +		bio->bi_disk = ns->disk;
> > +		bio->bi_opf |= REQ_FAILFAST_TRANSPORT;
> > +		ret = generic_make_request_fast(bio);
> > +	} else if (!list_empty_careful(&head->list)) {
> > +		printk_ratelimited("no path available - requeing I/O\n");
> > +
> > +		spin_lock_irq(&head->requeue_lock);
> > +		bio_list_add(&head->requeue_list, bio);
> > +		spin_unlock_irq(&head->requeue_lock);
> > +	} else {
> > +		printk_ratelimited("no path - failing I/O\n");
> > +
> > +		bio->bi_status = BLK_STS_IOERR;
> > +		bio_endio(bio);
> > +	}
> > +
> > +	srcu_read_unlock(&head->srcu, srcu_idx);
> > +	return ret;
> > +}
> 
> Hello Christoph,
> 
> Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path:
> can the same kind of soft lockups occur with the NVMe multipathing code as
> with the current upstream device mapper multipathing code? See e.g.
> "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity"
> (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html).

I suspect the code is not going to hit it because we check the controller
state before trying to queue I/O on the lower queue.  But if you point
me to a good reproducer test case I'd like to check.

Also does the "single queue" case in your mail refer to the old
request code?  nvme only uses blk-mq so it would not hit that.
But either way I think get_request should be fixed to return
BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN.

> Another question about this code is what will happen if
> generic_make_request_fast() returns BLK_STS_AGAIN and the submit_bio() or
> generic_make_request() caller ignores the return value of the called
> function? A quick grep revealed that there is plenty of code that ignores
> the return value of these last two functions.

generic_make_request and generic_make_request_fast only return
the polling cookie (blk_qc_t), not a block status.  Note that we do
not use blk_get_request / blk_mq_alloc_request for the request allocation
of the request on the lower device, so unless the caller passed REQ_NOWAIT
and is able to handle BLK_STS_AGAIN we won't ever return it.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-24  8:59       ` hch
@ 2017-08-24 20:17         ` Bart Van Assche
  -1 siblings, 0 replies; 122+ messages in thread
From: Bart Van Assche @ 2017-08-24 20:17 UTC (permalink / raw)
  To: hch; +Cc: keith.busch, linux-nvme, linux-block, axboe, sagi

T24gVGh1LCAyMDE3LTA4LTI0IGF0IDEwOjU5ICswMjAwLCBoY2hAbHN0LmRlIHdyb3RlOg0KPiBP
biBXZWQsIEF1ZyAyMywgMjAxNyBhdCAwNjoyMTo1NVBNICswMDAwLCBCYXJ0IFZhbiBBc3NjaGUg
d3JvdGU6DQo+ID4gU2luY2UgZ2VuZXJpY19tYWtlX3JlcXVlc3RfZmFzdCgpIHJldHVybnMgQkxL
X1NUU19BR0FJTiBmb3IgYSBkeWluZyBwYXRoOg0KPiA+IGNhbiB0aGUgc2FtZSBraW5kIG9mIHNv
ZnQgbG9ja3VwcyBvY2N1ciB3aXRoIHRoZSBOVk1lIG11bHRpcGF0aGluZyBjb2RlIGFzDQo+ID4g
d2l0aCB0aGUgY3VycmVudCB1cHN0cmVhbSBkZXZpY2UgbWFwcGVyIG11bHRpcGF0aGluZyBjb2Rl
PyBTZWUgZS5nLg0KPiA+ICJbUEFUQ0ggMy83XSBkbS1tcGF0aDogRG8gbm90IGxvY2sgdXAgYSBD
UFUgd2l0aCByZXF1ZXVpbmcgYWN0aXZpdHkiDQo+ID4gKGh0dHBzOi8vd3d3LnJlZGhhdC5jb20v
YXJjaGl2ZXMvZG0tZGV2ZWwvMjAxNy1BdWd1c3QvbXNnMDAxMjQuaHRtbCkuDQo+IA0KPiBJIHN1
c3BlY3QgdGhlIGNvZGUgaXMgbm90IGdvaW5nIHRvIGhpdCBpdCBiZWNhdXNlIHdlIGNoZWNrIHRo
ZSBjb250cm9sbGVyDQo+IHN0YXRlIGJlZm9yZSB0cnlpbmcgdG8gcXVldWUgSS9PIG9uIHRoZSBs
b3dlciBxdWV1ZS4gIEJ1dCBpZiB5b3UgcG9pbnQNCj4gbWUgdG8gYSBnb29kIHJlcHJvZHVjZXIg
dGVzdCBjYXNlIEknZCBsaWtlIHRvIGNoZWNrLg0KDQpGb3IgTlZNZSBvdmVyIFJETUEsIGhvdyBh
Ym91dCB0aGUgc2ltdWxhdGVfbmV0d29ya19mYWlsdXJlX2xvb3AoKSBmdW5jdGlvbiBpbg0KaHR0
cHM6Ly9naXRodWIuY29tL2J2YW5hc3NjaGUvc3JwLXRlc3QvYmxvYi9tYXN0ZXIvbGliL2Z1bmN0
aW9ucz8gSXQgc2ltdWxhdGVzDQphIG5ldHdvcmsgZmFpbHVyZSBieSB3cml0aW5nIGludG8gdGhl
IHJlc2V0X2NvbnRyb2xsZXIgc3lzZnMgYXR0cmlidXRlLg0KDQo+IEFsc28gZG9lcyB0aGUgInNp
bmdsZSBxdWV1ZSIgY2FzZSBpbiB5b3VyIG1haWwgcmVmZXIgdG8gdGhlIG9sZA0KPiByZXF1ZXN0
IGNvZGU/ICBudm1lIG9ubHkgdXNlcyBibGstbXEgc28gaXQgd291bGQgbm90IGhpdCB0aGF0Lg0K
PiBCdXQgZWl0aGVyIHdheSBJIHRoaW5rIGdldF9yZXF1ZXN0IHNob3VsZCBiZSBmaXhlZCB0byBy
ZXR1cm4NCj4gQkxLX1NUU19JT0VSUiBpZiB0aGUgcXVldWUgaXMgZHlpbmcgaW5zdGVhZCBvZiBC
TEtfU1RTX0FHQUlOLg0KDQpUaGUgZGVzY3JpcHRpb24gaW4gdGhlIHBhdGNoIEkgcmVmZXJyZWQg
dG8gaW5kZWVkIHJlZmVycyB0byB0aGUgb2xkIHJlcXVlc3QNCmNvZGUgaW4gdGhlIGJsb2NrIGxh
eWVyLiBXaGVuIEkgcHJlcGFyZWQgdGhhdCBwYXRjaCBJIGhhZCBhbmFseXplZCB0aGUNCmJlaGF2
aW9yIG9mIHRoZSBvbGQgcmVxdWVzdCBjb2RlIG9ubHkuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-24 20:17         ` Bart Van Assche
  0 siblings, 0 replies; 122+ messages in thread
From: Bart Van Assche @ 2017-08-24 20:17 UTC (permalink / raw)


On Thu, 2017-08-24@10:59 +0200, hch@lst.de wrote:
> On Wed, Aug 23, 2017@06:21:55PM +0000, Bart Van Assche wrote:
> > Since generic_make_request_fast() returns BLK_STS_AGAIN for a dying path:
> > can the same kind of soft lockups occur with the NVMe multipathing code as
> > with the current upstream device mapper multipathing code? See e.g.
> > "[PATCH 3/7] dm-mpath: Do not lock up a CPU with requeuing activity"
> > (https://www.redhat.com/archives/dm-devel/2017-August/msg00124.html).
> 
> I suspect the code is not going to hit it because we check the controller
> state before trying to queue I/O on the lower queue.  But if you point
> me to a good reproducer test case I'd like to check.

For NVMe over RDMA, how about the simulate_network_failure_loop() function in
https://github.com/bvanassche/srp-test/blob/master/lib/functions? It simulates
a network failure by writing into the reset_controller sysfs attribute.

> Also does the "single queue" case in your mail refer to the old
> request code?  nvme only uses blk-mq so it would not hit that.
> But either way I think get_request should be fixed to return
> BLK_STS_IOERR if the queue is dying instead of BLK_STS_AGAIN.

The description in the patch I referred to indeed refers to the old request
code in the block layer. When I prepared that patch I had analyzed the
behavior of the old request code only.

Bart.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 01/10] nvme: report more detailed status codes to the block layer
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:06     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:06 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-block, linux-nvme

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 01/10] nvme: report more detailed status codes to the block layer
@ 2017-08-28  6:06     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:06 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:06     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:06 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
@ 2017-08-28  6:06     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:06 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 03/10] nvme: remove unused struct nvme_ns fields
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:07     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:07 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 03/10] nvme: remove unused struct nvme_ns fields
@ 2017-08-28  6:07     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:07 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 04/10] nvme: remove nvme_revalidate_ns
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:12     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 04/10] nvme: remove nvme_revalidate_ns
@ 2017-08-28  6:12     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:12 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:17     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:17 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

> Instead validate that these identifiers do not change, as that is
> prohibited by the specification.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/core.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 157dbb7b328d..179ade01745b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1236,6 +1236,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   	struct nvme_ns *ns = disk->private_data;
>   	struct nvme_ctrl *ctrl = ns->ctrl;
>   	struct nvme_id_ns *id;
> +	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
> +	uuid_t uuid = uuid_null;
>   	int ret = 0;
>   
>   	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
> @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   		goto out;
>   	}
>   
> -	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
> +	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
> +	if (!uuid_equal(&ns->uuid, &uuid) ||
> +	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
> +	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {

Shouldn't uuid,nguid,eui64 record the previous values prior to calling
nvme_report_ns_ids?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
@ 2017-08-28  6:17     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:17 UTC (permalink / raw)


> Instead validate that these identifiers do not change, as that is
> prohibited by the specification.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>   drivers/nvme/host/core.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 157dbb7b328d..179ade01745b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1236,6 +1236,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   	struct nvme_ns *ns = disk->private_data;
>   	struct nvme_ctrl *ctrl = ns->ctrl;
>   	struct nvme_id_ns *id;
> +	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
> +	uuid_t uuid = uuid_null;
>   	int ret = 0;
>   
>   	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
> @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   		goto out;
>   	}
>   
> -	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
> +	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
> +	if (!uuid_equal(&ns->uuid, &uuid) ||
> +	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
> +	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {

Shouldn't uuid,nguid,eui64 record the previous values prior to calling
nvme_report_ns_ids?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 06/10] nvme: track subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:22     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:22 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

This looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 06/10] nvme: track subsystems
@ 2017-08-28  6:22     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:22 UTC (permalink / raw)


This looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
  2017-08-28  6:17     ` Sagi Grimberg
@ 2017-08-28  6:23       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  6:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Mon, Aug 28, 2017 at 09:17:44AM +0300, Sagi Grimberg wrote:
>> +	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
>> +	uuid_t uuid = uuid_null;
>>   	int ret = 0;
>>     	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
>> @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>>   		goto out;
>>   	}
>>   -	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, 
>> &ns->uuid);
>> +	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
>> +	if (!uuid_equal(&ns->uuid, &uuid) ||
>> +	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
>> +	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
>
> Shouldn't uuid,nguid,eui64 record the previous values prior to calling
> nvme_report_ns_ids?

The previous values are still in the namespace structure, and we use
the on-stack variables for the current ones.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
@ 2017-08-28  6:23       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  6:23 UTC (permalink / raw)


On Mon, Aug 28, 2017@09:17:44AM +0300, Sagi Grimberg wrote:
>> +	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
>> +	uuid_t uuid = uuid_null;
>>   	int ret = 0;
>>     	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
>> @@ -1252,7 +1254,15 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>>   		goto out;
>>   	}
>>   -	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, 
>> &ns->uuid);
>> +	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
>> +	if (!uuid_equal(&ns->uuid, &uuid) ||
>> +	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
>> +	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
>
> Shouldn't uuid,nguid,eui64 record the previous values prior to calling
> nvme_report_ns_ids?

The previous values are still in the namespace structure, and we use
the on-stack variables for the current ones.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
  2017-08-28  6:23       ` Christoph Hellwig
@ 2017-08-28  6:32         ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block


>> Shouldn't uuid,nguid,eui64 record the previous values prior to calling
>> nvme_report_ns_ids?
> 
> The previous values are still in the namespace structure, and we use
> the on-stack variables for the current ones.

Of course, misread that one, thanks.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
@ 2017-08-28  6:32         ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:32 UTC (permalink / raw)



>> Shouldn't uuid,nguid,eui64 record the previous values prior to calling
>> nvme_report_ns_ids?
> 
> The previous values are still in the namespace structure, and we use
> the on-stack variables for the current ones.

Of course, misread that one, thanks.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  6:51     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block



On 23/08/17 20:58, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.

I haven't read the later patches yet, but what requires sleep in the
path selection?

> 
> [1] comments welcome if you have a better name for it, the current one is
>      horrible.  One idea would be to rename the current struct nvme_ns
>      to struct nvme_ns_link or similar and use the nvme_ns name for the
>      new structure.  But that would involve a lot of churn.

maybe nvme_ns_primary?

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/nvme/host/core.c     | 218 +++++++++++++++++++++++++++++++++++--------
>   drivers/nvme/host/lightnvm.c |  14 +--
>   drivers/nvme/host/nvme.h     |  26 +++++-
>   3 files changed, 208 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8884000dfbdd..abc5911a8a66 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -249,10 +249,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   }
>   EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
>   
> +static void nvme_destroy_ns_head(struct kref *ref)
> +{
> +	struct nvme_ns_head *head =
> +		container_of(ref, struct nvme_ns_head, ref);
> +
> +	list_del_init(&head->entry);
> +	cleanup_srcu_struct(&head->srcu);
> +	kfree(head);
> +}
> +
> +static void nvme_put_ns_head(struct nvme_ns_head *head)
> +{
> +	kref_put(&head->ref, nvme_destroy_ns_head);
> +}
> +
>   static void nvme_free_ns(struct kref *kref)
>   {
>   	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>   
> +	if (ns->head)
> +		nvme_put_ns_head(ns->head);
> +
>   	if (ns->ndev)
>   		nvme_nvm_unregister(ns);
>   
> @@ -422,7 +440,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
>   {
>   	memset(cmnd, 0, sizeof(*cmnd));
>   	cmnd->common.opcode = nvme_cmd_flush;
> -	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
> +	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
>   }
>   
>   static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> @@ -453,7 +471,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>   
>   	memset(cmnd, 0, sizeof(*cmnd));
>   	cmnd->dsm.opcode = nvme_cmd_dsm;
> -	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
> +	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
>   	cmnd->dsm.nr = cpu_to_le32(segments - 1);
>   	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
>   
> @@ -492,7 +510,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   
>   	memset(cmnd, 0, sizeof(*cmnd));
>   	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
> -	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
> +	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
>   	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
>   	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>   
> @@ -977,7 +995,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
>   	memset(&c, 0, sizeof(c));
>   	c.rw.opcode = io.opcode;
>   	c.rw.flags = io.flags;
> -	c.rw.nsid = cpu_to_le32(ns->ns_id);
> +	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
>   	c.rw.slba = cpu_to_le64(io.slba);
>   	c.rw.length = cpu_to_le16(io.nblocks);
>   	c.rw.control = cpu_to_le16(io.control);
> @@ -1041,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
>   	switch (cmd) {
>   	case NVME_IOCTL_ID:
>   		force_successful_syscall_return();
> -		return ns->ns_id;
> +		return ns->head->ns_id;
>   	case NVME_IOCTL_ADMIN_CMD:
>   		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
>   	case NVME_IOCTL_IO_CMD:
> @@ -1248,7 +1266,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   		return -ENODEV;
>   	}
>   
> -	id = nvme_identify_ns(ctrl, ns->ns_id);
> +	id = nvme_identify_ns(ctrl, ns->head->ns_id);
>   	if (!id)
>   		return -ENODEV;
>   
> @@ -1257,12 +1275,12 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   		goto out;
>   	}
>   
> -	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
> -	if (!uuid_equal(&ns->uuid, &uuid) ||
> -	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
> -	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
> +	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, eui64, nguid, &uuid);
> +	if (!uuid_equal(&ns->head->uuid, &uuid) ||
> +	    memcmp(&ns->head->nguid, &nguid, sizeof(ns->head->nguid)) ||
> +	    memcmp(&ns->head->eui64, &eui64, sizeof(ns->head->eui64))) {
>   		dev_err(ctrl->device,
> -			"identifiers changed for nsid %d\n", ns->ns_id);
> +			"identifiers changed for nsid %d\n", ns->head->ns_id);
>   		ret = -ENODEV;
>   	}
>   
> @@ -1303,7 +1321,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
>   
>   	memset(&c, 0, sizeof(c));
>   	c.common.opcode = op;
> -	c.common.nsid = cpu_to_le32(ns->ns_id);
> +	c.common.nsid = cpu_to_le32(ns->head->ns_id);
>   	c.common.cdw10[0] = cpu_to_le32(cdw10);
>   
>   	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
> @@ -1812,6 +1830,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
>   	if (!subsys)
>   		return -ENOMEM;
>   	INIT_LIST_HEAD(&subsys->ctrls);
> +	INIT_LIST_HEAD(&subsys->nsheads);
>   	kref_init(&subsys->ref);
>   	nvme_init_subnqn(subsys, ctrl, id);
>   	mutex_init(&subsys->lock);
> @@ -2132,14 +2151,14 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
>   	int serial_len = sizeof(ctrl->serial);
>   	int model_len = sizeof(ctrl->model);
>   
> -	if (!uuid_is_null(&ns->uuid))
> -		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
> +	if (!uuid_is_null(&ns->head->uuid))
> +		return sprintf(buf, "uuid.%pU\n", &ns->head->uuid);
>   
> -	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
> -		return sprintf(buf, "eui.%16phN\n", ns->nguid);
> +	if (memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
> +		return sprintf(buf, "eui.%16phN\n", ns->head->nguid);
>   
> -	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
> -		return sprintf(buf, "eui.%8phN\n", ns->eui);
> +	if (memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
> +		return sprintf(buf, "eui.%8phN\n", ns->head->eui64);
>   
>   	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
>   				  ctrl->serial[serial_len - 1] == '\0'))
> @@ -2149,7 +2168,8 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
>   		model_len--;
>   
>   	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
> -		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
> +		serial_len, ctrl->serial, model_len, ctrl->model,
> +		ns->head->ns_id);
>   }
>   static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
>   
> @@ -2157,7 +2177,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
>   			  char *buf)
>   {
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%pU\n", ns->nguid);
> +	return sprintf(buf, "%pU\n", ns->head->nguid);
>   }
>   static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
>   
> @@ -2169,12 +2189,12 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>   	/* For backward compatibility expose the NGUID to userspace if
>   	 * we have no UUID set
>   	 */
> -	if (uuid_is_null(&ns->uuid)) {
> +	if (uuid_is_null(&ns->head->uuid)) {
>   		printk_ratelimited(KERN_WARNING
>   				   "No UUID available providing old NGUID\n");
> -		return sprintf(buf, "%pU\n", ns->nguid);
> +		return sprintf(buf, "%pU\n", ns->head->nguid);
>   	}
> -	return sprintf(buf, "%pU\n", &ns->uuid);
> +	return sprintf(buf, "%pU\n", &ns->head->uuid);
>   }
>   static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
>   
> @@ -2182,7 +2202,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
>   								char *buf)
>   {
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%8phd\n", ns->eui);
> +	return sprintf(buf, "%8phd\n", ns->head->eui64);
>   }
>   static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
>   
> @@ -2190,7 +2210,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
>   								char *buf)
>   {
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%d\n", ns->ns_id);
> +	return sprintf(buf, "%d\n", ns->head->ns_id);
>   }
>   static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
>   
> @@ -2210,16 +2230,16 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>   
>   	if (a == &dev_attr_uuid.attr) {
> -		if (uuid_is_null(&ns->uuid) ||
> -		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
> +		if (uuid_is_null(&ns->head->uuid) ||
> +		    !memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
>   			return 0;
>   	}
>   	if (a == &dev_attr_nguid.attr) {
> -		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
> +		if (!memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
>   			return 0;
>   	}
>   	if (a == &dev_attr_eui.attr) {
> -		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
> +		if (!memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
>   			return 0;
>   	}
>   	return a->mode;
> @@ -2357,12 +2377,122 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
>   	NULL,
>   };
>   
> +static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
> +		unsigned nsid)
> +{
> +	struct nvme_ns_head *h;
> +
> +	lockdep_assert_held(&subsys->lock);
> +
> +	list_for_each_entry(h, &subsys->nsheads, entry) {
> +		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
> +			return h;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int __nvme_check_ids(struct nvme_subsystem *subsys,
> +		struct nvme_ns_head *new)
> +{
> +	struct nvme_ns_head *h;
> +
> +	lockdep_assert_held(&subsys->lock);
> +
> +	list_for_each_entry(h, &subsys->nsheads, entry) {
> +		if ((!uuid_is_null(&new->uuid) &&
> +		     uuid_equal(&new->uuid, &h->uuid)) ||
> +		    (memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
> +		     memcmp(&new->nguid, &h->nguid, sizeof(new->nguid))) ||
> +		    (memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
> +		     memcmp(&new->eui64, &h->eui64, sizeof(new->eui64))))
> +			return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
> +		unsigned nsid, struct nvme_id_ns *id)
> +{
> +	struct nvme_ns_head *head;
> +	int ret = -ENOMEM;
> +
> +	head = kzalloc(sizeof(*head), GFP_KERNEL);
> +	if (!head)
> +		goto out;
> +
> +	INIT_LIST_HEAD(&head->list);
> +	head->ns_id = nsid;
> +	init_srcu_struct(&head->srcu);
> +	kref_init(&head->ref);
> +
> +	nvme_report_ns_ids(ctrl, nsid, id, head->eui64, head->nguid,
> +			&head->uuid);
> +
> +	ret = __nvme_check_ids(ctrl->subsys, head);
> +	if (ret) {
> +		dev_err(ctrl->device,
> +			"duplicate IDs for nsid %d\n", nsid);
> +		goto out_free_head;
> +	}
> +
> +	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
> +	return head;
> +out_free_head:
> +	cleanup_srcu_struct(&head->srcu);
> +	kfree(head);
> +out:
> +	return ERR_PTR(ret);
> +}
> +
> +static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
> +		struct nvme_id_ns *id)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	bool is_shared = id->nmic & (1 << 0);
> +	struct nvme_ns_head *head = NULL;
> +	int ret = 0;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +	if (is_shared)
> +		head = __nvme_find_ns_head(ctrl->subsys, nsid);
> +	if (!head) {
> +		head = nvme_alloc_ns_head(ctrl, nsid, id);
> +		if (IS_ERR(head)) {
> +			ret = PTR_ERR(head);
> +			goto out_unlock;
> +		}
> +	} else {
> +		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
> +		uuid_t uuid = uuid_null;
> +
> +		nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, &uuid);
> +		if (!uuid_equal(&head->uuid, &uuid) ||
> +		    memcmp(&head->nguid, &nguid, sizeof(head->nguid)) ||
> +		    memcmp(&head->eui64, &eui64, sizeof(head->eui64))) {

Suggestion, given that this matching pattern returns in several places
would it be better to move it to nvme_ns_match_id()?

>   
> +/*
> + * Anchor structure for namespaces.  There is one for each namespace in a
> + * NVMe subsystem that any of our controllers can see, and the namespace
> + * structure for each controller is chained of it.  For private namespaces
> + * there is a 1:1 relation to our namespace structures, that is ->list
> + * only ever has a single entry for private namespaces.
> + */
> +struct nvme_ns_head {
> +	struct list_head	list;

Maybe siblings is a better name than list,
and the nvme_ns list_head should be called
sibling_entry (or just sibling)?

> +	struct srcu_struct      srcu;
> +	unsigned		ns_id;
> +	u8			eui64[8];
> +	u8			nguid[16];
> +	uuid_t			uuid;
> +	struct list_head	entry;
> +	struct kref		ref;
> +};
> +
>   struct nvme_ns {
>   	struct list_head list;
>   
>   	struct nvme_ctrl *ctrl;
>   	struct request_queue *queue;
>   	struct gendisk *disk;
> +	struct list_head siblings;
>   	struct nvm_dev *ndev;
>   	struct kref kref;
> +	struct nvme_ns_head *head;
>   	int instance;
>   
> -	u8 eui[8];
> -	u8 nguid[16];
> -	uuid_t uuid;
> -
> -	unsigned ns_id;
>   	int lba_shift;
>   	u16 ms;
>   	u16 sgs;
> 

Overall this looks good.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28  6:51     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  6:51 UTC (permalink / raw)




On 23/08/17 20:58, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.

I haven't read the later patches yet, but what requires sleep in the
path selection?

> 
> [1] comments welcome if you have a better name for it, the current one is
>      horrible.  One idea would be to rename the current struct nvme_ns
>      to struct nvme_ns_link or similar and use the nvme_ns name for the
>      new structure.  But that would involve a lot of churn.

maybe nvme_ns_primary?

> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>   drivers/nvme/host/core.c     | 218 +++++++++++++++++++++++++++++++++++--------
>   drivers/nvme/host/lightnvm.c |  14 +--
>   drivers/nvme/host/nvme.h     |  26 +++++-
>   3 files changed, 208 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8884000dfbdd..abc5911a8a66 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -249,10 +249,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   }
>   EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
>   
> +static void nvme_destroy_ns_head(struct kref *ref)
> +{
> +	struct nvme_ns_head *head =
> +		container_of(ref, struct nvme_ns_head, ref);
> +
> +	list_del_init(&head->entry);
> +	cleanup_srcu_struct(&head->srcu);
> +	kfree(head);
> +}
> +
> +static void nvme_put_ns_head(struct nvme_ns_head *head)
> +{
> +	kref_put(&head->ref, nvme_destroy_ns_head);
> +}
> +
>   static void nvme_free_ns(struct kref *kref)
>   {
>   	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>   
> +	if (ns->head)
> +		nvme_put_ns_head(ns->head);
> +
>   	if (ns->ndev)
>   		nvme_nvm_unregister(ns);
>   
> @@ -422,7 +440,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
>   {
>   	memset(cmnd, 0, sizeof(*cmnd));
>   	cmnd->common.opcode = nvme_cmd_flush;
> -	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
> +	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
>   }
>   
>   static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
> @@ -453,7 +471,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>   
>   	memset(cmnd, 0, sizeof(*cmnd));
>   	cmnd->dsm.opcode = nvme_cmd_dsm;
> -	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
> +	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
>   	cmnd->dsm.nr = cpu_to_le32(segments - 1);
>   	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
>   
> @@ -492,7 +510,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   
>   	memset(cmnd, 0, sizeof(*cmnd));
>   	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
> -	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
> +	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
>   	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
>   	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>   
> @@ -977,7 +995,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
>   	memset(&c, 0, sizeof(c));
>   	c.rw.opcode = io.opcode;
>   	c.rw.flags = io.flags;
> -	c.rw.nsid = cpu_to_le32(ns->ns_id);
> +	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
>   	c.rw.slba = cpu_to_le64(io.slba);
>   	c.rw.length = cpu_to_le16(io.nblocks);
>   	c.rw.control = cpu_to_le16(io.control);
> @@ -1041,7 +1059,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
>   	switch (cmd) {
>   	case NVME_IOCTL_ID:
>   		force_successful_syscall_return();
> -		return ns->ns_id;
> +		return ns->head->ns_id;
>   	case NVME_IOCTL_ADMIN_CMD:
>   		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
>   	case NVME_IOCTL_IO_CMD:
> @@ -1248,7 +1266,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   		return -ENODEV;
>   	}
>   
> -	id = nvme_identify_ns(ctrl, ns->ns_id);
> +	id = nvme_identify_ns(ctrl, ns->head->ns_id);
>   	if (!id)
>   		return -ENODEV;
>   
> @@ -1257,12 +1275,12 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   		goto out;
>   	}
>   
> -	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
> -	if (!uuid_equal(&ns->uuid, &uuid) ||
> -	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
> -	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
> +	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, eui64, nguid, &uuid);
> +	if (!uuid_equal(&ns->head->uuid, &uuid) ||
> +	    memcmp(&ns->head->nguid, &nguid, sizeof(ns->head->nguid)) ||
> +	    memcmp(&ns->head->eui64, &eui64, sizeof(ns->head->eui64))) {
>   		dev_err(ctrl->device,
> -			"identifiers changed for nsid %d\n", ns->ns_id);
> +			"identifiers changed for nsid %d\n", ns->head->ns_id);
>   		ret = -ENODEV;
>   	}
>   
> @@ -1303,7 +1321,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
>   
>   	memset(&c, 0, sizeof(c));
>   	c.common.opcode = op;
> -	c.common.nsid = cpu_to_le32(ns->ns_id);
> +	c.common.nsid = cpu_to_le32(ns->head->ns_id);
>   	c.common.cdw10[0] = cpu_to_le32(cdw10);
>   
>   	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
> @@ -1812,6 +1830,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
>   	if (!subsys)
>   		return -ENOMEM;
>   	INIT_LIST_HEAD(&subsys->ctrls);
> +	INIT_LIST_HEAD(&subsys->nsheads);
>   	kref_init(&subsys->ref);
>   	nvme_init_subnqn(subsys, ctrl, id);
>   	mutex_init(&subsys->lock);
> @@ -2132,14 +2151,14 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
>   	int serial_len = sizeof(ctrl->serial);
>   	int model_len = sizeof(ctrl->model);
>   
> -	if (!uuid_is_null(&ns->uuid))
> -		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
> +	if (!uuid_is_null(&ns->head->uuid))
> +		return sprintf(buf, "uuid.%pU\n", &ns->head->uuid);
>   
> -	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
> -		return sprintf(buf, "eui.%16phN\n", ns->nguid);
> +	if (memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
> +		return sprintf(buf, "eui.%16phN\n", ns->head->nguid);
>   
> -	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
> -		return sprintf(buf, "eui.%8phN\n", ns->eui);
> +	if (memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
> +		return sprintf(buf, "eui.%8phN\n", ns->head->eui64);
>   
>   	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
>   				  ctrl->serial[serial_len - 1] == '\0'))
> @@ -2149,7 +2168,8 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
>   		model_len--;
>   
>   	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
> -		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
> +		serial_len, ctrl->serial, model_len, ctrl->model,
> +		ns->head->ns_id);
>   }
>   static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
>   
> @@ -2157,7 +2177,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
>   			  char *buf)
>   {
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%pU\n", ns->nguid);
> +	return sprintf(buf, "%pU\n", ns->head->nguid);
>   }
>   static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
>   
> @@ -2169,12 +2189,12 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>   	/* For backward compatibility expose the NGUID to userspace if
>   	 * we have no UUID set
>   	 */
> -	if (uuid_is_null(&ns->uuid)) {
> +	if (uuid_is_null(&ns->head->uuid)) {
>   		printk_ratelimited(KERN_WARNING
>   				   "No UUID available providing old NGUID\n");
> -		return sprintf(buf, "%pU\n", ns->nguid);
> +		return sprintf(buf, "%pU\n", ns->head->nguid);
>   	}
> -	return sprintf(buf, "%pU\n", &ns->uuid);
> +	return sprintf(buf, "%pU\n", &ns->head->uuid);
>   }
>   static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
>   
> @@ -2182,7 +2202,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
>   								char *buf)
>   {
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%8phd\n", ns->eui);
> +	return sprintf(buf, "%8phd\n", ns->head->eui64);
>   }
>   static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
>   
> @@ -2190,7 +2210,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
>   								char *buf)
>   {
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%d\n", ns->ns_id);
> +	return sprintf(buf, "%d\n", ns->head->ns_id);
>   }
>   static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
>   
> @@ -2210,16 +2230,16 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
>   	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>   
>   	if (a == &dev_attr_uuid.attr) {
> -		if (uuid_is_null(&ns->uuid) ||
> -		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
> +		if (uuid_is_null(&ns->head->uuid) ||
> +		    !memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
>   			return 0;
>   	}
>   	if (a == &dev_attr_nguid.attr) {
> -		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
> +		if (!memchr_inv(ns->head->nguid, 0, sizeof(ns->head->nguid)))
>   			return 0;
>   	}
>   	if (a == &dev_attr_eui.attr) {
> -		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
> +		if (!memchr_inv(ns->head->eui64, 0, sizeof(ns->head->eui64)))
>   			return 0;
>   	}
>   	return a->mode;
> @@ -2357,12 +2377,122 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
>   	NULL,
>   };
>   
> +static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
> +		unsigned nsid)
> +{
> +	struct nvme_ns_head *h;
> +
> +	lockdep_assert_held(&subsys->lock);
> +
> +	list_for_each_entry(h, &subsys->nsheads, entry) {
> +		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
> +			return h;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int __nvme_check_ids(struct nvme_subsystem *subsys,
> +		struct nvme_ns_head *new)
> +{
> +	struct nvme_ns_head *h;
> +
> +	lockdep_assert_held(&subsys->lock);
> +
> +	list_for_each_entry(h, &subsys->nsheads, entry) {
> +		if ((!uuid_is_null(&new->uuid) &&
> +		     uuid_equal(&new->uuid, &h->uuid)) ||
> +		    (memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
> +		     memcmp(&new->nguid, &h->nguid, sizeof(new->nguid))) ||
> +		    (memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
> +		     memcmp(&new->eui64, &h->eui64, sizeof(new->eui64))))
> +			return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
> +		unsigned nsid, struct nvme_id_ns *id)
> +{
> +	struct nvme_ns_head *head;
> +	int ret = -ENOMEM;
> +
> +	head = kzalloc(sizeof(*head), GFP_KERNEL);
> +	if (!head)
> +		goto out;
> +
> +	INIT_LIST_HEAD(&head->list);
> +	head->ns_id = nsid;
> +	init_srcu_struct(&head->srcu);
> +	kref_init(&head->ref);
> +
> +	nvme_report_ns_ids(ctrl, nsid, id, head->eui64, head->nguid,
> +			&head->uuid);
> +
> +	ret = __nvme_check_ids(ctrl->subsys, head);
> +	if (ret) {
> +		dev_err(ctrl->device,
> +			"duplicate IDs for nsid %d\n", nsid);
> +		goto out_free_head;
> +	}
> +
> +	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
> +	return head;
> +out_free_head:
> +	cleanup_srcu_struct(&head->srcu);
> +	kfree(head);
> +out:
> +	return ERR_PTR(ret);
> +}
> +
> +static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
> +		struct nvme_id_ns *id)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	bool is_shared = id->nmic & (1 << 0);
> +	struct nvme_ns_head *head = NULL;
> +	int ret = 0;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +	if (is_shared)
> +		head = __nvme_find_ns_head(ctrl->subsys, nsid);
> +	if (!head) {
> +		head = nvme_alloc_ns_head(ctrl, nsid, id);
> +		if (IS_ERR(head)) {
> +			ret = PTR_ERR(head);
> +			goto out_unlock;
> +		}
> +	} else {
> +		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
> +		uuid_t uuid = uuid_null;
> +
> +		nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, &uuid);
> +		if (!uuid_equal(&head->uuid, &uuid) ||
> +		    memcmp(&head->nguid, &nguid, sizeof(head->nguid)) ||
> +		    memcmp(&head->eui64, &eui64, sizeof(head->eui64))) {

Suggestion, given that this matching pattern returns in several places
would it be better to move it to nvme_ns_match_id()?

>   
> +/*
> + * Anchor structure for namespaces.  There is one for each namespace in a
> + * NVMe subsystem that any of our controllers can see, and the namespace
> + * structure for each controller is chained of it.  For private namespaces
> + * there is a 1:1 relation to our namespace structures, that is ->list
> + * only ever has a single entry for private namespaces.
> + */
> +struct nvme_ns_head {
> +	struct list_head	list;

Maybe siblings is a better name than list,
and the nvme_ns list_head should be called
sibling_entry (or just sibling)?

> +	struct srcu_struct      srcu;
> +	unsigned		ns_id;
> +	u8			eui64[8];
> +	u8			nguid[16];
> +	uuid_t			uuid;
> +	struct list_head	entry;
> +	struct kref		ref;
> +};
> +
>   struct nvme_ns {
>   	struct list_head list;
>   
>   	struct nvme_ctrl *ctrl;
>   	struct request_queue *queue;
>   	struct gendisk *disk;
> +	struct list_head siblings;
>   	struct nvm_dev *ndev;
>   	struct kref kref;
> +	struct nvme_ns_head *head;
>   	int instance;
>   
> -	u8 eui[8];
> -	u8 nguid[16];
> -	uuid_t uuid;
> -
> -	unsigned ns_id;
>   	int lba_shift;
>   	u16 ms;
>   	u16 sgs;
> 

Overall this looks good.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  7:00     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  7:00 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions,
> which requires the caller to not split the bio.

Isn't the same limits constraint too restrictive?

Say I have two paths to the same namespace via two different HBAs, each
with its own virt_boundary capability for example? That would require us
to split failover bio wouldn't it?

> +/*
> + * Fast-path version of generic_make_request.

generic_make_request is also called in the fast-path, maybe reword it
to: "Fast version of generic_make_request"

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
@ 2017-08-28  7:00     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  7:00 UTC (permalink / raw)


> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions,
> which requires the caller to not split the bio.

Isn't the same limits constraint too restrictive?

Say I have two paths to the same namespace via two different HBAs, each
with its own virt_boundary capability for example? That would require us
to split failover bio wouldn't it?

> +/*
> + * Fast-path version of generic_make_request.

generic_make_request is also called in the fast-path, maybe reword it
to: "Fast version of generic_make_request"

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 09/10] blk-mq: add a blk_steal_bios helper
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  7:04     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  7:04 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 09/10] blk-mq: add a blk_steal_bios helper
@ 2017-08-28  7:04     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  7:04 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28  7:23     ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  7:23 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block


> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> TODO: implement sysfs interfaces for the new subsystem and
> subsystem-namespace object.  Unless we can come up with something
> better than sysfs here..
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Christoph,

This is really taking a lot into the nvme driver. I'm not sure if
this approach will be used in other block driver, but would it
make sense to place the block_device node creation, the make_request
and failover logic and maybe the path selection in the block layer
leaving just the construction of the path mappings in nvme?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-28  7:23     ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28  7:23 UTC (permalink / raw)



> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> TODO: implement sysfs interfaces for the new subsystem and
> subsystem-namespace object.  Unless we can come up with something
> better than sysfs here..
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Christoph,

This is really taking a lot into the nvme driver. I'm not sure if
this approach will be used in other block driver, but would it
make sense to place the block_device node creation, the make_request
and failover logic and maybe the path selection in the block layer
leaving just the construction of the path mappings in nvme?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28  6:51     ` Sagi Grimberg
@ 2017-08-28  8:50       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  8:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Mon, Aug 28, 2017 at 09:51:14AM +0300, Sagi Grimberg wrote:
>> To allow lockless path lookup the list of nvme_ns structures per
>> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
>> structure through call_srcu.
>
> I haven't read the later patches yet, but what requires sleep in the
> path selection?

->make_request is allowed to sleep, and often will.

>> +	} else {
>> +		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
>> +		uuid_t uuid = uuid_null;
>> +
>> +		nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, &uuid);
>> +		if (!uuid_equal(&head->uuid, &uuid) ||
>> +		    memcmp(&head->nguid, &nguid, sizeof(head->nguid)) ||
>> +		    memcmp(&head->eui64, &eui64, sizeof(head->eui64))) {
>
> Suggestion, given that this matching pattern returns in several places
> would it be better to move it to nvme_ns_match_id()?

I'll look into it.  Maybe we'll need a nvme_ns_ids structure to avoid
having tons of parameters, though.

>>   +/*
>> + * Anchor structure for namespaces.  There is one for each namespace in a
>> + * NVMe subsystem that any of our controllers can see, and the namespace
>> + * structure for each controller is chained of it.  For private namespaces
>> + * there is a 1:1 relation to our namespace structures, that is ->list
>> + * only ever has a single entry for private namespaces.
>> + */
>> +struct nvme_ns_head {
>> +	struct list_head	list;
>
> Maybe siblings is a better name than list,
> and the nvme_ns list_head should be called
> sibling_entry (or just sibling)?

Yeah.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28  8:50       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  8:50 UTC (permalink / raw)


On Mon, Aug 28, 2017@09:51:14AM +0300, Sagi Grimberg wrote:
>> To allow lockless path lookup the list of nvme_ns structures per
>> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
>> structure through call_srcu.
>
> I haven't read the later patches yet, but what requires sleep in the
> path selection?

->make_request is allowed to sleep, and often will.

>> +	} else {
>> +		u8 eui64[8] = { 0 }, nguid[16] = { 0 };
>> +		uuid_t uuid = uuid_null;
>> +
>> +		nvme_report_ns_ids(ctrl, nsid, id, eui64, nguid, &uuid);
>> +		if (!uuid_equal(&head->uuid, &uuid) ||
>> +		    memcmp(&head->nguid, &nguid, sizeof(head->nguid)) ||
>> +		    memcmp(&head->eui64, &eui64, sizeof(head->eui64))) {
>
> Suggestion, given that this matching pattern returns in several places
> would it be better to move it to nvme_ns_match_id()?

I'll look into it.  Maybe we'll need a nvme_ns_ids structure to avoid
having tons of parameters, though.

>>   +/*
>> + * Anchor structure for namespaces.  There is one for each namespace in a
>> + * NVMe subsystem that any of our controllers can see, and the namespace
>> + * structure for each controller is chained of it.  For private namespaces
>> + * there is a 1:1 relation to our namespace structures, that is ->list
>> + * only ever has a single entry for private namespaces.
>> + */
>> +struct nvme_ns_head {
>> +	struct list_head	list;
>
> Maybe siblings is a better name than list,
> and the nvme_ns list_head should be called
> sibling_entry (or just sibling)?

Yeah.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
  2017-08-28  7:00     ` Sagi Grimberg
@ 2017-08-28  8:54       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  8:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Mon, Aug 28, 2017 at 10:00:35AM +0300, Sagi Grimberg wrote:
>> This helper allows reinserting a bio into a new queue without much
>> overhead, but requires all queue limits to be the same for the upper
>> and lower queues, and it does not provide any recursion preventions,
>> which requires the caller to not split the bio.
>
> Isn't the same limits constraint too restrictive?
>
> Say I have two paths to the same namespace via two different HBAs, each
> with its own virt_boundary capability for example? That would require us
> to split failover bio wouldn't it?

Uh oh - different transports for the same subsystem will be intereting.
For one it's not specified anywhere so I'd like to kick off a discussion
on the working group mailing list about it.

That being said ->make_request basically doesn't care about actual
limits at all, it mostly care about support features (e.g. discard, fua,
etc).  So I think a lot of these limits could porbably be lifted,
but I'd need to add the check to generic_make_request_checks back.

>> +/*
>> + * Fast-path version of generic_make_request.
>
> generic_make_request is also called in the fast-path, maybe reword it
> to: "Fast version of generic_make_request"

Yeah.  Maybe generic_make_request_direct or direct_make_request
is a better name as it describes the recursion avoidance bypassing
a little better.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
@ 2017-08-28  8:54       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  8:54 UTC (permalink / raw)


On Mon, Aug 28, 2017@10:00:35AM +0300, Sagi Grimberg wrote:
>> This helper allows reinserting a bio into a new queue without much
>> overhead, but requires all queue limits to be the same for the upper
>> and lower queues, and it does not provide any recursion preventions,
>> which requires the caller to not split the bio.
>
> Isn't the same limits constraint too restrictive?
>
> Say I have two paths to the same namespace via two different HBAs, each
> with its own virt_boundary capability for example? That would require us
> to split failover bio wouldn't it?

Uh oh - different transports for the same subsystem will be intereting.
For one it's not specified anywhere so I'd like to kick off a discussion
on the working group mailing list about it.

That being said ->make_request basically doesn't care about actual
limits at all, it mostly care about support features (e.g. discard, fua,
etc).  So I think a lot of these limits could porbably be lifted,
but I'd need to add the check to generic_make_request_checks back.

>> +/*
>> + * Fast-path version of generic_make_request.
>
> generic_make_request is also called in the fast-path, maybe reword it
> to: "Fast version of generic_make_request"

Yeah.  Maybe generic_make_request_direct or direct_make_request
is a better name as it describes the recursion avoidance bypassing
a little better.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-28  7:23     ` Sagi Grimberg
@ 2017-08-28  9:06       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  9:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Mon, Aug 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote:
> This is really taking a lot into the nvme driver.

This patch adds about 100 lines of code, out of which 1/4th is the
parsing of status codes.  I don't think it's all that much..

> I'm not sure if
> this approach will be used in other block driver, but would it
> make sense to place the block_device node creation,

I really want to reduce the amount of boilerplate code for creating
a block queue and I have some ideas for that.  But that's not in
any way related to this multi-path code.

> the make_request

That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.  The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

 (1) the whole requeue_list list logic.  I thought about moving this
     to the block layer as I'd also have some other use cases for it,
     but decided to wait until those materialize to see if I'd really
     need it.
 (2) turning the logic inside out, e.g. providing a generic make_request
     and supplying a find_path callback to it.  This would require (1)
     but even then we'd save basically no code, add an indirect call
     in the fast path and make things harder to read.  This actually
     is how I started out and didn't like it.

> and failover logic

The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.  The only thing that could reasonably be common is
the requeue_list handling as mentioned above.  Note that this will
become even more integrated with the nvme driver once ANA support
lands.

> and maybe the path selection in the block layer
> leaving just the construction of the path mappings in nvme?

The point is that path selection fundamentally depends on your
actual protocol.  E.g. for NVMe we now look at the controller state
as that's what our keep alive work on, and what reflects that status
in PCIe. We could try to communicate this up, but it would just lead
to more data structure that get out of sync.  And this will become
much worse with ANA where we have even more fine grained state.
In the end path selection right now is less than 20 lines of code.

Path selection is complicated when you want non-trivial path selector
algorithms like round robin or service time, but I'd really avoid having
those on nvme - we already have multiple queues to go beyong the limits
of a single queue and use blk-mq for that, and once we're beyond the
limits of a single transport path for performance reasons I'd much
rather rely on ANA, numa nodes or static partitioning of namepsaces
than trying to do dynamic decicions in the fast path.  But if I don't
get what I want there and we'll need more complicated algorithms they
absolutely should be in common code.  In fact one of my earlier protoypes
moved the dm-mpath path selector algorithms to core block code and
tried to use them - it just turned out enabling them was pretty much
pointless.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-28  9:06       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28  9:06 UTC (permalink / raw)


On Mon, Aug 28, 2017@10:23:33AM +0300, Sagi Grimberg wrote:
> This is really taking a lot into the nvme driver.

This patch adds about 100 lines of code, out of which 1/4th is the
parsing of status codes.  I don't think it's all that much..

> I'm not sure if
> this approach will be used in other block driver, but would it
> make sense to place the block_device node creation,

I really want to reduce the amount of boilerplate code for creating
a block queue and I have some ideas for that.  But that's not in
any way related to this multi-path code.

> the make_request

That basically just does a lookup (in a data structure we need for
tracking the siblings inside nvme anyway) and the submits the
I/O again.  The generic code all is in generic_make_request_fast.
There are two more things which could move into common code, one
reasonable, the other not:

 (1) the whole requeue_list list logic.  I thought about moving this
     to the block layer as I'd also have some other use cases for it,
     but decided to wait until those materialize to see if I'd really
     need it.
 (2) turning the logic inside out, e.g. providing a generic make_request
     and supplying a find_path callback to it.  This would require (1)
     but even then we'd save basically no code, add an indirect call
     in the fast path and make things harder to read.  This actually
     is how I started out and didn't like it.

> and failover logic

The big thing about the failover logic is looking a the detailed error
codes and changing the controller state, all of which is fundamentally
driver specific.  The only thing that could reasonably be common is
the requeue_list handling as mentioned above.  Note that this will
become even more integrated with the nvme driver once ANA support
lands.

> and maybe the path selection in the block layer
> leaving just the construction of the path mappings in nvme?

The point is that path selection fundamentally depends on your
actual protocol.  E.g. for NVMe we now look at the controller state
as that's what our keep alive work on, and what reflects that status
in PCIe. We could try to communicate this up, but it would just lead
to more data structure that get out of sync.  And this will become
much worse with ANA where we have even more fine grained state.
In the end path selection right now is less than 20 lines of code.

Path selection is complicated when you want non-trivial path selector
algorithms like round robin or service time, but I'd really avoid having
those on nvme - we already have multiple queues to go beyong the limits
of a single queue and use blk-mq for that, and once we're beyond the
limits of a single transport path for performance reasons I'd much
rather rely on ANA, numa nodes or static partitioning of namepsaces
than trying to do dynamic decicions in the fast path.  But if I don't
get what I want there and we'll need more complicated algorithms they
absolutely should be in common code.  In fact one of my earlier protoypes
moved the dm-mpath path selector algorithms to core block code and
tried to use them - it just turned out enabling them was pretty much
pointless.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
  2017-08-28  8:54       ` Christoph Hellwig
@ 2017-08-28 11:01         ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28 11:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block


>>> This helper allows reinserting a bio into a new queue without much
>>> overhead, but requires all queue limits to be the same for the upper
>>> and lower queues, and it does not provide any recursion preventions,
>>> which requires the caller to not split the bio.
>>
>> Isn't the same limits constraint too restrictive?
>>
>> Say I have two paths to the same namespace via two different HBAs, each
>> with its own virt_boundary capability for example? That would require us
>> to split failover bio wouldn't it?
> 
> Uh oh - different transports for the same subsystem will be intereting.
> For one it's not specified anywhere so I'd like to kick off a discussion
> on the working group mailing list about it.

Indeed that would be interesting, but I wasn't referring to different
transports, even the same transport can have different capabilities
(for example CX4 and CX4 devices on the same host).

> That being said ->make_request basically doesn't care about actual
> limits at all, it mostly care about support features (e.g. discard, fua,
> etc).  So I think a lot of these limits could porbably be lifted,
> but I'd need to add the check to generic_make_request_checks back.

Different virt_boundary capabilities will trigger bio splits which
can make make_request blocking (due to lack of tags).

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
@ 2017-08-28 11:01         ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28 11:01 UTC (permalink / raw)



>>> This helper allows reinserting a bio into a new queue without much
>>> overhead, but requires all queue limits to be the same for the upper
>>> and lower queues, and it does not provide any recursion preventions,
>>> which requires the caller to not split the bio.
>>
>> Isn't the same limits constraint too restrictive?
>>
>> Say I have two paths to the same namespace via two different HBAs, each
>> with its own virt_boundary capability for example? That would require us
>> to split failover bio wouldn't it?
> 
> Uh oh - different transports for the same subsystem will be intereting.
> For one it's not specified anywhere so I'd like to kick off a discussion
> on the working group mailing list about it.

Indeed that would be interesting, but I wasn't referring to different
transports, even the same transport can have different capabilities
(for example CX4 and CX4 devices on the same host).

> That being said ->make_request basically doesn't care about actual
> limits at all, it mostly care about support features (e.g. discard, fua,
> etc).  So I think a lot of these limits could porbably be lifted,
> but I'd need to add the check to generic_make_request_checks back.

Different virt_boundary capabilities will trigger bio splits which
can make make_request blocking (due to lack of tags).

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
  2017-08-28 11:01         ` Sagi Grimberg
@ 2017-08-28 11:54           ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28 11:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Mon, Aug 28, 2017 at 02:01:16PM +0300, Sagi Grimberg wrote:
>> That being said ->make_request basically doesn't care about actual
>> limits at all, it mostly care about support features (e.g. discard, fua,
>> etc).  So I think a lot of these limits could porbably be lifted,
>> but I'd need to add the check to generic_make_request_checks back.
>
> Different virt_boundary capabilities will trigger bio splits which
> can make make_request blocking (due to lack of tags).

All the bio splitting is done in blk_queue_split, and other things
related to limits are done even later in blk_mq_make_request when
building the request.  For normal make_request based stacking drivers
nothing of this matters, although a few drivers t call blk_queue_split
manually from their make_request method.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
@ 2017-08-28 11:54           ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28 11:54 UTC (permalink / raw)


On Mon, Aug 28, 2017@02:01:16PM +0300, Sagi Grimberg wrote:
>> That being said ->make_request basically doesn't care about actual
>> limits at all, it mostly care about support features (e.g. discard, fua,
>> etc).  So I think a lot of these limits could porbably be lifted,
>> but I'd need to add the check to generic_make_request_checks back.
>
> Different virt_boundary capabilities will trigger bio splits which
> can make make_request blocking (due to lack of tags).

All the bio splitting is done in blk_queue_split, and other things
related to limits are done even later in blk_mq_make_request when
building the request.  For normal make_request based stacking drivers
nothing of this matters, although a few drivers t call blk_queue_split
manually from their make_request method.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 12:04     ` javigon
  -1 siblings, 0 replies; 122+ messages in thread
From: javigon @ 2017-08-28 12:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-block, Sagi Grimberg, linux-nvme

[-- Attachment #1: Type: text/plain, Size: 1266 bytes --]

> On 23 Aug 2017, at 19.58, Christoph Hellwig <hch@lst.de> wrote:
> 
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> [1] comments welcome if you have a better name for it, the current one is
>    horrible.  One idea would be to rename the current struct nvme_ns
>    to struct nvme_ns_link or similar and use the nvme_ns name for the
>    new structure.  But that would involve a lot of churn.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/core.c     | 218 +++++++++++++++++++++++++++++++++++--------
> drivers/nvme/host/lightnvm.c |  14 +--

Nothing big here. Looks good.

Reviewed-by: Javier González <javier@cnexlabs.com>

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28 12:04     ` javigon
  0 siblings, 0 replies; 122+ messages in thread
From: javigon @ 2017-08-28 12:04 UTC (permalink / raw)


> On 23 Aug 2017,@19.58, Christoph Hellwig <hch@lst.de> wrote:
> 
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> [1] comments welcome if you have a better name for it, the current one is
>    horrible.  One idea would be to rename the current struct nvme_ns
>    to struct nvme_ns_link or similar and use the nvme_ns name for the
>    new structure.  But that would involve a lot of churn.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
> drivers/nvme/host/core.c     | 218 +++++++++++++++++++++++++++++++++++--------
> drivers/nvme/host/lightnvm.c |  14 +--

Nothing big here. Looks good.

Reviewed-by: Javier Gonz?lez <javier at cnexlabs.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170828/4199c587/attachment.sig>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 08/10] block: provide a generic_make_request_fast helper
  2017-08-28 11:54           ` Christoph Hellwig
@ 2017-08-28 12:38             ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28 12:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block


>>> That being said ->make_request basically doesn't care about actual
>>> limits at all, it mostly care about support features (e.g. discard, fua,
>>> etc).  So I think a lot of these limits could porbably be lifted,
>>> but I'd need to add the check to generic_make_request_checks back.
>>
>> Different virt_boundary capabilities will trigger bio splits which
>> can make make_request blocking (due to lack of tags).
> 
> All the bio splitting is done in blk_queue_split, and other things
> related to limits are done even later in blk_mq_make_request when
> building the request.  For normal make_request based stacking drivers
> nothing of this matters, although a few drivers t call blk_queue_split
> manually from their make_request method.

Maybe I misunderstood the change log comment, but didn't it said that
the caller is not allowed to submit a bio that might split?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 08/10] block: provide a generic_make_request_fast helper
@ 2017-08-28 12:38             ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28 12:38 UTC (permalink / raw)



>>> That being said ->make_request basically doesn't care about actual
>>> limits at all, it mostly care about support features (e.g. discard, fua,
>>> etc).  So I think a lot of these limits could porbably be lifted,
>>> but I'd need to add the check to generic_make_request_checks back.
>>
>> Different virt_boundary capabilities will trigger bio splits which
>> can make make_request blocking (due to lack of tags).
> 
> All the bio splitting is done in blk_queue_split, and other things
> related to limits are done even later in blk_mq_make_request when
> building the request.  For normal make_request based stacking drivers
> nothing of this matters, although a few drivers t call blk_queue_split
> manually from their make_request method.

Maybe I misunderstood the change log comment, but didn't it said that
the caller is not allowed to submit a bio that might split?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 12:41     ` Guan Junxiong
  -1 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-28 12:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, linux-block, Sagi Grimberg,
	linux-nvme, Shenhong (C),
	niuhaoxin



On 2017/8/24 1:58, Christoph Hellwig wrote:
> +static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
> +		struct nvme_id_ns *id)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	bool is_shared = id->nmic & (1 << 0);
> +	struct nvme_ns_head *head = NULL;
> +	int ret = 0;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +	if (is_shared)
> +		head = __nvme_find_ns_head(ctrl->subsys, nsid);

If a namespace can be accessed by another subsystem,  the above line
will ignore such namespace.

Or does the NVMe/NVMf specification constrain that any namespace
can only be accessed by a subsystem?

More comments after testing will be sent later.

Thanks
Guan Junxiong

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28 12:41     ` Guan Junxiong
  0 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-28 12:41 UTC (permalink / raw)




On 2017/8/24 1:58, Christoph Hellwig wrote:
> +static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
> +		struct nvme_id_ns *id)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	bool is_shared = id->nmic & (1 << 0);
> +	struct nvme_ns_head *head = NULL;
> +	int ret = 0;
> +
> +	mutex_lock(&ctrl->subsys->lock);
> +	if (is_shared)
> +		head = __nvme_find_ns_head(ctrl->subsys, nsid);

If a namespace can be accessed by another subsystem,  the above line
will ignore such namespace.

Or does the NVMe/NVMf specification constrain that any namespace
can only be accessed by a subsystem?

More comments after testing will be sent later.

Thanks
Guan Junxiong

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-28  9:06       ` Christoph Hellwig
@ 2017-08-28 13:40         ` Sagi Grimberg
  -1 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28 13:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block


>> the make_request
> 
> That basically just does a lookup (in a data structure we need for
> tracking the siblings inside nvme anyway) and the submits the
> I/O again.  The generic code all is in generic_make_request_fast.
> There are two more things which could move into common code, one
> reasonable, the other not:
> 
>   (1) the whole requeue_list list logic.  I thought about moving this
>       to the block layer as I'd also have some other use cases for it,
>       but decided to wait until those materialize to see if I'd really
>       need it.
>   (2) turning the logic inside out, e.g. providing a generic make_request
>       and supplying a find_path callback to it.  This would require (1)
>       but even then we'd save basically no code, add an indirect call
>       in the fast path and make things harder to read.  This actually
>       is how I started out and didn't like it.
> 
>> and failover logic
> 
> The big thing about the failover logic is looking a the detailed error
> codes and changing the controller state, all of which is fundamentally
> driver specific.  The only thing that could reasonably be common is
> the requeue_list handling as mentioned above.  Note that this will
> become even more integrated with the nvme driver once ANA support
> lands.

Not arguing with you at all, you obviously gave it a lot of thought.

I thought your multipathing code would really live in the block
layer and only require something really basic from nvme (which could
easily be applied on other drivers). But I do understand it might
create a lot of churn.

btw, why are partial completions something that can't be done
without cloning the bio? is it possible to clone the bio once from the
completion flow when you see that you got a partial completion?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-28 13:40         ` Sagi Grimberg
  0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2017-08-28 13:40 UTC (permalink / raw)



>> the make_request
> 
> That basically just does a lookup (in a data structure we need for
> tracking the siblings inside nvme anyway) and the submits the
> I/O again.  The generic code all is in generic_make_request_fast.
> There are two more things which could move into common code, one
> reasonable, the other not:
> 
>   (1) the whole requeue_list list logic.  I thought about moving this
>       to the block layer as I'd also have some other use cases for it,
>       but decided to wait until those materialize to see if I'd really
>       need it.
>   (2) turning the logic inside out, e.g. providing a generic make_request
>       and supplying a find_path callback to it.  This would require (1)
>       but even then we'd save basically no code, add an indirect call
>       in the fast path and make things harder to read.  This actually
>       is how I started out and didn't like it.
> 
>> and failover logic
> 
> The big thing about the failover logic is looking a the detailed error
> codes and changing the controller state, all of which is fundamentally
> driver specific.  The only thing that could reasonably be common is
> the requeue_list handling as mentioned above.  Note that this will
> become even more integrated with the nvme driver once ANA support
> lands.

Not arguing with you at all, you obviously gave it a lot of thought.

I thought your multipathing code would really live in the block
layer and only require something really basic from nvme (which could
easily be applied on other drivers). But I do understand it might
create a lot of churn.

btw, why are partial completions something that can't be done
without cloning the bio? is it possible to clone the bio once from the
completion flow when you see that you got a partial completion?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-28 13:40         ` Sagi Grimberg
@ 2017-08-28 14:24           ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28 14:24 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Mon, Aug 28, 2017 at 04:40:43PM +0300, Sagi Grimberg wrote:
> I thought your multipathing code would really live in the block
> layer and only require something really basic from nvme (which could
> easily be applied on other drivers). But I do understand it might
> create a lot of churn.

The earlier versions did a lot more in common code, but I gradually
moved away from that:

 - first I didn't have a separate queue, but just bounced I/O between
   sibling queues.  So there was no new make_request based queue,
   and we had to track the relations in the block layer, with a
   callback to check the path status
 - I got rid of the non-trivial path selector

So in the end very little block layer code remained.  But then again
very little nvme code remained either..

> btw, why are partial completions something that can't be done
> without cloning the bio? is it possible to clone the bio once from the
> completion flow when you see that you got a partial completion?

The problem with partial completions is that blk_update_request
completes bios as soon as it get enough bytes to finish them.

This should not be an unsolvable problem, but it will be a bit messy
at least.  But then again I hope that no new protocols will be designed
with partial completions - SCSI is pretty special in that regard.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-28 14:24           ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28 14:24 UTC (permalink / raw)


On Mon, Aug 28, 2017@04:40:43PM +0300, Sagi Grimberg wrote:
> I thought your multipathing code would really live in the block
> layer and only require something really basic from nvme (which could
> easily be applied on other drivers). But I do understand it might
> create a lot of churn.

The earlier versions did a lot more in common code, but I gradually
moved away from that:

 - first I didn't have a separate queue, but just bounced I/O between
   sibling queues.  So there was no new make_request based queue,
   and we had to track the relations in the block layer, with a
   callback to check the path status
 - I got rid of the non-trivial path selector

So in the end very little block layer code remained.  But then again
very little nvme code remained either..

> btw, why are partial completions something that can't be done
> without cloning the bio? is it possible to clone the bio once from the
> completion flow when you see that you got a partial completion?

The problem with partial completions is that blk_update_request
completes bios as soon as it get enough bytes to finish them.

This should not be an unsolvable problem, but it will be a bit messy
at least.  But then again I hope that no new protocols will be designed
with partial completions - SCSI is pretty special in that regard.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28 12:41     ` Guan Junxiong
@ 2017-08-28 14:30       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28 14:30 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, linux-block,
	Sagi Grimberg, linux-nvme, Shenhong (C),
	niuhaoxin

On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote:
> If a namespace can be accessed by another subsystem,  the above line
> will ignore such namespace.

And that's intentional.

> Or does the NVMe/NVMf specification constrain that any namespace
> can only be accessed by a subsystem?

Yes, inside the NVMe spec a Namespace is contained inside a Subsystem.
That doesn't preclude other ways to access the LBAs, but they are outside
the specification (e.g. also exporting them as SCSI).

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28 14:30       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-28 14:30 UTC (permalink / raw)


On Mon, Aug 28, 2017@08:41:23PM +0800, Guan Junxiong wrote:
> If a namespace can be accessed by another subsystem,  the above line
> will ignore such namespace.

And that's intentional.

> Or does the NVMe/NVMf specification constrain that any namespace
> can only be accessed by a subsystem?

Yes, inside the NVMe spec a Namespace is contained inside a Subsystem.
That doesn't preclude other ways to access the LBAs, but they are outside
the specification (e.g. also exporting them as SCSI).

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 01/10] nvme: report more detailed status codes to the block layer
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 18:50     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 18:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme

Looks good. 

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 01/10] nvme: report more detailed status codes to the block layer
@ 2017-08-28 18:50     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 18:50 UTC (permalink / raw)


Looks good. 

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 18:50     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 18:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context
@ 2017-08-28 18:50     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 18:50 UTC (permalink / raw)


Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 03/10] nvme: remove unused struct nvme_ns fields
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 19:13     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 03/10] nvme: remove unused struct nvme_ns fields
@ 2017-08-28 19:13     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:13 UTC (permalink / raw)


Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 04/10] nvme: remove nvme_revalidate_ns
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 19:14     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 04/10] nvme: remove nvme_revalidate_ns
@ 2017-08-28 19:14     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:14 UTC (permalink / raw)


Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 19:15     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate
@ 2017-08-28 19:15     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:15 UTC (permalink / raw)


Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-28 19:18     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28 19:18     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-28 19:18 UTC (permalink / raw)


Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28  6:51     ` Sagi Grimberg
@ 2017-08-28 20:21       ` J Freyensee
  -1 siblings, 0 replies; 122+ messages in thread
From: J Freyensee @ 2017-08-28 20:21 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, linux-block, linux-nvme

On Mon, 2017-08-28 at 09:51 +0300, Sagi Grimberg wrote:
> 
> On 23/08/17 20:58, Christoph Hellwig wrote:
> > Introduce a new struct nvme_ns_head [1] that holds information about
> > an actual namespace, unlike struct nvme_ns, which only holds the
> > per-controller namespace information.  For private namespaces there
> > is a 1:1 relation of the two, but for shared namespaces this lets us
> > discover all the paths to it.  For now only the identifiers are moved
> > to the new structure, but most of the information in struct nvme_ns
> > should eventually move over.
> > 
> > To allow lockless path lookup the list of nvme_ns structures per
> > nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> > structure through call_srcu.
> 
> I haven't read the later patches yet, but what requires sleep in the
> path selection?
> 
> > 
> > [1] comments welcome if you have a better name for it, the current one
> > is
> >      horrible.  One idea would be to rename the current struct nvme_ns
> >      to struct nvme_ns_link or similar and use the nvme_ns name for the
> >      new structure.  But that would involve a lot of churn.
> 
> maybe nvme_ns_primary?

Since it looks like it holds all unique identifier values and should hold
other namespace characteristics later, maybe:

nvme_ns_item?
Or nvme_ns_entry?
Or nvme_ns_element?
Or nvme_ns_unit?
Or nvme_ns_entity?
Or nvme_ns_container?

> > +/*
> > + * Anchor structure for namespaces.  There is one for each namespace
> > in a
> > + * NVMe subsystem that any of our controllers can see, and the
> > namespace
> > + * structure for each controller is chained of it.  For private
> > namespaces
> > + * there is a 1:1 relation to our namespace structures, that is ->list
> > + * only ever has a single entry for private namespaces.
> > + */
> > +struct nvme_ns_head {
> > +	struct list_head	list;
> 
> Maybe siblings is a better name than list,
> and the nvme_ns list_head should be called
> sibling_entry (or just sibling)?


I think that sounds good too.


> 
> > +	struct srcu_struct      srcu;
> > +	unsigned		ns_id;
> > +	u8			eui64[8];
> > +	u8			nguid[16];
> > +	uuid_t			uuid;
> > +	struct list_head	entry;
> > +	struct kref		ref;
> > +};
> > +
> >   struct nvme_ns {
> >   	struct list_head list;
> >   
> >   	struct nvme_ctrl *ctrl;
> >   	struct request_queue *queue;
> >   	struct gendisk *disk;
> > +	struct list_head siblings;
> >   	struct nvm_dev *ndev;
> >   	struct kref kref;
> > +	struct nvme_ns_head *head;
> >   	int instance;
> >   
> > -	u8 eui[8];
> > -	u8 nguid[16];
> > -	uuid_t uuid;
> > -
> > -	unsigned ns_id;
> >   	int lba_shift;
> >   	u16 ms;
> >   	u16 sgs;
> > 
> 
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-28 20:21       ` J Freyensee
  0 siblings, 0 replies; 122+ messages in thread
From: J Freyensee @ 2017-08-28 20:21 UTC (permalink / raw)


On Mon, 2017-08-28@09:51 +0300, Sagi Grimberg wrote:
> 
> On 23/08/17 20:58, Christoph Hellwig wrote:
> > Introduce a new struct nvme_ns_head [1] that holds information about
> > an actual namespace, unlike struct nvme_ns, which only holds the
> > per-controller namespace information.??For private namespaces there
> > is a 1:1 relation of the two, but for shared namespaces this lets us
> > discover all the paths to it.??For now only the identifiers are moved
> > to the new structure, but most of the information in struct nvme_ns
> > should eventually move over.
> > 
> > To allow lockless path lookup the list of nvme_ns structures per
> > nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> > structure through call_srcu.
> 
> I haven't read the later patches yet, but what requires sleep in the
> path selection?
> 
> > 
> > [1] comments welcome if you have a better name for it, the current one
> > is
> > ?????horrible.??One idea would be to rename the current struct nvme_ns
> > ?????to struct nvme_ns_link or similar and use the nvme_ns name for the
> > ?????new structure.??But that would involve a lot of churn.
> 
> maybe nvme_ns_primary?

Since it looks like it holds all unique identifier values and should hold
other namespace characteristics later, maybe:

nvme_ns_item?
Or nvme_ns_entry?
Or nvme_ns_element?
Or nvme_ns_unit?
Or nvme_ns_entity?
Or nvme_ns_container?

> > +/*
> > + * Anchor structure for namespaces.??There is one for each namespace
> > in a
> > + * NVMe subsystem that any of our controllers can see, and the
> > namespace
> > + * structure for each controller is chained of it.??For private
> > namespaces
> > + * there is a 1:1 relation to our namespace structures, that is ->list
> > + * only ever has a single entry for private namespaces.
> > + */
> > +struct nvme_ns_head {
> > +	struct list_head	list;
> 
> Maybe siblings is a better name than list,
> and the nvme_ns list_head should be called
> sibling_entry (or just sibling)?


I think that sounds good too.


> 
> > +	struct srcu_struct??????srcu;
> > +	unsigned		ns_id;
> > +	u8			eui64[8];
> > +	u8			nguid[16];
> > +	uuid_t			uuid;
> > +	struct list_head	entry;
> > +	struct kref		ref;
> > +};
> > +
> > ? struct nvme_ns {
> > ??	struct list_head list;
> > ??
> > ??	struct nvme_ctrl *ctrl;
> > ??	struct request_queue *queue;
> > ??	struct gendisk *disk;
> > +	struct list_head siblings;
> > ??	struct nvm_dev *ndev;
> > ??	struct kref kref;
> > +	struct nvme_ns_head *head;
> > ??	int instance;
> > ??
> > -	u8 eui[8];
> > -	u8 nguid[16];
> > -	uuid_t uuid;
> > -
> > -	unsigned ns_id;
> > ??	int lba_shift;
> > ??	u16 ms;
> > ??	u16 sgs;
> > 
> 
> 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28 14:30       ` Christoph Hellwig
@ 2017-08-29  2:42         ` Guan Junxiong
  -1 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-29  2:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-block, Sagi Grimberg, linux-nvme,
	Shenhong (C),
	niuhaoxin, Lilangbo

Hi, Christoph

On 2017/8/28 22:30, Christoph Hellwig wrote:
> On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote:
>> If a namespace can be accessed by another subsystem,  the above line
>> will ignore such namespace.
> And that's intentional.
>

As for the __nvme_find_ns_head function, can it lookup the namespace
globally, not in the current subsytem. Take hypermetro scenario for
example, two namespaces which should be viewed as the same namespaces
from the database application but exist in two different cities respectively.
Some vendors maybe specify those two namespaces with the same UUID.

In addition, could you add a switch to turn on/off finding namespaces in
a subsystem-wide level or globally?

>> Or does the NVMe/NVMf specification constrain that any namespace
>> can only be accessed by a subsystem?
> Yes, inside the NVMe spec a Namespace is contained inside a Subsystem.
> That doesn't preclude other ways to access the LBAs, but they are outside
> the specification (e.g. also exporting them as SCSI).

Can namespace be shared between two subsystem?  If not, for hypermetro scenario,
we need keep more information of a storage array in city A synchronized with the other
in city B.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-29  2:42         ` Guan Junxiong
  0 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-29  2:42 UTC (permalink / raw)


Hi, Christoph

On 2017/8/28 22:30, Christoph Hellwig wrote:
> On Mon, Aug 28, 2017@08:41:23PM +0800, Guan Junxiong wrote:
>> If a namespace can be accessed by another subsystem,  the above line
>> will ignore such namespace.
> And that's intentional.
>

As for the __nvme_find_ns_head function, can it lookup the namespace
globally, not in the current subsytem. Take hypermetro scenario for
example, two namespaces which should be viewed as the same namespaces
from the database application but exist in two different cities respectively.
Some vendors maybe specify those two namespaces with the same UUID.

In addition, could you add a switch to turn on/off finding namespaces in
a subsystem-wide level or globally?

>> Or does the NVMe/NVMf specification constrain that any namespace
>> can only be accessed by a subsystem?
> Yes, inside the NVMe spec a Namespace is contained inside a Subsystem.
> That doesn't preclude other ways to access the LBAs, but they are outside
> the specification (e.g. also exporting them as SCSI).

Can namespace be shared between two subsystem?  If not, for hypermetro scenario,
we need keep more information of a storage array in city A synchronized with the other
in city B.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28  6:51     ` Sagi Grimberg
@ 2017-08-29  6:54       ` Guan Junxiong
  -1 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-29  6:54 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, linux-block, linux-nvme, Shenhong (C), niuhaoxin



On 2017/8/28 14:51, Sagi Grimberg wrote:
> +static int __nvme_check_ids(struct nvme_subsystem *subsys,
> +        struct nvme_ns_head *new)
> +{
> +    struct nvme_ns_head *h;
> +
> +    lockdep_assert_held(&subsys->lock);
> +
> +    list_for_each_entry(h, &subsys->nsheads, entry) {
> +        if ((!uuid_is_null(&new->uuid) &&
> +             uuid_equal(&new->uuid, &h->uuid)) ||
> +            (memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
> +             memcmp(&new->nguid, &h->nguid, sizeof(new->nguid))) ||

memcmp() -> !memcmp

> +            (memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
> +             memcmp(&new->eui64, &h->eui64, sizeof(new->eui64))))

memcmp() -> !memcmp

Otherwise in this patch, looks good.
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-29  6:54       ` Guan Junxiong
  0 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-29  6:54 UTC (permalink / raw)




On 2017/8/28 14:51, Sagi Grimberg wrote:
> +static int __nvme_check_ids(struct nvme_subsystem *subsys,
> +        struct nvme_ns_head *new)
> +{
> +    struct nvme_ns_head *h;
> +
> +    lockdep_assert_held(&subsys->lock);
> +
> +    list_for_each_entry(h, &subsys->nsheads, entry) {
> +        if ((!uuid_is_null(&new->uuid) &&
> +             uuid_equal(&new->uuid, &h->uuid)) ||
> +            (memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
> +             memcmp(&new->nguid, &h->nguid, sizeof(new->nguid))) ||

memcmp() -> !memcmp

> +            (memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
> +             memcmp(&new->eui64, &h->eui64, sizeof(new->eui64))))

memcmp() -> !memcmp

Otherwise in this patch, looks good.
Reviewed-by: Guan Junxiong <guanjunxiong at huawei.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28 20:21       ` J Freyensee
@ 2017-08-29  8:25         ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29  8:25 UTC (permalink / raw)
  To: J Freyensee
  Cc: Sagi Grimberg, Christoph Hellwig, Jens Axboe, Keith Busch,
	linux-block, linux-nvme

On Mon, Aug 28, 2017 at 01:21:20PM -0700, J Freyensee wrote:
> > > �����horrible.��One idea would be to rename the current struct nvme_ns
> > > �����to struct nvme_ns_link or similar and use the nvme_ns name for the
> > > �����new structure.��But that would involve a lot of churn.
> > 
> > maybe nvme_ns_primary?
> 
> Since it looks like it holds all unique identifier values and should hold
> other namespace characteristics later, maybe:
> 
> nvme_ns_item?
> Or nvme_ns_entry?
> Or nvme_ns_element?
> Or nvme_ns_unit?
> Or nvme_ns_entity?
> Or nvme_ns_container?

I hate them all (including the current ns_head name :)).

I suspect the only way that would make my taste happy is to call
this new one nvme_ns, but that would lead to a lot of churn.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-29  8:25         ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29  8:25 UTC (permalink / raw)


On Mon, Aug 28, 2017@01:21:20PM -0700, J Freyensee wrote:
> > > ?????horrible.??One idea would be to rename the current struct nvme_ns
> > > ?????to struct nvme_ns_link or similar and use the nvme_ns name for the
> > > ?????new structure.??But that would involve a lot of churn.
> > 
> > maybe nvme_ns_primary?
> 
> Since it looks like it holds all unique identifier values and should hold
> other namespace characteristics later, maybe:
> 
> nvme_ns_item?
> Or nvme_ns_entry?
> Or nvme_ns_element?
> Or nvme_ns_unit?
> Or nvme_ns_entity?
> Or nvme_ns_container?

I hate them all (including the current ns_head name :)).

I suspect the only way that would make my taste happy is to call
this new one nvme_ns, but that would lead to a lot of churn.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-28 12:41     ` Guan Junxiong
@ 2017-08-29  8:29       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29  8:29 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, linux-block,
	Sagi Grimberg, linux-nvme, Shenhong (C),
	niuhaoxin

On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote:
> If a namespace can be accessed by another subsystem,  the above line
> will ignore such namespace.
>
> Or does the NVMe/NVMf specification constrain that any namespace
> can only be accessed by a subsystem?

A namespace is a part of a NVMe subsystem.  You must not reuse the
uniqueue identifier outside the subsystem scope or your implementation
will be non-compliant.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-29  8:29       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29  8:29 UTC (permalink / raw)


On Mon, Aug 28, 2017@08:41:23PM +0800, Guan Junxiong wrote:
> If a namespace can be accessed by another subsystem,  the above line
> will ignore such namespace.
>
> Or does the NVMe/NVMf specification constrain that any namespace
> can only be accessed by a subsystem?

A namespace is a part of a NVMe subsystem.  You must not reuse the
uniqueue identifier outside the subsystem scope or your implementation
will be non-compliant.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 07/10] nvme: track shared namespaces
  2017-08-29  2:42         ` Guan Junxiong
@ 2017-08-29  8:30           ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29  8:30 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-block,
	Sagi Grimberg, linux-nvme, Shenhong (C),
	niuhaoxin, Lilangbo

On Tue, Aug 29, 2017 at 10:42:18AM +0800, Guan Junxiong wrote:
> As for the __nvme_find_ns_head function, can it lookup the namespace
> globally, not in the current subsytem.

No.

> Take hypermetro scenario for

Please define "hypermetro"

> example, two namespaces which should be viewed as the same namespaces
> from the database application but exist in two different cities respectively.
> Some vendors maybe specify those two namespaces with the same UUID.

Then these vendors are non-compliant IFF the controllers don't belong
to the same subsystem.

> In addition, could you add a switch to turn on/off finding namespaces in
> a subsystem-wide level or globally?

No.

> Can namespace be shared between two subsystem?

No - if you share namespace access you are in the same subsystem.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 07/10] nvme: track shared namespaces
@ 2017-08-29  8:30           ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29  8:30 UTC (permalink / raw)


On Tue, Aug 29, 2017@10:42:18AM +0800, Guan Junxiong wrote:
> As for the __nvme_find_ns_head function, can it lookup the namespace
> globally, not in the current subsytem.

No.

> Take hypermetro scenario for

Please define "hypermetro"

> example, two namespaces which should be viewed as the same namespaces
> from the database application but exist in two different cities respectively.
> Some vendors maybe specify those two namespaces with the same UUID.

Then these vendors are non-compliant IFF the controllers don't belong
to the same subsystem.

> In addition, could you add a switch to turn on/off finding namespaces in
> a subsystem-wide level or globally?

No.

> Can namespace be shared between two subsystem?

No - if you share namespace access you are in the same subsystem.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-29 10:22     ` Guan Junxiong
  -1 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-29 10:22 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, linux-block, Sagi Grimberg, linux-nvme, niuhaoxin,
	Shenhong (C)



On 2017/8/24 1:58, Christoph Hellwig wrote:
> -static inline bool nvme_req_needs_retry(struct request *req)
> +static bool nvme_failover_rq(struct request *req)
>  {
> -	if (blk_noretry_request(req))
> +	struct nvme_ns *ns = req->q->queuedata;
> +	unsigned long flags;
> +
> +	/*
> +	 * Only fail over commands that came in through the the multipath
> +	 * aware submissions path.  Note that ns->head might not be set up
> +	 * for commands used during controller initialization, but those
> +	 * must never set REQ_FAILFAST_TRANSPORT.
> +	 */
> +	if (!(req->cmd_flags & REQ_FAILFAST_TRANSPORT))
> +		return false;
> +
> +	switch (nvme_req(req)->status & 0x7ff) {
> +	/*
> +	 * Generic command status:
> +	 */
> +	case NVME_SC_INVALID_OPCODE:
> +	case NVME_SC_INVALID_FIELD:
> +	case NVME_SC_INVALID_NS:
> +	case NVME_SC_LBA_RANGE:
> +	case NVME_SC_CAP_EXCEEDED:
> +	case NVME_SC_RESERVATION_CONFLICT:
> +		return false;
> +
> +	/*
> +	 * I/O command set specific error.  Unfortunately these values are
> +	 * reused for fabrics commands, but those should never get here.
> +	 */
> +	case NVME_SC_BAD_ATTRIBUTES:
> +	case NVME_SC_INVALID_PI:
> +	case NVME_SC_READ_ONLY:
> +	case NVME_SC_ONCS_NOT_SUPPORTED:
> +		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
> +			nvme_fabrics_command);
> +		return false;
> +
> +	/*
> +	 * Media and Data Integrity Errors:
> +	 */
> +	case NVME_SC_WRITE_FAULT:
> +	case NVME_SC_READ_ERROR:
> +	case NVME_SC_GUARD_CHECK:
> +	case NVME_SC_APPTAG_CHECK:
> +	case NVME_SC_REFTAG_CHECK:
> +	case NVME_SC_COMPARE_FAILED:
> +	case NVME_SC_ACCESS_DENIED:
> +	case NVME_SC_UNWRITTEN_BLOCK:
>  		return false;
> +	}
> +
> +	/* Anything else could be a path failure, so should be retried */
> +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> +	blk_steal_bios(&ns->head->requeue_list, req);
> +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> +
> +	nvme_reset_ctrl(ns->ctrl);
> +	kblockd_schedule_work(&ns->head->requeue_work);
> +	return true;
> +}
> +
> +static inline bool nvme_req_needs_retry(struct request *req)
> +{
>  	if (nvme_req(req)->status & NVME_SC_DNR)
>  		return false;
>  	if (jiffies - req->start_time >= req->timeout)
>  		return false;
>  	if (nvme_req(req)->retries >= nvme_max_retries)
>  		return false;
> +	if (nvme_failover_rq(req))
> +		return false;
> +	if (blk_noretry_request(req))
> +		return false;
>  	return true;
>  }

Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF
when path IO error occurs?  Such IO will be retried not only on the nvme-mpath
internal path, but also on the dm-mpath path.

In general, I wonder whether nvme-mpath can co-exist with DM-multipath
in a well-defined fashion.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-29 10:22     ` Guan Junxiong
  0 siblings, 0 replies; 122+ messages in thread
From: Guan Junxiong @ 2017-08-29 10:22 UTC (permalink / raw)




On 2017/8/24 1:58, Christoph Hellwig wrote:
> -static inline bool nvme_req_needs_retry(struct request *req)
> +static bool nvme_failover_rq(struct request *req)
>  {
> -	if (blk_noretry_request(req))
> +	struct nvme_ns *ns = req->q->queuedata;
> +	unsigned long flags;
> +
> +	/*
> +	 * Only fail over commands that came in through the the multipath
> +	 * aware submissions path.  Note that ns->head might not be set up
> +	 * for commands used during controller initialization, but those
> +	 * must never set REQ_FAILFAST_TRANSPORT.
> +	 */
> +	if (!(req->cmd_flags & REQ_FAILFAST_TRANSPORT))
> +		return false;
> +
> +	switch (nvme_req(req)->status & 0x7ff) {
> +	/*
> +	 * Generic command status:
> +	 */
> +	case NVME_SC_INVALID_OPCODE:
> +	case NVME_SC_INVALID_FIELD:
> +	case NVME_SC_INVALID_NS:
> +	case NVME_SC_LBA_RANGE:
> +	case NVME_SC_CAP_EXCEEDED:
> +	case NVME_SC_RESERVATION_CONFLICT:
> +		return false;
> +
> +	/*
> +	 * I/O command set specific error.  Unfortunately these values are
> +	 * reused for fabrics commands, but those should never get here.
> +	 */
> +	case NVME_SC_BAD_ATTRIBUTES:
> +	case NVME_SC_INVALID_PI:
> +	case NVME_SC_READ_ONLY:
> +	case NVME_SC_ONCS_NOT_SUPPORTED:
> +		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
> +			nvme_fabrics_command);
> +		return false;
> +
> +	/*
> +	 * Media and Data Integrity Errors:
> +	 */
> +	case NVME_SC_WRITE_FAULT:
> +	case NVME_SC_READ_ERROR:
> +	case NVME_SC_GUARD_CHECK:
> +	case NVME_SC_APPTAG_CHECK:
> +	case NVME_SC_REFTAG_CHECK:
> +	case NVME_SC_COMPARE_FAILED:
> +	case NVME_SC_ACCESS_DENIED:
> +	case NVME_SC_UNWRITTEN_BLOCK:
>  		return false;
> +	}
> +
> +	/* Anything else could be a path failure, so should be retried */
> +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> +	blk_steal_bios(&ns->head->requeue_list, req);
> +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> +
> +	nvme_reset_ctrl(ns->ctrl);
> +	kblockd_schedule_work(&ns->head->requeue_work);
> +	return true;
> +}
> +
> +static inline bool nvme_req_needs_retry(struct request *req)
> +{
>  	if (nvme_req(req)->status & NVME_SC_DNR)
>  		return false;
>  	if (jiffies - req->start_time >= req->timeout)
>  		return false;
>  	if (nvme_req(req)->retries >= nvme_max_retries)
>  		return false;
> +	if (nvme_failover_rq(req))
> +		return false;
> +	if (blk_noretry_request(req))
> +		return false;
>  	return true;
>  }

Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF
when path IO error occurs?  Such IO will be retried not only on the nvme-mpath
internal path, but also on the dm-mpath path.

In general, I wonder whether nvme-mpath can co-exist with DM-multipath
in a well-defined fashion.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-29 10:22     ` Guan Junxiong
@ 2017-08-29 14:51       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29 14:51 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-block,
	Sagi Grimberg, linux-nvme, niuhaoxin, Shenhong (C)

On Tue, Aug 29, 2017 at 06:22:50PM +0800, Guan Junxiong wrote:
> Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF
> when path IO error occurs?  Such IO will be retried not only on the nvme-mpath
> internal path, but also on the dm-mpath path.

It will not reach back to dm-multipath if we fail over here.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-29 14:51       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29 14:51 UTC (permalink / raw)


On Tue, Aug 29, 2017@06:22:50PM +0800, Guan Junxiong wrote:
> Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF
> when path IO error occurs?  Such IO will be retried not only on the nvme-mpath
> internal path, but also on the dm-mpath path.

It will not reach back to dm-multipath if we fail over here.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-23 17:58   ` Christoph Hellwig
@ 2017-08-29 14:54     ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-29 14:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote:
> +	/* Anything else could be a path failure, so should be retried */
> +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> +	blk_steal_bios(&ns->head->requeue_list, req);
> +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> +
> +	nvme_reset_ctrl(ns->ctrl);
> +	kblockd_schedule_work(&ns->head->requeue_work);
> +	return true;
> +}

It appears this isn't going to cause the path selection to failover for
the requeued work. The bio's bi_disk is unchanged from the failed path when the
requeue_work submits the bio again so it will use the same path, right? 

It also looks like new submissions will get a new path only from the
fact that the original/primary is being reset. The controller reset
itself seems a bit heavy-handed. Can we just set head->current_path to
the next active controller in the list?


> +static void nvme_requeue_work(struct work_struct *work)
> +{
> +	struct nvme_ns_head *head =
> +		container_of(work, struct nvme_ns_head, requeue_work);
> +	struct bio *bio, *next;
> +
> +	spin_lock_irq(&head->requeue_lock);
> +	next = bio_list_get(&head->requeue_list);
> +	spin_unlock_irq(&head->requeue_lock);
> +
> +	while ((bio = next) != NULL) {
> +		next = bio->bi_next;
> +		bio->bi_next = NULL;
> +		generic_make_request_fast(bio);
> +	}
> +}

Here, I think we need to reevaluate the path (nvme_find_path) and set
bio->bi_disk accordingly.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-29 14:54     ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-29 14:54 UTC (permalink / raw)


On Wed, Aug 23, 2017@07:58:15PM +0200, Christoph Hellwig wrote:
> +	/* Anything else could be a path failure, so should be retried */
> +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> +	blk_steal_bios(&ns->head->requeue_list, req);
> +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> +
> +	nvme_reset_ctrl(ns->ctrl);
> +	kblockd_schedule_work(&ns->head->requeue_work);
> +	return true;
> +}

It appears this isn't going to cause the path selection to failover for
the requeued work. The bio's bi_disk is unchanged from the failed path when the
requeue_work submits the bio again so it will use the same path, right? 

It also looks like new submissions will get a new path only from the
fact that the original/primary is being reset. The controller reset
itself seems a bit heavy-handed. Can we just set head->current_path to
the next active controller in the list?


> +static void nvme_requeue_work(struct work_struct *work)
> +{
> +	struct nvme_ns_head *head =
> +		container_of(work, struct nvme_ns_head, requeue_work);
> +	struct bio *bio, *next;
> +
> +	spin_lock_irq(&head->requeue_lock);
> +	next = bio_list_get(&head->requeue_list);
> +	spin_unlock_irq(&head->requeue_lock);
> +
> +	while ((bio = next) != NULL) {
> +		next = bio->bi_next;
> +		bio->bi_next = NULL;
> +		generic_make_request_fast(bio);
> +	}
> +}

Here, I think we need to reevaluate the path (nvme_find_path) and set
bio->bi_disk accordingly.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-29 14:54     ` Keith Busch
@ 2017-08-29 14:55       ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29 14:55 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Tue, Aug 29, 2017 at 10:54:17AM -0400, Keith Busch wrote:
> On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote:
> > +	/* Anything else could be a path failure, so should be retried */
> > +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> > +	blk_steal_bios(&ns->head->requeue_list, req);
> > +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> > +
> > +	nvme_reset_ctrl(ns->ctrl);
> > +	kblockd_schedule_work(&ns->head->requeue_work);
> > +	return true;
> > +}
> 
> It appears this isn't going to cause the path selection to failover for
> the requeued work. The bio's bi_disk is unchanged from the failed path when the
> requeue_work submits the bio again so it will use the same path, right? 

Oh.  This did indeed break with the bi_bdev -> bi_disk refactoring
I did just before sending this out.

> It also looks like new submissions will get a new path only from the
> fact that the original/primary is being reset. The controller reset
> itself seems a bit heavy-handed. Can we just set head->current_path to
> the next active controller in the list?

For ANA we'll have to do that anyway, but if we got a failure
that clearly indicates a path failure what benefit is there in not
resetting the controller?  But yeah, maybe we can just switch the
path for non-ANA controllers and wait for timeouts to do their work.

> 
> > +static void nvme_requeue_work(struct work_struct *work)
> > +{
> > +	struct nvme_ns_head *head =
> > +		container_of(work, struct nvme_ns_head, requeue_work);
> > +	struct bio *bio, *next;
> > +
> > +	spin_lock_irq(&head->requeue_lock);
> > +	next = bio_list_get(&head->requeue_list);
> > +	spin_unlock_irq(&head->requeue_lock);
> > +
> > +	while ((bio = next) != NULL) {
> > +		next = bio->bi_next;
> > +		bio->bi_next = NULL;
> > +		generic_make_request_fast(bio);
> > +	}
> > +}
> 
> Here, I think we need to reevaluate the path (nvme_find_path) and set
> bio->bi_disk accordingly.

Yes.  Previously this was opencoded and always used head->disk, but
I messed it up last minute.  In the end it still worked for my cases
because the controller would either already be reset or fail all
I/O, but this behavior clearly is not intended and suboptimal.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-29 14:55       ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-08-29 14:55 UTC (permalink / raw)


On Tue, Aug 29, 2017@10:54:17AM -0400, Keith Busch wrote:
> On Wed, Aug 23, 2017@07:58:15PM +0200, Christoph Hellwig wrote:
> > +	/* Anything else could be a path failure, so should be retried */
> > +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> > +	blk_steal_bios(&ns->head->requeue_list, req);
> > +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> > +
> > +	nvme_reset_ctrl(ns->ctrl);
> > +	kblockd_schedule_work(&ns->head->requeue_work);
> > +	return true;
> > +}
> 
> It appears this isn't going to cause the path selection to failover for
> the requeued work. The bio's bi_disk is unchanged from the failed path when the
> requeue_work submits the bio again so it will use the same path, right? 

Oh.  This did indeed break with the bi_bdev -> bi_disk refactoring
I did just before sending this out.

> It also looks like new submissions will get a new path only from the
> fact that the original/primary is being reset. The controller reset
> itself seems a bit heavy-handed. Can we just set head->current_path to
> the next active controller in the list?

For ANA we'll have to do that anyway, but if we got a failure
that clearly indicates a path failure what benefit is there in not
resetting the controller?  But yeah, maybe we can just switch the
path for non-ANA controllers and wait for timeouts to do their work.

> 
> > +static void nvme_requeue_work(struct work_struct *work)
> > +{
> > +	struct nvme_ns_head *head =
> > +		container_of(work, struct nvme_ns_head, requeue_work);
> > +	struct bio *bio, *next;
> > +
> > +	spin_lock_irq(&head->requeue_lock);
> > +	next = bio_list_get(&head->requeue_list);
> > +	spin_unlock_irq(&head->requeue_lock);
> > +
> > +	while ((bio = next) != NULL) {
> > +		next = bio->bi_next;
> > +		bio->bi_next = NULL;
> > +		generic_make_request_fast(bio);
> > +	}
> > +}
> 
> Here, I think we need to reevaluate the path (nvme_find_path) and set
> bio->bi_disk accordingly.

Yes.  Previously this was opencoded and always used head->disk, but
I messed it up last minute.  In the end it still worked for my cases
because the controller would either already be reset or fail all
I/O, but this behavior clearly is not intended and suboptimal.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-29 14:55       ` Christoph Hellwig
@ 2017-08-29 15:41         ` Keith Busch
  -1 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-29 15:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Tue, Aug 29, 2017 at 04:55:59PM +0200, Christoph Hellwig wrote:
> On Tue, Aug 29, 2017 at 10:54:17AM -0400, Keith Busch wrote:
> > It also looks like new submissions will get a new path only from the
> > fact that the original/primary is being reset. The controller reset
> > itself seems a bit heavy-handed. Can we just set head->current_path to
> > the next active controller in the list?
> 
> For ANA we'll have to do that anyway, but if we got a failure
> that clearly indicates a path failure what benefit is there in not
> resetting the controller?  But yeah, maybe we can just switch the
> path for non-ANA controllers and wait for timeouts to do their work.

Okay, sounds reasonable.

Speaking of timeouts, nvme_req_needs_retry will fail the command
immediately rather than try the alternate path if it was cancelled due
to timeout handling. Should we create a new construct for a command's
total time separate from recovery timeout so we may try an alternate
paths?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-08-29 15:41         ` Keith Busch
  0 siblings, 0 replies; 122+ messages in thread
From: Keith Busch @ 2017-08-29 15:41 UTC (permalink / raw)


On Tue, Aug 29, 2017@04:55:59PM +0200, Christoph Hellwig wrote:
> On Tue, Aug 29, 2017@10:54:17AM -0400, Keith Busch wrote:
> > It also looks like new submissions will get a new path only from the
> > fact that the original/primary is being reset. The controller reset
> > itself seems a bit heavy-handed. Can we just set head->current_path to
> > the next active controller in the list?
> 
> For ANA we'll have to do that anyway, but if we got a failure
> that clearly indicates a path failure what benefit is there in not
> resetting the controller?  But yeah, maybe we can just switch the
> path for non-ANA controllers and wait for timeouts to do their work.

Okay, sounds reasonable.

Speaking of timeouts, nvme_req_needs_retry will fail the command
immediately rather than try the alternate path if it was cancelled due
to timeout handling. Should we create a new construct for a command's
total time separate from recovery timeout so we may try an alternate
paths?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-24 20:17         ` Bart Van Assche
@ 2017-09-05 11:53           ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-09-05 11:53 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, keith.busch, linux-block, axboe, linux-nvme, sagi

On Thu, Aug 24, 2017 at 08:17:32PM +0000, Bart Van Assche wrote:
> For NVMe over RDMA, how about the simulate_network_failure_loop() function in
> https://github.com/bvanassche/srp-test/blob/master/lib/functions? It simulates
> a network failure by writing into the reset_controller sysfs attribute.

FYI, I've tested lots of reset_controllers.  But automatic them is of
course even better.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-09-05 11:53           ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-09-05 11:53 UTC (permalink / raw)


On Thu, Aug 24, 2017@08:17:32PM +0000, Bart Van Assche wrote:
> For NVMe over RDMA, how about the simulate_network_failure_loop() function in
> https://github.com/bvanassche/srp-test/blob/master/lib/functions? It simulates
> a network failure by writing into the reset_controller sysfs attribute.

FYI, I've tested lots of reset_controllers.  But automatic them is of
course even better.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-28  9:06       ` Christoph Hellwig
  (?)
  (?)
@ 2017-09-07 15:17       ` Tony Yang
  -1 siblings, 0 replies; 122+ messages in thread
From: Tony Yang @ 2017-09-07 15:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Jens Axboe, Keith Busch, linux-block, linux-nvme

[-- Attachment #1: Type: text/plain, Size: 3544 bytes --]

Hi, Where can I download this patch? Thanks

2017-08-28 17:06 GMT+08:00 Christoph Hellwig <hch@lst.de>:

> On Mon, Aug 28, 2017 at 10:23:33AM +0300, Sagi Grimberg wrote:
> > This is really taking a lot into the nvme driver.
>
> This patch adds about 100 lines of code, out of which 1/4th is the
> parsing of status codes.  I don't think it's all that much..
>
> > I'm not sure if
> > this approach will be used in other block driver, but would it
> > make sense to place the block_device node creation,
>
> I really want to reduce the amount of boilerplate code for creating
> a block queue and I have some ideas for that.  But that's not in
> any way related to this multi-path code.
>
> > the make_request
>
> That basically just does a lookup (in a data structure we need for
> tracking the siblings inside nvme anyway) and the submits the
> I/O again.  The generic code all is in generic_make_request_fast.
> There are two more things which could move into common code, one
> reasonable, the other not:
>
>  (1) the whole requeue_list list logic.  I thought about moving this
>      to the block layer as I'd also have some other use cases for it,
>      but decided to wait until those materialize to see if I'd really
>      need it.
>  (2) turning the logic inside out, e.g. providing a generic make_request
>      and supplying a find_path callback to it.  This would require (1)
>      but even then we'd save basically no code, add an indirect call
>      in the fast path and make things harder to read.  This actually
>      is how I started out and didn't like it.
>
> > and failover logic
>
> The big thing about the failover logic is looking a the detailed error
> codes and changing the controller state, all of which is fundamentally
> driver specific.  The only thing that could reasonably be common is
> the requeue_list handling as mentioned above.  Note that this will
> become even more integrated with the nvme driver once ANA support
> lands.
>
> > and maybe the path selection in the block layer
> > leaving just the construction of the path mappings in nvme?
>
> The point is that path selection fundamentally depends on your
> actual protocol.  E.g. for NVMe we now look at the controller state
> as that's what our keep alive work on, and what reflects that status
> in PCIe. We could try to communicate this up, but it would just lead
> to more data structure that get out of sync.  And this will become
> much worse with ANA where we have even more fine grained state.
> In the end path selection right now is less than 20 lines of code.
>
> Path selection is complicated when you want non-trivial path selector
> algorithms like round robin or service time, but I'd really avoid having
> those on nvme - we already have multiple queues to go beyong the limits
> of a single queue and use blk-mq for that, and once we're beyond the
> limits of a single transport path for performance reasons I'd much
> rather rely on ANA, numa nodes or static partitioning of namepsaces
> than trying to do dynamic decicions in the fast path.  But if I don't
> get what I want there and we'll need more complicated algorithms they
> absolutely should be in common code.  In fact one of my earlier protoypes
> moved the dm-mpath path selector algorithms to core block code and
> tried to use them - it just turned out enabling them was pretty much
> pointless.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

[-- Attachment #2: Type: text/html, Size: 4480 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-09-05 11:53           ` Christoph Hellwig
  (?)
@ 2017-09-11  6:34           ` Tony Yang
  -1 siblings, 0 replies; 122+ messages in thread
From: Tony Yang @ 2017-09-11  6:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, axboe, linux-block, sagi, linux-nvme, keith.busch, hch

[-- Attachment #1: Type: text/plain, Size: 25956 bytes --]

Hi, All

I use this url : http://git.infradead.org/users/hch/block.git/shortlog/
refs/heads/nvme-mpath to download core.c and nvme.h, And then compile the
kernel, the following error, unable to compile successful, what is the
problem? Thanks

drivers/net/wireless/wl3501_cs.c: In function ‘wl3501_receive’:
drivers/net/wireless/wl3501_cs.c:765: warning: ‘next_addr’ may be used
uninitialized in this function
drivers/net/wireless/wl3501_cs.c:794: warning: ‘next_addr1’ may be used
uninitialized in this function
  CC [M]  drivers/net/wireless/rndis_wlan.o
  CC [M]  drivers/net/wireless/mac80211_hwsim.o
  LD      drivers/net/built-in.o
  CC [M]  drivers/net/dummy.o
  CC [M]  drivers/net/ifb.o
  CC [M]  drivers/net/macvlan.o
drivers/net/macvlan.c: In function ‘macvlan_changelink’:
drivers/net/macvlan.c:1387: warning: ‘mode’ may be used uninitialized in
this function
  CC [M]  drivers/net/macvtap.o
  CC [M]  drivers/net/mii.o
  CC [M]  drivers/net/mdio.o
  CC [M]  drivers/net/netconsole.o
  CC [M]  drivers/net/tun.o
  CC [M]  drivers/net/veth.o
  CC [M]  drivers/net/virtio_net.o
  CC [M]  drivers/net/vxlan.o
  CC [M]  drivers/net/sungem_phy.o
  LD      drivers/nfc/built-in.o
  LD      drivers/nvme/host/built-in.o
  CC [M]  drivers/nvme/host/core.o
In file included from drivers/nvme/host/core.c:32:
drivers/nvme/host/nvme.h:22:28: error: linux/sed-opal.h: No such file or
directory
In file included from drivers/nvme/host/core.c:32:
drivers/nvme/host/nvme.h:86: error: field ‘result’ has incomplete type
drivers/nvme/host/nvme.h:220: error: expected specifier-qualifier-list
before ‘uuid_t’
drivers/nvme/host/nvme.h: In function ‘nvme_cleanup_cmd’:
drivers/nvme/host/nvme.h:288: error: ‘struct request’ has no member named
‘rq_flags’
drivers/nvme/host/nvme.h:288: error: ‘RQF_SPECIAL_PAYLOAD’ undeclared
(first use in this function)
drivers/nvme/host/nvme.h:288: error: (Each undeclared identifier is
reported only once
drivers/nvme/host/nvme.h:288: error: for each function it appears in.)
drivers/nvme/host/nvme.h:289: error: ‘struct request’ has no member named
‘special_vec’
drivers/nvme/host/nvme.h:290: error: ‘struct request’ has no member named
‘special_vec’
drivers/nvme/host/nvme.h: At top level:
drivers/nvme/host/nvme.h:295: error: parameter 3 (‘result’) has incomplete
type
drivers/nvme/host/nvme.h: In function ‘nvme_end_request’:
drivers/nvme/host/nvme.h:301: error: too few arguments to function
‘blk_mq_complete_request’
drivers/nvme/host/nvme.h: At top level:
drivers/nvme/host/nvme.h:341: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or
‘__attribute__’ before ‘nvme_setup_cmd’
drivers/nvme/host/core.c:109: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or
‘__attribute__’ before ‘nvme_error_status’
drivers/nvme/host/core.c: In function ‘nvme_failover_rq’:
drivers/nvme/host/core.c:169: error: ‘NVME_SC_ONCS_NOT_SUPPORTED’
undeclared (first use in this function)
drivers/nvme/host/core.c:184: error: ‘NVME_SC_UNWRITTEN_BLOCK’ undeclared
(first use in this function)
drivers/nvme/host/core.c:190: error: implicit declaration of function
‘blk_steal_bios’
drivers/nvme/host/core.c: In function ‘nvme_complete_rq’:
drivers/nvme/host/core.c:217: error: too many arguments to function
‘blk_mq_requeue_request’
drivers/nvme/host/core.c:221: error: implicit declaration of function
‘nvme_error_status’
drivers/nvme/host/core.c: In function ‘nvme_cancel_request’:
drivers/nvme/host/core.c:239: error: too few arguments to function
‘blk_mq_complete_request’
drivers/nvme/host/core.c: In function ‘nvme_destroy_ns_head’:
drivers/nvme/host/core.c:334: error: ‘struct nvme_ns_head’ has no member
named ‘ref’
drivers/nvme/host/core.c:334: error: type defaults to ‘int’ in declaration
of ‘__mptr’
drivers/nvme/host/core.c:334: warning: initialization from incompatible
pointer type
drivers/nvme/host/core.c:334: error: ‘struct nvme_ns_head’ has no member
named ‘ref’
drivers/nvme/host/core.c:343: error: ‘struct nvme_ns_head’ has no member
named ‘instance’
drivers/nvme/host/core.c:345: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c: In function ‘nvme_put_ns_head’:
drivers/nvme/host/core.c:352: error: ‘struct nvme_ns_head’ has no member
named ‘ref’
drivers/nvme/host/core.c: In function ‘nvme_alloc_request’:
drivers/nvme/host/core.c:408: error: ‘REQ_OP_DRV_OUT’ undeclared (first use
in this function)
drivers/nvme/host/core.c:408: error: ‘REQ_OP_DRV_IN’ undeclared (first use
in this function)
drivers/nvme/host/core.c: In function ‘nvme_toggle_streams’:
drivers/nvme/host/core.c:433: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:433: error: ‘nvme_admin_directive_send’ undeclared
(first use in this function)
drivers/nvme/host/core.c:434: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:434: error: ‘NVME_NSID_ALL’ undeclared (first use
in this function)
drivers/nvme/host/core.c:435: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:435: error: ‘NVME_DIR_SND_ID_OP_ENABLE’ undeclared
(first use in this function)
drivers/nvme/host/core.c:436: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:436: error: ‘NVME_DIR_IDENTIFY’ undeclared (first
use in this function)
drivers/nvme/host/core.c:437: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:437: error: ‘NVME_DIR_STREAMS’ undeclared (first
use in this function)
drivers/nvme/host/core.c:438: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:438: error: ‘NVME_DIR_ENDIR’ undeclared (first use
in this function)
drivers/nvme/host/core.c: At top level:
drivers/nvme/host/core.c:454: warning: ‘struct streams_directive_params’
declared inside parameter list
drivers/nvme/host/core.c:454: warning: its scope is only this definition or
declaration, which is probably not what you want
drivers/nvme/host/core.c: In function ‘nvme_get_stream_params’:
drivers/nvme/host/core.c:459: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:461: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:461: error: ‘nvme_admin_directive_recv’ undeclared
(first use in this function)
drivers/nvme/host/core.c:462: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:463: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:463: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:464: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:464: error: ‘NVME_DIR_RCV_ST_OP_PARAM’ undeclared
(first use in this function)
drivers/nvme/host/core.c:465: error: ‘struct nvme_command’ has no member
named ‘directive’
drivers/nvme/host/core.c:465: error: ‘NVME_DIR_STREAMS’ undeclared (first
use in this function)
drivers/nvme/host/core.c:467: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c: In function ‘nvme_configure_directives’:
drivers/nvme/host/core.c:472: error: storage size of ‘s’ isn’t known
drivers/nvme/host/core.c:475: error: ‘NVME_CTRL_OACS_DIRECTIVES’ undeclared
(first use in this function)
drivers/nvme/host/core.c:484: error: ‘NVME_NSID_ALL’ undeclared (first use
in this function)
drivers/nvme/host/core.c:489: error: ‘BLK_MAX_WRITE_HINTS’ undeclared
(first use in this function)
drivers/nvme/host/core.c:472: warning: unused variable ‘s’
drivers/nvme/host/core.c: In function ‘nvme_assign_write_stream’:
drivers/nvme/host/core.c:509: error: variable ‘streamid’ has initializer
but incomplete type
drivers/nvme/host/core.c:509: error: ‘struct request’ has no member named
‘write_hint’
drivers/nvme/host/core.c:509: error: storage size of ‘streamid’ isn’t known
drivers/nvme/host/core.c:511: error: ‘WRITE_LIFE_NOT_SET’ undeclared (first
use in this function)
drivers/nvme/host/core.c:511: error: ‘WRITE_LIFE_NONE’ undeclared (first
use in this function)
drivers/nvme/host/core.c:518: error: ‘NVME_RW_DTYPE_STREAMS’ undeclared
(first use in this function)
drivers/nvme/host/core.c:522: error: ‘struct request_queue’ has no member
named ‘write_hints’
drivers/nvme/host/core.c:522: error: ‘struct request_queue’ has no member
named ‘write_hints’
drivers/nvme/host/core.c:522: error: ‘struct request_queue’ has no member
named ‘write_hints’
drivers/nvme/host/core.c:522: error: ‘struct request_queue’ has no member
named ‘write_hints’
drivers/nvme/host/core.c:522: error: type defaults to ‘int’ in declaration
of ‘type name’
drivers/nvme/host/core.c:522: error: type defaults to ‘int’ in declaration
of ‘type name’
drivers/nvme/host/core.c:522: error: negative width in bit-field
‘<anonymous>’
drivers/nvme/host/core.c:523: error: ‘struct request_queue’ has no member
named ‘write_hints’
drivers/nvme/host/core.c:509: warning: unused variable ‘streamid’
drivers/nvme/host/core.c: At top level:
drivers/nvme/host/core.c:534: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or
‘__attribute__’ before ‘nvme_setup_discard’
drivers/nvme/host/core.c:574: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or
‘__attribute__’ before ‘nvme_setup_rw’
drivers/nvme/host/core.c:630: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or
‘__attribute__’ before ‘nvme_setup_cmd’
drivers/nvme/host/core.c:666: error: ‘nvme_setup_cmd’ undeclared here (not
in a function)
drivers/nvme/host/core.c:666: error: type defaults to ‘int’ in declaration
of ‘nvme_setup_cmd’
drivers/nvme/host/core.c: In function ‘__nvme_submit_sync_cmd’:
drivers/nvme/host/core.c:693: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c: In function ‘__nvme_submit_user_cmd’:
drivers/nvme/host/core.c:740: error: ‘struct bio’ has no member named
‘bi_disk’
drivers/nvme/host/core.c: At top level:
drivers/nvme/host/core.c:806: error: expected declaration specifiers or
‘...’ before ‘blk_status_t’
drivers/nvme/host/core.c: In function ‘nvme_keep_alive_end_io’:
drivers/nvme/host/core.c:812: error: ‘status’ undeclared (first use in this
function)
drivers/nvme/host/core.c: In function ‘nvme_keep_alive’:
drivers/nvme/host/core.c:838: warning: passing argument 5 of
‘blk_execute_rq_nowait’ from incompatible pointer type
./include/linux/blkdev.h:837: note: expected ‘void (*)(struct request *,
int)’ but argument is of type ‘void (*)(struct request *)’
drivers/nvme/host/core.c: In function ‘nvme_identify_ctrl’:
drivers/nvme/host/core.c:882: error: ‘NVME_ID_CNS_CTRL’ undeclared (first
use in this function)
drivers/nvme/host/core.c: At top level:
drivers/nvme/host/core.c:896: error: expected declaration specifiers or
‘...’ before ‘uuid_t’
drivers/nvme/host/core.c: In function ‘nvme_identify_ns_descs’:
drivers/nvme/host/core.c:906: error: ‘NVME_ID_CNS_NS_DESC_LIST’ undeclared
(first use in this function)
drivers/nvme/host/core.c:908: error: ‘NVME_IDENTIFY_DATA_SIZE’ undeclared
(first use in this function)
drivers/nvme/host/core.c:920: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:923: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:924: error: ‘NVME_NIDT_EUI64’ undeclared (first
use in this function)
drivers/nvme/host/core.c:925: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:925: error: ‘NVME_NIDT_EUI64_LEN’ undeclared
(first use in this function)
drivers/nvme/host/core.c:928: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:932: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:934: error: ‘NVME_NIDT_NGUID’ undeclared (first
use in this function)
drivers/nvme/host/core.c:935: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:935: error: ‘NVME_NIDT_NGUID_LEN’ undeclared
(first use in this function)
drivers/nvme/host/core.c:938: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:942: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:944: error: ‘NVME_NIDT_UUID’ undeclared (first use
in this function)
drivers/nvme/host/core.c:945: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:945: error: ‘NVME_NIDT_UUID_LEN’ undeclared (first
use in this function)
drivers/nvme/host/core.c:948: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:952: error: implicit declaration of function
‘uuid_copy’
drivers/nvme/host/core.c:952: error: ‘uuid’ undeclared (first use in this
function)
drivers/nvme/host/core.c:952: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:956: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:960: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c: In function ‘nvme_identify_ns_list’:
drivers/nvme/host/core.c:972: error: ‘NVME_ID_CNS_NS_ACTIVE_LIST’
undeclared (first use in this function)
drivers/nvme/host/core.c: In function ‘nvme_identify_ns’:
drivers/nvme/host/core.c:987: error: ‘NVME_ID_CNS_NS’ undeclared (first use
in this function)
drivers/nvme/host/core.c: In function ‘nvme_set_features’:
drivers/nvme/host/core.c:1007: error: storage size of ‘res’ isn’t known
drivers/nvme/host/core.c:1007: warning: unused variable ‘res’
drivers/nvme/host/core.c: In function ‘nvme_ioctl’:
drivers/nvme/host/core.c:1162: error: implicit declaration of function
‘is_sed_ioctl’
drivers/nvme/host/core.c:1163: error: implicit declaration of function
‘sed_ioctl’
drivers/nvme/host/core.c: In function ‘nvme_config_discard’:
drivers/nvme/host/core.c:1275: error: ‘NVME_DSM_MAX_RANGES’ undeclared
(first use in this function)
drivers/nvme/host/core.c:1288: error: implicit declaration of function
‘blk_queue_max_discard_segments’
drivers/nvme/host/core.c:1292: error: implicit declaration of function
‘blk_queue_max_write_zeroes_sectors’
drivers/nvme/host/core.c: At top level:
drivers/nvme/host/core.c:1296: error: expected declaration specifiers or
‘...’ before ‘uuid_t’
drivers/nvme/host/core.c:1298:33: error: macro "NVME_VS" passed 3
arguments, but takes just 2
drivers/nvme/host/core.c: In function ‘nvme_report_ns_ids’:
drivers/nvme/host/core.c:1298: error: ‘NVME_VS’ undeclared (first use in
this function)
drivers/nvme/host/core.c:1300:33: error: macro "NVME_VS" passed 3
arguments, but takes just 2
drivers/nvme/host/core.c:1302:33: error: macro "NVME_VS" passed 3
arguments, but takes just 2
drivers/nvme/host/core.c:1306: error: ‘uuid’ undeclared (first use in this
function)
drivers/nvme/host/core.c:1306: error: too many arguments to function
‘nvme_identify_ns_descs’
drivers/nvme/host/core.c: In function ‘__nvme_revalidate_disk’:
drivers/nvme/host/core.c:1326: error: ‘struct nvme_id_ns’ has no member
named ‘noiob’
drivers/nvme/host/core.c: In function ‘nvme_revalidate_disk’:
drivers/nvme/host/core.c:1361: error: ‘uuid_t’ undeclared (first use in
this function)
drivers/nvme/host/core.c:1361: error: expected ‘;’ before ‘uuid’
drivers/nvme/host/core.c:1362: warning: ISO C90 forbids mixed declarations
and code
drivers/nvme/host/core.c:1378: error: ‘uuid’ undeclared (first use in this
function)
drivers/nvme/host/core.c:1378: error: too many arguments to function
‘nvme_report_ns_ids’
drivers/nvme/host/core.c:1379: error: implicit declaration of function
‘uuid_equal’
drivers/nvme/host/core.c:1379: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c: In function ‘nvme_enable_ctrl’:
drivers/nvme/host/core.c:1589: error: ‘NVME_CC_AMS_RR’ undeclared (first
use in this function)
drivers/nvme/host/core.c: In function ‘nvme_configure_timestamp’:
drivers/nvme/host/core.c:1656: error: ‘NVME_CTRL_ONCS_TIMESTAMP’ undeclared
(first use in this function)
drivers/nvme/host/core.c:1660: error: ‘NVME_FEAT_TIMESTAMP’ undeclared
(first use in this function)
drivers/nvme/host/core.c: In function ‘nvme_configure_apst’:
drivers/nvme/host/core.c:1704: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:1726: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:1777: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:1783: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:1877:33: error: macro "NVME_VS" passed 3
arguments, but takes just 2
drivers/nvme/host/core.c: In function ‘nvme_init_subnqn’:
drivers/nvme/host/core.c:1877: error: ‘NVME_VS’ undeclared (first use in
this function)
drivers/nvme/host/core.c: In function ‘nvme_init_subsystem’:
drivers/nvme/host/core.c:1946: error: ‘struct nvme_id_ctrl’ has no member
named ‘cmic’
drivers/nvme/host/core.c:1995:33: error: macro "NVME_VS" passed 3
arguments, but takes just 2
drivers/nvme/host/core.c: In function ‘nvme_init_identify’:
drivers/nvme/host/core.c:1995: error: ‘NVME_VS’ undeclared (first use in
this function)
drivers/nvme/host/core.c:2091: error: ‘struct nvme_id_ctrl’ has no member
named ‘hmpre’
drivers/nvme/host/core.c:2092: error: ‘struct nvme_id_ctrl’ has no member
named ‘hmmin’
drivers/nvme/host/core.c: In function ‘wwid_show’:
drivers/nvme/host/core.c:2254: error: implicit declaration of function
‘uuid_is_null’
drivers/nvme/host/core.c:2254: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c:2255: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c: In function ‘uuid_show’:
drivers/nvme/host/core.c:2292: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c:2297: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c: In function ‘nvme_ns_attrs_are_visible’:
drivers/nvme/host/core.c:2333: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c: In function ‘nvme_make_request’:
drivers/nvme/host/core.c:2506: error: ‘struct bio’ has no member named
‘bi_disk’
drivers/nvme/host/core.c:2508: error: implicit declaration of function
‘generic_make_request_fast’
drivers/nvme/host/core.c:2518: error: ‘struct bio’ has no member named
‘bi_status’
drivers/nvme/host/core.c:2518: error: ‘BLK_STS_IOERR’ undeclared (first use
in this function)
drivers/nvme/host/core.c: In function ‘__nvme_find_ns_head’:
drivers/nvme/host/core.c:2554: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2554: error: type defaults to ‘int’ in declaration
of ‘__mptr’
drivers/nvme/host/core.c:2554: warning: initialization from incompatible
pointer type
drivers/nvme/host/core.c:2554: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2554: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2554: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2554: error: type defaults to ‘int’ in declaration
of ‘__mptr’
drivers/nvme/host/core.c:2554: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2554: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2555: error: ‘struct nvme_ns_head’ has no member
named ‘ref’
drivers/nvme/host/core.c: In function ‘__nvme_check_ids’:
drivers/nvme/host/core.c:2569: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2569: error: type defaults to ‘int’ in declaration
of ‘__mptr’
drivers/nvme/host/core.c:2569: warning: initialization from incompatible
pointer type
drivers/nvme/host/core.c:2569: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2569: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2569: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2569: error: type defaults to ‘int’ in declaration
of ‘__mptr’
drivers/nvme/host/core.c:2569: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2569: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2570: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c:2571: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c:2571: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c: In function ‘nvme_alloc_ns_head’:
drivers/nvme/host/core.c:2599: error: ‘struct nvme_ns_head’ has no member
named ‘ref’
drivers/nvme/host/core.c:2602: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c:2602: error: too many arguments to function
‘nvme_report_ns_ids’
drivers/nvme/host/core.c:2622: error: ‘struct nvme_ns_head’ has no member
named ‘instance’
drivers/nvme/host/core.c:2623: error: ‘struct nvme_ns_head’ has no member
named ‘instance’
drivers/nvme/host/core.c:2633: error: ‘struct nvme_ns_head’ has no member
named ‘instance’
drivers/nvme/host/core.c:2635: error: ‘struct nvme_ns_head’ has no member
named ‘entry’
drivers/nvme/host/core.c:2639: error: ‘struct nvme_ns_head’ has no member
named ‘instance’
drivers/nvme/host/core.c: In function ‘nvme_init_ns_head’:
drivers/nvme/host/core.c:2670: error: ‘uuid_t’ undeclared (first use in
this function)
drivers/nvme/host/core.c:2670: error: expected ‘;’ before ‘uuid’
drivers/nvme/host/core.c:2672: error: ‘uuid’ undeclared (first use in this
function)
drivers/nvme/host/core.c:2672: error: too many arguments to function
‘nvme_report_ns_ids’
drivers/nvme/host/core.c:2673: error: ‘struct nvme_ns_head’ has no member
named ‘uuid’
drivers/nvme/host/core.c: In function ‘nvme_setup_streams_ns’:
drivers/nvme/host/core.c:2722: error: storage size of ‘s’ isn’t known
drivers/nvme/host/core.c:2722: warning: unused variable ‘s’
drivers/nvme/host/core.c:2964:33: error: macro "NVME_VS" passed 3
arguments, but takes just 2
drivers/nvme/host/core.c: In function ‘nvme_scan_work’:
drivers/nvme/host/core.c:2964: error: ‘NVME_VS’ undeclared (first use in
this function)
drivers/nvme/host/core.c: In function ‘nvme_ctrl_pp_status’:
drivers/nvme/host/core.c:3038: error: ‘NVME_CSTS_PP’ undeclared (first use
in this function)
drivers/nvme/host/core.c: In function ‘nvme_get_fw_slot_info’:
drivers/nvme/host/core.c:3046: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:3051: error: ‘NVME_NSID_ALL’ undeclared (first use
in this function)
drivers/nvme/host/core.c:3052: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:3054: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c: In function ‘nvme_complete_async_event’:
drivers/nvme/host/core.c:3095: error: dereferencing pointer to incomplete
type
drivers/nvme/host/core.c:3118: error: ‘NVME_AER_NOTICE_FW_ACT_STARTING’
undeclared (first use in this function)
drivers/nvme/host/core.c: In function ‘nvme_kill_queues’:
drivers/nvme/host/core.c:3294: error: implicit declaration of function
‘blk_mq_unquiesce_queue’
drivers/nvme/host/core.c: In function ‘nvme_wait_freeze_timeout’:
drivers/nvme/host/core.c:3330: error: implicit declaration of function
‘blk_mq_freeze_queue_wait_timeout’
drivers/nvme/host/core.c: In function ‘nvme_wait_freeze’:
drivers/nvme/host/core.c:3344: error: implicit declaration of function
‘blk_mq_freeze_queue_wait’
drivers/nvme/host/core.c: In function ‘nvme_start_freeze’:
drivers/nvme/host/core.c:3355: error: implicit declaration of function
‘blk_freeze_queue_start’
drivers/nvme/host/core.c: In function ‘nvme_stop_queues’:
drivers/nvme/host/core.c:3366: error: implicit declaration of function
‘blk_mq_quiesce_queue’
make[3]: *** [drivers/nvme/host/core.o] Error 1
make[2]: *** [drivers/nvme/host] Error 2
make[1]: *** [drivers/nvme] Error 2
make: *** [drivers] Error 2
[root@cescel01 linux-4.8.17]#

2017-09-05 19:53 GMT+08:00 Christoph Hellwig <hch@infradead.org>:

> On Thu, Aug 24, 2017 at 08:17:32PM +0000, Bart Van Assche wrote:
> > For NVMe over RDMA, how about the simulate_network_failure_loop()
> function in
> > https://github.com/bvanassche/srp-test/blob/master/lib/functions? It
> simulates
> > a network failure by writing into the reset_controller sysfs attribute.
>
> FYI, I've tested lots of reset_controllers.  But automatic them is of
> course even better.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

[-- Attachment #2: Type: text/html, Size: 29709 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems
  2017-08-29 15:41         ` Keith Busch
@ 2017-09-18  0:17           ` Christoph Hellwig
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-09-18  0:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Tue, Aug 29, 2017 at 11:41:38AM -0400, Keith Busch wrote:
> Speaking of timeouts, nvme_req_needs_retry will fail the command
> immediately rather than try the alternate path if it was cancelled due
> to timeout handling. Should we create a new construct for a command's
> total time separate from recovery timeout so we may try an alternate
> paths?

We probably need to anyway, see the discussion with James in the
other thread.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [PATCH 10/10] nvme: implement multipath access to nvme subsystems
@ 2017-09-18  0:17           ` Christoph Hellwig
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2017-09-18  0:17 UTC (permalink / raw)


On Tue, Aug 29, 2017@11:41:38AM -0400, Keith Busch wrote:
> Speaking of timeouts, nvme_req_needs_retry will fail the command
> immediately rather than try the alternate path if it was cancelled due
> to timeout handling. Should we create a new construct for a command's
> total time separate from recovery timeout so we may try an alternate
> paths?

We probably need to anyway, see the discussion with James in the
other thread.

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2017-09-18  0:17 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-23 17:58 RFC: nvme multipath support Christoph Hellwig
2017-08-23 17:58 ` Christoph Hellwig
2017-08-23 17:58 ` [PATCH 01/10] nvme: report more detailed status codes to the block layer Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:06   ` Sagi Grimberg
2017-08-28  6:06     ` Sagi Grimberg
2017-08-28 18:50   ` Keith Busch
2017-08-28 18:50     ` Keith Busch
2017-08-23 17:58 ` [PATCH 02/10] nvme: allow calling nvme_change_ctrl_state from irq context Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:06   ` Sagi Grimberg
2017-08-28  6:06     ` Sagi Grimberg
2017-08-28 18:50   ` Keith Busch
2017-08-28 18:50     ` Keith Busch
2017-08-23 17:58 ` [PATCH 03/10] nvme: remove unused struct nvme_ns fields Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:07   ` Sagi Grimberg
2017-08-28  6:07     ` Sagi Grimberg
2017-08-28 19:13   ` Keith Busch
2017-08-28 19:13     ` Keith Busch
2017-08-23 17:58 ` [PATCH 04/10] nvme: remove nvme_revalidate_ns Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:12   ` Sagi Grimberg
2017-08-28  6:12     ` Sagi Grimberg
2017-08-28 19:14   ` Keith Busch
2017-08-28 19:14     ` Keith Busch
2017-08-23 17:58 ` [PATCH 05/10] nvme: don't blindly overwrite identifiers on disk revalidate Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:17   ` Sagi Grimberg
2017-08-28  6:17     ` Sagi Grimberg
2017-08-28  6:23     ` Christoph Hellwig
2017-08-28  6:23       ` Christoph Hellwig
2017-08-28  6:32       ` Sagi Grimberg
2017-08-28  6:32         ` Sagi Grimberg
2017-08-28 19:15   ` Keith Busch
2017-08-28 19:15     ` Keith Busch
2017-08-23 17:58 ` [PATCH 06/10] nvme: track subsystems Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-23 22:04   ` Keith Busch
2017-08-23 22:04     ` Keith Busch
2017-08-24  8:52     ` Christoph Hellwig
2017-08-24  8:52       ` Christoph Hellwig
2017-08-28  6:22   ` Sagi Grimberg
2017-08-28  6:22     ` Sagi Grimberg
2017-08-23 17:58 ` [PATCH 07/10] nvme: track shared namespaces Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  6:51   ` Sagi Grimberg
2017-08-28  6:51     ` Sagi Grimberg
2017-08-28  8:50     ` Christoph Hellwig
2017-08-28  8:50       ` Christoph Hellwig
2017-08-28 20:21     ` J Freyensee
2017-08-28 20:21       ` J Freyensee
2017-08-29  8:25       ` Christoph Hellwig
2017-08-29  8:25         ` Christoph Hellwig
2017-08-29  6:54     ` Guan Junxiong
2017-08-29  6:54       ` Guan Junxiong
2017-08-28 12:04   ` javigon
2017-08-28 12:04     ` javigon
2017-08-28 12:41   ` Guan Junxiong
2017-08-28 12:41     ` Guan Junxiong
2017-08-28 14:30     ` Christoph Hellwig
2017-08-28 14:30       ` Christoph Hellwig
2017-08-29  2:42       ` Guan Junxiong
2017-08-29  2:42         ` Guan Junxiong
2017-08-29  8:30         ` Christoph Hellwig
2017-08-29  8:30           ` Christoph Hellwig
2017-08-29  8:29     ` Christoph Hellwig
2017-08-29  8:29       ` Christoph Hellwig
2017-08-28 19:18   ` Keith Busch
2017-08-28 19:18     ` Keith Busch
2017-08-23 17:58 ` [PATCH 08/10] block: provide a generic_make_request_fast helper Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  7:00   ` Sagi Grimberg
2017-08-28  7:00     ` Sagi Grimberg
2017-08-28  8:54     ` Christoph Hellwig
2017-08-28  8:54       ` Christoph Hellwig
2017-08-28 11:01       ` Sagi Grimberg
2017-08-28 11:01         ` Sagi Grimberg
2017-08-28 11:54         ` Christoph Hellwig
2017-08-28 11:54           ` Christoph Hellwig
2017-08-28 12:38           ` Sagi Grimberg
2017-08-28 12:38             ` Sagi Grimberg
2017-08-23 17:58 ` [PATCH 09/10] blk-mq: add a blk_steal_bios helper Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-28  7:04   ` Sagi Grimberg
2017-08-28  7:04     ` Sagi Grimberg
2017-08-23 17:58 ` [PATCH 10/10] nvme: implement multipath access to nvme subsystems Christoph Hellwig
2017-08-23 17:58   ` Christoph Hellwig
2017-08-23 18:21   ` Bart Van Assche
2017-08-23 18:21     ` Bart Van Assche
2017-08-24  8:59     ` hch
2017-08-24  8:59       ` hch
2017-08-24 20:17       ` Bart Van Assche
2017-08-24 20:17         ` Bart Van Assche
2017-09-05 11:53         ` Christoph Hellwig
2017-09-05 11:53           ` Christoph Hellwig
2017-09-11  6:34           ` Tony Yang
2017-08-23 22:53   ` Keith Busch
2017-08-23 22:53     ` Keith Busch
2017-08-24  8:52     ` Christoph Hellwig
2017-08-24  8:52       ` Christoph Hellwig
2017-08-28  7:23   ` Sagi Grimberg
2017-08-28  7:23     ` Sagi Grimberg
2017-08-28  9:06     ` Christoph Hellwig
2017-08-28  9:06       ` Christoph Hellwig
2017-08-28 13:40       ` Sagi Grimberg
2017-08-28 13:40         ` Sagi Grimberg
2017-08-28 14:24         ` Christoph Hellwig
2017-08-28 14:24           ` Christoph Hellwig
2017-09-07 15:17       ` Tony Yang
2017-08-29 10:22   ` Guan Junxiong
2017-08-29 10:22     ` Guan Junxiong
2017-08-29 14:51     ` Christoph Hellwig
2017-08-29 14:51       ` Christoph Hellwig
2017-08-29 14:54   ` Keith Busch
2017-08-29 14:54     ` Keith Busch
2017-08-29 14:55     ` Christoph Hellwig
2017-08-29 14:55       ` Christoph Hellwig
2017-08-29 15:41       ` Keith Busch
2017-08-29 15:41         ` Keith Busch
2017-09-18  0:17         ` Christoph Hellwig
2017-09-18  0:17           ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.