All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme multipath support V3
@ 2017-09-25 13:40 ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

There are a couple questions left in the individual patches, comments
welcome.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

Changes since V2:
  - don't create duplicate subsystems on reset (Keith Bush)
  - free requests properly when failing over in I/O completion (Keith Bush)
  - new devices names: /dev/nvm-sub%dn%d
  - expose the namespace identification sysfs files for the mpath nodes

Changes since V1:
  - introduce new nvme_ns_ids structure to clean up identifier handling
  - generic_make_request_fast is now named direct_make_request and calls
    generic_make_request_checks
  - reset bi_disk on resubmission
  - create sysfs links between the existing nvme namespace block devices and
    the new share mpath device
  - temporarily added the timeout patches from James, this should go into
    nvme-4.14, though

^ permalink raw reply	[flat|nested] 74+ messages in thread

* nvme multipath support V3
@ 2017-09-25 13:40 ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

There are a couple questions left in the individual patches, comments
welcome.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

Changes since V2:
  - don't create duplicate subsystems on reset (Keith Bush)
  - free requests properly when failing over in I/O completion (Keith Bush)
  - new devices names: /dev/nvm-sub%dn%d
  - expose the namespace identification sysfs files for the mpath nodes

Changes since V1:
  - introduce new nvme_ns_ids structure to clean up identifier handling
  - generic_make_request_fast is now named direct_make_request and calls
    generic_make_request_checks
  - reset bi_disk on resubmission
  - create sysfs links between the existing nvme namespace block devices and
    the new share mpath device
  - temporarily added the timeout patches from James, this should go into
    nvme-4.14, though

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/9] block: move REQ_NOWAIT
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This flag should be before the operation-specific REQ_NOUNMAP bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/blk_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa709cef..acc2f3cdc2fc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -224,11 +224,11 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_NOWAIT,           /* Don't wait if request will block */
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
-	__REQ_NOWAIT,           /* Don't wait if request will block */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -245,9 +245,9 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
-#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 1/9] block: move REQ_NOWAIT
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


This flag should be before the operation-specific REQ_NOUNMAP bit.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 include/linux/blk_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa709cef..acc2f3cdc2fc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -224,11 +224,11 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_NOWAIT,           /* Don't wait if request will block */
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
-	__REQ_NOWAIT,           /* Don't wait if request will block */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -245,9 +245,9 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
-#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/9] block: add REQ_DRV bit
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Set aside a bit in the request/bio flags for driver use.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/blk_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index acc2f3cdc2fc..7ec2ed097a8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -229,6 +229,9 @@ enum req_flag_bits {
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
+	/* for driver use */
+	__REQ_DRV,
+
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -249,6 +252,8 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
+#define REQ_DRV			(1ULL << __REQ_DRV)
+
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/9] block: add REQ_DRV bit
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


Set aside a bit in the request/bio flags for driver use.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 include/linux/blk_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index acc2f3cdc2fc..7ec2ed097a8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -229,6 +229,9 @@ enum req_flag_bits {
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
+	/* for driver use */
+	__REQ_DRV,
+
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -249,6 +252,8 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
+#define REQ_DRV			(1ULL << __REQ_DRV)
+
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/9] block: provide a direct_make_request helper
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       | 32 ++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..d7b5d392b550 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2241,6 +2241,38 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/**
+ * direct_make_request - hand a buffer directly to its device driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves like generic_make_request(), but does not protect
+ * against recursion.  Must only be used if the called driver is known
+ * to not re-issue this bio or a split to itself.
+ */
+blk_qc_t direct_make_request(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	blk_qc_t ret;
+
+	if (!generic_make_request_checks(bio))
+		return BLK_QC_T_NONE;
+
+	if (unlikely(blk_queue_enter(q, nowait))) {
+		if (nowait && !blk_queue_dying(q))
+			bio->bi_status = BLK_STS_AGAIN;
+		else
+			bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
+	ret = q->make_request_fn(q, bio);
+	blk_queue_exit(q);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(direct_make_request);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..780f01db5899 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -936,6 +936,7 @@ do {								\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/9] block: provide a direct_make_request helper
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c       | 32 ++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..d7b5d392b550 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2241,6 +2241,38 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/**
+ * direct_make_request - hand a buffer directly to its device driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves like generic_make_request(), but does not protect
+ * against recursion.  Must only be used if the called driver is known
+ * to not re-issue this bio or a split to itself.
+ */
+blk_qc_t direct_make_request(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	blk_qc_t ret;
+
+	if (!generic_make_request_checks(bio))
+		return BLK_QC_T_NONE;
+
+	if (unlikely(blk_queue_enter(q, nowait))) {
+		if (nowait && !blk_queue_dying(q))
+			bio->bi_status = BLK_STS_AGAIN;
+		else
+			bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
+	ret = q->make_request_fn(q, bio);
+	blk_queue_exit(q);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(direct_make_request);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..780f01db5899 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -936,6 +936,7 @@ do {								\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/9] block: add a blk_steal_bios helper
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-core.c       | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index d7b5d392b550..856fb626d9fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2727,6 +2727,26 @@ struct request *blk_fetch_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+	if (rq->bio) {
+		if (list->tail)
+			list->tail->bi_next = rq->bio;
+		else
+			list->head = rq->bio;
+		list->tail = rq->biotail;
+	}
+
+	rq->bio = NULL;
+	rq->biotail = NULL;
+	rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:      the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 780f01db5899..45c63764a14e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1110,6 +1110,8 @@ extern struct request *blk_peek_request(struct request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/9] block: add a blk_steal_bios helper
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
---
 block/blk-core.c       | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index d7b5d392b550..856fb626d9fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2727,6 +2727,26 @@ struct request *blk_fetch_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+	if (rq->bio) {
+		if (list->tail)
+			list->tail->bi_next = rq->bio;
+		else
+			list->head = rq->bio;
+		list->tail = rq->biotail;
+	}
+
+	rq->bio = NULL;
+	rq->biotail = NULL;
+	rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:      the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 780f01db5899..45c63764a14e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1110,6 +1110,8 @@ extern struct request *blk_peek_request(struct request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/9] nvme: track subsystems
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c    | 153 +++++++++++++++++++++++++++++++++++---------
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h    |  20 ++++--
 3 files changed, 141 insertions(+), 36 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bb2aad078637..15cecb708334 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -71,6 +71,9 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1736,14 +1739,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
 		string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl,
+		struct nvme_id_ctrl *id)
 {
 	size_t nqnlen;
 	int off;
 
 	nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
 	if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-		strcpy(ctrl->subnqn, id->subnqn);
+		strcpy(subsys->subnqn, id->subnqn);
 		return;
 	}
 
@@ -1751,14 +1755,95 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
 	/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-	off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+	off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
 			"nqn.2014.08.org.nvmexpress:%4x%4x",
 			le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-	memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+	memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
 	off += sizeof(id->sn);
-	memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+	memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
 	off += sizeof(id->mn);
-	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void nvme_free_subsystem(struct kref *ref)
+{
+	struct nvme_subsystem *subsys =
+			container_of(ref, struct nvme_subsystem, ref);
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_del(&subsys->entry);
+	mutex_unlock(&nvme_subsystems_lock);
+
+	kfree(subsys);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+	kref_put(&subsys->ref, nvme_free_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	lockdep_assert_held(&nvme_subsystems_lock);
+
+	list_for_each_entry(subsys, &nvme_subsystems, entry) {
+		if (strcmp(subsys->subnqn, subsysnqn))
+			continue;
+		if (!kref_get_unless_zero(&subsys->ref))
+			continue;
+		return subsys;
+	}
+
+	return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+	struct nvme_subsystem *subsys, *found;
+
+	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+	if (!subsys)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&subsys->ctrls);
+	kref_init(&subsys->ref);
+	nvme_init_subnqn(subsys, ctrl, id);
+	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
+	memcpy(subsys->model, id->mn, sizeof(subsys->model));
+	memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev));
+	subsys->vendor_id = le16_to_cpu(id->vid);
+	mutex_init(&subsys->lock);
+
+	mutex_lock(&nvme_subsystems_lock);
+	found = __nvme_find_get_subsystem(subsys->subnqn);
+	if (found) {
+		/*
+		 * Verify that the subsystem actually supports multiple
+		 * controllers, else bail out.
+		 */
+		kfree(subsys);
+		if (!(id->cmic & (1 << 1))) {
+			dev_err(ctrl->device,
+				"ignoring ctrl due to duplicate subnqn (%s).\n",
+				found->subnqn);
+			mutex_unlock(&nvme_subsystems_lock);
+			return -EINVAL;
+		}
+
+		subsys = found;
+	} else {
+		list_add_tail(&subsys->entry, &nvme_subsystems);
+	}
+
+	ctrl->subsys = subsys;
+	mutex_unlock(&nvme_subsystems_lock);
+
+	mutex_lock(&subsys->lock);
+	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
+	mutex_unlock(&subsys->lock);
+
+	return 0;
 }
 
 /*
@@ -1796,9 +1881,13 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
-	nvme_init_subnqn(ctrl, id);
-
 	if (!ctrl->identified) {
+		int i;
+
+		ret = nvme_init_subsystem(ctrl, id);
+		if (ret)
+			goto out_free;
+
 		/*
 		 * Check for quirks.  Quirk can depend on firmware version,
 		 * so, in principle, the set of quirks present can change
@@ -1807,9 +1896,6 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		 * the device, but we'd have to make sure that the driver
 		 * behaves intelligently if the quirks change.
 		 */
-
-		int i;
-
 		for (i = 0; i < ARRAY_SIZE(core_quirks); i++) {
 			if (quirk_matches(id, &core_quirks[i]))
 				ctrl->quirks |= core_quirks[i].quirks;
@@ -1822,14 +1908,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	ctrl->oacs = le16_to_cpu(id->oacs);
-	ctrl->vid = le16_to_cpu(id->vid);
 	ctrl->oncs = le16_to_cpup(&id->oncs);
 	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
-	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
-	memcpy(ctrl->model, id->mn, sizeof(id->mn));
-	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
@@ -2054,9 +2136,9 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ctrl *ctrl = ns->ctrl;
-	int serial_len = sizeof(ctrl->serial);
-	int model_len = sizeof(ctrl->model);
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	int serial_len = sizeof(subsys->serial);
+	int model_len = sizeof(subsys->model);
 
 	if (!uuid_is_null(&ns->uuid))
 		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
@@ -2067,15 +2149,16 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
 		return sprintf(buf, "eui.%8phN\n", ns->eui);
 
-	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
-				  ctrl->serial[serial_len - 1] == '\0'))
+	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
+				  subsys->serial[serial_len - 1] == '\0'))
 		serial_len--;
-	while (model_len > 0 && (ctrl->model[model_len - 1] == ' ' ||
-				 ctrl->model[model_len - 1] == '\0'))
+	while (model_len > 0 && (subsys->model[model_len - 1] == ' ' ||
+				 subsys->model[model_len - 1] == '\0'))
 		model_len--;
 
-	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
-		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
+	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
+		serial_len, subsys->serial, model_len, subsys->model,
+		ns->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2161,10 +2244,15 @@ static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
 {										\
         struct nvme_ctrl *ctrl = dev_get_drvdata(dev);				\
-        return sprintf(buf, "%.*s\n", (int)sizeof(ctrl->field), ctrl->field);	\
+        return sprintf(buf, "%.*s\n",						\
+		(int)sizeof(ctrl->subsys->field), ctrl->subsys->field);		\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
+nvme_show_str_function(model);
+nvme_show_str_function(serial);
+nvme_show_str_function(firmware_rev);
+
 #define nvme_show_int_function(field)						\
 static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
@@ -2174,9 +2262,6 @@ static ssize_t  field##_show(struct device *dev,				\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
-nvme_show_str_function(model);
-nvme_show_str_function(serial);
-nvme_show_str_function(firmware_rev);
 nvme_show_int_function(cntlid);
 
 static ssize_t nvme_sysfs_delete(struct device *dev,
@@ -2230,7 +2315,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
 {
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 
-	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
+	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
 }
 static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
 
@@ -2770,12 +2855,22 @@ EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 static void nvme_free_ctrl(struct kref *kref)
 {
 	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	put_device(ctrl->device);
 	nvme_release_instance(ctrl);
 	ida_destroy(&ctrl->ns_ida);
 
+	if (subsys) {
+		mutex_lock(&subsys->lock);
+		list_del(&ctrl->subsys_entry);
+		mutex_unlock(&subsys->lock);
+	}
+
 	ctrl->ops->free_ctrl(ctrl);
+
+	if (subsys)
+		nvme_put_subsystem(subsys);
 }
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 555c976cc2ee..8744ee38ee3d 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -874,10 +874,10 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
-	if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
+	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
-			ctrl->subnqn);
+			ctrl->subsys->subnqn);
 		up_read(&nvmf_transports_rwsem);
 		ctrl->ops->delete_ctrl(ctrl);
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d3f3c4447515..33677b2e104b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -138,13 +138,12 @@ struct nvme_ctrl {
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
+	struct nvme_subsystem *subsys;
+	struct list_head subsys_entry;
+
 	struct opal_dev *opal_dev;
 
 	char name[12];
-	char serial[20];
-	char model[40];
-	char firmware_rev[8];
-	char subnqn[NVMF_NQN_SIZE];
 	u16 cntlid;
 
 	u32 ctrl_config;
@@ -155,7 +154,6 @@ struct nvme_ctrl {
 	u32 page_size;
 	u32 max_hw_sectors;
 	u16 oncs;
-	u16 vid;
 	u16 oacs;
 	u16 nssa;
 	u16 nr_streams;
@@ -197,6 +195,18 @@ struct nvme_ctrl {
 	struct nvmf_ctrl_options *opts;
 };
 
+struct nvme_subsystem {
+	struct list_head	entry;
+	struct mutex		lock;
+	struct list_head	ctrls;
+	struct kref		ref;
+	char			subnqn[NVMF_NQN_SIZE];
+	char			serial[20];
+	char			model[40];
+	char			firmware_rev[8];
+	u16			vendor_id;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/9] nvme: track subsystems
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c    | 153 +++++++++++++++++++++++++++++++++++---------
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h    |  20 ++++--
 3 files changed, 141 insertions(+), 36 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bb2aad078637..15cecb708334 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -71,6 +71,9 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1736,14 +1739,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
 		string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl,
+		struct nvme_id_ctrl *id)
 {
 	size_t nqnlen;
 	int off;
 
 	nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
 	if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-		strcpy(ctrl->subnqn, id->subnqn);
+		strcpy(subsys->subnqn, id->subnqn);
 		return;
 	}
 
@@ -1751,14 +1755,95 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
 	/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-	off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+	off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
 			"nqn.2014.08.org.nvmexpress:%4x%4x",
 			le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-	memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+	memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
 	off += sizeof(id->sn);
-	memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+	memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
 	off += sizeof(id->mn);
-	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void nvme_free_subsystem(struct kref *ref)
+{
+	struct nvme_subsystem *subsys =
+			container_of(ref, struct nvme_subsystem, ref);
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_del(&subsys->entry);
+	mutex_unlock(&nvme_subsystems_lock);
+
+	kfree(subsys);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+	kref_put(&subsys->ref, nvme_free_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	lockdep_assert_held(&nvme_subsystems_lock);
+
+	list_for_each_entry(subsys, &nvme_subsystems, entry) {
+		if (strcmp(subsys->subnqn, subsysnqn))
+			continue;
+		if (!kref_get_unless_zero(&subsys->ref))
+			continue;
+		return subsys;
+	}
+
+	return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+	struct nvme_subsystem *subsys, *found;
+
+	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+	if (!subsys)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&subsys->ctrls);
+	kref_init(&subsys->ref);
+	nvme_init_subnqn(subsys, ctrl, id);
+	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
+	memcpy(subsys->model, id->mn, sizeof(subsys->model));
+	memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev));
+	subsys->vendor_id = le16_to_cpu(id->vid);
+	mutex_init(&subsys->lock);
+
+	mutex_lock(&nvme_subsystems_lock);
+	found = __nvme_find_get_subsystem(subsys->subnqn);
+	if (found) {
+		/*
+		 * Verify that the subsystem actually supports multiple
+		 * controllers, else bail out.
+		 */
+		kfree(subsys);
+		if (!(id->cmic & (1 << 1))) {
+			dev_err(ctrl->device,
+				"ignoring ctrl due to duplicate subnqn (%s).\n",
+				found->subnqn);
+			mutex_unlock(&nvme_subsystems_lock);
+			return -EINVAL;
+		}
+
+		subsys = found;
+	} else {
+		list_add_tail(&subsys->entry, &nvme_subsystems);
+	}
+
+	ctrl->subsys = subsys;
+	mutex_unlock(&nvme_subsystems_lock);
+
+	mutex_lock(&subsys->lock);
+	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
+	mutex_unlock(&subsys->lock);
+
+	return 0;
 }
 
 /*
@@ -1796,9 +1881,13 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
-	nvme_init_subnqn(ctrl, id);
-
 	if (!ctrl->identified) {
+		int i;
+
+		ret = nvme_init_subsystem(ctrl, id);
+		if (ret)
+			goto out_free;
+
 		/*
 		 * Check for quirks.  Quirk can depend on firmware version,
 		 * so, in principle, the set of quirks present can change
@@ -1807,9 +1896,6 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		 * the device, but we'd have to make sure that the driver
 		 * behaves intelligently if the quirks change.
 		 */
-
-		int i;
-
 		for (i = 0; i < ARRAY_SIZE(core_quirks); i++) {
 			if (quirk_matches(id, &core_quirks[i]))
 				ctrl->quirks |= core_quirks[i].quirks;
@@ -1822,14 +1908,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	ctrl->oacs = le16_to_cpu(id->oacs);
-	ctrl->vid = le16_to_cpu(id->vid);
 	ctrl->oncs = le16_to_cpup(&id->oncs);
 	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
-	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
-	memcpy(ctrl->model, id->mn, sizeof(id->mn));
-	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
@@ -2054,9 +2136,9 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ctrl *ctrl = ns->ctrl;
-	int serial_len = sizeof(ctrl->serial);
-	int model_len = sizeof(ctrl->model);
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	int serial_len = sizeof(subsys->serial);
+	int model_len = sizeof(subsys->model);
 
 	if (!uuid_is_null(&ns->uuid))
 		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
@@ -2067,15 +2149,16 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
 		return sprintf(buf, "eui.%8phN\n", ns->eui);
 
-	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
-				  ctrl->serial[serial_len - 1] == '\0'))
+	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
+				  subsys->serial[serial_len - 1] == '\0'))
 		serial_len--;
-	while (model_len > 0 && (ctrl->model[model_len - 1] == ' ' ||
-				 ctrl->model[model_len - 1] == '\0'))
+	while (model_len > 0 && (subsys->model[model_len - 1] == ' ' ||
+				 subsys->model[model_len - 1] == '\0'))
 		model_len--;
 
-	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
-		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
+	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
+		serial_len, subsys->serial, model_len, subsys->model,
+		ns->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2161,10 +2244,15 @@ static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
 {										\
         struct nvme_ctrl *ctrl = dev_get_drvdata(dev);				\
-        return sprintf(buf, "%.*s\n", (int)sizeof(ctrl->field), ctrl->field);	\
+        return sprintf(buf, "%.*s\n",						\
+		(int)sizeof(ctrl->subsys->field), ctrl->subsys->field);		\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
+nvme_show_str_function(model);
+nvme_show_str_function(serial);
+nvme_show_str_function(firmware_rev);
+
 #define nvme_show_int_function(field)						\
 static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
@@ -2174,9 +2262,6 @@ static ssize_t  field##_show(struct device *dev,				\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
-nvme_show_str_function(model);
-nvme_show_str_function(serial);
-nvme_show_str_function(firmware_rev);
 nvme_show_int_function(cntlid);
 
 static ssize_t nvme_sysfs_delete(struct device *dev,
@@ -2230,7 +2315,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
 {
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 
-	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
+	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
 }
 static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
 
@@ -2770,12 +2855,22 @@ EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 static void nvme_free_ctrl(struct kref *kref)
 {
 	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	put_device(ctrl->device);
 	nvme_release_instance(ctrl);
 	ida_destroy(&ctrl->ns_ida);
 
+	if (subsys) {
+		mutex_lock(&subsys->lock);
+		list_del(&ctrl->subsys_entry);
+		mutex_unlock(&subsys->lock);
+	}
+
 	ctrl->ops->free_ctrl(ctrl);
+
+	if (subsys)
+		nvme_put_subsystem(subsys);
 }
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 555c976cc2ee..8744ee38ee3d 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -874,10 +874,10 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
-	if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
+	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
-			ctrl->subnqn);
+			ctrl->subsys->subnqn);
 		up_read(&nvmf_transports_rwsem);
 		ctrl->ops->delete_ctrl(ctrl);
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d3f3c4447515..33677b2e104b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -138,13 +138,12 @@ struct nvme_ctrl {
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
+	struct nvme_subsystem *subsys;
+	struct list_head subsys_entry;
+
 	struct opal_dev *opal_dev;
 
 	char name[12];
-	char serial[20];
-	char model[40];
-	char firmware_rev[8];
-	char subnqn[NVMF_NQN_SIZE];
 	u16 cntlid;
 
 	u32 ctrl_config;
@@ -155,7 +154,6 @@ struct nvme_ctrl {
 	u32 page_size;
 	u32 max_hw_sectors;
 	u16 oncs;
-	u16 vid;
 	u16 oacs;
 	u16 nssa;
 	u16 nr_streams;
@@ -197,6 +195,18 @@ struct nvme_ctrl {
 	struct nvmf_ctrl_options *opts;
 };
 
+struct nvme_subsystem {
+	struct list_head	entry;
+	struct mutex		lock;
+	struct list_head	ctrls;
+	struct kref		ref;
+	char			subnqn[NVMF_NQN_SIZE];
+	char			serial[20];
+	char			model[40];
+	char			firmware_rev[8];
+	u16			vendor_id;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 6/9] nvme: introduce a nvme_ns_ids structure
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++---------------------
 drivers/nvme/host/nvme.h | 14 +++++++---
 2 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 15cecb708334..bcfe8ea1c2bb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -781,7 +781,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 }
 
 static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
-		u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_ns_ids *ids)
 {
 	struct nvme_command c = { };
 	int status;
@@ -817,7 +817,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_EUI64_LEN;
-			memcpy(eui64, data + pos + sizeof(*cur), len);
+			memcpy(ids->eui64, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_NGUID:
 			if (cur->nidl != NVME_NIDT_NGUID_LEN) {
@@ -827,7 +827,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_NGUID_LEN;
-			memcpy(nguid, data + pos + sizeof(*cur), len);
+			memcpy(ids->nguid, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_UUID:
 			if (cur->nidl != NVME_NIDT_UUID_LEN) {
@@ -837,7 +837,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_UUID_LEN;
-			uuid_copy(uuid, data + pos + sizeof(*cur));
+			uuid_copy(&ids->uuid, data + pos + sizeof(*cur));
 			break;
 		default:
 			/* Skip unnkown types */
@@ -1178,22 +1178,31 @@ static void nvme_config_discard(struct nvme_ns *ns)
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
-		struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_id_ns *id, struct nvme_ns_ids *ids)
 {
+	memset(ids, 0, sizeof(*ids));
+
 	if (ctrl->vs >= NVME_VS(1, 1, 0))
-		memcpy(eui64, id->eui64, sizeof(id->eui64));
+		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
 	if (ctrl->vs >= NVME_VS(1, 2, 0))
-		memcpy(nguid, id->nguid, sizeof(id->nguid));
+		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
 	if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 		 /* Don't treat error as fatal we potentially
 		  * already have a NGUID or EUI-64
 		  */
-		if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+		if (nvme_identify_ns_descs(ctrl, nsid, ids))
 			dev_warn(ctrl->device,
 				 "%s: Identify Descriptors failed\n", __func__);
 	}
 }
 
+static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
+{
+	return uuid_equal(&a->uuid, &b->uuid) &&
+		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
+		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
+}
+
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
 	struct nvme_ns *ns = disk->private_data;
@@ -1234,8 +1243,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id;
-	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
-	uuid_t uuid = uuid_null;
+	struct nvme_ns_ids ids;
 	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
@@ -1252,10 +1260,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
-	if (!uuid_equal(&ns->uuid, &uuid) ||
-	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
-	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
 		dev_err(ctrl->device,
 			"identifiers changed for nsid %d\n", ns->ns_id);
 		ret = -ENODEV;
@@ -2136,18 +2142,19 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
-	if (!uuid_is_null(&ns->uuid))
-		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
+	if (!uuid_is_null(&ids->uuid))
+		return sprintf(buf, "uuid.%pU\n", &ids->uuid);
 
-	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
-		return sprintf(buf, "eui.%16phN\n", ns->nguid);
+	if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+		return sprintf(buf, "eui.%16phN\n", ids->nguid);
 
-	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
-		return sprintf(buf, "eui.%8phN\n", ns->eui);
+	if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
+		return sprintf(buf, "eui.%8phN\n", ids->eui64);
 
 	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
 				  subsys->serial[serial_len - 1] == '\0'))
@@ -2166,7 +2173,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->nguid);
+	return sprintf(buf, "%pU\n", ns->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2174,16 +2181,17 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
 	 */
-	if (uuid_is_null(&ns->uuid)) {
+	if (uuid_is_null(&ids->uuid)) {
 		printk_ratelimited(KERN_WARNING
 				   "No UUID available providing old NGUID\n");
-		return sprintf(buf, "%pU\n", ns->nguid);
+		return sprintf(buf, "%pU\n", ids->nguid);
 	}
-	return sprintf(buf, "%pU\n", &ns->uuid);
+	return sprintf(buf, "%pU\n", &ids->uuid);
 }
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
@@ -2191,7 +2199,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->eui);
+	return sprintf(buf, "%8phd\n", ns->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2217,18 +2225,19 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	if (a == &dev_attr_uuid.attr) {
-		if (uuid_is_null(&ns->uuid) ||
-		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (uuid_is_null(&ids->uuid) ||
+		    !memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_nguid.attr) {
-		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (!memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_eui.attr) {
-		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
+		if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
 			return 0;
 	}
 	return a->mode;
@@ -2460,7 +2469,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 33677b2e104b..505cd310dc34 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -207,6 +207,15 @@ struct nvme_subsystem {
 	u16			vendor_id;
 };
 
+/*
+ * Container structure for uniqueue namespace identifiers.
+ */
+struct nvme_ns_ids {
+	u8	eui64[8];
+	u8	nguid[16];
+	uuid_t	uuid;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
@@ -217,11 +226,8 @@ struct nvme_ns {
 	struct kref kref;
 	int instance;
 
-	u8 eui[8];
-	u8 nguid[16];
-	uuid_t uuid;
-
 	unsigned ns_id;
+	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 6/9] nvme: introduce a nvme_ns_ids structure
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++---------------------
 drivers/nvme/host/nvme.h | 14 +++++++---
 2 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 15cecb708334..bcfe8ea1c2bb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -781,7 +781,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 }
 
 static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
-		u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_ns_ids *ids)
 {
 	struct nvme_command c = { };
 	int status;
@@ -817,7 +817,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_EUI64_LEN;
-			memcpy(eui64, data + pos + sizeof(*cur), len);
+			memcpy(ids->eui64, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_NGUID:
 			if (cur->nidl != NVME_NIDT_NGUID_LEN) {
@@ -827,7 +827,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_NGUID_LEN;
-			memcpy(nguid, data + pos + sizeof(*cur), len);
+			memcpy(ids->nguid, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_UUID:
 			if (cur->nidl != NVME_NIDT_UUID_LEN) {
@@ -837,7 +837,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_UUID_LEN;
-			uuid_copy(uuid, data + pos + sizeof(*cur));
+			uuid_copy(&ids->uuid, data + pos + sizeof(*cur));
 			break;
 		default:
 			/* Skip unnkown types */
@@ -1178,22 +1178,31 @@ static void nvme_config_discard(struct nvme_ns *ns)
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
-		struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_id_ns *id, struct nvme_ns_ids *ids)
 {
+	memset(ids, 0, sizeof(*ids));
+
 	if (ctrl->vs >= NVME_VS(1, 1, 0))
-		memcpy(eui64, id->eui64, sizeof(id->eui64));
+		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
 	if (ctrl->vs >= NVME_VS(1, 2, 0))
-		memcpy(nguid, id->nguid, sizeof(id->nguid));
+		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
 	if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 		 /* Don't treat error as fatal we potentially
 		  * already have a NGUID or EUI-64
 		  */
-		if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+		if (nvme_identify_ns_descs(ctrl, nsid, ids))
 			dev_warn(ctrl->device,
 				 "%s: Identify Descriptors failed\n", __func__);
 	}
 }
 
+static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
+{
+	return uuid_equal(&a->uuid, &b->uuid) &&
+		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
+		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
+}
+
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
 	struct nvme_ns *ns = disk->private_data;
@@ -1234,8 +1243,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id;
-	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
-	uuid_t uuid = uuid_null;
+	struct nvme_ns_ids ids;
 	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
@@ -1252,10 +1260,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
-	if (!uuid_equal(&ns->uuid, &uuid) ||
-	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
-	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
 		dev_err(ctrl->device,
 			"identifiers changed for nsid %d\n", ns->ns_id);
 		ret = -ENODEV;
@@ -2136,18 +2142,19 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
-	if (!uuid_is_null(&ns->uuid))
-		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
+	if (!uuid_is_null(&ids->uuid))
+		return sprintf(buf, "uuid.%pU\n", &ids->uuid);
 
-	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
-		return sprintf(buf, "eui.%16phN\n", ns->nguid);
+	if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+		return sprintf(buf, "eui.%16phN\n", ids->nguid);
 
-	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
-		return sprintf(buf, "eui.%8phN\n", ns->eui);
+	if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
+		return sprintf(buf, "eui.%8phN\n", ids->eui64);
 
 	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
 				  subsys->serial[serial_len - 1] == '\0'))
@@ -2166,7 +2173,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->nguid);
+	return sprintf(buf, "%pU\n", ns->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2174,16 +2181,17 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
 	 */
-	if (uuid_is_null(&ns->uuid)) {
+	if (uuid_is_null(&ids->uuid)) {
 		printk_ratelimited(KERN_WARNING
 				   "No UUID available providing old NGUID\n");
-		return sprintf(buf, "%pU\n", ns->nguid);
+		return sprintf(buf, "%pU\n", ids->nguid);
 	}
-	return sprintf(buf, "%pU\n", &ns->uuid);
+	return sprintf(buf, "%pU\n", &ids->uuid);
 }
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
@@ -2191,7 +2199,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->eui);
+	return sprintf(buf, "%8phd\n", ns->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2217,18 +2225,19 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	if (a == &dev_attr_uuid.attr) {
-		if (uuid_is_null(&ns->uuid) ||
-		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (uuid_is_null(&ids->uuid) ||
+		    !memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_nguid.attr) {
-		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (!memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_eui.attr) {
-		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
+		if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
 			return 0;
 	}
 	return a->mode;
@@ -2460,7 +2469,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 33677b2e104b..505cd310dc34 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -207,6 +207,15 @@ struct nvme_subsystem {
 	u16			vendor_id;
 };
 
+/*
+ * Container structure for uniqueue namespace identifiers.
+ */
+struct nvme_ns_ids {
+	u8	eui64[8];
+	u8	nguid[16];
+	uuid_t	uuid;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
@@ -217,11 +226,8 @@ struct nvme_ns {
 	struct kref kref;
 	int instance;
 
-	u8 eui[8];
-	u8 nguid[16];
-	uuid_t uuid;
-
 	unsigned ns_id;
+	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 7/9] nvme: track shared namespaces
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

Introduce a new struct nvme_ns_head [1] that holds information about
an actual namespace, unlike struct nvme_ns, which only holds the
per-controller namespace information.  For private namespaces there
is a 1:1 relation of the two, but for shared namespaces this lets us
discover all the paths to it.  For now only the identifiers are moved
to the new structure, but most of the information in struct nvme_ns
should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

[1] comments welcome if you have a better name for it, the current one is
    horrible.  One idea would be to rename the current struct nvme_ns
    to struct nvme_ns_link or similar and use the nvme_ns name for the
    new structure.  But that would involve a lot of churn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c     | 192 +++++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/lightnvm.c |  14 ++--
 drivers/nvme/host/nvme.h     |  21 ++++-
 3 files changed, 192 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bcfe8ea1c2bb..c1be5a0a69b1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -247,10 +247,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_free_ns_head(struct kref *ref)
+{
+	struct nvme_ns_head *head =
+		container_of(ref, struct nvme_ns_head, ref);
+
+	list_del_init(&head->entry);
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+	kref_put(&head->ref, nvme_free_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	if (ns->head)
+		nvme_put_ns_head(ns->head);
+
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
 
@@ -420,7 +438,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -451,7 +469,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->dsm.nr = cpu_to_le32(segments - 1);
 	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -490,7 +508,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -971,7 +989,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	memset(&c, 0, sizeof(c));
 	c.rw.opcode = io.opcode;
 	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.rw.slba = cpu_to_le64(io.slba);
 	c.rw.length = cpu_to_le16(io.nblocks);
 	c.rw.control = cpu_to_le16(io.control);
@@ -1036,7 +1054,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
-		return ns->ns_id;
+		return ns->head->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
@@ -1196,6 +1214,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
 	}
 }
 
+static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
+{
+	return !uuid_is_null(&ids->uuid) ||
+		memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) ||
+		memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1251,7 +1276,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
-	id = nvme_identify_ns(ctrl, ns->ns_id);
+	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
 
@@ -1260,10 +1285,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
-	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
+	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->head->ids, &ids)) {
 		dev_err(ctrl->device,
-			"identifiers changed for nsid %d\n", ns->ns_id);
+			"identifiers changed for nsid %d\n", ns->head->ns_id);
 		ret = -ENODEV;
 	}
 
@@ -1304,7 +1329,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
 	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
@@ -1813,6 +1838,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	if (!subsys)
 		return -ENOMEM;
 	INIT_LIST_HEAD(&subsys->ctrls);
+	INIT_LIST_HEAD(&subsys->nsheads);
 	kref_init(&subsys->ref);
 	nvme_init_subnqn(subsys, ctrl, id);
 	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
@@ -2142,7 +2168,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
@@ -2165,7 +2191,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->ns_id);
+		ns->head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2173,7 +2199,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->ids.nguid);
+	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2181,7 +2207,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2199,7 +2225,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->ids.eui64);
+	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2207,7 +2233,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->ns_id);
+	return sprintf(buf, "%d\n", ns->head->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
@@ -2225,7 +2251,7 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2377,12 +2403,114 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
+		unsigned nsid)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
+			return h;
+	}
+
+	return NULL;
+}
+
+static int __nvme_check_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_head *new)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (nvme_ns_ids_valid(&new->ids) &&
+		    nvme_ns_ids_equal(&new->ids, &h->ids))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
+		unsigned nsid, struct nvme_id_ns *id)
+{
+	struct nvme_ns_head *head;
+	int ret = -ENOMEM;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto out;
+
+	INIT_LIST_HEAD(&head->list);
+	head->ns_id = nsid;
+	init_srcu_struct(&head->srcu);
+	kref_init(&head->ref);
+
+	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
+
+	ret = __nvme_check_ids(ctrl->subsys, head);
+	if (ret) {
+		dev_err(ctrl->device,
+			"duplicate IDs for nsid %d\n", nsid);
+		goto out_free_head;
+	}
+
+	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
+	return head;
+out_free_head:
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+out:
+	return ERR_PTR(ret);
+}
+
+static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
+		struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	bool is_shared = id->nmic & (1 << 0);
+	struct nvme_ns_head *head = NULL;
+	int ret = 0;
+
+	mutex_lock(&ctrl->subsys->lock);
+	if (is_shared)
+		head = __nvme_find_ns_head(ctrl->subsys, nsid);
+	if (!head) {
+		head = nvme_alloc_ns_head(ctrl, nsid, id);
+		if (IS_ERR(head)) {
+			ret = PTR_ERR(head);
+			goto out_unlock;
+		}
+	} else {
+		struct nvme_ns_ids ids;
+
+		nvme_report_ns_ids(ctrl, nsid, id, &ids);
+		if (!nvme_ns_ids_equal(&head->ids, &ids)) {
+			dev_err(ctrl->device,
+				"IDs don't match for shared namespace %d\n",
+					nsid);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	list_add_tail(&ns->siblings, &head->list);
+	ns->head = head;
+
+out_unlock:
+	mutex_unlock(&ctrl->subsys->lock);
+	return ret;
+}
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
 	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
 
-	return nsa->ns_id - nsb->ns_id;
+	return nsa->head->ns_id - nsb->head->ns_id;
 }
 
 static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -2391,12 +2519,12 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid) {
+		if (ns->head->ns_id == nsid) {
 			kref_get(&ns->kref);
 			ret = ns;
 			break;
 		}
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			break;
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
@@ -2411,7 +2539,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
 	if (!ctrl->nr_streams)
 		return 0;
 
-	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	ret = nvme_get_stream_params(ctrl, &s, ns->head->ns_id);
 	if (ret)
 		return ret;
 
@@ -2453,7 +2581,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->ctrl = ctrl;
 
 	kref_init(&ns->kref);
-	ns->ns_id = nsid;
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
@@ -2469,18 +2596,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
+	if (nvme_init_ns_head(ns, nsid, id))
+		goto out_free_id;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
 			dev_warn(ctrl->device, "LightNVM init failure\n");
-			goto out_free_id;
+			goto out_unlink_ns;
 		}
 	}
 
 	disk = alloc_disk_node(0, node);
 	if (!disk)
-		goto out_free_id;
+		goto out_unlink_ns;
 
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
@@ -2508,6 +2636,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 	return;
+ out_unlink_ns:
+	mutex_lock(&ctrl->subsys->lock);
+	list_del_rcu(&ns->siblings);
+	mutex_unlock(&ctrl->subsys->lock);
  out_free_id:
 	kfree(id);
  out_free_queue:
@@ -2520,6 +2652,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	struct nvme_ns_head *head = ns->head;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -2534,10 +2668,16 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 
+	mutex_lock(&ns->ctrl->subsys->lock);
+	if (head)
+		list_del_rcu(&ns->siblings);
+	mutex_unlock(&ns->ctrl->subsys->lock);
+
 	mutex_lock(&ns->ctrl->namespaces_mutex);
 	list_del_init(&ns->list);
 	mutex_unlock(&ns->ctrl->namespaces_mutex);
 
+	synchronize_srcu(&head->srcu);
 	nvme_put_ns(ns);
 }
 
@@ -2560,7 +2700,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			nvme_ns_remove(ns);
 	}
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 1f79e3f141e6..44e46276319c 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	int ret;
 
 	c.identity.opcode = nvme_nvm_admin_identity;
-	c.identity.nsid = cpu_to_le32(ns->ns_id);
+	c.identity.nsid = cpu_to_le32(ns->head->ns_id);
 	c.identity.chnl_off = 0;
 
 	nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
 	int ret = 0;
 
 	c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
-	c.l2p.nsid = cpu_to_le32(ns->ns_id);
+	c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
 	entries = kmalloc(len, GFP_KERNEL);
 	if (!entries)
 		return -ENOMEM;
@@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
 	int ret = 0;
 
 	c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
-	c.get_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.get_bb.spba = cpu_to_le64(ppa.ppa);
 
 	bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
 	int ret = 0;
 
 	c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
-	c.set_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.set_bb.spba = cpu_to_le64(ppas->ppa);
 	c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
 	c.set_bb.value = type;
@@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
 				    struct nvme_nvm_command *c)
 {
 	c->ph_rw.opcode = rqd->opcode;
-	c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
 	c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
 	c->ph_rw.control = cpu_to_le16(rqd->flags);
@@ -691,7 +691,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
 
 	memset(&c, 0, sizeof(c));
 	c.ph_rw.opcode = vio.opcode;
-	c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.ph_rw.control = cpu_to_le16(vio.control);
 	c.ph_rw.length = cpu_to_le16(vio.nppas);
 
@@ -728,7 +728,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = vcmd.opcode;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 505cd310dc34..da30df2668f5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -199,6 +199,7 @@ struct nvme_subsystem {
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
+	struct list_head	nsheads;
 	struct kref		ref;
 	char			subnqn[NVMF_NQN_SIZE];
 	char			serial[20];
@@ -216,18 +217,34 @@ struct nvme_ns_ids {
 	uuid_t	uuid;
 };
 
+/*
+ * Anchor structure for namespaces.  There is one for each namespace in a
+ * NVMe subsystem that any of our controllers can see, and the namespace
+ * structure for each controller is chained of it.  For private namespaces
+ * there is a 1:1 relation to our namespace structures, that is ->list
+ * only ever has a single entry for private namespaces.
+ */
+struct nvme_ns_head {
+	struct list_head	list;
+	struct srcu_struct      srcu;
+	unsigned		ns_id;
+	struct nvme_ns_ids	ids;
+	struct list_head	entry;
+	struct kref		ref;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
+	struct nvme_ns_head *head;
 	int instance;
 
-	unsigned ns_id;
-	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 7/9] nvme: track shared namespaces
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


Introduce a new struct nvme_ns_head [1] that holds information about
an actual namespace, unlike struct nvme_ns, which only holds the
per-controller namespace information.  For private namespaces there
is a 1:1 relation of the two, but for shared namespaces this lets us
discover all the paths to it.  For now only the identifiers are moved
to the new structure, but most of the information in struct nvme_ns
should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

[1] comments welcome if you have a better name for it, the current one is
    horrible.  One idea would be to rename the current struct nvme_ns
    to struct nvme_ns_link or similar and use the nvme_ns name for the
    new structure.  But that would involve a lot of churn.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c     | 192 +++++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/lightnvm.c |  14 ++--
 drivers/nvme/host/nvme.h     |  21 ++++-
 3 files changed, 192 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bcfe8ea1c2bb..c1be5a0a69b1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -247,10 +247,28 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_free_ns_head(struct kref *ref)
+{
+	struct nvme_ns_head *head =
+		container_of(ref, struct nvme_ns_head, ref);
+
+	list_del_init(&head->entry);
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+	kref_put(&head->ref, nvme_free_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	if (ns->head)
+		nvme_put_ns_head(ns->head);
+
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
 
@@ -420,7 +438,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -451,7 +469,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->dsm.nr = cpu_to_le32(segments - 1);
 	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -490,7 +508,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -971,7 +989,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	memset(&c, 0, sizeof(c));
 	c.rw.opcode = io.opcode;
 	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.rw.slba = cpu_to_le64(io.slba);
 	c.rw.length = cpu_to_le16(io.nblocks);
 	c.rw.control = cpu_to_le16(io.control);
@@ -1036,7 +1054,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
-		return ns->ns_id;
+		return ns->head->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
@@ -1196,6 +1214,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
 	}
 }
 
+static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
+{
+	return !uuid_is_null(&ids->uuid) ||
+		memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) ||
+		memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1251,7 +1276,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
-	id = nvme_identify_ns(ctrl, ns->ns_id);
+	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
 
@@ -1260,10 +1285,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
-	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
+	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->head->ids, &ids)) {
 		dev_err(ctrl->device,
-			"identifiers changed for nsid %d\n", ns->ns_id);
+			"identifiers changed for nsid %d\n", ns->head->ns_id);
 		ret = -ENODEV;
 	}
 
@@ -1304,7 +1329,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
 	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
@@ -1813,6 +1838,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	if (!subsys)
 		return -ENOMEM;
 	INIT_LIST_HEAD(&subsys->ctrls);
+	INIT_LIST_HEAD(&subsys->nsheads);
 	kref_init(&subsys->ref);
 	nvme_init_subnqn(subsys, ctrl, id);
 	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
@@ -2142,7 +2168,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
@@ -2165,7 +2191,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->ns_id);
+		ns->head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2173,7 +2199,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->ids.nguid);
+	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2181,7 +2207,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2199,7 +2225,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->ids.eui64);
+	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2207,7 +2233,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->ns_id);
+	return sprintf(buf, "%d\n", ns->head->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
@@ -2225,7 +2251,7 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2377,12 +2403,114 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
+		unsigned nsid)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
+			return h;
+	}
+
+	return NULL;
+}
+
+static int __nvme_check_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_head *new)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (nvme_ns_ids_valid(&new->ids) &&
+		    nvme_ns_ids_equal(&new->ids, &h->ids))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
+		unsigned nsid, struct nvme_id_ns *id)
+{
+	struct nvme_ns_head *head;
+	int ret = -ENOMEM;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto out;
+
+	INIT_LIST_HEAD(&head->list);
+	head->ns_id = nsid;
+	init_srcu_struct(&head->srcu);
+	kref_init(&head->ref);
+
+	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
+
+	ret = __nvme_check_ids(ctrl->subsys, head);
+	if (ret) {
+		dev_err(ctrl->device,
+			"duplicate IDs for nsid %d\n", nsid);
+		goto out_free_head;
+	}
+
+	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
+	return head;
+out_free_head:
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+out:
+	return ERR_PTR(ret);
+}
+
+static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
+		struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	bool is_shared = id->nmic & (1 << 0);
+	struct nvme_ns_head *head = NULL;
+	int ret = 0;
+
+	mutex_lock(&ctrl->subsys->lock);
+	if (is_shared)
+		head = __nvme_find_ns_head(ctrl->subsys, nsid);
+	if (!head) {
+		head = nvme_alloc_ns_head(ctrl, nsid, id);
+		if (IS_ERR(head)) {
+			ret = PTR_ERR(head);
+			goto out_unlock;
+		}
+	} else {
+		struct nvme_ns_ids ids;
+
+		nvme_report_ns_ids(ctrl, nsid, id, &ids);
+		if (!nvme_ns_ids_equal(&head->ids, &ids)) {
+			dev_err(ctrl->device,
+				"IDs don't match for shared namespace %d\n",
+					nsid);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	list_add_tail(&ns->siblings, &head->list);
+	ns->head = head;
+
+out_unlock:
+	mutex_unlock(&ctrl->subsys->lock);
+	return ret;
+}
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
 	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
 
-	return nsa->ns_id - nsb->ns_id;
+	return nsa->head->ns_id - nsb->head->ns_id;
 }
 
 static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -2391,12 +2519,12 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid) {
+		if (ns->head->ns_id == nsid) {
 			kref_get(&ns->kref);
 			ret = ns;
 			break;
 		}
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			break;
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
@@ -2411,7 +2539,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
 	if (!ctrl->nr_streams)
 		return 0;
 
-	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	ret = nvme_get_stream_params(ctrl, &s, ns->head->ns_id);
 	if (ret)
 		return ret;
 
@@ -2453,7 +2581,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->ctrl = ctrl;
 
 	kref_init(&ns->kref);
-	ns->ns_id = nsid;
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
@@ -2469,18 +2596,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
+	if (nvme_init_ns_head(ns, nsid, id))
+		goto out_free_id;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
 			dev_warn(ctrl->device, "LightNVM init failure\n");
-			goto out_free_id;
+			goto out_unlink_ns;
 		}
 	}
 
 	disk = alloc_disk_node(0, node);
 	if (!disk)
-		goto out_free_id;
+		goto out_unlink_ns;
 
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
@@ -2508,6 +2636,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 	return;
+ out_unlink_ns:
+	mutex_lock(&ctrl->subsys->lock);
+	list_del_rcu(&ns->siblings);
+	mutex_unlock(&ctrl->subsys->lock);
  out_free_id:
 	kfree(id);
  out_free_queue:
@@ -2520,6 +2652,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	struct nvme_ns_head *head = ns->head;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -2534,10 +2668,16 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 
+	mutex_lock(&ns->ctrl->subsys->lock);
+	if (head)
+		list_del_rcu(&ns->siblings);
+	mutex_unlock(&ns->ctrl->subsys->lock);
+
 	mutex_lock(&ns->ctrl->namespaces_mutex);
 	list_del_init(&ns->list);
 	mutex_unlock(&ns->ctrl->namespaces_mutex);
 
+	synchronize_srcu(&head->srcu);
 	nvme_put_ns(ns);
 }
 
@@ -2560,7 +2700,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			nvme_ns_remove(ns);
 	}
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 1f79e3f141e6..44e46276319c 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	int ret;
 
 	c.identity.opcode = nvme_nvm_admin_identity;
-	c.identity.nsid = cpu_to_le32(ns->ns_id);
+	c.identity.nsid = cpu_to_le32(ns->head->ns_id);
 	c.identity.chnl_off = 0;
 
 	nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
 	int ret = 0;
 
 	c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
-	c.l2p.nsid = cpu_to_le32(ns->ns_id);
+	c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
 	entries = kmalloc(len, GFP_KERNEL);
 	if (!entries)
 		return -ENOMEM;
@@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
 	int ret = 0;
 
 	c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
-	c.get_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.get_bb.spba = cpu_to_le64(ppa.ppa);
 
 	bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
 	int ret = 0;
 
 	c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
-	c.set_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.set_bb.spba = cpu_to_le64(ppas->ppa);
 	c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
 	c.set_bb.value = type;
@@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
 				    struct nvme_nvm_command *c)
 {
 	c->ph_rw.opcode = rqd->opcode;
-	c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
 	c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
 	c->ph_rw.control = cpu_to_le16(rqd->flags);
@@ -691,7 +691,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
 
 	memset(&c, 0, sizeof(c));
 	c.ph_rw.opcode = vio.opcode;
-	c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.ph_rw.control = cpu_to_le16(vio.control);
 	c.ph_rw.length = cpu_to_le16(vio.nppas);
 
@@ -728,7 +728,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = vcmd.opcode;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 505cd310dc34..da30df2668f5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -199,6 +199,7 @@ struct nvme_subsystem {
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
+	struct list_head	nsheads;
 	struct kref		ref;
 	char			subnqn[NVMF_NQN_SIZE];
 	char			serial[20];
@@ -216,18 +217,34 @@ struct nvme_ns_ids {
 	uuid_t	uuid;
 };
 
+/*
+ * Anchor structure for namespaces.  There is one for each namespace in a
+ * NVMe subsystem that any of our controllers can see, and the namespace
+ * structure for each controller is chained of it.  For private namespaces
+ * there is a 1:1 relation to our namespace structures, that is ->list
+ * only ever has a single entry for private namespaces.
+ */
+struct nvme_ns_head {
+	struct list_head	list;
+	struct srcu_struct      srcu;
+	unsigned		ns_id;
+	struct nvme_ns_ids	ids;
+	struct list_head	entry;
+	struct kref		ref;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
+	struct nvme_ns_head *head;
 	int instance;
 
-	unsigned ns_id;
-	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

This patch adds initial multipath support to the nvme driver.  For each
namespace we create a new block device node, which can be used to access
that namespace through any of the controllers that refer to it.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

The new block devices nodes for multipath access will show up as

	/dev/nvm-subXnZ

where X is the local instance number of the subsystems, and Z is the index
for the namespace.

To get persistent devices names the following lines can be added to
/lib/udev/rules.d/60-persistent-storage.rules:

---------------------------------- snip ----------------------------------
KERNEL=="nvm-sub*[0-9]n*[0-9]", ATTR{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}"
KERNEL=="nvm-sub*[0-9]n*[0-9]p*[0-9]", ENV{DEVTYPE}=="partition", ATTRS{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}-part%n"
KERNEL=="nvm-sub*[0-9]n*[0-9]", ENV{DEVTYPE}=="disk", ATTRS{wwid}=="?*", ENV{ID_WWN}="$attr{wwid}"
---------------------------------- snip ----------------------------------

Note that these create the new persistent names.  Overriding the existing
nvme ones would be nicer, but while that works for the first path, the
normal rule will override it again for each subsequent path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 277 +++++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h |  11 ++
 2 files changed, 269 insertions(+), 19 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c1be5a0a69b1..faef0241c6b5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -72,6 +72,7 @@ struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
 static LIST_HEAD(nvme_subsystems);
+static DEFINE_IDA(nvme_subsystems_ida);
 static DEFINE_MUTEX(nvme_subsystems_lock);
 
 static LIST_HEAD(nvme_ctrl_list);
@@ -104,6 +105,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
+static void nvme_failover_req(struct request *req)
+{
+	struct nvme_ns *ns = req->q->queuedata;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ns->head->requeue_lock, flags);
+	blk_steal_bios(&ns->head->requeue_list, req);
+	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+	blk_mq_end_request(req, 0);
+
+	nvme_reset_ctrl(ns->ctrl);
+	kblockd_schedule_work(&ns->head->requeue_work);
+}
+
 static blk_status_t nvme_error_status(struct request *req)
 {
 	switch (nvme_req(req)->status & 0x7ff) {
@@ -131,6 +146,53 @@ static blk_status_t nvme_error_status(struct request *req)
 	}
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+	if (!(req->cmd_flags & REQ_NVME_MPATH))
+		return false;
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
+		return false;
+	}
+
+	/* Everything else could be a path failure, so should be retried */
+	return true;
+}
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
 	if (blk_noretry_request(req))
@@ -145,6 +207,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+		if (nvme_req_needs_failover(req)) {
+			nvme_failover_req(req);
+			return;
+		}
+
 		nvme_req(req)->retries++;
 		blk_mq_requeue_request(req, true);
 		return;
@@ -173,6 +240,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->head)
+			kblockd_schedule_work(&ns->head->requeue_work);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -240,9 +319,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
 	if (changed)
 		ctrl->state = new_state;
-
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
+	if (changed && ctrl->state == NVME_CTRL_LIVE)
+		nvme_kick_requeue_lists(ctrl);
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -252,6 +332,14 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	del_gendisk(head->disk);
+	blk_set_queue_dying(head->disk->queue);
+	/* make sure all pending bios are cleaned up */
+	kblockd_schedule_work(&head->requeue_work);
+	flush_work(&head->requeue_work);
+	blk_cleanup_queue(head->disk->queue);
+	put_disk(head->disk);
+
 	list_del_init(&head->entry);
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -1123,8 +1211,10 @@ static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
 	if (blk_get_integrity(disk) &&
 	    (ns->pi_type != pi_type || ns->ms != old_ms ||
 	     bs != queue_logical_block_size(disk->queue) ||
-	     (ns->ms && ns->ext)))
+	     (ns->ms && ns->ext))) {
 		blk_integrity_unregister(disk);
+		blk_integrity_unregister(ns->head->disk);
+	}
 
 	ns->pi_type = pi_type;
 }
@@ -1152,7 +1242,9 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 	}
 	integrity.tuple_size = ns->ms;
 	blk_integrity_register(ns->disk, &integrity);
+	blk_integrity_register(ns->head->disk, &integrity);
 	blk_queue_max_integrity_segments(ns->queue, 1);
+	blk_queue_max_integrity_segments(ns->head->disk->queue, 1);
 }
 #else
 static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
@@ -1170,7 +1262,7 @@ static void nvme_set_chunk_size(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size));
 }
 
-static void nvme_config_discard(struct nvme_ns *ns)
+static void nvme_config_discard(struct nvme_ns *ns, struct request_queue *queue)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1181,18 +1273,18 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	if (ctrl->nr_streams && ns->sws && ns->sgs) {
 		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
 
-		ns->queue->limits.discard_alignment = sz;
-		ns->queue->limits.discard_granularity = sz;
+		queue->limits.discard_alignment = sz;
+		queue->limits.discard_granularity = sz;
 	} else {
 		ns->queue->limits.discard_alignment = logical_block_size;
 		ns->queue->limits.discard_granularity = logical_block_size;
 	}
-	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
-	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+	blk_queue_max_discard_sectors(queue, UINT_MAX);
+	blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue);
 
 	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
+		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
@@ -1249,17 +1341,25 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
+	blk_queue_logical_block_size(ns->head->disk->queue, bs);
 	if (ns->noiob)
 		nvme_set_chunk_size(ns);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) {
 		set_capacity(disk, 0);
-	else
+		if (ns->head)
+			set_capacity(ns->head->disk, 0);
+	} else {
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+		if (ns->head)
+			set_capacity(ns->head->disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+	}
 
-	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM) {
+		nvme_config_discard(ns, ns->queue);
+		nvme_config_discard(ns, ns->head->disk->queue);
+	}
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
@@ -1796,6 +1896,12 @@ static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ct
 	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
 }
 
+static void __nvme_free_subsystem(struct nvme_subsystem *subsys)
+{
+	ida_simple_remove(&nvme_subsystems_ida, subsys->instance);
+	kfree(subsys);
+}
+
 static void nvme_free_subsystem(struct kref *ref)
 {
 	struct nvme_subsystem *subsys =
@@ -1804,8 +1910,7 @@ static void nvme_free_subsystem(struct kref *ref)
 	mutex_lock(&nvme_subsystems_lock);
 	list_del(&subsys->entry);
 	mutex_unlock(&nvme_subsystems_lock);
-
-	kfree(subsys);
+	__nvme_free_subsystem(subsys);
 }
 
 static void nvme_put_subsystem(struct nvme_subsystem *subsys)
@@ -1833,10 +1938,16 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
+	int ret;
 
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
 		return -ENOMEM;
+	subsys->instance = ret = ida_simple_get(&nvme_subsystems_ida, 0, 0,
+			GFP_KERNEL);
+	if (ret < 0)
+		goto out_free_subsys;
+
 	INIT_LIST_HEAD(&subsys->ctrls);
 	INIT_LIST_HEAD(&subsys->nsheads);
 	kref_init(&subsys->ref);
@@ -1854,7 +1965,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		 * Verify that the subsystem actually supports multiple
 		 * controllers, else bail out.
 		 */
-		kfree(subsys);
+		__nvme_free_subsystem(subsys);
 		if (!(id->cmic & (1 << 1))) {
 			dev_err(ctrl->device,
 				"ignoring ctrl due to duplicate subnqn (%s).\n",
@@ -1876,6 +1987,9 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	mutex_unlock(&subsys->lock);
 
 	return 0;
+out_free_subsys:
+	kfree(subsys);
+	return ret;
 }
 
 /*
@@ -2403,6 +2517,80 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (ns->ctrl->state == NVME_CTRL_LIVE) {
+			rcu_assign_pointer(head->current_path, ns);
+			return ns;
+		}
+	}
+
+	return NULL;
+}
+
+static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct device *dev = disk_to_dev(head->disk);
+	struct nvme_ns *ns;
+	blk_qc_t ret = BLK_QC_T_NONE;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = srcu_dereference(head->current_path, &head->srcu);
+	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
+		ns = nvme_find_path(head);
+	if (likely(ns)) {
+		bio->bi_disk = ns->disk;
+		bio->bi_opf |= REQ_NVME_MPATH;
+		ret = direct_make_request(bio);
+	} else if (!list_empty_careful(&head->list)) {
+		dev_warn_ratelimited(dev, "no path available - requeing I/O\n");
+
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		dev_warn_ratelimited(dev, "no path - failing I/O\n");
+
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static const struct block_device_operations nvme_subsys_ops = {
+	.owner		= THIS_MODULE,
+};
+
+static void nvme_requeue_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&head->requeue_lock);
+	next = bio_list_get(&head->requeue_list);
+	spin_unlock_irq(&head->requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+
+		/*
+		 * Reset disk to the mpath node and resubmit to select a new
+		 * path.
+		 */
+		bio->bi_disk = head->disk;
+		generic_make_request(bio);
+	}
+}
+
 static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
 		unsigned nsid)
 {
@@ -2438,6 +2626,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		unsigned nsid, struct nvme_id_ns *id)
 {
 	struct nvme_ns_head *head;
+	struct request_queue *q;
 	int ret = -ENOMEM;
 
 	head = kzalloc(sizeof(*head), GFP_KERNEL);
@@ -2446,6 +2635,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 
 	INIT_LIST_HEAD(&head->list);
 	head->ns_id = nsid;
+	bio_list_init(&head->requeue_list);
+	spin_lock_init(&head->requeue_lock);
+	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
 	kref_init(&head->ref);
 
@@ -2458,8 +2650,31 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		goto out_free_head;
 	}
 
+	ret = -ENOMEM;
+	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
+	if (!q)
+		goto out_free_head;
+	q->queuedata = head;
+	blk_queue_make_request(q, nvme_make_request);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	/* set to a default value for 512 until disk is validated */
+	blk_queue_logical_block_size(q, 512);
+	nvme_set_queue_limits(ctrl, q);
+
+	head->disk = alloc_disk(0);
+	if (!head->disk)
+		goto out_cleanup_queue;
+	head->disk->fops = &nvme_subsys_ops;
+	head->disk->private_data = head;
+	head->disk->queue = q;
+	head->disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(head->disk->disk_name, "nvm-sub%dn%d",
+			ctrl->subsys->instance, nsid);
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
+
+out_cleanup_queue:
+	blk_cleanup_queue(q);
 out_free_head:
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -2468,7 +2683,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 }
 
 static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
-		struct nvme_id_ns *id)
+		struct nvme_id_ns *id, bool *new)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	bool is_shared = id->nmic & (1 << 0);
@@ -2484,6 +2699,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = PTR_ERR(head);
 			goto out_unlock;
 		}
+
+		*new = true;
 	} else {
 		struct nvme_ns_ids ids;
 
@@ -2495,6 +2712,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = -EINVAL;
 			goto out_unlock;
 		}
+
+		*new = false;
 	}
 
 	list_add_tail(&ns->siblings, &head->list);
@@ -2564,6 +2783,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
+	bool new = true;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -2596,7 +2816,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	if (nvme_init_ns_head(ns, nsid, id))
+	if (nvme_init_ns_head(ns, nsid, id, &new))
 		goto out_free_id;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
@@ -2635,6 +2855,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
+
+	if (new)
+		add_disk(ns->head->disk);
+
+	if (sysfs_create_link(&disk_to_dev(ns->disk)->kobj,
+			&disk_to_dev(ns->head->disk)->kobj, "mpath"))
+		pr_warn("%s: failed to create sysfs link to mpath device\n",
+			ns->disk->disk_name);
+	if (sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
+			&disk_to_dev(ns->disk)->kobj, ns->disk->disk_name))
+		pr_warn("%s: failed to create sysfs link from mpath device\n",
+			ns->disk->disk_name);
+
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2662,6 +2895,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 			blk_integrity_unregister(ns->disk);
 		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
 					&nvme_ns_attr_group);
+		sysfs_remove_link(&disk_to_dev(ns->disk)->kobj, "mpath");
+		sysfs_remove_link(&disk_to_dev(ns->head->disk)->kobj,
+				ns->disk->disk_name);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
 		del_gendisk(ns->disk);
@@ -2669,8 +2905,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	}
 
 	mutex_lock(&ns->ctrl->subsys->lock);
-	if (head)
+	if (head) {
+		rcu_assign_pointer(head->current_path, NULL);
 		list_del_rcu(&ns->siblings);
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	mutex_lock(&ns->ctrl->namespaces_mutex);
@@ -3221,6 +3459,7 @@ int __init nvme_core_init(void)
 
 void nvme_core_exit(void)
 {
+	ida_destroy(&nvme_subsystems_ida);
 	class_destroy(nvme_class);
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index da30df2668f5..7e21cce0aefe 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -94,6 +94,11 @@ struct nvme_request {
 	u16			status;
 };
 
+/*
+ * Mark a bio as coming in through the mpath node.
+ */
+#define REQ_NVME_MPATH		REQ_DRV
+
 enum {
 	NVME_REQ_CANCELLED		= (1 << 0),
 };
@@ -206,6 +211,7 @@ struct nvme_subsystem {
 	char			model[40];
 	char			firmware_rev[8];
 	u16			vendor_id;
+	int			instance;
 };
 
 /*
@@ -225,8 +231,13 @@ struct nvme_ns_ids {
  * only ever has a single entry for private namespaces.
  */
 struct nvme_ns_head {
+	struct nvme_ns __rcu	*current_path;
+	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct bio_list		requeue_list;
+	spinlock_t		requeue_lock;
+	struct work_struct	requeue_work;
 	unsigned		ns_id;
 	struct nvme_ns_ids	ids;
 	struct list_head	entry;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


This patch adds initial multipath support to the nvme driver.  For each
namespace we create a new block device node, which can be used to access
that namespace through any of the controllers that refer to it.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

The new block devices nodes for multipath access will show up as

	/dev/nvm-subXnZ

where X is the local instance number of the subsystems, and Z is the index
for the namespace.

To get persistent devices names the following lines can be added to
/lib/udev/rules.d/60-persistent-storage.rules:

---------------------------------- snip ----------------------------------
KERNEL=="nvm-sub*[0-9]n*[0-9]", ATTR{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}"
KERNEL=="nvm-sub*[0-9]n*[0-9]p*[0-9]", ENV{DEVTYPE}=="partition", ATTRS{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}-part%n"
KERNEL=="nvm-sub*[0-9]n*[0-9]", ENV{DEVTYPE}=="disk", ATTRS{wwid}=="?*", ENV{ID_WWN}="$attr{wwid}"
---------------------------------- snip ----------------------------------

Note that these create the new persistent names.  Overriding the existing
nvme ones would be nicer, but while that works for the first path, the
normal rule will override it again for each subsequent path.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 277 +++++++++++++++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h |  11 ++
 2 files changed, 269 insertions(+), 19 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c1be5a0a69b1..faef0241c6b5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -72,6 +72,7 @@ struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
 static LIST_HEAD(nvme_subsystems);
+static DEFINE_IDA(nvme_subsystems_ida);
 static DEFINE_MUTEX(nvme_subsystems_lock);
 
 static LIST_HEAD(nvme_ctrl_list);
@@ -104,6 +105,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
+static void nvme_failover_req(struct request *req)
+{
+	struct nvme_ns *ns = req->q->queuedata;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ns->head->requeue_lock, flags);
+	blk_steal_bios(&ns->head->requeue_list, req);
+	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+	blk_mq_end_request(req, 0);
+
+	nvme_reset_ctrl(ns->ctrl);
+	kblockd_schedule_work(&ns->head->requeue_work);
+}
+
 static blk_status_t nvme_error_status(struct request *req)
 {
 	switch (nvme_req(req)->status & 0x7ff) {
@@ -131,6 +146,53 @@ static blk_status_t nvme_error_status(struct request *req)
 	}
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+	if (!(req->cmd_flags & REQ_NVME_MPATH))
+		return false;
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
+		return false;
+	}
+
+	/* Everything else could be a path failure, so should be retried */
+	return true;
+}
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
 	if (blk_noretry_request(req))
@@ -145,6 +207,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+		if (nvme_req_needs_failover(req)) {
+			nvme_failover_req(req);
+			return;
+		}
+
 		nvme_req(req)->retries++;
 		blk_mq_requeue_request(req, true);
 		return;
@@ -173,6 +240,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->head)
+			kblockd_schedule_work(&ns->head->requeue_work);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -240,9 +319,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
 	if (changed)
 		ctrl->state = new_state;
-
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
+	if (changed && ctrl->state == NVME_CTRL_LIVE)
+		nvme_kick_requeue_lists(ctrl);
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -252,6 +332,14 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	del_gendisk(head->disk);
+	blk_set_queue_dying(head->disk->queue);
+	/* make sure all pending bios are cleaned up */
+	kblockd_schedule_work(&head->requeue_work);
+	flush_work(&head->requeue_work);
+	blk_cleanup_queue(head->disk->queue);
+	put_disk(head->disk);
+
 	list_del_init(&head->entry);
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -1123,8 +1211,10 @@ static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
 	if (blk_get_integrity(disk) &&
 	    (ns->pi_type != pi_type || ns->ms != old_ms ||
 	     bs != queue_logical_block_size(disk->queue) ||
-	     (ns->ms && ns->ext)))
+	     (ns->ms && ns->ext))) {
 		blk_integrity_unregister(disk);
+		blk_integrity_unregister(ns->head->disk);
+	}
 
 	ns->pi_type = pi_type;
 }
@@ -1152,7 +1242,9 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 	}
 	integrity.tuple_size = ns->ms;
 	blk_integrity_register(ns->disk, &integrity);
+	blk_integrity_register(ns->head->disk, &integrity);
 	blk_queue_max_integrity_segments(ns->queue, 1);
+	blk_queue_max_integrity_segments(ns->head->disk->queue, 1);
 }
 #else
 static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
@@ -1170,7 +1262,7 @@ static void nvme_set_chunk_size(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size));
 }
 
-static void nvme_config_discard(struct nvme_ns *ns)
+static void nvme_config_discard(struct nvme_ns *ns, struct request_queue *queue)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1181,18 +1273,18 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	if (ctrl->nr_streams && ns->sws && ns->sgs) {
 		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
 
-		ns->queue->limits.discard_alignment = sz;
-		ns->queue->limits.discard_granularity = sz;
+		queue->limits.discard_alignment = sz;
+		queue->limits.discard_granularity = sz;
 	} else {
 		ns->queue->limits.discard_alignment = logical_block_size;
 		ns->queue->limits.discard_granularity = logical_block_size;
 	}
-	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
-	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+	blk_queue_max_discard_sectors(queue, UINT_MAX);
+	blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue);
 
 	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
+		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
@@ -1249,17 +1341,25 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
+	blk_queue_logical_block_size(ns->head->disk->queue, bs);
 	if (ns->noiob)
 		nvme_set_chunk_size(ns);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) {
 		set_capacity(disk, 0);
-	else
+		if (ns->head)
+			set_capacity(ns->head->disk, 0);
+	} else {
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+		if (ns->head)
+			set_capacity(ns->head->disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+	}
 
-	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM) {
+		nvme_config_discard(ns, ns->queue);
+		nvme_config_discard(ns, ns->head->disk->queue);
+	}
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
@@ -1796,6 +1896,12 @@ static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ct
 	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
 }
 
+static void __nvme_free_subsystem(struct nvme_subsystem *subsys)
+{
+	ida_simple_remove(&nvme_subsystems_ida, subsys->instance);
+	kfree(subsys);
+}
+
 static void nvme_free_subsystem(struct kref *ref)
 {
 	struct nvme_subsystem *subsys =
@@ -1804,8 +1910,7 @@ static void nvme_free_subsystem(struct kref *ref)
 	mutex_lock(&nvme_subsystems_lock);
 	list_del(&subsys->entry);
 	mutex_unlock(&nvme_subsystems_lock);
-
-	kfree(subsys);
+	__nvme_free_subsystem(subsys);
 }
 
 static void nvme_put_subsystem(struct nvme_subsystem *subsys)
@@ -1833,10 +1938,16 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
+	int ret;
 
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
 		return -ENOMEM;
+	subsys->instance = ret = ida_simple_get(&nvme_subsystems_ida, 0, 0,
+			GFP_KERNEL);
+	if (ret < 0)
+		goto out_free_subsys;
+
 	INIT_LIST_HEAD(&subsys->ctrls);
 	INIT_LIST_HEAD(&subsys->nsheads);
 	kref_init(&subsys->ref);
@@ -1854,7 +1965,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		 * Verify that the subsystem actually supports multiple
 		 * controllers, else bail out.
 		 */
-		kfree(subsys);
+		__nvme_free_subsystem(subsys);
 		if (!(id->cmic & (1 << 1))) {
 			dev_err(ctrl->device,
 				"ignoring ctrl due to duplicate subnqn (%s).\n",
@@ -1876,6 +1987,9 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	mutex_unlock(&subsys->lock);
 
 	return 0;
+out_free_subsys:
+	kfree(subsys);
+	return ret;
 }
 
 /*
@@ -2403,6 +2517,80 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (ns->ctrl->state == NVME_CTRL_LIVE) {
+			rcu_assign_pointer(head->current_path, ns);
+			return ns;
+		}
+	}
+
+	return NULL;
+}
+
+static blk_qc_t nvme_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct device *dev = disk_to_dev(head->disk);
+	struct nvme_ns *ns;
+	blk_qc_t ret = BLK_QC_T_NONE;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = srcu_dereference(head->current_path, &head->srcu);
+	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
+		ns = nvme_find_path(head);
+	if (likely(ns)) {
+		bio->bi_disk = ns->disk;
+		bio->bi_opf |= REQ_NVME_MPATH;
+		ret = direct_make_request(bio);
+	} else if (!list_empty_careful(&head->list)) {
+		dev_warn_ratelimited(dev, "no path available - requeing I/O\n");
+
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		dev_warn_ratelimited(dev, "no path - failing I/O\n");
+
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static const struct block_device_operations nvme_subsys_ops = {
+	.owner		= THIS_MODULE,
+};
+
+static void nvme_requeue_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&head->requeue_lock);
+	next = bio_list_get(&head->requeue_list);
+	spin_unlock_irq(&head->requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+
+		/*
+		 * Reset disk to the mpath node and resubmit to select a new
+		 * path.
+		 */
+		bio->bi_disk = head->disk;
+		generic_make_request(bio);
+	}
+}
+
 static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
 		unsigned nsid)
 {
@@ -2438,6 +2626,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		unsigned nsid, struct nvme_id_ns *id)
 {
 	struct nvme_ns_head *head;
+	struct request_queue *q;
 	int ret = -ENOMEM;
 
 	head = kzalloc(sizeof(*head), GFP_KERNEL);
@@ -2446,6 +2635,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 
 	INIT_LIST_HEAD(&head->list);
 	head->ns_id = nsid;
+	bio_list_init(&head->requeue_list);
+	spin_lock_init(&head->requeue_lock);
+	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
 	kref_init(&head->ref);
 
@@ -2458,8 +2650,31 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		goto out_free_head;
 	}
 
+	ret = -ENOMEM;
+	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
+	if (!q)
+		goto out_free_head;
+	q->queuedata = head;
+	blk_queue_make_request(q, nvme_make_request);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	/* set to a default value for 512 until disk is validated */
+	blk_queue_logical_block_size(q, 512);
+	nvme_set_queue_limits(ctrl, q);
+
+	head->disk = alloc_disk(0);
+	if (!head->disk)
+		goto out_cleanup_queue;
+	head->disk->fops = &nvme_subsys_ops;
+	head->disk->private_data = head;
+	head->disk->queue = q;
+	head->disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(head->disk->disk_name, "nvm-sub%dn%d",
+			ctrl->subsys->instance, nsid);
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
+
+out_cleanup_queue:
+	blk_cleanup_queue(q);
 out_free_head:
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -2468,7 +2683,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 }
 
 static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
-		struct nvme_id_ns *id)
+		struct nvme_id_ns *id, bool *new)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	bool is_shared = id->nmic & (1 << 0);
@@ -2484,6 +2699,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = PTR_ERR(head);
 			goto out_unlock;
 		}
+
+		*new = true;
 	} else {
 		struct nvme_ns_ids ids;
 
@@ -2495,6 +2712,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = -EINVAL;
 			goto out_unlock;
 		}
+
+		*new = false;
 	}
 
 	list_add_tail(&ns->siblings, &head->list);
@@ -2564,6 +2783,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
+	bool new = true;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -2596,7 +2816,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	if (nvme_init_ns_head(ns, nsid, id))
+	if (nvme_init_ns_head(ns, nsid, id, &new))
 		goto out_free_id;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
@@ -2635,6 +2855,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
+
+	if (new)
+		add_disk(ns->head->disk);
+
+	if (sysfs_create_link(&disk_to_dev(ns->disk)->kobj,
+			&disk_to_dev(ns->head->disk)->kobj, "mpath"))
+		pr_warn("%s: failed to create sysfs link to mpath device\n",
+			ns->disk->disk_name);
+	if (sysfs_create_link(&disk_to_dev(ns->head->disk)->kobj,
+			&disk_to_dev(ns->disk)->kobj, ns->disk->disk_name))
+		pr_warn("%s: failed to create sysfs link from mpath device\n",
+			ns->disk->disk_name);
+
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2662,6 +2895,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 			blk_integrity_unregister(ns->disk);
 		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
 					&nvme_ns_attr_group);
+		sysfs_remove_link(&disk_to_dev(ns->disk)->kobj, "mpath");
+		sysfs_remove_link(&disk_to_dev(ns->head->disk)->kobj,
+				ns->disk->disk_name);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
 		del_gendisk(ns->disk);
@@ -2669,8 +2905,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	}
 
 	mutex_lock(&ns->ctrl->subsys->lock);
-	if (head)
+	if (head) {
+		rcu_assign_pointer(head->current_path, NULL);
 		list_del_rcu(&ns->siblings);
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	mutex_lock(&ns->ctrl->namespaces_mutex);
@@ -3221,6 +3459,7 @@ int __init nvme_core_init(void)
 
 void nvme_core_exit(void)
 {
+	ida_destroy(&nvme_subsystems_ida);
 	class_destroy(nvme_class);
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index da30df2668f5..7e21cce0aefe 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -94,6 +94,11 @@ struct nvme_request {
 	u16			status;
 };
 
+/*
+ * Mark a bio as coming in through the mpath node.
+ */
+#define REQ_NVME_MPATH		REQ_DRV
+
 enum {
 	NVME_REQ_CANCELLED		= (1 << 0),
 };
@@ -206,6 +211,7 @@ struct nvme_subsystem {
 	char			model[40];
 	char			firmware_rev[8];
 	u16			vendor_id;
+	int			instance;
 };
 
 /*
@@ -225,8 +231,13 @@ struct nvme_ns_ids {
  * only ever has a single entry for private namespaces.
  */
 struct nvme_ns_head {
+	struct nvme_ns __rcu	*current_path;
+	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct bio_list		requeue_list;
+	spinlock_t		requeue_lock;
+	struct work_struct	requeue_work;
 	unsigned		ns_id;
 	struct nvme_ns_ids	ids;
 	struct list_head	entry;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 9/9] nvme: also expose the namespace identification sysfs files for mpath nodes
  2017-09-25 13:40 ` Christoph Hellwig
@ 2017-09-25 13:40   ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

We do this by adding a helper that returns the ns_head for a device that
can belong to either the per-controller or per-subsystem block device
nodes, and otherwise reuse all the existing code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 68 +++++++++++++++++++++++++++++-------------------
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index faef0241c6b5..c296f275b475 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -79,6 +79,7 @@ static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
+static const struct attribute_group nvme_ns_id_attr_group;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
@@ -332,6 +333,8 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
+			   &nvme_ns_id_attr_group);
 	del_gendisk(head->disk);
 	blk_set_queue_dying(head->disk->queue);
 	/* make sure all pending bios are cleaned up */
@@ -2278,12 +2281,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
 }
 static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
 
+static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (disk->fops == &nvme_fops)
+		return nvme_get_ns_from_dev(dev)->head;
+	else
+		return disk->private_data;
+}
+
 static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
-	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	struct nvme_ns_head *head = dev_to_ns_head(dev);
+	struct nvme_ns_ids *ids = &head->ids;
+	struct nvme_subsystem *subsys = head->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
@@ -2305,23 +2318,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->head->ns_id);
+		head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
 static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
-			  char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
+	return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
 static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2336,22 +2347,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
 static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
+	return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
 static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->head->ns_id);
+	return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
-static struct attribute *nvme_ns_attrs[] = {
+static struct attribute *nvme_ns_id_attrs[] = {
 	&dev_attr_wwid.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_nguid.attr,
@@ -2360,12 +2369,11 @@ static struct attribute *nvme_ns_attrs[] = {
 	NULL,
 };
 
-static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
+static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
 		struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2383,9 +2391,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 	return a->mode;
 }
 
-static const struct attribute_group nvme_ns_attr_group = {
-	.attrs		= nvme_ns_attrs,
-	.is_visible	= nvme_ns_attrs_are_visible,
+static const struct attribute_group nvme_ns_id_attr_group = {
+	.attrs		= nvme_ns_id_attrs,
+	.is_visible	= nvme_ns_id_attrs_are_visible,
 };
 
 #define nvme_show_str_function(field)						\
@@ -2639,6 +2647,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	spin_lock_init(&head->requeue_lock);
 	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
+	head->subsys = ctrl->subsys;
 	kref_init(&head->ref);
 
 	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
@@ -2849,15 +2858,20 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	device_add_disk(ctrl->device, ns->disk);
 	if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group))
+					&nvme_ns_id_attr_group))
 		pr_warn("%s: failed to create sysfs group for identification\n",
 			ns->disk->disk_name);
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 
-	if (new)
+	if (new) {
 		add_disk(ns->head->disk);
+		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
+				&nvme_ns_id_attr_group))
+			pr_warn("%s: failed to create sysfs group for identification\n",
+				ns->head->disk->disk_name);
+	}
 
 	if (sysfs_create_link(&disk_to_dev(ns->disk)->kobj,
 			&disk_to_dev(ns->head->disk)->kobj, "mpath"))
@@ -2894,7 +2908,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
 		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group);
+					&nvme_ns_id_attr_group);
 		sysfs_remove_link(&disk_to_dev(ns->disk)->kobj, "mpath");
 		sysfs_remove_link(&disk_to_dev(ns->head->disk)->kobj,
 				ns->disk->disk_name);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7e21cce0aefe..febc21d8935c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -235,6 +235,7 @@ struct nvme_ns_head {
 	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct nvme_subsystem	*subsys;
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
 	struct work_struct	requeue_work;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 9/9] nvme: also expose the namespace identification sysfs files for mpath nodes
@ 2017-09-25 13:40   ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:40 UTC (permalink / raw)


We do this by adding a helper that returns the ns_head for a device that
can belong to either the per-controller or per-subsystem block device
nodes, and otherwise reuse all the existing code.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 68 +++++++++++++++++++++++++++++-------------------
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index faef0241c6b5..c296f275b475 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -79,6 +79,7 @@ static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
+static const struct attribute_group nvme_ns_id_attr_group;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
@@ -332,6 +333,8 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
+			   &nvme_ns_id_attr_group);
 	del_gendisk(head->disk);
 	blk_set_queue_dying(head->disk->queue);
 	/* make sure all pending bios are cleaned up */
@@ -2278,12 +2281,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
 }
 static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
 
+static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (disk->fops == &nvme_fops)
+		return nvme_get_ns_from_dev(dev)->head;
+	else
+		return disk->private_data;
+}
+
 static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
-	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	struct nvme_ns_head *head = dev_to_ns_head(dev);
+	struct nvme_ns_ids *ids = &head->ids;
+	struct nvme_subsystem *subsys = head->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
@@ -2305,23 +2318,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->head->ns_id);
+		head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
 static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
-			  char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
+	return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
 static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2336,22 +2347,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
 static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
+	return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
 static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->head->ns_id);
+	return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
-static struct attribute *nvme_ns_attrs[] = {
+static struct attribute *nvme_ns_id_attrs[] = {
 	&dev_attr_wwid.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_nguid.attr,
@@ -2360,12 +2369,11 @@ static struct attribute *nvme_ns_attrs[] = {
 	NULL,
 };
 
-static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
+static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
 		struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2383,9 +2391,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 	return a->mode;
 }
 
-static const struct attribute_group nvme_ns_attr_group = {
-	.attrs		= nvme_ns_attrs,
-	.is_visible	= nvme_ns_attrs_are_visible,
+static const struct attribute_group nvme_ns_id_attr_group = {
+	.attrs		= nvme_ns_id_attrs,
+	.is_visible	= nvme_ns_id_attrs_are_visible,
 };
 
 #define nvme_show_str_function(field)						\
@@ -2639,6 +2647,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	spin_lock_init(&head->requeue_lock);
 	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
+	head->subsys = ctrl->subsys;
 	kref_init(&head->ref);
 
 	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
@@ -2849,15 +2858,20 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	device_add_disk(ctrl->device, ns->disk);
 	if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group))
+					&nvme_ns_id_attr_group))
 		pr_warn("%s: failed to create sysfs group for identification\n",
 			ns->disk->disk_name);
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 
-	if (new)
+	if (new) {
 		add_disk(ns->head->disk);
+		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
+				&nvme_ns_id_attr_group))
+			pr_warn("%s: failed to create sysfs group for identification\n",
+				ns->head->disk->disk_name);
+	}
 
 	if (sysfs_create_link(&disk_to_dev(ns->disk)->kobj,
 			&disk_to_dev(ns->head->disk)->kobj, "mpath"))
@@ -2894,7 +2908,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
 		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group);
+					&nvme_ns_id_attr_group);
 		sysfs_remove_link(&disk_to_dev(ns->disk)->kobj, "mpath");
 		sysfs_remove_link(&disk_to_dev(ns->head->disk)->kobj,
 				ns->disk->disk_name);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7e21cce0aefe..febc21d8935c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -235,6 +235,7 @@ struct nvme_ns_head {
 	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct nvme_subsystem	*subsys;
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
 	struct work_struct	requeue_work;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-25 13:47     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2017-09-25 13:47 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block

On 09/25/2017 03:40 PM, Christoph Hellwig wrote:
> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> The new block devices nodes for multipath access will show up as
> 
> 	/dev/nvm-subXnZ
> 
> where X is the local instance number of the subsystems, and Z is the index
> for the namespace.
> 
Can't we make the multipath support invisible to the host?
IE check the shared namespaces before creating the device node, and just
move them under the existing namespaces if one exists?

Thing is, this device multiplication was (and is) the #1 objection to
dm-multipath, and we should be looking to avoid that if we do a new
implementation.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 13:47     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2017-09-25 13:47 UTC (permalink / raw)


On 09/25/2017 03:40 PM, Christoph Hellwig wrote:
> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> The new block devices nodes for multipath access will show up as
> 
> 	/dev/nvm-subXnZ
> 
> where X is the local instance number of the subsystems, and Z is the index
> for the namespace.
> 
Can't we make the multipath support invisible to the host?
IE check the shared namespaces before creating the device node, and just
move them under the existing namespaces if one exists?

Thing is, this device multiplication was (and is) the #1 objection to
dm-multipath, and we should be looking to avoid that if we do a new
implementation.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:47     ` Hannes Reinecke
@ 2017-09-25 13:50       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:50 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:47:43PM +0200, Hannes Reinecke wrote:
> Can't we make the multipath support invisible to the host?
> IE check the shared namespaces before creating the device node, and just
> move them under the existing namespaces if one exists?

That was what my first version did, but various people talked me
out of it.  Unfortunately just multiplexing breaks a few things,
including the userspace passthrough ioctls.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 13:50       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 13:50 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:47:43PM +0200, Hannes Reinecke wrote:
> Can't we make the multipath support invisible to the host?
> IE check the shared namespaces before creating the device node, and just
> move them under the existing namespaces if one exists?

That was what my first version did, but various people talked me
out of it.  Unfortunately just multiplexing breaks a few things,
including the userspace passthrough ioctls.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:50       ` Christoph Hellwig
@ 2017-09-25 14:05         ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2017-09-25 14:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, linux-block

On 09/25/2017 03:50 PM, Christoph Hellwig wrote:
> On Mon, Sep 25, 2017 at 03:47:43PM +0200, Hannes Reinecke wrote:
>> Can't we make the multipath support invisible to the host?
>> IE check the shared namespaces before creating the device node, and just
>> move them under the existing namespaces if one exists?
> 
> That was what my first version did, but various people talked me
> out of it.  Unfortunately just multiplexing breaks a few things,
> including the userspace passthrough ioctls.
> 
Care to give some specifics?
How would userspace passthrough be affected?
I would've thought that we're sending the ioctl down one path, and we'd
be getting the completion back on the same path/queue/whatever.
(Unless we're talking multi-command ioctls, but those are evil anyway)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 14:05         ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2017-09-25 14:05 UTC (permalink / raw)


On 09/25/2017 03:50 PM, Christoph Hellwig wrote:
> On Mon, Sep 25, 2017@03:47:43PM +0200, Hannes Reinecke wrote:
>> Can't we make the multipath support invisible to the host?
>> IE check the shared namespaces before creating the device node, and just
>> move them under the existing namespaces if one exists?
> 
> That was what my first version did, but various people talked me
> out of it.  Unfortunately just multiplexing breaks a few things,
> including the userspace passthrough ioctls.
> 
Care to give some specifics?
How would userspace passthrough be affected?
I would've thought that we're sending the ioctl down one path, and we'd
be getting the completion back on the same path/queue/whatever.
(Unless we're talking multi-command ioctls, but those are evil anyway)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 14:05         ` Hannes Reinecke
@ 2017-09-25 14:45           ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 14:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 04:05:17PM +0200, Hannes Reinecke wrote:
> On 09/25/2017 03:50 PM, Christoph Hellwig wrote:
> > On Mon, Sep 25, 2017 at 03:47:43PM +0200, Hannes Reinecke wrote:
> >> Can't we make the multipath support invisible to the host?
> >> IE check the shared namespaces before creating the device node, and just
> >> move them under the existing namespaces if one exists?
> > 
> > That was what my first version did, but various people talked me
> > out of it.  Unfortunately just multiplexing breaks a few things,
> > including the userspace passthrough ioctls.
> > 
> Care to give some specifics?
> How would userspace passthrough be affected?

Ioctls aside, if you've only one disk handle, it's a little difficult to
compare iostats on each path, which can be useful for diagnosing issues.

> I would've thought that we're sending the ioctl down one path, and we'd
> be getting the completion back on the same path/queue/whatever.
> (Unless we're talking multi-command ioctls, but those are evil anyway)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 14:45           ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 14:45 UTC (permalink / raw)


On Mon, Sep 25, 2017@04:05:17PM +0200, Hannes Reinecke wrote:
> On 09/25/2017 03:50 PM, Christoph Hellwig wrote:
> > On Mon, Sep 25, 2017@03:47:43PM +0200, Hannes Reinecke wrote:
> >> Can't we make the multipath support invisible to the host?
> >> IE check the shared namespaces before creating the device node, and just
> >> move them under the existing namespaces if one exists?
> > 
> > That was what my first version did, but various people talked me
> > out of it.  Unfortunately just multiplexing breaks a few things,
> > including the userspace passthrough ioctls.
> > 
> Care to give some specifics?
> How would userspace passthrough be affected?

Ioctls aside, if you've only one disk handle, it's a little difficult to
compare iostats on each path, which can be useful for diagnosing issues.

> I would've thought that we're sending the ioctl down one path, and we'd
> be getting the completion back on the same path/queue/whatever.
> (Unless we're talking multi-command ioctls, but those are evil anyway)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 14:05         ` Hannes Reinecke
@ 2017-09-25 15:07           ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 15:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	linux-nvme, linux-block, Martin K. Petersen

On Mon, Sep 25, 2017 at 04:05:17PM +0200, Hannes Reinecke wrote:
> On 09/25/2017 03:50 PM, Christoph Hellwig wrote:
> > On Mon, Sep 25, 2017 at 03:47:43PM +0200, Hannes Reinecke wrote:
> >> Can't we make the multipath support invisible to the host?
> >> IE check the shared namespaces before creating the device node, and just
> >> move them under the existing namespaces if one exists?
> > 
> > That was what my first version did, but various people talked me
> > out of it.  Unfortunately just multiplexing breaks a few things,
> > including the userspace passthrough ioctls.
> > 
> Care to give some specifics?
> How would userspace passthrough be affected?

Because specific commands will have to go to the right controller and
not just any avaiable controller.  If we only expose a single node
for multipath there is no way to chose the specific one.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 15:07           ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 15:07 UTC (permalink / raw)


On Mon, Sep 25, 2017@04:05:17PM +0200, Hannes Reinecke wrote:
> On 09/25/2017 03:50 PM, Christoph Hellwig wrote:
> > On Mon, Sep 25, 2017@03:47:43PM +0200, Hannes Reinecke wrote:
> >> Can't we make the multipath support invisible to the host?
> >> IE check the shared namespaces before creating the device node, and just
> >> move them under the existing namespaces if one exists?
> > 
> > That was what my first version did, but various people talked me
> > out of it.  Unfortunately just multiplexing breaks a few things,
> > including the userspace passthrough ioctls.
> > 
> Care to give some specifics?
> How would userspace passthrough be affected?

Because specific commands will have to go to the right controller and
not just any avaiable controller.  If we only expose a single node
for multipath there is no way to chose the specific one.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/9] nvme: track subsystems
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-25 16:07     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:40:27PM +0200, Christoph Hellwig wrote:
> This adds a new nvme_subsystem structure so that we can track multiple
> controllers that belong to a single subsystem.  For now we only use it
> to store the NQN, and to check that we don't have duplicate NQNs unless
> the involved subsystems support multiple controllers.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/9] nvme: track subsystems
@ 2017-09-25 16:07     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:07 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:40:27PM +0200, Christoph Hellwig wrote:
> This adds a new nvme_subsystem structure so that we can track multiple
> controllers that belong to a single subsystem.  For now we only use it
> to store the NQN, and to check that we don't have duplicate NQNs unless
> the involved subsystems support multiple controllers.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] nvme: introduce a nvme_ns_ids structure
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-25 16:08     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:40:28PM +0200, Christoph Hellwig wrote:
> This allows us to manage the various uniqueue namespace identifiers
> together instead needing various variables and arguments.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 6/9] nvme: introduce a nvme_ns_ids structure
@ 2017-09-25 16:08     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:08 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:40:28PM +0200, Christoph Hellwig wrote:
> This allows us to manage the various uniqueue namespace identifiers
> together instead needing various variables and arguments.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/9] nvme: track shared namespaces
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-25 16:11     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:40:29PM +0200, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> [1] comments welcome if you have a better name for it, the current one is
>     horrible.  One idea would be to rename the current struct nvme_ns
>     to struct nvme_ns_link or similar and use the nvme_ns name for the
>     new structure.  But that would involve a lot of churn.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good; I can live with 'nvme_ns_head'.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/9] nvme: track shared namespaces
@ 2017-09-25 16:11     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:11 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:40:29PM +0200, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head [1] that holds information about
> an actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there
> is a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> [1] comments welcome if you have a better name for it, the current one is
>     horrible.  One idea would be to rename the current struct nvme_ns
>     to struct nvme_ns_link or similar and use the nvme_ns name for the
>     new structure.  But that would involve a lot of churn.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good; I can live with 'nvme_ns_head'.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-25 16:18     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> The new block devices nodes for multipath access will show up as
> 
> 	/dev/nvm-subXnZ
> 
> where X is the local instance number of the subsystems, and Z is the index
> for the namespace.
> 
> To get persistent devices names the following lines can be added to
> /lib/udev/rules.d/60-persistent-storage.rules:
> 
> ---------------------------------- snip ----------------------------------
> KERNEL=="nvm-sub*[0-9]n*[0-9]", ATTR{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}"
> KERNEL=="nvm-sub*[0-9]n*[0-9]p*[0-9]", ENV{DEVTYPE}=="partition", ATTRS{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}-part%n"
> KERNEL=="nvm-sub*[0-9]n*[0-9]", ENV{DEVTYPE}=="disk", ATTRS{wwid}=="?*", ENV{ID_WWN}="$attr{wwid}"
> ---------------------------------- snip ----------------------------------
> 
> Note that these create the new persistent names.  Overriding the existing
> nvme ones would be nicer, but while that works for the first path, the
> normal rule will override it again for each subsequent path.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-25 16:18     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:18 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:40:30PM +0200, Christoph Hellwig wrote:
> This patch adds initial multipath support to the nvme driver.  For each
> namespace we create a new block device node, which can be used to access
> that namespace through any of the controllers that refer to it.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> The new block devices nodes for multipath access will show up as
> 
> 	/dev/nvm-subXnZ
> 
> where X is the local instance number of the subsystems, and Z is the index
> for the namespace.
> 
> To get persistent devices names the following lines can be added to
> /lib/udev/rules.d/60-persistent-storage.rules:
> 
> ---------------------------------- snip ----------------------------------
> KERNEL=="nvm-sub*[0-9]n*[0-9]", ATTR{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}"
> KERNEL=="nvm-sub*[0-9]n*[0-9]p*[0-9]", ENV{DEVTYPE}=="partition", ATTRS{wwid}=="?*", SYMLINK+="disk/by-id/nvme-sub-$attr{wwid}-part%n"
> KERNEL=="nvm-sub*[0-9]n*[0-9]", ENV{DEVTYPE}=="disk", ATTRS{wwid}=="?*", ENV{ID_WWN}="$attr{wwid}"
> ---------------------------------- snip ----------------------------------
> 
> Note that these create the new persistent names.  Overriding the existing
> nvme ones would be nicer, but while that works for the first path, the
> normal rule will override it again for each subsequent path.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 9/9] nvme: also expose the namespace identification sysfs files for mpath nodes
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-25 16:19     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:40:31PM +0200, Christoph Hellwig wrote:
> We do this by adding a helper that returns the ns_head for a device that
> can belong to either the per-controller or per-subsystem block device
> nodes, and otherwise reuse all the existing code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 9/9] nvme: also expose the namespace identification sysfs files for mpath nodes
@ 2017-09-25 16:19     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-25 16:19 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:40:31PM +0200, Christoph Hellwig wrote:
> We do this by adding a helper that returns the ns_head for a device that
> can belong to either the per-controller or per-subsystem block device
> nodes, and otherwise reuse all the existing code.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/9] nvme: track shared namespaces
  2017-09-25 16:11     ` Keith Busch
@ 2017-09-25 16:19       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 16:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 10:11:55AM -0600, Keith Busch wrote:
> Looks good; I can live with 'nvme_ns_head'.

I hate it :)  But so far no one has come up with a good name.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/9] nvme: track shared namespaces
@ 2017-09-25 16:19       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-09-25 16:19 UTC (permalink / raw)


On Mon, Sep 25, 2017@10:11:55AM -0600, Keith Busch wrote:
> Looks good; I can live with 'nvme_ns_head'.

I hate it :)  But so far no one has come up with a good name.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-09-28 15:53     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-28 15:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block

On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
> The new block devices nodes for multipath access will show up as
> 
> 	/dev/nvm-subXnZ

Just thinking ahead ... Once this goes in, someone will want to boot their
OS from a multipath target. It was a pain getting installers to recognize
/dev/nvmeXnY as an install destination. I'm not sure if installers have
gotten any better in the last 5 years about recognizing new block names.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-28 15:53     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2017-09-28 15:53 UTC (permalink / raw)


On Mon, Sep 25, 2017@03:40:30PM +0200, Christoph Hellwig wrote:
> The new block devices nodes for multipath access will show up as
> 
> 	/dev/nvm-subXnZ

Just thinking ahead ... Once this goes in, someone will want to boot their
OS from a multipath target. It was a pain getting installers to recognize
/dev/nvmeXnY as an install destination. I'm not sure if installers have
gotten any better in the last 5 years about recognizing new block names.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-28 15:53     ` Keith Busch
@ 2017-09-29 14:21       ` Tony Yang
  -1 siblings, 0 replies; 74+ messages in thread
From: Tony Yang @ 2017-09-29 14:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme

Hi, All

     Because my environment requirements, the kernel must use 4.8.17,
I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
Because I see support for multi-path versions are above 4.13

Expect everyone's help, thank you very much

2017-09-28 23:53 GMT+08:00 Keith Busch <keith.busch@intel.com>:
> On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
>> The new block devices nodes for multipath access will show up as
>>
>>       /dev/nvm-subXnZ
>
> Just thinking ahead ... Once this goes in, someone will want to boot their
> OS from a multipath target. It was a pain getting installers to recognize
> /dev/nvmeXnY as an install destination. I'm not sure if installers have
> gotten any better in the last 5 years about recognizing new block names.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-29 14:21       ` Tony Yang
  0 siblings, 0 replies; 74+ messages in thread
From: Tony Yang @ 2017-09-29 14:21 UTC (permalink / raw)


Hi, All

     Because my environment requirements, the kernel must use 4.8.17,
I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
Because I see support for multi-path versions are above 4.13

Expect everyone's help, thank you very much

2017-09-28 23:53 GMT+08:00 Keith Busch <keith.busch at intel.com>:
> On Mon, Sep 25, 2017@03:40:30PM +0200, Christoph Hellwig wrote:
>> The new block devices nodes for multipath access will show up as
>>
>>       /dev/nvm-subXnZ
>
> Just thinking ahead ... Once this goes in, someone will want to boot their
> OS from a multipath target. It was a pain getting installers to recognize
> /dev/nvmeXnY as an install destination. I'm not sure if installers have
> gotten any better in the last 5 years about recognizing new block names.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-28 15:53     ` Keith Busch
@ 2017-09-30 19:37       ` Johannes Thumshirn
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Thumshirn @ 2017-09-30 19:37 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Sagi Grimberg,
	linux-nvme, Hannes Reinecke


[+Cc Hannes ]

Keith Busch <keith.busch@intel.com> writes:

> On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
>> The new block devices nodes for multipath access will show up as
>>=20
>> 	/dev/nvm-subXnZ
>
> Just thinking ahead ... Once this goes in, someone will want to boot their
> OS from a multipath target. It was a pain getting installers to recognize
> /dev/nvmeXnY as an install destination. I'm not sure if installers have
> gotten any better in the last 5 years about recognizing new block names.

We discussed (Hannes, Christoph and me) this as well offline this week
and it should eithr be possible with some udev magic (fake a
/dev/mapper/WWID symlink) or a shim device-mapper target that prevents
dm-mpath from attaching to the underlying block devices and creates a
real /dev/mapper/WWID device node (Christoph especially dislikes
this). None of them are pretty, but probably all we can do so far.

Byte,
        Johannes

--=20
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
GF: Felix Imend=C3=B6rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N=C3=BCrnberg)
Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-09-30 19:37       ` Johannes Thumshirn
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Thumshirn @ 2017-09-30 19:37 UTC (permalink / raw)



[+Cc Hannes ]

Keith Busch <keith.busch at intel.com> writes:

> On Mon, Sep 25, 2017@03:40:30PM +0200, Christoph Hellwig wrote:
>> The new block devices nodes for multipath access will show up as
>> 
>> 	/dev/nvm-subXnZ
>
> Just thinking ahead ... Once this goes in, someone will want to boot their
> OS from a multipath target. It was a pain getting installers to recognize
> /dev/nvmeXnY as an install destination. I'm not sure if installers have
> gotten any better in the last 5 years about recognizing new block names.

We discussed (Hannes, Christoph and me) this as well offline this week
and it should eithr be possible with some udev magic (fake a
/dev/mapper/WWID symlink) or a shim device-mapper target that prevents
dm-mpath from attaching to the underlying block devices and creates a
real /dev/mapper/WWID device node (Christoph especially dislikes
this). None of them are pretty, but probably all we can do so far.

Byte,
        Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-30 19:37       ` Johannes Thumshirn
@ 2017-10-01  8:07         ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-01  8:07 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme, Hannes Reinecke

On Sat, Sep 30, 2017 at 09:37:38PM +0200, Johannes Thumshirn wrote:
> We discussed (Hannes, Christoph and me) this as well offline this week
> and it should eithr be possible with some udev magic (fake a
> /dev/mapper/WWID symlink) or a shim device-mapper target that prevents
> dm-mpath from attaching to the underlying block devices and creates a
> real /dev/mapper/WWID device node (Christoph especially dislikes
> this). None of them are pretty, but probably all we can do so far.

I think the former is the way to go.

That being said I think you really need to kick the installers peoples
ass - at least for local bus devices there is no reason every to not
just support any node in /dev/.  We can triviall go from that to the
modules required in sysfs.  Anything networkish will need additional
work for setup that is more than bus probing, but we really should
not work around people being lazy and/or stupid too much.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-10-01  8:07         ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-01  8:07 UTC (permalink / raw)


On Sat, Sep 30, 2017@09:37:38PM +0200, Johannes Thumshirn wrote:
> We discussed (Hannes, Christoph and me) this as well offline this week
> and it should eithr be possible with some udev magic (fake a
> /dev/mapper/WWID symlink) or a shim device-mapper target that prevents
> dm-mpath from attaching to the underlying block devices and creates a
> real /dev/mapper/WWID device node (Christoph especially dislikes
> this). None of them are pretty, but probably all we can do so far.

I think the former is the way to go.

That being said I think you really need to kick the installers peoples
ass - at least for local bus devices there is no reason every to not
just support any node in /dev/.  We can triviall go from that to the
modules required in sysfs.  Anything networkish will need additional
work for setup that is more than bus probing, but we really should
not work around people being lazy and/or stupid too much.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-29 14:21       ` Tony Yang
@ 2017-10-01  8:07         ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-01  8:07 UTC (permalink / raw)
  To: Tony Yang
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-block,
	Sagi Grimberg, linux-nvme

On Fri, Sep 29, 2017 at 10:21:53PM +0800, Tony Yang wrote:
> Hi, All
> 
>      Because my environment requirements, the kernel must use 4.8.17,
> I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
> Because I see support for multi-path versions are above 4.13

In that case we simply can't help you.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-10-01  8:07         ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-01  8:07 UTC (permalink / raw)


On Fri, Sep 29, 2017@10:21:53PM +0800, Tony Yang wrote:
> Hi, All
> 
>      Because my environment requirements, the kernel must use 4.8.17,
> I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
> Because I see support for multi-path versions are above 4.13

In that case we simply can't help you.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-30 19:37       ` Johannes Thumshirn
@ 2017-10-02  6:19         ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2017-10-02  6:19 UTC (permalink / raw)
  To: Johannes Thumshirn, Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Sagi Grimberg, linux-nvme

On 09/30/2017 09:37 PM, Johannes Thumshirn wrote:
> 
> [+Cc Hannes ]
> 
> Keith Busch <keith.busch@intel.com> writes:
> 
>> On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
>>> The new block devices nodes for multipath access will show up as
>>>
>>> 	/dev/nvm-subXnZ
>>
>> Just thinking ahead ... Once this goes in, someone will want to boot their
>> OS from a multipath target. It was a pain getting installers to recognize
>> /dev/nvmeXnY as an install destination. I'm not sure if installers have
>> gotten any better in the last 5 years about recognizing new block names.
> 
> We discussed (Hannes, Christoph and me) this as well offline this week
> and it should eithr be possible with some udev magic (fake a
> /dev/mapper/WWID symlink) or a shim device-mapper target that prevents
> dm-mpath from attaching to the underlying block devices and creates a
> real /dev/mapper/WWID device node (Christoph especially dislikes
> this). None of them are pretty, but probably all we can do so far.
> 
There are several issues here.
The one is the device-naming issue; I'm heavily against the nvm-subX
thing. Why don't we just name them

/dev/nvmsXnZ

which nicely aligns with the existing /dev/nvmeXnZ naming scheme.

The other issue is how to enable multipath booting. Painful experience
tells us we _need_ to be able to figure out if a device is multipath
capable without having to invoke any external tools, ie the device
itself need to carry some marker 'this device is multipath capable'.
In NVMe speak this means we have to export the CMIC bit from the
Identify controller data structure to somewhere in sysfs.

Then it's easy to setup a udev rule which sets 'SYSTEMD_READY=0'
whenever this setting is detected, and the entire udev mechanism will
leave it alone.

Once this is done we have the paths nicely separated in udev, and then
setting up the symlinks pointing to the correct device is trivial.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-10-02  6:19         ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2017-10-02  6:19 UTC (permalink / raw)


On 09/30/2017 09:37 PM, Johannes Thumshirn wrote:
> 
> [+Cc Hannes ]
> 
> Keith Busch <keith.busch at intel.com> writes:
> 
>> On Mon, Sep 25, 2017@03:40:30PM +0200, Christoph Hellwig wrote:
>>> The new block devices nodes for multipath access will show up as
>>>
>>> 	/dev/nvm-subXnZ
>>
>> Just thinking ahead ... Once this goes in, someone will want to boot their
>> OS from a multipath target. It was a pain getting installers to recognize
>> /dev/nvmeXnY as an install destination. I'm not sure if installers have
>> gotten any better in the last 5 years about recognizing new block names.
> 
> We discussed (Hannes, Christoph and me) this as well offline this week
> and it should eithr be possible with some udev magic (fake a
> /dev/mapper/WWID symlink) or a shim device-mapper target that prevents
> dm-mpath from attaching to the underlying block devices and creates a
> real /dev/mapper/WWID device node (Christoph especially dislikes
> this). None of them are pretty, but probably all we can do so far.
> 
There are several issues here.
The one is the device-naming issue; I'm heavily against the nvm-subX
thing. Why don't we just name them

/dev/nvmsXnZ

which nicely aligns with the existing /dev/nvmeXnZ naming scheme.

The other issue is how to enable multipath booting. Painful experience
tells us we _need_ to be able to figure out if a device is multipath
capable without having to invoke any external tools, ie the device
itself need to carry some marker 'this device is multipath capable'.
In NVMe speak this means we have to export the CMIC bit from the
Identify controller data structure to somewhere in sysfs.

Then it's easy to setup a udev rule which sets 'SYSTEMD_READY=0'
whenever this setting is detected, and the entire udev mechanism will
leave it alone.

Once this is done we have the paths nicely separated in udev, and then
setting up the symlinks pointing to the correct device is trivial.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] block: move REQ_NOWAIT
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 11:51     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/9] block: move REQ_NOWAIT
@ 2017-10-11 11:51     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:51 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/9] block: add REQ_DRV bit
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 11:51     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/9] block: add REQ_DRV bit
@ 2017-10-11 11:51     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:51 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/9] block: provide a direct_make_request helper
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 11:54     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:54 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block


> +/**
> + * direct_make_request - hand a buffer directly to its device driver for I/O
> + * @bio:  The bio describing the location in memory and on the device.
> + *
> + * This function behaves like generic_make_request(), but does not protect
> + * against recursion.  Must only be used if the called driver is known
> + * to not re-issue this bio or a split to itself.
> + */

Can you explain a bit on the comment? What happens if nvme attempts
to failover a split if the original IO didn't comply to the
virt_boundary?

Would that cause any problems?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/9] block: provide a direct_make_request helper
@ 2017-10-11 11:54     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:54 UTC (permalink / raw)



> +/**
> + * direct_make_request - hand a buffer directly to its device driver for I/O
> + * @bio:  The bio describing the location in memory and on the device.
> + *
> + * This function behaves like generic_make_request(), but does not protect
> + * against recursion.  Must only be used if the called driver is known
> + * to not re-issue this bio or a split to itself.
> + */

Can you explain a bit on the comment? What happens if nvme attempts
to failover a split if the original IO didn't comply to the
virt_boundary?

Would that cause any problems?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/9] nvme: track subsystems
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 11:57     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:57 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

> @@ -2230,7 +2315,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
>   {
>   	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
>   
> -	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
> +	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
>   }
>   static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
>   
> @@ -2770,12 +2855,22 @@ EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
>   static void nvme_free_ctrl(struct kref *kref)
>   {
>   	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
> +	struct nvme_subsystem *subsys = ctrl->subsys;
>   
>   	put_device(ctrl->device);
>   	nvme_release_instance(ctrl);
>   	ida_destroy(&ctrl->ns_ida);
>   
> +	if (subsys) {
> +		mutex_lock(&subsys->lock);
> +		list_del(&ctrl->subsys_entry);
> +		mutex_unlock(&subsys->lock);
> +	}
> +

When can this happen? can a controller not have a subsys?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/9] nvme: track subsystems
@ 2017-10-11 11:57     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:57 UTC (permalink / raw)


> @@ -2230,7 +2315,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
>   {
>   	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
>   
> -	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
> +	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
>   }
>   static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
>   
> @@ -2770,12 +2855,22 @@ EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
>   static void nvme_free_ctrl(struct kref *kref)
>   {
>   	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
> +	struct nvme_subsystem *subsys = ctrl->subsys;
>   
>   	put_device(ctrl->device);
>   	nvme_release_instance(ctrl);
>   	ida_destroy(&ctrl->ns_ida);
>   
> +	if (subsys) {
> +		mutex_lock(&subsys->lock);
> +		list_del(&ctrl->subsys_entry);
> +		mutex_unlock(&subsys->lock);
> +	}
> +

When can this happen? can a controller not have a subsys?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] nvme: introduce a nvme_ns_ids structure
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 11:59     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:59 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 6/9] nvme: introduce a nvme_ns_ids structure
@ 2017-10-11 11:59     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 11:59 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/9] nvme: track shared namespaces
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 12:04     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 12:04 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block


> +/*
> + * Anchor structure for namespaces.  There is one for each namespace in a
> + * NVMe subsystem that any of our controllers can see, and the namespace
> + * structure for each controller is chained of it.  For private namespaces
> + * there is a 1:1 relation to our namespace structures, that is ->list
> + * only ever has a single entry for private namespaces.
> + */
> +struct nvme_ns_head {
> +	struct list_head	list;

Didn't we agree to call this list_head siblings?

> +	struct srcu_struct      srcu;
> +	unsigned		ns_id;
> +	struct nvme_ns_ids	ids;
> +	struct list_head	entry;
> +	struct kref		ref;
> +};
> +
>   struct nvme_ns {
>   	struct list_head list;
>   
>   	struct nvme_ctrl *ctrl;
>   	struct request_queue *queue;
>   	struct gendisk *disk;
> +	struct list_head siblings;

And this list_head list (or entry, or sibling, or sibling_entry)?

>   	struct nvm_dev *ndev;
>   	struct kref kref;
> +	struct nvme_ns_head *head;
>   	int instance;
>   
> -	unsigned ns_id;
> -	struct nvme_ns_ids ids;
>   	int lba_shift;
>   	u16 ms;
>   	u16 sgs;
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/9] nvme: track shared namespaces
@ 2017-10-11 12:04     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 12:04 UTC (permalink / raw)



> +/*
> + * Anchor structure for namespaces.  There is one for each namespace in a
> + * NVMe subsystem that any of our controllers can see, and the namespace
> + * structure for each controller is chained of it.  For private namespaces
> + * there is a 1:1 relation to our namespace structures, that is ->list
> + * only ever has a single entry for private namespaces.
> + */
> +struct nvme_ns_head {
> +	struct list_head	list;

Didn't we agree to call this list_head siblings?

> +	struct srcu_struct      srcu;
> +	unsigned		ns_id;
> +	struct nvme_ns_ids	ids;
> +	struct list_head	entry;
> +	struct kref		ref;
> +};
> +
>   struct nvme_ns {
>   	struct list_head list;
>   
>   	struct nvme_ctrl *ctrl;
>   	struct request_queue *queue;
>   	struct gendisk *disk;
> +	struct list_head siblings;

And this list_head list (or entry, or sibling, or sibling_entry)?

>   	struct nvm_dev *ndev;
>   	struct kref kref;
> +	struct nvme_ns_head *head;
>   	int instance;
>   
> -	unsigned ns_id;
> -	struct nvme_ns_ids ids;
>   	int lba_shift;
>   	u16 ms;
>   	u16 sgs;
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems
  2017-09-25 13:40   ` Christoph Hellwig
@ 2017-10-11 12:24     ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 12:24 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: Keith Busch, linux-nvme, linux-block


> +static const struct block_device_operations nvme_subsys_ops = {
> +	.owner		= THIS_MODULE,
> +};

Nit - maybe better to name this nvme_ns_head_ops

> +	ret = -ENOMEM;
> +	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
> +	if (!q)
> +		goto out_free_head;
> +	q->queuedata = head;
> +	blk_queue_make_request(q, nvme_make_request);
> +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> +	/* set to a default value for 512 until disk is validated */
> +	blk_queue_logical_block_size(q, 512);
> +	nvme_set_queue_limits(ctrl, q);
> +
> +	head->disk = alloc_disk(0);
> +	if (!head->disk)
> +		goto out_cleanup_queue;
> +	head->disk->fops = &nvme_subsys_ops;
> +	head->disk->private_data = head;
> +	head->disk->queue = q;
> +	head->disk->flags = GENHD_FL_EXT_DEVT;
> +	sprintf(head->disk->disk_name, "nvm-sub%dn%d",
> +			ctrl->subsys->instance, nsid);

Did we end up in agreement on nvmsXnY?


> @@ -2669,8 +2905,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   	}
>   
>   	mutex_lock(&ns->ctrl->subsys->lock);
> -	if (head)
> +	if (head) {
> +		rcu_assign_pointer(head->current_path, NULL);

Doesn't this assignment need to happen only when the ns *is* the
current_path?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] nvme: implement multipath access to nvme subsystems
@ 2017-10-11 12:24     ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-11 12:24 UTC (permalink / raw)



> +static const struct block_device_operations nvme_subsys_ops = {
> +	.owner		= THIS_MODULE,
> +};

Nit - maybe better to name this nvme_ns_head_ops

> +	ret = -ENOMEM;
> +	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
> +	if (!q)
> +		goto out_free_head;
> +	q->queuedata = head;
> +	blk_queue_make_request(q, nvme_make_request);
> +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> +	/* set to a default value for 512 until disk is validated */
> +	blk_queue_logical_block_size(q, 512);
> +	nvme_set_queue_limits(ctrl, q);
> +
> +	head->disk = alloc_disk(0);
> +	if (!head->disk)
> +		goto out_cleanup_queue;
> +	head->disk->fops = &nvme_subsys_ops;
> +	head->disk->private_data = head;
> +	head->disk->queue = q;
> +	head->disk->flags = GENHD_FL_EXT_DEVT;
> +	sprintf(head->disk->disk_name, "nvm-sub%dn%d",
> +			ctrl->subsys->instance, nsid);

Did we end up in agreement on nvmsXnY?


> @@ -2669,8 +2905,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   	}
>   
>   	mutex_lock(&ns->ctrl->subsys->lock);
> -	if (head)
> +	if (head) {
> +		rcu_assign_pointer(head->current_path, NULL);

Doesn't this assignment need to happen only when the ns *is* the
current_path?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/9] nvme: track subsystems
  2017-10-11 11:57     ` Sagi Grimberg
@ 2017-10-18 10:12       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-18 10:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

>>   static void nvme_free_ctrl(struct kref *kref)
>>   {
>>   	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
>> +	struct nvme_subsystem *subsys = ctrl->subsys;
>>     	put_device(ctrl->device);
>>   	nvme_release_instance(ctrl);
>>   	ida_destroy(&ctrl->ns_ida);
>>   +	if (subsys) {
>> +		mutex_lock(&subsys->lock);
>> +		list_del(&ctrl->subsys_entry);
>> +		mutex_unlock(&subsys->lock);
>> +	}
>> +
>
> When can this happen? can a controller not have a subsys?

When we fail early enough to not have set up the subsystem yet.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/9] nvme: track subsystems
@ 2017-10-18 10:12       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-18 10:12 UTC (permalink / raw)


>>   static void nvme_free_ctrl(struct kref *kref)
>>   {
>>   	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
>> +	struct nvme_subsystem *subsys = ctrl->subsys;
>>     	put_device(ctrl->device);
>>   	nvme_release_instance(ctrl);
>>   	ida_destroy(&ctrl->ns_ida);
>>   +	if (subsys) {
>> +		mutex_lock(&subsys->lock);
>> +		list_del(&ctrl->subsys_entry);
>> +		mutex_unlock(&subsys->lock);
>> +	}
>> +
>
> When can this happen? can a controller not have a subsys?

When we fail early enough to not have set up the subsystem yet.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/9] nvme: track shared namespaces
  2017-10-11 12:04     ` Sagi Grimberg
@ 2017-10-18 10:15       ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-18 10:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-nvme, linux-block

On Wed, Oct 11, 2017 at 03:04:14PM +0300, Sagi Grimberg wrote:
>
>> +/*
>> + * Anchor structure for namespaces.  There is one for each namespace in a
>> + * NVMe subsystem that any of our controllers can see, and the namespace
>> + * structure for each controller is chained of it.  For private namespaces
>> + * there is a 1:1 relation to our namespace structures, that is ->list
>> + * only ever has a single entry for private namespaces.
>> + */
>> +struct nvme_ns_head {
>> +	struct list_head	list;
>
> Didn't we agree to call this list_head siblings?

Don't think I did.

>>     	struct nvme_ctrl *ctrl;
>>   	struct request_queue *queue;
>>   	struct gendisk *disk;
>> +	struct list_head siblings;
>
> And this list_head list (or entry, or sibling, or sibling_entry)?

I think siblings fits perfectly fine.  It is a lot more descriptive
than entry..

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/9] nvme: track shared namespaces
@ 2017-10-18 10:15       ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2017-10-18 10:15 UTC (permalink / raw)


On Wed, Oct 11, 2017@03:04:14PM +0300, Sagi Grimberg wrote:
>
>> +/*
>> + * Anchor structure for namespaces.  There is one for each namespace in a
>> + * NVMe subsystem that any of our controllers can see, and the namespace
>> + * structure for each controller is chained of it.  For private namespaces
>> + * there is a 1:1 relation to our namespace structures, that is ->list
>> + * only ever has a single entry for private namespaces.
>> + */
>> +struct nvme_ns_head {
>> +	struct list_head	list;
>
> Didn't we agree to call this list_head siblings?

Don't think I did.

>>     	struct nvme_ctrl *ctrl;
>>   	struct request_queue *queue;
>>   	struct gendisk *disk;
>> +	struct list_head siblings;
>
> And this list_head list (or entry, or sibling, or sibling_entry)?

I think siblings fits perfectly fine.  It is a lot more descriptive
than entry..

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/9] nvme: track subsystems
  2017-10-18 10:12       ` Christoph Hellwig
@ 2017-10-18 11:16         ` Sagi Grimberg
  -1 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-18 11:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Keith Busch, linux-nvme, linux-block


>>>    static void nvme_free_ctrl(struct kref *kref)
>>>    {
>>>    	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
>>> +	struct nvme_subsystem *subsys = ctrl->subsys;
>>>      	put_device(ctrl->device);
>>>    	nvme_release_instance(ctrl);
>>>    	ida_destroy(&ctrl->ns_ida);
>>>    +	if (subsys) {
>>> +		mutex_lock(&subsys->lock);
>>> +		list_del(&ctrl->subsys_entry);
>>> +		mutex_unlock(&subsys->lock);
>>> +	}
>>> +
>>
>> When can this happen? can a controller not have a subsys?
> 
> When we fail early enough to not have set up the subsystem yet.

I think that the fact the nvme_free_ctrl cleans up after everything
that is setup in a random place is not a very good approach (just sent
a patchset that moves the uninit_ctrl cleanup back to its natural
place).

However here, given that each individual driver calls nvme_init_identify
on its own as part of its private admin configuration I don't see how
we can avoid it for now...

Hopefully we can sort this out when we centralize some more
setup/teardown logic and this subsys detach would move to where
we teardown the admin_queue (as it was added when we configured it)...

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/9] nvme: track subsystems
@ 2017-10-18 11:16         ` Sagi Grimberg
  0 siblings, 0 replies; 74+ messages in thread
From: Sagi Grimberg @ 2017-10-18 11:16 UTC (permalink / raw)



>>>    static void nvme_free_ctrl(struct kref *kref)
>>>    {
>>>    	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
>>> +	struct nvme_subsystem *subsys = ctrl->subsys;
>>>      	put_device(ctrl->device);
>>>    	nvme_release_instance(ctrl);
>>>    	ida_destroy(&ctrl->ns_ida);
>>>    +	if (subsys) {
>>> +		mutex_lock(&subsys->lock);
>>> +		list_del(&ctrl->subsys_entry);
>>> +		mutex_unlock(&subsys->lock);
>>> +	}
>>> +
>>
>> When can this happen? can a controller not have a subsys?
> 
> When we fail early enough to not have set up the subsystem yet.

I think that the fact the nvme_free_ctrl cleans up after everything
that is setup in a random place is not a very good approach (just sent
a patchset that moves the uninit_ctrl cleanup back to its natural
place).

However here, given that each individual driver calls nvme_init_identify
on its own as part of its private admin configuration I don't see how
we can avoid it for now...

Hopefully we can sort this out when we centralize some more
setup/teardown logic and this subsys detach would move to where
we teardown the admin_queue (as it was added when we configured it)...

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2017-10-18 11:16 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-25 13:40 nvme multipath support V3 Christoph Hellwig
2017-09-25 13:40 ` Christoph Hellwig
2017-09-25 13:40 ` [PATCH 1/9] block: move REQ_NOWAIT Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-10-11 11:51   ` Sagi Grimberg
2017-10-11 11:51     ` Sagi Grimberg
2017-09-25 13:40 ` [PATCH 2/9] block: add REQ_DRV bit Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-10-11 11:51   ` Sagi Grimberg
2017-10-11 11:51     ` Sagi Grimberg
2017-09-25 13:40 ` [PATCH 3/9] block: provide a direct_make_request helper Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-10-11 11:54   ` Sagi Grimberg
2017-10-11 11:54     ` Sagi Grimberg
2017-09-25 13:40 ` [PATCH 4/9] block: add a blk_steal_bios helper Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-09-25 13:40 ` [PATCH 5/9] nvme: track subsystems Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-09-25 16:07   ` Keith Busch
2017-09-25 16:07     ` Keith Busch
2017-10-11 11:57   ` Sagi Grimberg
2017-10-11 11:57     ` Sagi Grimberg
2017-10-18 10:12     ` Christoph Hellwig
2017-10-18 10:12       ` Christoph Hellwig
2017-10-18 11:16       ` Sagi Grimberg
2017-10-18 11:16         ` Sagi Grimberg
2017-09-25 13:40 ` [PATCH 6/9] nvme: introduce a nvme_ns_ids structure Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-09-25 16:08   ` Keith Busch
2017-09-25 16:08     ` Keith Busch
2017-10-11 11:59   ` Sagi Grimberg
2017-10-11 11:59     ` Sagi Grimberg
2017-09-25 13:40 ` [PATCH 7/9] nvme: track shared namespaces Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-09-25 16:11   ` Keith Busch
2017-09-25 16:11     ` Keith Busch
2017-09-25 16:19     ` Christoph Hellwig
2017-09-25 16:19       ` Christoph Hellwig
2017-10-11 12:04   ` Sagi Grimberg
2017-10-11 12:04     ` Sagi Grimberg
2017-10-18 10:15     ` Christoph Hellwig
2017-10-18 10:15       ` Christoph Hellwig
2017-09-25 13:40 ` [PATCH 8/9] nvme: implement multipath access to nvme subsystems Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-09-25 13:47   ` Hannes Reinecke
2017-09-25 13:47     ` Hannes Reinecke
2017-09-25 13:50     ` Christoph Hellwig
2017-09-25 13:50       ` Christoph Hellwig
2017-09-25 14:05       ` Hannes Reinecke
2017-09-25 14:05         ` Hannes Reinecke
2017-09-25 14:45         ` Keith Busch
2017-09-25 14:45           ` Keith Busch
2017-09-25 15:07         ` Christoph Hellwig
2017-09-25 15:07           ` Christoph Hellwig
2017-09-25 16:18   ` Keith Busch
2017-09-25 16:18     ` Keith Busch
2017-09-28 15:53   ` Keith Busch
2017-09-28 15:53     ` Keith Busch
2017-09-29 14:21     ` Tony Yang
2017-09-29 14:21       ` Tony Yang
2017-10-01  8:07       ` Christoph Hellwig
2017-10-01  8:07         ` Christoph Hellwig
2017-09-30 19:37     ` Johannes Thumshirn
2017-09-30 19:37       ` Johannes Thumshirn
2017-10-01  8:07       ` Christoph Hellwig
2017-10-01  8:07         ` Christoph Hellwig
2017-10-02  6:19       ` Hannes Reinecke
2017-10-02  6:19         ` Hannes Reinecke
2017-10-11 12:24   ` Sagi Grimberg
2017-10-11 12:24     ` Sagi Grimberg
2017-09-25 13:40 ` [PATCH 9/9] nvme: also expose the namespace identification sysfs files for mpath nodes Christoph Hellwig
2017-09-25 13:40   ` Christoph Hellwig
2017-09-25 16:19   ` Keith Busch
2017-09-25 16:19     ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.