All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme multipath support V5
@ 2017-10-23 14:51 ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

Changes since V4:
  - add a refcount to release the device in struct nvme_subsystem
  - use the instance to name the nvme_subsystems in sysfs
  - remove a NULL check before nvme_put_ns_head
  - take a ns_head reference in ->open
  - improve open protection for GENHD_FL_HIDDEN
  - add poll support for the mpath device

Changes since V3:
  - new block layer support for hidden gendisks
  - a couple new patches to refactor device handling before the
    actual multipath support
  - don't expose per-controller block device nodes
  - use /dev/nvmeXnZ as the device nodes for the whole subsystem.
  - expose subsystems in sysfs (Hannes Reinecke)
  - fix a subsystem leak when duplicate NQNs are found
  - fix up some names
  - don't clear current_path if freeing a different namespace

Changes since V2:
  - don't create duplicate subsystems on reset (Keith Bush)
  - free requests properly when failing over in I/O completion (Keith Bush)
  - new devices names: /dev/nvm-sub%dn%d
  - expose the namespace identification sysfs files for the mpath nodes

Changes since V1:
  - introduce new nvme_ns_ids structure to clean up identifier handling
  - generic_make_request_fast is now named direct_make_request and calls
    generic_make_request_checks
  - reset bi_disk on resubmission
  - create sysfs links between the existing nvme namespace block devices and
    the new share mpath device
  - temporarily added the timeout patches from James, this should go into
    nvme-4.14, though

^ permalink raw reply	[flat|nested] 116+ messages in thread

* nvme multipath support V5
@ 2017-10-23 14:51 ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

Changes since V4:
  - add a refcount to release the device in struct nvme_subsystem
  - use the instance to name the nvme_subsystems in sysfs
  - remove a NULL check before nvme_put_ns_head
  - take a ns_head reference in ->open
  - improve open protection for GENHD_FL_HIDDEN
  - add poll support for the mpath device

Changes since V3:
  - new block layer support for hidden gendisks
  - a couple new patches to refactor device handling before the
    actual multipath support
  - don't expose per-controller block device nodes
  - use /dev/nvmeXnZ as the device nodes for the whole subsystem.
  - expose subsystems in sysfs (Hannes Reinecke)
  - fix a subsystem leak when duplicate NQNs are found
  - fix up some names
  - don't clear current_path if freeing a different namespace

Changes since V2:
  - don't create duplicate subsystems on reset (Keith Bush)
  - free requests properly when failing over in I/O completion (Keith Bush)
  - new devices names: /dev/nvm-sub%dn%d
  - expose the namespace identification sysfs files for the mpath nodes

Changes since V1:
  - introduce new nvme_ns_ids structure to clean up identifier handling
  - generic_make_request_fast is now named direct_make_request and calls
    generic_make_request_checks
  - reset bi_disk on resubmission
  - create sysfs links between the existing nvme namespace block devices and
    the new share mpath device
  - temporarily added the timeout patches from James, this should go into
    nvme-4.14, though

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 01/17] block: move REQ_NOWAIT
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This flag should be before the operation-specific REQ_NOUNMAP bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 include/linux/blk_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa709cef..acc2f3cdc2fc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -224,11 +224,11 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_NOWAIT,           /* Don't wait if request will block */
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
-	__REQ_NOWAIT,           /* Don't wait if request will block */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -245,9 +245,9 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
-#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 01/17] block: move REQ_NOWAIT
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This flag should be before the operation-specific REQ_NOUNMAP bit.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 include/linux/blk_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa709cef..acc2f3cdc2fc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -224,11 +224,11 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_NOWAIT,           /* Don't wait if request will block */
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
-	__REQ_NOWAIT,           /* Don't wait if request will block */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -245,9 +245,9 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
-#define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 02/17] block: add REQ_DRV bit
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

Set aside a bit in the request/bio flags for driver use.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 include/linux/blk_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index acc2f3cdc2fc..7ec2ed097a8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -229,6 +229,9 @@ enum req_flag_bits {
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
+	/* for driver use */
+	__REQ_DRV,
+
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -249,6 +252,8 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
+#define REQ_DRV			(1ULL << __REQ_DRV)
+
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 02/17] block: add REQ_DRV bit
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


Set aside a bit in the request/bio flags for driver use.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Hannes Reinecke <hare at suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 include/linux/blk_types.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index acc2f3cdc2fc..7ec2ed097a8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -229,6 +229,9 @@ enum req_flag_bits {
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
+	/* for driver use */
+	__REQ_DRV,
+
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -249,6 +252,8 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
+#define REQ_DRV			(1ULL << __REQ_DRV)
+
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 03/17] block: provide a direct_make_request helper
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14f7674fa0b1..b8c80f39f5fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2241,6 +2241,40 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/**
+ * direct_make_request - hand a buffer directly to its device driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves like generic_make_request(), but does not protect
+ * against recursion.  Must only be used if the called driver is known
+ * to not call generic_make_request (or direct_make_request) again from
+ * its make_request function.  (Calling direct_make_request again from
+ * a workqueue is perfectly fine as that doesn't recurse).
+ */
+blk_qc_t direct_make_request(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	blk_qc_t ret;
+
+	if (!generic_make_request_checks(bio))
+		return BLK_QC_T_NONE;
+
+	if (unlikely(blk_queue_enter(q, nowait))) {
+		if (nowait && !blk_queue_dying(q))
+			bio->bi_status = BLK_STS_AGAIN;
+		else
+			bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
+	ret = q->make_request_fn(q, bio);
+	blk_queue_exit(q);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(direct_make_request);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..780f01db5899 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -936,6 +936,7 @@ do {								\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 03/17] block: provide a direct_make_request helper
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
---
 block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14f7674fa0b1..b8c80f39f5fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2241,6 +2241,40 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+/**
+ * direct_make_request - hand a buffer directly to its device driver for I/O
+ * @bio:  The bio describing the location in memory and on the device.
+ *
+ * This function behaves like generic_make_request(), but does not protect
+ * against recursion.  Must only be used if the called driver is known
+ * to not call generic_make_request (or direct_make_request) again from
+ * its make_request function.  (Calling direct_make_request again from
+ * a workqueue is perfectly fine as that doesn't recurse).
+ */
+blk_qc_t direct_make_request(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	blk_qc_t ret;
+
+	if (!generic_make_request_checks(bio))
+		return BLK_QC_T_NONE;
+
+	if (unlikely(blk_queue_enter(q, nowait))) {
+		if (nowait && !blk_queue_dying(q))
+			bio->bi_status = BLK_STS_AGAIN;
+		else
+			bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return BLK_QC_T_NONE;
+	}
+
+	ret = q->make_request_fn(q, bio);
+	blk_queue_exit(q);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(direct_make_request);
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..780f01db5899 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -936,6 +936,7 @@ do {								\
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 extern blk_qc_t generic_make_request(struct bio *bio);
+extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 04/17] block: add a blk_steal_bios helper
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-core.c       | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8c80f39f5fe..e804529e65a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2767,6 +2767,26 @@ struct request *blk_fetch_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+	if (rq->bio) {
+		if (list->tail)
+			list->tail->bi_next = rq->bio;
+		else
+			list->head = rq->bio;
+		list->tail = rq->biotail;
+	}
+
+	rq->bio = NULL;
+	rq->biotail = NULL;
+	rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:      the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 780f01db5899..45c63764a14e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1110,6 +1110,8 @@ extern struct request *blk_peek_request(struct request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 04/17] block: add a blk_steal_bios helper
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
---
 block/blk-core.c       | 20 ++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8c80f39f5fe..e804529e65a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2767,6 +2767,26 @@ struct request *blk_fetch_request(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_fetch_request);
 
+/*
+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */
+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+	if (rq->bio) {
+		if (list->tail)
+			list->tail->bi_next = rq->bio;
+		else
+			list->head = rq->bio;
+		list->tail = rq->biotail;
+	}
+
+	rq->bio = NULL;
+	rq->biotail = NULL;
+	rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+
 /**
  * blk_update_request - Special helper function for request stacking drivers
  * @req:      the request being processed
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 780f01db5899..45c63764a14e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1110,6 +1110,8 @@ extern struct request *blk_peek_request(struct request_queue *q);
 extern void blk_start_request(struct request *rq);
 extern struct request *blk_fetch_request(struct request_queue *q);
 
+void blk_steal_bios(struct bio_list *list, struct request *rq);
+
 /*
  * Request completion related functions.
  *
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 05/17] block: don't look at the struct device dev_t in disk_devt
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/genhd.c         | 4 ----
 include/linux/genhd.h | 2 +-
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index dd305c65ffb0..1174d24e405e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -649,10 +649,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 		return;
 	}
 	disk_to_dev(disk)->devt = devt;
-
-	/* ->major and ->first_minor aren't supposed to be
-	 * dereferenced from here on, but set them just in case.
-	 */
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ea652bfcd675..5c0ed5db33c2 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -234,7 +234,7 @@ static inline bool disk_part_scan_enabled(struct gendisk *disk)
 
 static inline dev_t disk_devt(struct gendisk *disk)
 {
-	return disk_to_dev(disk)->devt;
+	return MKDEV(disk->major, disk->first_minor);
 }
 
 static inline dev_t part_devt(struct hd_struct *part)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 05/17] block: don't look at the struct device dev_t in disk_devt
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 block/genhd.c         | 4 ----
 include/linux/genhd.h | 2 +-
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index dd305c65ffb0..1174d24e405e 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -649,10 +649,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 		return;
 	}
 	disk_to_dev(disk)->devt = devt;
-
-	/* ->major and ->first_minor aren't supposed to be
-	 * dereferenced from here on, but set them just in case.
-	 */
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ea652bfcd675..5c0ed5db33c2 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -234,7 +234,7 @@ static inline bool disk_part_scan_enabled(struct gendisk *disk)
 
 static inline dev_t disk_devt(struct gendisk *disk)
 {
-	return disk_to_dev(disk)->devt;
+	return MKDEV(disk->major, disk->first_minor);
 }
 
 static inline dev_t part_devt(struct hd_struct *part)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c         | 57 +++++++++++++++++++++++++++++++++++----------------
 include/linux/genhd.h |  1 +
 2 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 1174d24e405e..11a41cca3475 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -585,6 +585,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		dev_set_uevent_suppress(ddev, 0);
+		return;
+	}
+
 	disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
 
@@ -616,6 +621,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter)))
 		kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
 	disk_part_iter_exit(&piter);
+
+	err = sysfs_create_link(&ddev->kobj,
+				&disk->queue->backing_dev_info->dev->kobj,
+				"bdi");
+	WARN_ON(err);
 }
 
 /**
@@ -630,7 +640,6 @@ static void register_disk(struct device *parent, struct gendisk *disk)
  */
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
 
@@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 * parameters make sense.
 	 */
 	WARN_ON(disk->minors && !(disk->major || disk->first_minor));
-	WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
+	WARN_ON(!disk->minors &&
+		!(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));
 
 	disk->flags |= GENHD_FL_UP;
 
@@ -648,18 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 		WARN_ON(1);
 		return;
 	}
-	disk_to_dev(disk)->devt = devt;
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
 	disk_alloc_events(disk);
 
-	/* Register BDI before referencing it from bdev */
-	bdi = disk->queue->backing_dev_info;
-	bdi_register_owner(bdi, disk_to_dev(disk));
-
-	blk_register_region(disk_devt(disk), disk->minors, NULL,
-			    exact_match, exact_lock, disk);
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		/*
+		 * Don't let hidden disks show up in /proc/partitions,
+		 * and don't bother scanning for partitions either.
+		 */
+		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+		disk->flags |= GENHD_FL_NO_PART_SCAN;
+	} else {
+		/* Register BDI before referencing it from bdev */
+		disk_to_dev(disk)->devt = devt;
+		bdi_register_owner(disk->queue->backing_dev_info,
+				disk_to_dev(disk));
+		blk_register_region(disk_devt(disk), disk->minors, NULL,
+				    exact_match, exact_lock, disk);
+	}
 	register_disk(parent, disk);
 	blk_register_queue(disk);
 
@@ -669,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 */
 	WARN_ON_ONCE(!blk_get_queue(disk->queue));
 
-	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
-				   "bdi");
-	WARN_ON(retval);
-
 	disk_add_events(disk);
 	blk_integrity_add(disk);
 }
@@ -701,7 +715,8 @@ void del_gendisk(struct gendisk *disk)
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
-	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	if (disk->queue) {
 		/*
 		 * Unregister bdi before releasing device numbers (as they can
@@ -712,13 +727,15 @@ void del_gendisk(struct gendisk *disk)
 	} else {
 		WARN_ON(1);
 	}
-	blk_unregister_region(disk_devt(disk), disk->minors);
+
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+		blk_unregister_region(disk_devt(disk), disk->minors);
+		kobject_put(disk->part0.holder_dir);
+		kobject_put(disk->slave_dir);
+	}
 
 	part_stat_set_all(&disk->part0, 0);
 	disk->part0.stamp = 0;
-
-	kobject_put(disk->part0.holder_dir);
-	kobject_put(disk->slave_dir);
 	if (!sysfs_deprecated)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
@@ -781,6 +798,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 		spin_unlock_bh(&ext_devt_lock);
 	}
 
+	if (unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+		put_disk(disk);
+		disk = NULL;
+	}
 	return disk;
 }
 EXPORT_SYMBOL(get_gendisk);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c0ed5db33c2..93aae3476f58 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -140,6 +140,7 @@ struct hd_struct {
 #define GENHD_FL_NATIVE_CAPACITY		128
 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE	256
 #define GENHD_FL_NO_PART_SCAN			512
+#define GENHD_FL_HIDDEN				1024
 
 enum {
 	DISK_EVENT_MEDIA_CHANGE			= 1 << 0, /* media changed */
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/genhd.c         | 57 +++++++++++++++++++++++++++++++++++----------------
 include/linux/genhd.h |  1 +
 2 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 1174d24e405e..11a41cca3475 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -585,6 +585,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		dev_set_uevent_suppress(ddev, 0);
+		return;
+	}
+
 	disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
 
@@ -616,6 +621,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter)))
 		kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
 	disk_part_iter_exit(&piter);
+
+	err = sysfs_create_link(&ddev->kobj,
+				&disk->queue->backing_dev_info->dev->kobj,
+				"bdi");
+	WARN_ON(err);
 }
 
 /**
@@ -630,7 +640,6 @@ static void register_disk(struct device *parent, struct gendisk *disk)
  */
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
 
@@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 * parameters make sense.
 	 */
 	WARN_ON(disk->minors && !(disk->major || disk->first_minor));
-	WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
+	WARN_ON(!disk->minors &&
+		!(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));
 
 	disk->flags |= GENHD_FL_UP;
 
@@ -648,18 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 		WARN_ON(1);
 		return;
 	}
-	disk_to_dev(disk)->devt = devt;
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
 	disk_alloc_events(disk);
 
-	/* Register BDI before referencing it from bdev */
-	bdi = disk->queue->backing_dev_info;
-	bdi_register_owner(bdi, disk_to_dev(disk));
-
-	blk_register_region(disk_devt(disk), disk->minors, NULL,
-			    exact_match, exact_lock, disk);
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		/*
+		 * Don't let hidden disks show up in /proc/partitions,
+		 * and don't bother scanning for partitions either.
+		 */
+		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+		disk->flags |= GENHD_FL_NO_PART_SCAN;
+	} else {
+		/* Register BDI before referencing it from bdev */
+		disk_to_dev(disk)->devt = devt;
+		bdi_register_owner(disk->queue->backing_dev_info,
+				disk_to_dev(disk));
+		blk_register_region(disk_devt(disk), disk->minors, NULL,
+				    exact_match, exact_lock, disk);
+	}
 	register_disk(parent, disk);
 	blk_register_queue(disk);
 
@@ -669,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 */
 	WARN_ON_ONCE(!blk_get_queue(disk->queue));
 
-	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
-				   "bdi");
-	WARN_ON(retval);
-
 	disk_add_events(disk);
 	blk_integrity_add(disk);
 }
@@ -701,7 +715,8 @@ void del_gendisk(struct gendisk *disk)
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
-	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	if (disk->queue) {
 		/*
 		 * Unregister bdi before releasing device numbers (as they can
@@ -712,13 +727,15 @@ void del_gendisk(struct gendisk *disk)
 	} else {
 		WARN_ON(1);
 	}
-	blk_unregister_region(disk_devt(disk), disk->minors);
+
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+		blk_unregister_region(disk_devt(disk), disk->minors);
+		kobject_put(disk->part0.holder_dir);
+		kobject_put(disk->slave_dir);
+	}
 
 	part_stat_set_all(&disk->part0, 0);
 	disk->part0.stamp = 0;
-
-	kobject_put(disk->part0.holder_dir);
-	kobject_put(disk->slave_dir);
 	if (!sysfs_deprecated)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
@@ -781,6 +798,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 		spin_unlock_bh(&ext_devt_lock);
 	}
 
+	if (unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+		put_disk(disk);
+		disk = NULL;
+	}
 	return disk;
 }
 EXPORT_SYMBOL(get_gendisk);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c0ed5db33c2..93aae3476f58 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -140,6 +140,7 @@ struct hd_struct {
 #define GENHD_FL_NATIVE_CAPACITY		128
 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE	256
 #define GENHD_FL_NO_PART_SCAN			512
+#define GENHD_FL_HIDDEN				1024
 
 enum {
 	DISK_EVENT_MEDIA_CHANGE			= 1 << 0, /* media changed */
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 07/17] block: add a poll_fn callback to struct request_queue
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c             | 11 +++++++++++
 block/blk-mq.c               | 14 +++++---------
 drivers/nvme/target/io-cmd.c |  2 +-
 fs/block_dev.c               |  4 ++--
 fs/direct-io.c               |  2 +-
 fs/iomap.c                   |  2 +-
 include/linux/blkdev.h       |  4 +++-
 mm/page_io.c                 |  2 +-
 8 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e804529e65a5..8e7e12e5ffa2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2319,6 +2319,17 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+bool blk_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	if (!q->poll_fn || !blk_qc_t_valid(cookie))
+		return false;
+
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+	return q->poll_fn(q, cookie);
+}
+EXPORT_SYMBOL_GPL(blk_poll);
+
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for new the queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f01d69879d6..10c99cf6fd71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -37,6 +37,7 @@
 #include "blk-wbt.h"
 #include "blk-mq-sched.h"
 
+static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2401,6 +2402,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	spin_lock_init(&q->requeue_lock);
 
 	blk_queue_make_request(q, blk_mq_make_request);
+	if (q->mq_ops->poll)
+		q->poll_fn = blk_mq_poll;
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
@@ -2860,20 +2863,14 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return false;
 }
 
-bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
+static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 {
 	struct blk_mq_hw_ctx *hctx;
-	struct blk_plug *plug;
 	struct request *rq;
 
-	if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return false;
 
-	plug = current->plug;
-	if (plug)
-		blk_flush_plug_list(plug, false);
-
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 	if (!blk_qc_t_is_internal(cookie))
 		rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
@@ -2891,7 +2888,6 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 
 	return __blk_mq_poll(hctx, rq);
 }
-EXPORT_SYMBOL_GPL(blk_mq_poll);
 
 static int __init blk_mq_init(void)
 {
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 0d4c23dc4532..db632818777d 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -94,7 +94,7 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 
 	cookie = submit_bio(bio);
 
-	blk_mq_poll(bdev_get_queue(req->ns->bdev), cookie);
+	blk_poll(bdev_get_queue(req->ns->bdev), cookie);
 }
 
 static void nvmet_execute_flush(struct nvmet_req *req)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..49a55246ba50 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -249,7 +249,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		if (!READ_ONCE(bio.bi_private))
 			break;
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_mq_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc))
 			io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
@@ -414,7 +414,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			break;
 
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_mq_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc))
 			io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 62cf812ed0e5..d2bc339cb1e9 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -486,7 +486,7 @@ static struct bio *dio_await_one(struct dio *dio)
 		dio->waiter = current;
 		spin_unlock_irqrestore(&dio->bio_lock, flags);
 		if (!(dio->iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_mq_poll(dio->bio_disk->queue, dio->bio_cookie))
+		    !blk_poll(dio->bio_disk->queue, dio->bio_cookie))
 			io_schedule();
 		/* wake up sets us TASK_RUNNING */
 		spin_lock_irqsave(&dio->bio_lock, flags);
diff --git a/fs/iomap.c b/fs/iomap.c
index 8194d30bdca0..4241bac905b1 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1049,7 +1049,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
 			    !dio->submit.last_queue ||
-			    !blk_mq_poll(dio->submit.last_queue,
+			    !blk_poll(dio->submit.last_queue,
 					 dio->submit.cookie))
 				io_schedule();
 		}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 45c63764a14e..a18ea9b9b8f7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -266,6 +266,7 @@ struct blk_queue_ctx;
 
 typedef void (request_fn_proc) (struct request_queue *q);
 typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
+typedef bool (poll_q_fn) (struct request_queue *q, blk_qc_t);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
 
@@ -408,6 +409,7 @@ struct request_queue {
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
+	poll_q_fn		*poll_fn;
 	prep_rq_fn		*prep_rq_fn;
 	unprep_rq_fn		*unprep_rq_fn;
 	softirq_done_fn		*softirq_done_fn;
@@ -991,7 +993,7 @@ extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
+bool blk_poll(struct request_queue *q, blk_qc_t cookie);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
diff --git a/mm/page_io.c b/mm/page_io.c
index 21502d341a67..ff04de630c46 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -407,7 +407,7 @@ int swap_readpage(struct page *page, bool do_poll)
 		if (!READ_ONCE(bio->bi_private))
 			break;
 
-		if (!blk_mq_poll(disk->queue, qc))
+		if (!blk_poll(disk->queue, qc))
 			break;
 	}
 	__set_current_state(TASK_RUNNING);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 07/17] block: add a poll_fn callback to struct request_queue
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c             | 11 +++++++++++
 block/blk-mq.c               | 14 +++++---------
 drivers/nvme/target/io-cmd.c |  2 +-
 fs/block_dev.c               |  4 ++--
 fs/direct-io.c               |  2 +-
 fs/iomap.c                   |  2 +-
 include/linux/blkdev.h       |  4 +++-
 mm/page_io.c                 |  2 +-
 8 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e804529e65a5..8e7e12e5ffa2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2319,6 +2319,17 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+bool blk_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	if (!q->poll_fn || !blk_qc_t_valid(cookie))
+		return false;
+
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+	return q->poll_fn(q, cookie);
+}
+EXPORT_SYMBOL_GPL(blk_poll);
+
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for new the queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f01d69879d6..10c99cf6fd71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -37,6 +37,7 @@
 #include "blk-wbt.h"
 #include "blk-mq-sched.h"
 
+static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2401,6 +2402,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	spin_lock_init(&q->requeue_lock);
 
 	blk_queue_make_request(q, blk_mq_make_request);
+	if (q->mq_ops->poll)
+		q->poll_fn = blk_mq_poll;
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
@@ -2860,20 +2863,14 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return false;
 }
 
-bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
+static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 {
 	struct blk_mq_hw_ctx *hctx;
-	struct blk_plug *plug;
 	struct request *rq;
 
-	if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return false;
 
-	plug = current->plug;
-	if (plug)
-		blk_flush_plug_list(plug, false);
-
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 	if (!blk_qc_t_is_internal(cookie))
 		rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
@@ -2891,7 +2888,6 @@ bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 
 	return __blk_mq_poll(hctx, rq);
 }
-EXPORT_SYMBOL_GPL(blk_mq_poll);
 
 static int __init blk_mq_init(void)
 {
diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
index 0d4c23dc4532..db632818777d 100644
--- a/drivers/nvme/target/io-cmd.c
+++ b/drivers/nvme/target/io-cmd.c
@@ -94,7 +94,7 @@ static void nvmet_execute_rw(struct nvmet_req *req)
 
 	cookie = submit_bio(bio);
 
-	blk_mq_poll(bdev_get_queue(req->ns->bdev), cookie);
+	blk_poll(bdev_get_queue(req->ns->bdev), cookie);
 }
 
 static void nvmet_execute_flush(struct nvmet_req *req)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..49a55246ba50 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -249,7 +249,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		if (!READ_ONCE(bio.bi_private))
 			break;
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_mq_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc))
 			io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
@@ -414,7 +414,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 			break;
 
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_mq_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc))
 			io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 62cf812ed0e5..d2bc339cb1e9 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -486,7 +486,7 @@ static struct bio *dio_await_one(struct dio *dio)
 		dio->waiter = current;
 		spin_unlock_irqrestore(&dio->bio_lock, flags);
 		if (!(dio->iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_mq_poll(dio->bio_disk->queue, dio->bio_cookie))
+		    !blk_poll(dio->bio_disk->queue, dio->bio_cookie))
 			io_schedule();
 		/* wake up sets us TASK_RUNNING */
 		spin_lock_irqsave(&dio->bio_lock, flags);
diff --git a/fs/iomap.c b/fs/iomap.c
index 8194d30bdca0..4241bac905b1 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1049,7 +1049,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
 			    !dio->submit.last_queue ||
-			    !blk_mq_poll(dio->submit.last_queue,
+			    !blk_poll(dio->submit.last_queue,
 					 dio->submit.cookie))
 				io_schedule();
 		}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 45c63764a14e..a18ea9b9b8f7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -266,6 +266,7 @@ struct blk_queue_ctx;
 
 typedef void (request_fn_proc) (struct request_queue *q);
 typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
+typedef bool (poll_q_fn) (struct request_queue *q, blk_qc_t);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unprep_rq_fn) (struct request_queue *, struct request *);
 
@@ -408,6 +409,7 @@ struct request_queue {
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
+	poll_q_fn		*poll_fn;
 	prep_rq_fn		*prep_rq_fn;
 	unprep_rq_fn		*unprep_rq_fn;
 	softirq_done_fn		*softirq_done_fn;
@@ -991,7 +993,7 @@ extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *,
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
 
-bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
+bool blk_poll(struct request_queue *q, blk_qc_t cookie);
 
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
diff --git a/mm/page_io.c b/mm/page_io.c
index 21502d341a67..ff04de630c46 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -407,7 +407,7 @@ int swap_readpage(struct page *page, bool do_poll)
 		if (!READ_ONCE(bio->bi_private))
 			break;
 
-		if (!blk_mq_poll(disk->queue, qc))
+		if (!blk_poll(disk->queue, qc))
 			break;
 	}
 	__set_current_state(TASK_RUNNING);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 08/17] nvme: use kref_get_unless_zero in nvme_find_get_ns
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

For kref_get_unless_zero to protect against lookup vs free races we need
to use it in all places where we aren't guaranteed to already hold a
reference.  There is no such guarantee in nvme_find_get_ns, so switch to
kref_get_unless_zero in this function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7fae42d595d5..1d931deac83b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2290,7 +2290,8 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->ns_id == nsid) {
-			kref_get(&ns->kref);
+			if (!kref_get_unless_zero(&ns->kref))
+				continue;
 			ret = ns;
 			break;
 		}
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 08/17] nvme: use kref_get_unless_zero in nvme_find_get_ns
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


For kref_get_unless_zero to protect against lookup vs free races we need
to use it in all places where we aren't guaranteed to already hold a
reference.  There is no such guarantee in nvme_find_get_ns, so switch to
kref_get_unless_zero in this function.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7fae42d595d5..1d931deac83b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2290,7 +2290,8 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->ns_id == nsid) {
-			kref_get(&ns->kref);
+			if (!kref_get_unless_zero(&ns->kref))
+				continue;
 			ret = ns;
 			break;
 		}
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 09/17] nvme: simplify nvme_open
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

Now that we are protected against lookup vs free races for the namespace
by using kref_get_unless_zero we don't need the hack of NULLing out the
disk private data during removal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/nvme/host/core.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1d931deac83b..9f8ae15c9fe8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -253,12 +253,6 @@ static void nvme_free_ns(struct kref *kref)
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
 
-	if (ns->disk) {
-		spin_lock(&dev_list_lock);
-		ns->disk->private_data = NULL;
-		spin_unlock(&dev_list_lock);
-	}
-
 	put_disk(ns->disk);
 	ida_simple_remove(&ns->ctrl->ns_ida, ns->instance);
 	nvme_put_ctrl(ns->ctrl);
@@ -270,29 +264,6 @@ static void nvme_put_ns(struct nvme_ns *ns)
 	kref_put(&ns->kref, nvme_free_ns);
 }
 
-static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
-{
-	struct nvme_ns *ns;
-
-	spin_lock(&dev_list_lock);
-	ns = disk->private_data;
-	if (ns) {
-		if (!kref_get_unless_zero(&ns->kref))
-			goto fail;
-		if (!try_module_get(ns->ctrl->ops->module))
-			goto fail_put_ns;
-	}
-	spin_unlock(&dev_list_lock);
-
-	return ns;
-
-fail_put_ns:
-	kref_put(&ns->kref, nvme_free_ns);
-fail:
-	spin_unlock(&dev_list_lock);
-	return NULL;
-}
-
 struct request *nvme_alloc_request(struct request_queue *q,
 		struct nvme_command *cmd, unsigned int flags, int qid)
 {
@@ -1056,7 +1027,16 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 
 static int nvme_open(struct block_device *bdev, fmode_t mode)
 {
-	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	if (!kref_get_unless_zero(&ns->kref))
+		return -ENXIO;
+	if (!try_module_get(ns->ctrl->ops->module)) {
+		kref_put(&ns->kref, nvme_free_ns);
+		return -ENXIO;
+	}
+
+	return 0;
 }
 
 static void nvme_release(struct gendisk *disk, fmode_t mode)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 09/17] nvme: simplify nvme_open
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


Now that we are protected against lookup vs free races for the namespace
by using kref_get_unless_zero we don't need the hack of NULLing out the
disk private data during removal.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1d931deac83b..9f8ae15c9fe8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -253,12 +253,6 @@ static void nvme_free_ns(struct kref *kref)
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
 
-	if (ns->disk) {
-		spin_lock(&dev_list_lock);
-		ns->disk->private_data = NULL;
-		spin_unlock(&dev_list_lock);
-	}
-
 	put_disk(ns->disk);
 	ida_simple_remove(&ns->ctrl->ns_ida, ns->instance);
 	nvme_put_ctrl(ns->ctrl);
@@ -270,29 +264,6 @@ static void nvme_put_ns(struct nvme_ns *ns)
 	kref_put(&ns->kref, nvme_free_ns);
 }
 
-static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
-{
-	struct nvme_ns *ns;
-
-	spin_lock(&dev_list_lock);
-	ns = disk->private_data;
-	if (ns) {
-		if (!kref_get_unless_zero(&ns->kref))
-			goto fail;
-		if (!try_module_get(ns->ctrl->ops->module))
-			goto fail_put_ns;
-	}
-	spin_unlock(&dev_list_lock);
-
-	return ns;
-
-fail_put_ns:
-	kref_put(&ns->kref, nvme_free_ns);
-fail:
-	spin_unlock(&dev_list_lock);
-	return NULL;
-}
-
 struct request *nvme_alloc_request(struct request_queue *q,
 		struct nvme_command *cmd, unsigned int flags, int qid)
 {
@@ -1056,7 +1027,16 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 
 static int nvme_open(struct block_device *bdev, fmode_t mode)
 {
-	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	if (!kref_get_unless_zero(&ns->kref))
+		return -ENXIO;
+	if (!try_module_get(ns->ctrl->ops->module)) {
+		kref_put(&ns->kref, nvme_free_ns);
+		return -ENXIO;
+	}
+
+	return 0;
 }
 
 static void nvme_release(struct gendisk *disk, fmode_t mode)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 10/17] nvme: switch controller refcounting to use struct device
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

Instead of allocating a separate struct device for the character device
handle embedd it into struct nvme_ctrl and use it for the main controller
refcounting.  This removes double refcounting and gets us an automatic
reference for the character device operations.  We keep ctrl->device as a
pointer for now to avoid chaning printks all over, but in the future we
could look into message printing helpers that take a controller structure
similar to what other subsystems do.

Note the delete_ctrl operation always already has a reference (either
through sysfs due this change, or because every open file on the
/dev/nvme-fabrics node has a refernece) when it is entered now, so we
don't need to do the unless_zero variant there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c   | 43 ++++++++++++++++++++++---------------------
 drivers/nvme/host/fc.c     |  8 ++------
 drivers/nvme/host/nvme.h   | 12 +++++++++++-
 drivers/nvme/host/pci.c    |  2 +-
 drivers/nvme/host/rdma.c   |  5 ++---
 drivers/nvme/target/loop.c |  2 +-
 6 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9f8ae15c9fe8..3a97daa163f6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1915,7 +1915,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
 			ret = -EWOULDBLOCK;
 			break;
 		}
-		if (!kref_get_unless_zero(&ctrl->kref))
+		if (!kobject_get_unless_zero(&ctrl->device->kobj))
 			break;
 		file->private_data = ctrl;
 		ret = 0;
@@ -2374,7 +2374,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	list_add_tail(&ns->list, &ctrl->namespaces);
 	mutex_unlock(&ctrl->namespaces_mutex);
 
-	kref_get(&ctrl->kref);
+	nvme_get_ctrl(ctrl);
 
 	kfree(id);
 
@@ -2703,7 +2703,7 @@ EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 {
-	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+	device_del(ctrl->device);
 
 	spin_lock(&dev_list_lock);
 	list_del(&ctrl->node);
@@ -2711,23 +2711,17 @@ void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 
-static void nvme_free_ctrl(struct kref *kref)
+static void nvme_free_ctrl(struct device *dev)
 {
-	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_ctrl *ctrl =
+		container_of(dev, struct nvme_ctrl, ctrl_device);
 
-	put_device(ctrl->device);
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 	ida_destroy(&ctrl->ns_ida);
 
 	ctrl->ops->free_ctrl(ctrl);
 }
 
-void nvme_put_ctrl(struct nvme_ctrl *ctrl)
-{
-	kref_put(&ctrl->kref, nvme_free_ctrl);
-}
-EXPORT_SYMBOL_GPL(nvme_put_ctrl);
-
 /*
  * Initialize a NVMe controller structures.  This needs to be called during
  * earliest initialization so that we have the initialized structured around
@@ -2742,7 +2736,6 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	spin_lock_init(&ctrl->lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
 	mutex_init(&ctrl->namespaces_mutex);
-	kref_init(&ctrl->kref);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
@@ -2755,15 +2748,21 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 		goto out;
 	ctrl->instance = ret;
 
-	ctrl->device = device_create_with_groups(nvme_class, ctrl->dev,
-				MKDEV(nvme_char_major, ctrl->instance),
-				ctrl, nvme_dev_attr_groups,
-				"nvme%d", ctrl->instance);
-	if (IS_ERR(ctrl->device)) {
-		ret = PTR_ERR(ctrl->device);
+	device_initialize(&ctrl->ctrl_device);
+	ctrl->device = &ctrl->ctrl_device;
+	ctrl->device->devt = MKDEV(nvme_char_major, ctrl->instance);
+	ctrl->device->class = nvme_class;
+	ctrl->device->parent = ctrl->dev;
+	ctrl->device->groups = nvme_dev_attr_groups;
+	ctrl->device->release = nvme_free_ctrl;
+	dev_set_drvdata(ctrl->device, ctrl);
+	ret = dev_set_name(ctrl->device, "nvme%d", ctrl->instance);
+	if (ret)
 		goto out_release_instance;
-	}
-	get_device(ctrl->device);
+	ret = device_add(ctrl->device);
+	if (ret)
+		goto out_free_name;
+
 	ida_init(&ctrl->ns_ida);
 
 	spin_lock(&dev_list_lock);
@@ -2779,6 +2778,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 		min(default_ps_max_latency_us, (unsigned long)S32_MAX));
 
 	return 0;
+out_free_name:
+	kfree_const(dev->kobj.name);
 out_release_instance:
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 out:
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index c6c903f1b172..aa9aec6923bb 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2692,14 +2692,10 @@ nvme_fc_del_nvme_ctrl(struct nvme_ctrl *nctrl)
 	struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
 	int ret;
 
-	if (!kref_get_unless_zero(&ctrl->ctrl.kref))
-		return -EBUSY;
-
+	nvme_get_ctrl(&ctrl->ctrl);
 	ret = __nvme_fc_del_ctrl(ctrl);
-
 	if (!ret)
 		flush_workqueue(nvme_wq);
-
 	nvme_put_ctrl(&ctrl->ctrl);
 
 	return ret;
@@ -2918,7 +2914,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 		return ERR_PTR(ret);
 	}
 
-	kref_get(&ctrl->ctrl.kref);
+	nvme_get_ctrl(&ctrl->ctrl);
 
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: new ctrl: NQN \"%s\"\n",
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cb9d93048f3d..ae60d8342e60 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -127,12 +127,12 @@ struct nvme_ctrl {
 	struct request_queue *admin_q;
 	struct request_queue *connect_q;
 	struct device *dev;
-	struct kref kref;
 	int instance;
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
 	struct mutex namespaces_mutex;
+	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct list_head node;
 	struct ida ns_ida;
@@ -279,6 +279,16 @@ static inline void nvme_end_request(struct request *req, __le16 status,
 	blk_mq_complete_request(req);
 }
 
+static inline void nvme_get_ctrl(struct nvme_ctrl *ctrl)
+{
+	get_device(ctrl->device);
+}
+
+static inline void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+{
+	put_device(ctrl->device);
+}
+
 void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 11e6fd9d0ba4..7735571ffc9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2292,7 +2292,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 {
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
-	kref_get(&dev->ctrl.kref);
+	nvme_get_ctrl(&dev->ctrl);
 	nvme_dev_disable(dev, false);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4552744eff45..62b58f9b7d00 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1793,8 +1793,7 @@ static int nvme_rdma_del_ctrl(struct nvme_ctrl *nctrl)
 	 * Keep a reference until all work is flushed since
 	 * __nvme_rdma_del_ctrl can free the ctrl mem
 	 */
-	if (!kref_get_unless_zero(&ctrl->ctrl.kref))
-		return -EBUSY;
+	nvme_get_ctrl(&ctrl->ctrl);
 	ret = __nvme_rdma_del_ctrl(ctrl);
 	if (!ret)
 		flush_work(&ctrl->delete_work);
@@ -1955,7 +1954,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
 		ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
 
-	kref_get(&ctrl->ctrl.kref);
+	nvme_get_ctrl(&ctrl->ctrl);
 
 	mutex_lock(&nvme_rdma_ctrl_mutex);
 	list_add_tail(&ctrl->list, &nvme_rdma_ctrl_list);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index c56354e1e4c6..f83e925fe64a 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -642,7 +642,7 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
 	dev_info(ctrl->ctrl.device,
 		 "new ctrl: \"%s\"\n", ctrl->ctrl.opts->subsysnqn);
 
-	kref_get(&ctrl->ctrl.kref);
+	nvme_get_ctrl(&ctrl->ctrl);
 
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
 	WARN_ON_ONCE(!changed);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 10/17] nvme: switch controller refcounting to use struct device
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


Instead of allocating a separate struct device for the character device
handle embedd it into struct nvme_ctrl and use it for the main controller
refcounting.  This removes double refcounting and gets us an automatic
reference for the character device operations.  We keep ctrl->device as a
pointer for now to avoid chaning printks all over, but in the future we
could look into message printing helpers that take a controller structure
similar to what other subsystems do.

Note the delete_ctrl operation always already has a reference (either
through sysfs due this change, or because every open file on the
/dev/nvme-fabrics node has a refernece) when it is entered now, so we
don't need to do the unless_zero variant there.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c   | 43 ++++++++++++++++++++++---------------------
 drivers/nvme/host/fc.c     |  8 ++------
 drivers/nvme/host/nvme.h   | 12 +++++++++++-
 drivers/nvme/host/pci.c    |  2 +-
 drivers/nvme/host/rdma.c   |  5 ++---
 drivers/nvme/target/loop.c |  2 +-
 6 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9f8ae15c9fe8..3a97daa163f6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1915,7 +1915,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
 			ret = -EWOULDBLOCK;
 			break;
 		}
-		if (!kref_get_unless_zero(&ctrl->kref))
+		if (!kobject_get_unless_zero(&ctrl->device->kobj))
 			break;
 		file->private_data = ctrl;
 		ret = 0;
@@ -2374,7 +2374,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	list_add_tail(&ns->list, &ctrl->namespaces);
 	mutex_unlock(&ctrl->namespaces_mutex);
 
-	kref_get(&ctrl->kref);
+	nvme_get_ctrl(ctrl);
 
 	kfree(id);
 
@@ -2703,7 +2703,7 @@ EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 {
-	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+	device_del(ctrl->device);
 
 	spin_lock(&dev_list_lock);
 	list_del(&ctrl->node);
@@ -2711,23 +2711,17 @@ void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 
-static void nvme_free_ctrl(struct kref *kref)
+static void nvme_free_ctrl(struct device *dev)
 {
-	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_ctrl *ctrl =
+		container_of(dev, struct nvme_ctrl, ctrl_device);
 
-	put_device(ctrl->device);
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 	ida_destroy(&ctrl->ns_ida);
 
 	ctrl->ops->free_ctrl(ctrl);
 }
 
-void nvme_put_ctrl(struct nvme_ctrl *ctrl)
-{
-	kref_put(&ctrl->kref, nvme_free_ctrl);
-}
-EXPORT_SYMBOL_GPL(nvme_put_ctrl);
-
 /*
  * Initialize a NVMe controller structures.  This needs to be called during
  * earliest initialization so that we have the initialized structured around
@@ -2742,7 +2736,6 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	spin_lock_init(&ctrl->lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
 	mutex_init(&ctrl->namespaces_mutex);
-	kref_init(&ctrl->kref);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
@@ -2755,15 +2748,21 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 		goto out;
 	ctrl->instance = ret;
 
-	ctrl->device = device_create_with_groups(nvme_class, ctrl->dev,
-				MKDEV(nvme_char_major, ctrl->instance),
-				ctrl, nvme_dev_attr_groups,
-				"nvme%d", ctrl->instance);
-	if (IS_ERR(ctrl->device)) {
-		ret = PTR_ERR(ctrl->device);
+	device_initialize(&ctrl->ctrl_device);
+	ctrl->device = &ctrl->ctrl_device;
+	ctrl->device->devt = MKDEV(nvme_char_major, ctrl->instance);
+	ctrl->device->class = nvme_class;
+	ctrl->device->parent = ctrl->dev;
+	ctrl->device->groups = nvme_dev_attr_groups;
+	ctrl->device->release = nvme_free_ctrl;
+	dev_set_drvdata(ctrl->device, ctrl);
+	ret = dev_set_name(ctrl->device, "nvme%d", ctrl->instance);
+	if (ret)
 		goto out_release_instance;
-	}
-	get_device(ctrl->device);
+	ret = device_add(ctrl->device);
+	if (ret)
+		goto out_free_name;
+
 	ida_init(&ctrl->ns_ida);
 
 	spin_lock(&dev_list_lock);
@@ -2779,6 +2778,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 		min(default_ps_max_latency_us, (unsigned long)S32_MAX));
 
 	return 0;
+out_free_name:
+	kfree_const(dev->kobj.name);
 out_release_instance:
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 out:
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index c6c903f1b172..aa9aec6923bb 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2692,14 +2692,10 @@ nvme_fc_del_nvme_ctrl(struct nvme_ctrl *nctrl)
 	struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
 	int ret;
 
-	if (!kref_get_unless_zero(&ctrl->ctrl.kref))
-		return -EBUSY;
-
+	nvme_get_ctrl(&ctrl->ctrl);
 	ret = __nvme_fc_del_ctrl(ctrl);
-
 	if (!ret)
 		flush_workqueue(nvme_wq);
-
 	nvme_put_ctrl(&ctrl->ctrl);
 
 	return ret;
@@ -2918,7 +2914,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 		return ERR_PTR(ret);
 	}
 
-	kref_get(&ctrl->ctrl.kref);
+	nvme_get_ctrl(&ctrl->ctrl);
 
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: new ctrl: NQN \"%s\"\n",
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cb9d93048f3d..ae60d8342e60 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -127,12 +127,12 @@ struct nvme_ctrl {
 	struct request_queue *admin_q;
 	struct request_queue *connect_q;
 	struct device *dev;
-	struct kref kref;
 	int instance;
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
 	struct mutex namespaces_mutex;
+	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct list_head node;
 	struct ida ns_ida;
@@ -279,6 +279,16 @@ static inline void nvme_end_request(struct request *req, __le16 status,
 	blk_mq_complete_request(req);
 }
 
+static inline void nvme_get_ctrl(struct nvme_ctrl *ctrl)
+{
+	get_device(ctrl->device);
+}
+
+static inline void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+{
+	put_device(ctrl->device);
+}
+
 void nvme_complete_rq(struct request *req);
 void nvme_cancel_request(struct request *req, void *data, bool reserved);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 11e6fd9d0ba4..7735571ffc9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2292,7 +2292,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 {
 	dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);
 
-	kref_get(&dev->ctrl.kref);
+	nvme_get_ctrl(&dev->ctrl);
 	nvme_dev_disable(dev, false);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4552744eff45..62b58f9b7d00 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1793,8 +1793,7 @@ static int nvme_rdma_del_ctrl(struct nvme_ctrl *nctrl)
 	 * Keep a reference until all work is flushed since
 	 * __nvme_rdma_del_ctrl can free the ctrl mem
 	 */
-	if (!kref_get_unless_zero(&ctrl->ctrl.kref))
-		return -EBUSY;
+	nvme_get_ctrl(&ctrl->ctrl);
 	ret = __nvme_rdma_del_ctrl(ctrl);
 	if (!ret)
 		flush_work(&ctrl->delete_work);
@@ -1955,7 +1954,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
 		ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
 
-	kref_get(&ctrl->ctrl.kref);
+	nvme_get_ctrl(&ctrl->ctrl);
 
 	mutex_lock(&nvme_rdma_ctrl_mutex);
 	list_add_tail(&ctrl->list, &nvme_rdma_ctrl_list);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index c56354e1e4c6..f83e925fe64a 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -642,7 +642,7 @@ static struct nvme_ctrl *nvme_loop_create_ctrl(struct device *dev,
 	dev_info(ctrl->ctrl.device,
 		 "new ctrl: \"%s\"\n", ctrl->ctrl.opts->subsysnqn);
 
-	kref_get(&ctrl->ctrl.kref);
+	nvme_get_ctrl(&ctrl->ctrl);
 
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
 	WARN_ON_ONCE(!changed);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 11/17] nvme: get rid of nvme_ctrl_list
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

Use the core chrdev code to set up the link between the character device
and the nvme controller.  This allows us to get rid of the global list
of all controllers, and also ensures that we have both a reference to
the controller and the transport module before the open method of the
character device is called.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sgi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/nvme/host/core.c | 76 ++++++++++--------------------------------------
 drivers/nvme/host/nvme.h |  3 +-
 2 files changed, 18 insertions(+), 61 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3a97daa163f6..a56a1e0432e7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -52,9 +52,6 @@ static u8 nvme_max_retries = 5;
 module_param_named(max_retries, nvme_max_retries, byte, 0644);
 MODULE_PARM_DESC(max_retries, "max number of retries a command may have");
 
-static int nvme_char_major;
-module_param(nvme_char_major, int, 0);
-
 static unsigned long default_ps_max_latency_us = 100000;
 module_param(default_ps_max_latency_us, ulong, 0644);
 MODULE_PARM_DESC(default_ps_max_latency_us,
@@ -71,11 +68,8 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
-static LIST_HEAD(nvme_ctrl_list);
-static DEFINE_SPINLOCK(dev_list_lock);
-
 static DEFINE_IDA(nvme_instance_ida);
-
+static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
@@ -1031,20 +1025,12 @@ static int nvme_open(struct block_device *bdev, fmode_t mode)
 
 	if (!kref_get_unless_zero(&ns->kref))
 		return -ENXIO;
-	if (!try_module_get(ns->ctrl->ops->module)) {
-		kref_put(&ns->kref, nvme_free_ns);
-		return -ENXIO;
-	}
-
 	return 0;
 }
 
 static void nvme_release(struct gendisk *disk, fmode_t mode)
 {
-	struct nvme_ns *ns = disk->private_data;
-
-	module_put(ns->ctrl->ops->module);
-	nvme_put_ns(ns);
+	nvme_put_ns(disk->private_data);
 }
 
 static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1902,33 +1888,12 @@ EXPORT_SYMBOL_GPL(nvme_init_identify);
 
 static int nvme_dev_open(struct inode *inode, struct file *file)
 {
-	struct nvme_ctrl *ctrl;
-	int instance = iminor(inode);
-	int ret = -ENODEV;
-
-	spin_lock(&dev_list_lock);
-	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
-		if (ctrl->instance != instance)
-			continue;
-
-		if (!ctrl->admin_q) {
-			ret = -EWOULDBLOCK;
-			break;
-		}
-		if (!kobject_get_unless_zero(&ctrl->device->kobj))
-			break;
-		file->private_data = ctrl;
-		ret = 0;
-		break;
-	}
-	spin_unlock(&dev_list_lock);
-
-	return ret;
-}
+	struct nvme_ctrl *ctrl =
+		container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
-static int nvme_dev_release(struct inode *inode, struct file *file)
-{
-	nvme_put_ctrl(file->private_data);
+	if (!ctrl->admin_q)
+		return -EWOULDBLOCK;
+	file->private_data = ctrl;
 	return 0;
 }
 
@@ -1992,7 +1957,6 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 static const struct file_operations nvme_dev_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nvme_dev_open,
-	.release	= nvme_dev_release,
 	.unlocked_ioctl	= nvme_dev_ioctl,
 	.compat_ioctl	= nvme_dev_ioctl,
 };
@@ -2703,11 +2667,7 @@ EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 {
-	device_del(ctrl->device);
-
-	spin_lock(&dev_list_lock);
-	list_del(&ctrl->node);
-	spin_unlock(&dev_list_lock);
+	cdev_device_del(&ctrl->cdev, ctrl->device);
 }
 EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 
@@ -2750,7 +2710,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 
 	device_initialize(&ctrl->ctrl_device);
 	ctrl->device = &ctrl->ctrl_device;
-	ctrl->device->devt = MKDEV(nvme_char_major, ctrl->instance);
+	ctrl->device->devt = MKDEV(MAJOR(nvme_chr_devt), ctrl->instance);
 	ctrl->device->class = nvme_class;
 	ctrl->device->parent = ctrl->dev;
 	ctrl->device->groups = nvme_dev_attr_groups;
@@ -2759,16 +2719,15 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ret = dev_set_name(ctrl->device, "nvme%d", ctrl->instance);
 	if (ret)
 		goto out_release_instance;
-	ret = device_add(ctrl->device);
+
+	cdev_init(&ctrl->cdev, &nvme_dev_fops);
+	ctrl->cdev.owner = ops->module;
+	ret = cdev_device_add(&ctrl->cdev, ctrl->device);
 	if (ret)
 		goto out_free_name;
 
 	ida_init(&ctrl->ns_ida);
 
-	spin_lock(&dev_list_lock);
-	list_add_tail(&ctrl->node, &nvme_ctrl_list);
-	spin_unlock(&dev_list_lock);
-
 	/*
 	 * Initialize latency tolerance controls.  The sysfs files won't
 	 * be visible to userspace unless the device actually supports APST.
@@ -2909,12 +2868,9 @@ int __init nvme_core_init(void)
 	if (!nvme_wq)
 		return -ENOMEM;
 
-	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
-							&nvme_dev_fops);
+	result = alloc_chrdev_region(&nvme_chr_devt, 0, NVME_MINORS, "nvme");
 	if (result < 0)
 		goto destroy_wq;
-	else if (result > 0)
-		nvme_char_major = result;
 
 	nvme_class = class_create(THIS_MODULE, "nvme");
 	if (IS_ERR(nvme_class)) {
@@ -2925,7 +2881,7 @@ int __init nvme_core_init(void)
 	return 0;
 
 unregister_chrdev:
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 destroy_wq:
 	destroy_workqueue(nvme_wq);
 	return result;
@@ -2934,7 +2890,7 @@ int __init nvme_core_init(void)
 void nvme_core_exit(void)
 {
 	class_destroy(nvme_class);
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 	destroy_workqueue(nvme_wq);
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ae60d8342e60..1bb2bc165e54 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -15,6 +15,7 @@
 #define _NVME_H
 
 #include <linux/nvme.h>
+#include <linux/cdev.h>
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
@@ -134,7 +135,7 @@ struct nvme_ctrl {
 	struct mutex namespaces_mutex;
 	struct device ctrl_device;
 	struct device *device;	/* char device */
-	struct list_head node;
+	struct cdev cdev;
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 11/17] nvme: get rid of nvme_ctrl_list
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


Use the core chrdev code to set up the link between the character device
and the nvme controller.  This allows us to get rid of the global list
of all controllers, and also ensures that we have both a reference to
the controller and the transport module before the open method of the
character device is called.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sgi at grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 76 ++++++++++--------------------------------------
 drivers/nvme/host/nvme.h |  3 +-
 2 files changed, 18 insertions(+), 61 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3a97daa163f6..a56a1e0432e7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -52,9 +52,6 @@ static u8 nvme_max_retries = 5;
 module_param_named(max_retries, nvme_max_retries, byte, 0644);
 MODULE_PARM_DESC(max_retries, "max number of retries a command may have");
 
-static int nvme_char_major;
-module_param(nvme_char_major, int, 0);
-
 static unsigned long default_ps_max_latency_us = 100000;
 module_param(default_ps_max_latency_us, ulong, 0644);
 MODULE_PARM_DESC(default_ps_max_latency_us,
@@ -71,11 +68,8 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
-static LIST_HEAD(nvme_ctrl_list);
-static DEFINE_SPINLOCK(dev_list_lock);
-
 static DEFINE_IDA(nvme_instance_ida);
-
+static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
@@ -1031,20 +1025,12 @@ static int nvme_open(struct block_device *bdev, fmode_t mode)
 
 	if (!kref_get_unless_zero(&ns->kref))
 		return -ENXIO;
-	if (!try_module_get(ns->ctrl->ops->module)) {
-		kref_put(&ns->kref, nvme_free_ns);
-		return -ENXIO;
-	}
-
 	return 0;
 }
 
 static void nvme_release(struct gendisk *disk, fmode_t mode)
 {
-	struct nvme_ns *ns = disk->private_data;
-
-	module_put(ns->ctrl->ops->module);
-	nvme_put_ns(ns);
+	nvme_put_ns(disk->private_data);
 }
 
 static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1902,33 +1888,12 @@ EXPORT_SYMBOL_GPL(nvme_init_identify);
 
 static int nvme_dev_open(struct inode *inode, struct file *file)
 {
-	struct nvme_ctrl *ctrl;
-	int instance = iminor(inode);
-	int ret = -ENODEV;
-
-	spin_lock(&dev_list_lock);
-	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
-		if (ctrl->instance != instance)
-			continue;
-
-		if (!ctrl->admin_q) {
-			ret = -EWOULDBLOCK;
-			break;
-		}
-		if (!kobject_get_unless_zero(&ctrl->device->kobj))
-			break;
-		file->private_data = ctrl;
-		ret = 0;
-		break;
-	}
-	spin_unlock(&dev_list_lock);
-
-	return ret;
-}
+	struct nvme_ctrl *ctrl =
+		container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
-static int nvme_dev_release(struct inode *inode, struct file *file)
-{
-	nvme_put_ctrl(file->private_data);
+	if (!ctrl->admin_q)
+		return -EWOULDBLOCK;
+	file->private_data = ctrl;
 	return 0;
 }
 
@@ -1992,7 +1957,6 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 static const struct file_operations nvme_dev_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nvme_dev_open,
-	.release	= nvme_dev_release,
 	.unlocked_ioctl	= nvme_dev_ioctl,
 	.compat_ioctl	= nvme_dev_ioctl,
 };
@@ -2703,11 +2667,7 @@ EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 {
-	device_del(ctrl->device);
-
-	spin_lock(&dev_list_lock);
-	list_del(&ctrl->node);
-	spin_unlock(&dev_list_lock);
+	cdev_device_del(&ctrl->cdev, ctrl->device);
 }
 EXPORT_SYMBOL_GPL(nvme_uninit_ctrl);
 
@@ -2750,7 +2710,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 
 	device_initialize(&ctrl->ctrl_device);
 	ctrl->device = &ctrl->ctrl_device;
-	ctrl->device->devt = MKDEV(nvme_char_major, ctrl->instance);
+	ctrl->device->devt = MKDEV(MAJOR(nvme_chr_devt), ctrl->instance);
 	ctrl->device->class = nvme_class;
 	ctrl->device->parent = ctrl->dev;
 	ctrl->device->groups = nvme_dev_attr_groups;
@@ -2759,16 +2719,15 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ret = dev_set_name(ctrl->device, "nvme%d", ctrl->instance);
 	if (ret)
 		goto out_release_instance;
-	ret = device_add(ctrl->device);
+
+	cdev_init(&ctrl->cdev, &nvme_dev_fops);
+	ctrl->cdev.owner = ops->module;
+	ret = cdev_device_add(&ctrl->cdev, ctrl->device);
 	if (ret)
 		goto out_free_name;
 
 	ida_init(&ctrl->ns_ida);
 
-	spin_lock(&dev_list_lock);
-	list_add_tail(&ctrl->node, &nvme_ctrl_list);
-	spin_unlock(&dev_list_lock);
-
 	/*
 	 * Initialize latency tolerance controls.  The sysfs files won't
 	 * be visible to userspace unless the device actually supports APST.
@@ -2909,12 +2868,9 @@ int __init nvme_core_init(void)
 	if (!nvme_wq)
 		return -ENOMEM;
 
-	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
-							&nvme_dev_fops);
+	result = alloc_chrdev_region(&nvme_chr_devt, 0, NVME_MINORS, "nvme");
 	if (result < 0)
 		goto destroy_wq;
-	else if (result > 0)
-		nvme_char_major = result;
 
 	nvme_class = class_create(THIS_MODULE, "nvme");
 	if (IS_ERR(nvme_class)) {
@@ -2925,7 +2881,7 @@ int __init nvme_core_init(void)
 	return 0;
 
 unregister_chrdev:
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 destroy_wq:
 	destroy_workqueue(nvme_wq);
 	return result;
@@ -2934,7 +2890,7 @@ int __init nvme_core_init(void)
 void nvme_core_exit(void)
 {
 	class_destroy(nvme_class);
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 	destroy_workqueue(nvme_wq);
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ae60d8342e60..1bb2bc165e54 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -15,6 +15,7 @@
 #define _NVME_H
 
 #include <linux/nvme.h>
+#include <linux/cdev.h>
 #include <linux/pci.h>
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
@@ -134,7 +135,7 @@ struct nvme_ctrl {
 	struct mutex namespaces_mutex;
 	struct device ctrl_device;
 	struct device *device;	/* char device */
-	struct list_head node;
+	struct cdev cdev;
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 12/17] nvme: check for a live controller in nvme_dev_open
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This is a much more sensible check than just the admin queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@rimbeg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a56a1e0432e7..df525ab42fcd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1891,7 +1891,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
 	struct nvme_ctrl *ctrl =
 		container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
-	if (!ctrl->admin_q)
+	if (ctrl->state != NVME_CTRL_LIVE)
 		return -EWOULDBLOCK;
 	file->private_data = ctrl;
 	return 0;
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 12/17] nvme: check for a live controller in nvme_dev_open
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This is a much more sensible check than just the admin queue.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Sagi Grimberg <sagi at rimbeg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a56a1e0432e7..df525ab42fcd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1891,7 +1891,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
 	struct nvme_ctrl *ctrl =
 		container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
-	if (!ctrl->admin_q)
+	if (ctrl->state != NVME_CTRL_LIVE)
 		return -EWOULDBLOCK;
 	file->private_data = ctrl;
 	return 0;
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 13/17] nvme: track subsystems
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Includes code originally from Hannes Reinecke to expose the subsystems
in sysfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c    | 200 +++++++++++++++++++++++++++++++++++++-------
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h    |  26 ++++--
 3 files changed, 194 insertions(+), 36 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index df525ab42fcd..b3d468c77684 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -68,9 +68,14 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static DEFINE_IDA(nvme_subsystems_ida);
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static DEFINE_IDA(nvme_instance_ida);
 static dev_t nvme_chr_devt;
 static struct class *nvme_class;
+static struct class *nvme_subsys_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
@@ -1694,14 +1699,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
 		string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl,
+		struct nvme_id_ctrl *id)
 {
 	size_t nqnlen;
 	int off;
 
 	nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
 	if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-		strcpy(ctrl->subnqn, id->subnqn);
+		strcpy(subsys->subnqn, id->subnqn);
 		return;
 	}
 
@@ -1709,14 +1715,131 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
 	/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-	off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+	off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
 			"nqn.2014.08.org.nvmexpress:%4x%4x",
 			le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-	memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+	memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
 	off += sizeof(id->sn);
-	memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+	memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
 	off += sizeof(id->mn);
-	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void __nvme_release_subsystem(struct nvme_subsystem *subsys)
+{
+	ida_simple_remove(&nvme_subsystems_ida, subsys->instance);
+	kfree(subsys);
+}
+
+static void nvme_release_subsystem(struct device *dev)
+{
+	__nvme_release_subsystem(container_of(dev, struct nvme_subsystem, dev));
+}
+
+static void nvme_destroy_subsystem(struct kref *ref)
+{
+	struct nvme_subsystem *subsys =
+			container_of(ref, struct nvme_subsystem, ref);
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_del(&subsys->entry);
+	mutex_unlock(&nvme_subsystems_lock);
+
+	device_del(&subsys->dev);
+	put_device(&subsys->dev);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+	kref_put(&subsys->ref, nvme_destroy_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	lockdep_assert_held(&nvme_subsystems_lock);
+
+	list_for_each_entry(subsys, &nvme_subsystems, entry) {
+		if (strcmp(subsys->subnqn, subsysnqn))
+			continue;
+		if (!kref_get_unless_zero(&subsys->ref))
+			continue;
+		return subsys;
+	}
+
+	return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+	struct nvme_subsystem *subsys, *found;
+	int ret;
+
+	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+	if (!subsys)
+		return -ENOMEM;
+	ret = ida_simple_get(&nvme_subsystems_ida, 0, 0, GFP_KERNEL);
+	if (ret < 0) {
+		kfree(subsys);
+		return ret;
+	}
+	subsys->instance = ret;
+	kref_init(&subsys->ref);
+	INIT_LIST_HEAD(&subsys->ctrls);
+	nvme_init_subnqn(subsys, ctrl, id);
+	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
+	memcpy(subsys->model, id->mn, sizeof(subsys->model));
+	memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev));
+	subsys->vendor_id = le16_to_cpu(id->vid);
+	mutex_init(&subsys->lock);
+
+	subsys->dev.class = nvme_subsys_class;
+	subsys->dev.release = nvme_release_subsystem;
+	dev_set_name(&subsys->dev, "nvme-subsys%d", subsys->instance);
+	device_initialize(&subsys->dev);
+
+	mutex_lock(&nvme_subsystems_lock);
+	found = __nvme_find_get_subsystem(subsys->subnqn);
+	if (found) {
+		/*
+		 * Verify that the subsystem actually supports multiple
+		 * controllers, else bail out.
+		 */
+		if (!(id->cmic & (1 << 1))) {
+			dev_err(ctrl->device,
+				"ignoring ctrl due to duplicate subnqn (%s).\n",
+				found->subnqn);
+			nvme_put_subsystem(found);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		__nvme_release_subsystem(subsys);
+		subsys = found;
+	} else {
+		ret = device_add(&subsys->dev);
+		if (ret) {
+			dev_err(ctrl->device,
+				"failed to register subsystem device.\n");
+			goto out_unlock;
+		}
+		list_add_tail(&subsys->entry, &nvme_subsystems);
+	}
+
+	ctrl->subsys = subsys;
+	mutex_unlock(&nvme_subsystems_lock);
+
+	mutex_lock(&subsys->lock);
+	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
+	mutex_unlock(&subsys->lock);
+
+	return 0;
+
+out_unlock:
+	mutex_unlock(&nvme_subsystems_lock);
+	put_device(&subsys->dev);
+	return ret;
 }
 
 /*
@@ -1754,9 +1877,13 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
-	nvme_init_subnqn(ctrl, id);
-
 	if (!ctrl->identified) {
+		int i;
+
+		ret = nvme_init_subsystem(ctrl, id);
+		if (ret)
+			goto out_free;
+
 		/*
 		 * Check for quirks.  Quirk can depend on firmware version,
 		 * so, in principle, the set of quirks present can change
@@ -1765,9 +1892,6 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		 * the device, but we'd have to make sure that the driver
 		 * behaves intelligently if the quirks change.
 		 */
-
-		int i;
-
 		for (i = 0; i < ARRAY_SIZE(core_quirks); i++) {
 			if (quirk_matches(id, &core_quirks[i]))
 				ctrl->quirks |= core_quirks[i].quirks;
@@ -1780,14 +1904,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	ctrl->oacs = le16_to_cpu(id->oacs);
-	ctrl->vid = le16_to_cpu(id->vid);
 	ctrl->oncs = le16_to_cpup(&id->oncs);
 	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
-	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
-	memcpy(ctrl->model, id->mn, sizeof(id->mn));
-	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
@@ -1990,9 +2110,9 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ctrl *ctrl = ns->ctrl;
-	int serial_len = sizeof(ctrl->serial);
-	int model_len = sizeof(ctrl->model);
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	int serial_len = sizeof(subsys->serial);
+	int model_len = sizeof(subsys->model);
 
 	if (!uuid_is_null(&ns->uuid))
 		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
@@ -2003,15 +2123,16 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
 		return sprintf(buf, "eui.%8phN\n", ns->eui);
 
-	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
-				  ctrl->serial[serial_len - 1] == '\0'))
+	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
+				  subsys->serial[serial_len - 1] == '\0'))
 		serial_len--;
-	while (model_len > 0 && (ctrl->model[model_len - 1] == ' ' ||
-				 ctrl->model[model_len - 1] == '\0'))
+	while (model_len > 0 && (subsys->model[model_len - 1] == ' ' ||
+				 subsys->model[model_len - 1] == '\0'))
 		model_len--;
 
-	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
-		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
+	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
+		serial_len, subsys->serial, model_len, subsys->model,
+		ns->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2097,10 +2218,15 @@ static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
 {										\
         struct nvme_ctrl *ctrl = dev_get_drvdata(dev);				\
-        return sprintf(buf, "%.*s\n", (int)sizeof(ctrl->field), ctrl->field);	\
+        return sprintf(buf, "%.*s\n",						\
+		(int)sizeof(ctrl->subsys->field), ctrl->subsys->field);		\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
+nvme_show_str_function(model);
+nvme_show_str_function(serial);
+nvme_show_str_function(firmware_rev);
+
 #define nvme_show_int_function(field)						\
 static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
@@ -2110,9 +2236,6 @@ static ssize_t  field##_show(struct device *dev,				\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
-nvme_show_str_function(model);
-nvme_show_str_function(serial);
-nvme_show_str_function(firmware_rev);
 nvme_show_int_function(cntlid);
 
 static ssize_t nvme_sysfs_delete(struct device *dev,
@@ -2166,7 +2289,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
 {
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 
-	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
+	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
 }
 static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
 
@@ -2675,11 +2798,21 @@ static void nvme_free_ctrl(struct device *dev)
 {
 	struct nvme_ctrl *ctrl =
 		container_of(dev, struct nvme_ctrl, ctrl_device);
+	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 	ida_destroy(&ctrl->ns_ida);
 
+	if (subsys) {
+		mutex_lock(&subsys->lock);
+		list_del(&ctrl->subsys_entry);
+		mutex_unlock(&subsys->lock);
+	}
+
 	ctrl->ops->free_ctrl(ctrl);
+
+	if (subsys)
+		nvme_put_subsystem(subsys);
 }
 
 /*
@@ -2878,8 +3011,15 @@ int __init nvme_core_init(void)
 		goto unregister_chrdev;
 	}
 
+	nvme_subsys_class = class_create(THIS_MODULE, "nvme-subsystem");
+	if (IS_ERR(nvme_subsys_class)) {
+		result = PTR_ERR(nvme_subsys_class);
+		goto destroy_class;
+	}
 	return 0;
 
+destroy_class:
+	class_destroy(nvme_class);
 unregister_chrdev:
 	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 destroy_wq:
@@ -2889,6 +3029,8 @@ int __init nvme_core_init(void)
 
 void nvme_core_exit(void)
 {
+	ida_destroy(&nvme_subsystems_ida);
+	class_destroy(nvme_subsys_class);
 	class_destroy(nvme_class);
 	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 8bca36a46924..dcad66e37a8f 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -877,10 +877,10 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
-	if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
+	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
-			ctrl->subnqn);
+			ctrl->subsys->subnqn);
 		up_read(&nvmf_transports_rwsem);
 		ctrl->ops->delete_ctrl(ctrl);
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1bb2bc165e54..6a5702072048 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -139,13 +139,12 @@ struct nvme_ctrl {
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
+	struct nvme_subsystem *subsys;
+	struct list_head subsys_entry;
+
 	struct opal_dev *opal_dev;
 
 	char name[12];
-	char serial[20];
-	char model[40];
-	char firmware_rev[8];
-	char subnqn[NVMF_NQN_SIZE];
 	u16 cntlid;
 
 	u32 ctrl_config;
@@ -156,7 +155,6 @@ struct nvme_ctrl {
 	u32 page_size;
 	u32 max_hw_sectors;
 	u16 oncs;
-	u16 vid;
 	u16 oacs;
 	u16 nssa;
 	u16 nr_streams;
@@ -198,6 +196,24 @@ struct nvme_ctrl {
 	struct nvmf_ctrl_options *opts;
 };
 
+struct nvme_subsystem {
+	int			instance;
+	struct device		dev;
+	/*
+	 * Because we unregister the device on the last put we need
+	 * a separate refcount.
+	 */
+	struct kref		ref;
+	struct list_head	entry;
+	struct mutex		lock;
+	struct list_head	ctrls;
+	char			subnqn[NVMF_NQN_SIZE];
+	char			serial[20];
+	char			model[40];
+	char			firmware_rev[8];
+	u16			vendor_id;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 13/17] nvme: track subsystems
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Includes code originally from Hannes Reinecke to expose the subsystems
in sysfs.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c    | 200 +++++++++++++++++++++++++++++++++++++-------
 drivers/nvme/host/fabrics.c |   4 +-
 drivers/nvme/host/nvme.h    |  26 ++++--
 3 files changed, 194 insertions(+), 36 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index df525ab42fcd..b3d468c77684 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -68,9 +68,14 @@ MODULE_PARM_DESC(streams, "turn on support for Streams write directives");
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static DEFINE_IDA(nvme_subsystems_ida);
+static LIST_HEAD(nvme_subsystems);
+static DEFINE_MUTEX(nvme_subsystems_lock);
+
 static DEFINE_IDA(nvme_instance_ida);
 static dev_t nvme_chr_devt;
 static struct class *nvme_class;
+static struct class *nvme_subsys_class;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
@@ -1694,14 +1699,15 @@ static bool quirk_matches(const struct nvme_id_ctrl *id,
 		string_matches(id->fr, q->fr, sizeof(id->fr));
 }
 
-static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+static void nvme_init_subnqn(struct nvme_subsystem *subsys, struct nvme_ctrl *ctrl,
+		struct nvme_id_ctrl *id)
 {
 	size_t nqnlen;
 	int off;
 
 	nqnlen = strnlen(id->subnqn, NVMF_NQN_SIZE);
 	if (nqnlen > 0 && nqnlen < NVMF_NQN_SIZE) {
-		strcpy(ctrl->subnqn, id->subnqn);
+		strcpy(subsys->subnqn, id->subnqn);
 		return;
 	}
 
@@ -1709,14 +1715,131 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		dev_warn(ctrl->device, "missing or invalid SUBNQN field.\n");
 
 	/* Generate a "fake" NQN per Figure 254 in NVMe 1.3 + ECN 001 */
-	off = snprintf(ctrl->subnqn, NVMF_NQN_SIZE,
+	off = snprintf(subsys->subnqn, NVMF_NQN_SIZE,
 			"nqn.2014.08.org.nvmexpress:%4x%4x",
 			le16_to_cpu(id->vid), le16_to_cpu(id->ssvid));
-	memcpy(ctrl->subnqn + off, id->sn, sizeof(id->sn));
+	memcpy(subsys->subnqn + off, id->sn, sizeof(id->sn));
 	off += sizeof(id->sn);
-	memcpy(ctrl->subnqn + off, id->mn, sizeof(id->mn));
+	memcpy(subsys->subnqn + off, id->mn, sizeof(id->mn));
 	off += sizeof(id->mn);
-	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
+	memset(subsys->subnqn + off, 0, sizeof(subsys->subnqn) - off);
+}
+
+static void __nvme_release_subsystem(struct nvme_subsystem *subsys)
+{
+	ida_simple_remove(&nvme_subsystems_ida, subsys->instance);
+	kfree(subsys);
+}
+
+static void nvme_release_subsystem(struct device *dev)
+{
+	__nvme_release_subsystem(container_of(dev, struct nvme_subsystem, dev));
+}
+
+static void nvme_destroy_subsystem(struct kref *ref)
+{
+	struct nvme_subsystem *subsys =
+			container_of(ref, struct nvme_subsystem, ref);
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_del(&subsys->entry);
+	mutex_unlock(&nvme_subsystems_lock);
+
+	device_del(&subsys->dev);
+	put_device(&subsys->dev);
+}
+
+static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+{
+	kref_put(&subsys->ref, nvme_destroy_subsystem);
+}
+
+static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	lockdep_assert_held(&nvme_subsystems_lock);
+
+	list_for_each_entry(subsys, &nvme_subsystems, entry) {
+		if (strcmp(subsys->subnqn, subsysnqn))
+			continue;
+		if (!kref_get_unless_zero(&subsys->ref))
+			continue;
+		return subsys;
+	}
+
+	return NULL;
+}
+
+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+	struct nvme_subsystem *subsys, *found;
+	int ret;
+
+	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+	if (!subsys)
+		return -ENOMEM;
+	ret = ida_simple_get(&nvme_subsystems_ida, 0, 0, GFP_KERNEL);
+	if (ret < 0) {
+		kfree(subsys);
+		return ret;
+	}
+	subsys->instance = ret;
+	kref_init(&subsys->ref);
+	INIT_LIST_HEAD(&subsys->ctrls);
+	nvme_init_subnqn(subsys, ctrl, id);
+	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
+	memcpy(subsys->model, id->mn, sizeof(subsys->model));
+	memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev));
+	subsys->vendor_id = le16_to_cpu(id->vid);
+	mutex_init(&subsys->lock);
+
+	subsys->dev.class = nvme_subsys_class;
+	subsys->dev.release = nvme_release_subsystem;
+	dev_set_name(&subsys->dev, "nvme-subsys%d", subsys->instance);
+	device_initialize(&subsys->dev);
+
+	mutex_lock(&nvme_subsystems_lock);
+	found = __nvme_find_get_subsystem(subsys->subnqn);
+	if (found) {
+		/*
+		 * Verify that the subsystem actually supports multiple
+		 * controllers, else bail out.
+		 */
+		if (!(id->cmic & (1 << 1))) {
+			dev_err(ctrl->device,
+				"ignoring ctrl due to duplicate subnqn (%s).\n",
+				found->subnqn);
+			nvme_put_subsystem(found);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		__nvme_release_subsystem(subsys);
+		subsys = found;
+	} else {
+		ret = device_add(&subsys->dev);
+		if (ret) {
+			dev_err(ctrl->device,
+				"failed to register subsystem device.\n");
+			goto out_unlock;
+		}
+		list_add_tail(&subsys->entry, &nvme_subsystems);
+	}
+
+	ctrl->subsys = subsys;
+	mutex_unlock(&nvme_subsystems_lock);
+
+	mutex_lock(&subsys->lock);
+	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
+	mutex_unlock(&subsys->lock);
+
+	return 0;
+
+out_unlock:
+	mutex_unlock(&nvme_subsystems_lock);
+	put_device(&subsys->dev);
+	return ret;
 }
 
 /*
@@ -1754,9 +1877,13 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
-	nvme_init_subnqn(ctrl, id);
-
 	if (!ctrl->identified) {
+		int i;
+
+		ret = nvme_init_subsystem(ctrl, id);
+		if (ret)
+			goto out_free;
+
 		/*
 		 * Check for quirks.  Quirk can depend on firmware version,
 		 * so, in principle, the set of quirks present can change
@@ -1765,9 +1892,6 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		 * the device, but we'd have to make sure that the driver
 		 * behaves intelligently if the quirks change.
 		 */
-
-		int i;
-
 		for (i = 0; i < ARRAY_SIZE(core_quirks); i++) {
 			if (quirk_matches(id, &core_quirks[i]))
 				ctrl->quirks |= core_quirks[i].quirks;
@@ -1780,14 +1904,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	ctrl->oacs = le16_to_cpu(id->oacs);
-	ctrl->vid = le16_to_cpu(id->vid);
 	ctrl->oncs = le16_to_cpup(&id->oncs);
 	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	ctrl->cntlid = le16_to_cpup(&id->cntlid);
-	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
-	memcpy(ctrl->model, id->mn, sizeof(id->mn));
-	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
 	if (id->mdts)
 		max_hw_sectors = 1 << (id->mdts + page_shift - 9);
 	else
@@ -1990,9 +2110,9 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ctrl *ctrl = ns->ctrl;
-	int serial_len = sizeof(ctrl->serial);
-	int model_len = sizeof(ctrl->model);
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	int serial_len = sizeof(subsys->serial);
+	int model_len = sizeof(subsys->model);
 
 	if (!uuid_is_null(&ns->uuid))
 		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
@@ -2003,15 +2123,16 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
 		return sprintf(buf, "eui.%8phN\n", ns->eui);
 
-	while (serial_len > 0 && (ctrl->serial[serial_len - 1] == ' ' ||
-				  ctrl->serial[serial_len - 1] == '\0'))
+	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
+				  subsys->serial[serial_len - 1] == '\0'))
 		serial_len--;
-	while (model_len > 0 && (ctrl->model[model_len - 1] == ' ' ||
-				 ctrl->model[model_len - 1] == '\0'))
+	while (model_len > 0 && (subsys->model[model_len - 1] == ' ' ||
+				 subsys->model[model_len - 1] == '\0'))
 		model_len--;
 
-	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", ctrl->vid,
-		serial_len, ctrl->serial, model_len, ctrl->model, ns->ns_id);
+	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
+		serial_len, subsys->serial, model_len, subsys->model,
+		ns->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2097,10 +2218,15 @@ static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
 {										\
         struct nvme_ctrl *ctrl = dev_get_drvdata(dev);				\
-        return sprintf(buf, "%.*s\n", (int)sizeof(ctrl->field), ctrl->field);	\
+        return sprintf(buf, "%.*s\n",						\
+		(int)sizeof(ctrl->subsys->field), ctrl->subsys->field);		\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
+nvme_show_str_function(model);
+nvme_show_str_function(serial);
+nvme_show_str_function(firmware_rev);
+
 #define nvme_show_int_function(field)						\
 static ssize_t  field##_show(struct device *dev,				\
 			    struct device_attribute *attr, char *buf)		\
@@ -2110,9 +2236,6 @@ static ssize_t  field##_show(struct device *dev,				\
 }										\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
-nvme_show_str_function(model);
-nvme_show_str_function(serial);
-nvme_show_str_function(firmware_rev);
 nvme_show_int_function(cntlid);
 
 static ssize_t nvme_sysfs_delete(struct device *dev,
@@ -2166,7 +2289,7 @@ static ssize_t nvme_sysfs_show_subsysnqn(struct device *dev,
 {
 	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
 
-	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subnqn);
+	return snprintf(buf, PAGE_SIZE, "%s\n", ctrl->subsys->subnqn);
 }
 static DEVICE_ATTR(subsysnqn, S_IRUGO, nvme_sysfs_show_subsysnqn, NULL);
 
@@ -2675,11 +2798,21 @@ static void nvme_free_ctrl(struct device *dev)
 {
 	struct nvme_ctrl *ctrl =
 		container_of(dev, struct nvme_ctrl, ctrl_device);
+	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 	ida_destroy(&ctrl->ns_ida);
 
+	if (subsys) {
+		mutex_lock(&subsys->lock);
+		list_del(&ctrl->subsys_entry);
+		mutex_unlock(&subsys->lock);
+	}
+
 	ctrl->ops->free_ctrl(ctrl);
+
+	if (subsys)
+		nvme_put_subsystem(subsys);
 }
 
 /*
@@ -2878,8 +3011,15 @@ int __init nvme_core_init(void)
 		goto unregister_chrdev;
 	}
 
+	nvme_subsys_class = class_create(THIS_MODULE, "nvme-subsystem");
+	if (IS_ERR(nvme_subsys_class)) {
+		result = PTR_ERR(nvme_subsys_class);
+		goto destroy_class;
+	}
 	return 0;
 
+destroy_class:
+	class_destroy(nvme_class);
 unregister_chrdev:
 	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 destroy_wq:
@@ -2889,6 +3029,8 @@ int __init nvme_core_init(void)
 
 void nvme_core_exit(void)
 {
+	ida_destroy(&nvme_subsystems_ida);
+	class_destroy(nvme_subsys_class);
 	class_destroy(nvme_class);
 	unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);
 	destroy_workqueue(nvme_wq);
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 8bca36a46924..dcad66e37a8f 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -877,10 +877,10 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
 		goto out_unlock;
 	}
 
-	if (strcmp(ctrl->subnqn, opts->subsysnqn)) {
+	if (strcmp(ctrl->subsys->subnqn, opts->subsysnqn)) {
 		dev_warn(ctrl->device,
 			"controller returned incorrect NQN: \"%s\".\n",
-			ctrl->subnqn);
+			ctrl->subsys->subnqn);
 		up_read(&nvmf_transports_rwsem);
 		ctrl->ops->delete_ctrl(ctrl);
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1bb2bc165e54..6a5702072048 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -139,13 +139,12 @@ struct nvme_ctrl {
 	struct ida ns_ida;
 	struct work_struct reset_work;
 
+	struct nvme_subsystem *subsys;
+	struct list_head subsys_entry;
+
 	struct opal_dev *opal_dev;
 
 	char name[12];
-	char serial[20];
-	char model[40];
-	char firmware_rev[8];
-	char subnqn[NVMF_NQN_SIZE];
 	u16 cntlid;
 
 	u32 ctrl_config;
@@ -156,7 +155,6 @@ struct nvme_ctrl {
 	u32 page_size;
 	u32 max_hw_sectors;
 	u16 oncs;
-	u16 vid;
 	u16 oacs;
 	u16 nssa;
 	u16 nr_streams;
@@ -198,6 +196,24 @@ struct nvme_ctrl {
 	struct nvmf_ctrl_options *opts;
 };
 
+struct nvme_subsystem {
+	int			instance;
+	struct device		dev;
+	/*
+	 * Because we unregister the device on the last put we need
+	 * a separate refcount.
+	 */
+	struct kref		ref;
+	struct list_head	entry;
+	struct mutex		lock;
+	struct list_head	ctrls;
+	char			subnqn[NVMF_NQN_SIZE];
+	char			serial[20];
+	char			model[40];
+	char			firmware_rev[8];
+	u16			vendor_id;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 14/17] nvme: introduce a nvme_ns_ids structure
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++---------------------
 drivers/nvme/host/nvme.h | 14 +++++++---
 2 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b3d468c77684..ab1a8022ead3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -750,7 +750,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 }
 
 static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
-		u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_ns_ids *ids)
 {
 	struct nvme_command c = { };
 	int status;
@@ -786,7 +786,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_EUI64_LEN;
-			memcpy(eui64, data + pos + sizeof(*cur), len);
+			memcpy(ids->eui64, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_NGUID:
 			if (cur->nidl != NVME_NIDT_NGUID_LEN) {
@@ -796,7 +796,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_NGUID_LEN;
-			memcpy(nguid, data + pos + sizeof(*cur), len);
+			memcpy(ids->nguid, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_UUID:
 			if (cur->nidl != NVME_NIDT_UUID_LEN) {
@@ -806,7 +806,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_UUID_LEN;
-			uuid_copy(uuid, data + pos + sizeof(*cur));
+			uuid_copy(&ids->uuid, data + pos + sizeof(*cur));
 			break;
 		default:
 			/* Skip unnkown types */
@@ -1138,22 +1138,31 @@ static void nvme_config_discard(struct nvme_ns *ns)
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
-		struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_id_ns *id, struct nvme_ns_ids *ids)
 {
+	memset(ids, 0, sizeof(*ids));
+
 	if (ctrl->vs >= NVME_VS(1, 1, 0))
-		memcpy(eui64, id->eui64, sizeof(id->eui64));
+		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
 	if (ctrl->vs >= NVME_VS(1, 2, 0))
-		memcpy(nguid, id->nguid, sizeof(id->nguid));
+		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
 	if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 		 /* Don't treat error as fatal we potentially
 		  * already have a NGUID or EUI-64
 		  */
-		if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+		if (nvme_identify_ns_descs(ctrl, nsid, ids))
 			dev_warn(ctrl->device,
 				 "%s: Identify Descriptors failed\n", __func__);
 	}
 }
 
+static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
+{
+	return uuid_equal(&a->uuid, &b->uuid) &&
+		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
+		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
+}
+
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
 	struct nvme_ns *ns = disk->private_data;
@@ -1194,8 +1203,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id;
-	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
-	uuid_t uuid = uuid_null;
+	struct nvme_ns_ids ids;
 	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
@@ -1212,10 +1220,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
-	if (!uuid_equal(&ns->uuid, &uuid) ||
-	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
-	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
 		dev_err(ctrl->device,
 			"identifiers changed for nsid %d\n", ns->ns_id);
 		ret = -ENODEV;
@@ -2110,18 +2116,19 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
-	if (!uuid_is_null(&ns->uuid))
-		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
+	if (!uuid_is_null(&ids->uuid))
+		return sprintf(buf, "uuid.%pU\n", &ids->uuid);
 
-	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
-		return sprintf(buf, "eui.%16phN\n", ns->nguid);
+	if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+		return sprintf(buf, "eui.%16phN\n", ids->nguid);
 
-	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
-		return sprintf(buf, "eui.%8phN\n", ns->eui);
+	if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
+		return sprintf(buf, "eui.%8phN\n", ids->eui64);
 
 	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
 				  subsys->serial[serial_len - 1] == '\0'))
@@ -2140,7 +2147,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->nguid);
+	return sprintf(buf, "%pU\n", ns->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2148,16 +2155,17 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
 	 */
-	if (uuid_is_null(&ns->uuid)) {
+	if (uuid_is_null(&ids->uuid)) {
 		printk_ratelimited(KERN_WARNING
 				   "No UUID available providing old NGUID\n");
-		return sprintf(buf, "%pU\n", ns->nguid);
+		return sprintf(buf, "%pU\n", ids->nguid);
 	}
-	return sprintf(buf, "%pU\n", &ns->uuid);
+	return sprintf(buf, "%pU\n", &ids->uuid);
 }
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
@@ -2165,7 +2173,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->eui);
+	return sprintf(buf, "%8phd\n", ns->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2191,18 +2199,19 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	if (a == &dev_attr_uuid.attr) {
-		if (uuid_is_null(&ns->uuid) ||
-		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (uuid_is_null(&ids->uuid) ||
+		    !memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_nguid.attr) {
-		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (!memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_eui.attr) {
-		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
+		if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
 			return 0;
 	}
 	return a->mode;
@@ -2435,7 +2444,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 6a5702072048..efbf4dde6c87 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -214,6 +214,15 @@ struct nvme_subsystem {
 	u16			vendor_id;
 };
 
+/*
+ * Container structure for uniqueue namespace identifiers.
+ */
+struct nvme_ns_ids {
+	u8	eui64[8];
+	u8	nguid[16];
+	uuid_t	uuid;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
@@ -224,11 +233,8 @@ struct nvme_ns {
 	struct kref kref;
 	int instance;
 
-	u8 eui[8];
-	u8 nguid[16];
-	uuid_t uuid;
-
 	unsigned ns_id;
+	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 14/17] nvme: introduce a nvme_ns_ids structure
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++---------------------
 drivers/nvme/host/nvme.h | 14 +++++++---
 2 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b3d468c77684..ab1a8022ead3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -750,7 +750,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 }
 
 static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
-		u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_ns_ids *ids)
 {
 	struct nvme_command c = { };
 	int status;
@@ -786,7 +786,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_EUI64_LEN;
-			memcpy(eui64, data + pos + sizeof(*cur), len);
+			memcpy(ids->eui64, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_NGUID:
 			if (cur->nidl != NVME_NIDT_NGUID_LEN) {
@@ -796,7 +796,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_NGUID_LEN;
-			memcpy(nguid, data + pos + sizeof(*cur), len);
+			memcpy(ids->nguid, data + pos + sizeof(*cur), len);
 			break;
 		case NVME_NIDT_UUID:
 			if (cur->nidl != NVME_NIDT_UUID_LEN) {
@@ -806,7 +806,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 				goto free_data;
 			}
 			len = NVME_NIDT_UUID_LEN;
-			uuid_copy(uuid, data + pos + sizeof(*cur));
+			uuid_copy(&ids->uuid, data + pos + sizeof(*cur));
 			break;
 		default:
 			/* Skip unnkown types */
@@ -1138,22 +1138,31 @@ static void nvme_config_discard(struct nvme_ns *ns)
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
-		struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
+		struct nvme_id_ns *id, struct nvme_ns_ids *ids)
 {
+	memset(ids, 0, sizeof(*ids));
+
 	if (ctrl->vs >= NVME_VS(1, 1, 0))
-		memcpy(eui64, id->eui64, sizeof(id->eui64));
+		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
 	if (ctrl->vs >= NVME_VS(1, 2, 0))
-		memcpy(nguid, id->nguid, sizeof(id->nguid));
+		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
 	if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 		 /* Don't treat error as fatal we potentially
 		  * already have a NGUID or EUI-64
 		  */
-		if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+		if (nvme_identify_ns_descs(ctrl, nsid, ids))
 			dev_warn(ctrl->device,
 				 "%s: Identify Descriptors failed\n", __func__);
 	}
 }
 
+static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
+{
+	return uuid_equal(&a->uuid, &b->uuid) &&
+		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
+		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
+}
+
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
 	struct nvme_ns *ns = disk->private_data;
@@ -1194,8 +1203,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id;
-	u8 eui64[8] = { 0 }, nguid[16] = { 0 };
-	uuid_t uuid = uuid_null;
+	struct nvme_ns_ids ids;
 	int ret = 0;
 
 	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
@@ -1212,10 +1220,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, &uuid);
-	if (!uuid_equal(&ns->uuid, &uuid) ||
-	    memcmp(&ns->nguid, &nguid, sizeof(ns->nguid)) ||
-	    memcmp(&ns->eui, &eui64, sizeof(ns->eui))) {
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
 		dev_err(ctrl->device,
 			"identifiers changed for nsid %d\n", ns->ns_id);
 		ret = -ENODEV;
@@ -2110,18 +2116,19 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
-	if (!uuid_is_null(&ns->uuid))
-		return sprintf(buf, "uuid.%pU\n", &ns->uuid);
+	if (!uuid_is_null(&ids->uuid))
+		return sprintf(buf, "uuid.%pU\n", &ids->uuid);
 
-	if (memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
-		return sprintf(buf, "eui.%16phN\n", ns->nguid);
+	if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
+		return sprintf(buf, "eui.%16phN\n", ids->nguid);
 
-	if (memchr_inv(ns->eui, 0, sizeof(ns->eui)))
-		return sprintf(buf, "eui.%8phN\n", ns->eui);
+	if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
+		return sprintf(buf, "eui.%8phN\n", ids->eui64);
 
 	while (serial_len > 0 && (subsys->serial[serial_len - 1] == ' ' ||
 				  subsys->serial[serial_len - 1] == '\0'))
@@ -2140,7 +2147,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->nguid);
+	return sprintf(buf, "%pU\n", ns->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2148,16 +2155,17 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
 	 */
-	if (uuid_is_null(&ns->uuid)) {
+	if (uuid_is_null(&ids->uuid)) {
 		printk_ratelimited(KERN_WARNING
 				   "No UUID available providing old NGUID\n");
-		return sprintf(buf, "%pU\n", ns->nguid);
+		return sprintf(buf, "%pU\n", ids->nguid);
 	}
-	return sprintf(buf, "%pU\n", &ns->uuid);
+	return sprintf(buf, "%pU\n", &ids->uuid);
 }
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
@@ -2165,7 +2173,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->eui);
+	return sprintf(buf, "%8phd\n", ns->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2191,18 +2199,19 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	struct nvme_ns_ids *ids = &ns->ids;
 
 	if (a == &dev_attr_uuid.attr) {
-		if (uuid_is_null(&ns->uuid) ||
-		    !memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (uuid_is_null(&ids->uuid) ||
+		    !memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_nguid.attr) {
-		if (!memchr_inv(ns->nguid, 0, sizeof(ns->nguid)))
+		if (!memchr_inv(ids->nguid, 0, sizeof(ids->nguid)))
 			return 0;
 	}
 	if (a == &dev_attr_eui.attr) {
-		if (!memchr_inv(ns->eui, 0, sizeof(ns->eui)))
+		if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
 			return 0;
 	}
 	return a->mode;
@@ -2435,7 +2444,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, ns->eui, ns->nguid, &ns->uuid);
+	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 6a5702072048..efbf4dde6c87 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -214,6 +214,15 @@ struct nvme_subsystem {
 	u16			vendor_id;
 };
 
+/*
+ * Container structure for uniqueue namespace identifiers.
+ */
+struct nvme_ns_ids {
+	u8	eui64[8];
+	u8	nguid[16];
+	uuid_t	uuid;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
@@ -224,11 +233,8 @@ struct nvme_ns {
 	struct kref kref;
 	int instance;
 
-	u8 eui[8];
-	u8 nguid[16];
-	uuid_t uuid;
-
 	unsigned ns_id;
+	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 15/17] nvme: track shared namespaces
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

Introduce a new struct nvme_ns_head that holds information about an actual
namespace, unlike struct nvme_ns, which only holds the per-controller
namespace information.  For private namespaces there is a 1:1 relation of
the two, but for shared namespaces this lets us discover all the paths to
it.  For now only the identifiers are moved to the new structure, but most
of the information in struct nvme_ns should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/nvme/host/core.c     | 190 +++++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/lightnvm.c |  14 ++--
 drivers/nvme/host/nvme.h     |  21 ++++-
 3 files changed, 190 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ab1a8022ead3..1db26729bd89 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -245,6 +245,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_free_ns_head(struct kref *ref)
+{
+	struct nvme_ns_head *head =
+		container_of(ref, struct nvme_ns_head, ref);
+
+	list_del_init(&head->entry);
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+	kref_put(&head->ref, nvme_free_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
@@ -254,6 +269,7 @@ static void nvme_free_ns(struct kref *kref)
 
 	put_disk(ns->disk);
 	ida_simple_remove(&ns->ctrl->ns_ida, ns->instance);
+	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
 }
@@ -389,7 +405,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -420,7 +436,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->dsm.nr = cpu_to_le32(segments - 1);
 	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -459,7 +475,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -940,7 +956,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	memset(&c, 0, sizeof(c));
 	c.rw.opcode = io.opcode;
 	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.rw.slba = cpu_to_le64(io.slba);
 	c.rw.length = cpu_to_le16(io.nblocks);
 	c.rw.control = cpu_to_le16(io.control);
@@ -1005,7 +1021,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
-		return ns->ns_id;
+		return ns->head->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
@@ -1156,6 +1172,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
 	}
 }
 
+static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
+{
+	return !uuid_is_null(&ids->uuid) ||
+		memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) ||
+		memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1211,7 +1234,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
-	id = nvme_identify_ns(ctrl, ns->ns_id);
+	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
 
@@ -1220,10 +1243,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
-	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
+	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->head->ids, &ids)) {
 		dev_err(ctrl->device,
-			"identifiers changed for nsid %d\n", ns->ns_id);
+			"identifiers changed for nsid %d\n", ns->head->ns_id);
 		ret = -ENODEV;
 	}
 
@@ -1264,7 +1287,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
 	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
@@ -1793,6 +1816,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	subsys->instance = ret;
 	kref_init(&subsys->ref);
 	INIT_LIST_HEAD(&subsys->ctrls);
+	INIT_LIST_HEAD(&subsys->nsheads);
 	nvme_init_subnqn(subsys, ctrl, id);
 	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
 	memcpy(subsys->model, id->mn, sizeof(subsys->model));
@@ -2116,7 +2140,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
@@ -2139,7 +2163,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->ns_id);
+		ns->head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2147,7 +2171,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->ids.nguid);
+	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2155,7 +2179,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2173,7 +2197,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->ids.eui64);
+	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2181,7 +2205,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->ns_id);
+	return sprintf(buf, "%d\n", ns->head->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
@@ -2199,7 +2223,7 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2351,12 +2375,114 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
+		unsigned nsid)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
+			return h;
+	}
+
+	return NULL;
+}
+
+static int __nvme_check_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_head *new)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (nvme_ns_ids_valid(&new->ids) &&
+		    nvme_ns_ids_equal(&new->ids, &h->ids))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
+		unsigned nsid, struct nvme_id_ns *id)
+{
+	struct nvme_ns_head *head;
+	int ret = -ENOMEM;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto out;
+
+	INIT_LIST_HEAD(&head->list);
+	head->ns_id = nsid;
+	init_srcu_struct(&head->srcu);
+	kref_init(&head->ref);
+
+	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
+
+	ret = __nvme_check_ids(ctrl->subsys, head);
+	if (ret) {
+		dev_err(ctrl->device,
+			"duplicate IDs for nsid %d\n", nsid);
+		goto out_free_head;
+	}
+
+	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
+	return head;
+out_free_head:
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+out:
+	return ERR_PTR(ret);
+}
+
+static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
+		struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	bool is_shared = id->nmic & (1 << 0);
+	struct nvme_ns_head *head = NULL;
+	int ret = 0;
+
+	mutex_lock(&ctrl->subsys->lock);
+	if (is_shared)
+		head = __nvme_find_ns_head(ctrl->subsys, nsid);
+	if (!head) {
+		head = nvme_alloc_ns_head(ctrl, nsid, id);
+		if (IS_ERR(head)) {
+			ret = PTR_ERR(head);
+			goto out_unlock;
+		}
+	} else {
+		struct nvme_ns_ids ids;
+
+		nvme_report_ns_ids(ctrl, nsid, id, &ids);
+		if (!nvme_ns_ids_equal(&head->ids, &ids)) {
+			dev_err(ctrl->device,
+				"IDs don't match for shared namespace %d\n",
+					nsid);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	list_add_tail(&ns->siblings, &head->list);
+	ns->head = head;
+
+out_unlock:
+	mutex_unlock(&ctrl->subsys->lock);
+	return ret;
+}
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
 	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
 
-	return nsa->ns_id - nsb->ns_id;
+	return nsa->head->ns_id - nsb->head->ns_id;
 }
 
 static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -2365,13 +2491,13 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid) {
+		if (ns->head->ns_id == nsid) {
 			if (!kref_get_unless_zero(&ns->kref))
 				continue;
 			ret = ns;
 			break;
 		}
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			break;
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
@@ -2386,7 +2512,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
 	if (!ctrl->nr_streams)
 		return 0;
 
-	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	ret = nvme_get_stream_params(ctrl, &s, ns->head->ns_id);
 	if (ret)
 		return ret;
 
@@ -2428,7 +2554,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->ctrl = ctrl;
 
 	kref_init(&ns->kref);
-	ns->ns_id = nsid;
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
@@ -2444,18 +2569,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
+	if (nvme_init_ns_head(ns, nsid, id))
+		goto out_free_id;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
 			dev_warn(ctrl->device, "LightNVM init failure\n");
-			goto out_free_id;
+			goto out_unlink_ns;
 		}
 	}
 
 	disk = alloc_disk_node(0, node);
 	if (!disk)
-		goto out_free_id;
+		goto out_unlink_ns;
 
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
@@ -2483,6 +2609,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 	return;
+ out_unlink_ns:
+	mutex_lock(&ctrl->subsys->lock);
+	list_del_rcu(&ns->siblings);
+	mutex_unlock(&ctrl->subsys->lock);
  out_free_id:
 	kfree(id);
  out_free_queue:
@@ -2495,6 +2625,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	struct nvme_ns_head *head = ns->head;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -2509,10 +2641,16 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 
+	mutex_lock(&ns->ctrl->subsys->lock);
+	if (head)
+		list_del_rcu(&ns->siblings);
+	mutex_unlock(&ns->ctrl->subsys->lock);
+
 	mutex_lock(&ns->ctrl->namespaces_mutex);
 	list_del_init(&ns->list);
 	mutex_unlock(&ns->ctrl->namespaces_mutex);
 
+	synchronize_srcu(&head->srcu);
 	nvme_put_ns(ns);
 }
 
@@ -2535,7 +2673,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			nvme_ns_remove(ns);
 	}
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 1f79e3f141e6..44e46276319c 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	int ret;
 
 	c.identity.opcode = nvme_nvm_admin_identity;
-	c.identity.nsid = cpu_to_le32(ns->ns_id);
+	c.identity.nsid = cpu_to_le32(ns->head->ns_id);
 	c.identity.chnl_off = 0;
 
 	nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
 	int ret = 0;
 
 	c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
-	c.l2p.nsid = cpu_to_le32(ns->ns_id);
+	c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
 	entries = kmalloc(len, GFP_KERNEL);
 	if (!entries)
 		return -ENOMEM;
@@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
 	int ret = 0;
 
 	c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
-	c.get_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.get_bb.spba = cpu_to_le64(ppa.ppa);
 
 	bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
 	int ret = 0;
 
 	c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
-	c.set_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.set_bb.spba = cpu_to_le64(ppas->ppa);
 	c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
 	c.set_bb.value = type;
@@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
 				    struct nvme_nvm_command *c)
 {
 	c->ph_rw.opcode = rqd->opcode;
-	c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
 	c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
 	c->ph_rw.control = cpu_to_le16(rqd->flags);
@@ -691,7 +691,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
 
 	memset(&c, 0, sizeof(c));
 	c.ph_rw.opcode = vio.opcode;
-	c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.ph_rw.control = cpu_to_le16(vio.control);
 	c.ph_rw.length = cpu_to_le16(vio.nppas);
 
@@ -728,7 +728,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = vcmd.opcode;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index efbf4dde6c87..849413def126 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -207,6 +207,7 @@ struct nvme_subsystem {
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
+	struct list_head	nsheads;
 	char			subnqn[NVMF_NQN_SIZE];
 	char			serial[20];
 	char			model[40];
@@ -223,18 +224,34 @@ struct nvme_ns_ids {
 	uuid_t	uuid;
 };
 
+/*
+ * Anchor structure for namespaces.  There is one for each namespace in a
+ * NVMe subsystem that any of our controllers can see, and the namespace
+ * structure for each controller is chained of it.  For private namespaces
+ * there is a 1:1 relation to our namespace structures, that is ->list
+ * only ever has a single entry for private namespaces.
+ */
+struct nvme_ns_head {
+	struct list_head	list;
+	struct srcu_struct      srcu;
+	unsigned		ns_id;
+	struct nvme_ns_ids	ids;
+	struct list_head	entry;
+	struct kref		ref;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
+	struct nvme_ns_head *head;
 	int instance;
 
-	unsigned ns_id;
-	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 15/17] nvme: track shared namespaces
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


Introduce a new struct nvme_ns_head that holds information about an actual
namespace, unlike struct nvme_ns, which only holds the per-controller
namespace information.  For private namespaces there is a 1:1 relation of
the two, but for shared namespaces this lets us discover all the paths to
it.  For now only the identifiers are moved to the new structure, but most
of the information in struct nvme_ns should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
Reviewed-by: Javier Gonz?lez <javier at cnexlabs.com>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c     | 190 +++++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/lightnvm.c |  14 ++--
 drivers/nvme/host/nvme.h     |  21 ++++-
 3 files changed, 190 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ab1a8022ead3..1db26729bd89 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -245,6 +245,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_free_ns_head(struct kref *ref)
+{
+	struct nvme_ns_head *head =
+		container_of(ref, struct nvme_ns_head, ref);
+
+	list_del_init(&head->entry);
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+	kref_put(&head->ref, nvme_free_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
@@ -254,6 +269,7 @@ static void nvme_free_ns(struct kref *kref)
 
 	put_disk(ns->disk);
 	ida_simple_remove(&ns->ctrl->ns_ida, ns->instance);
+	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
 }
@@ -389,7 +405,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -420,7 +436,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->dsm.nr = cpu_to_le32(segments - 1);
 	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -459,7 +475,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -940,7 +956,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	memset(&c, 0, sizeof(c));
 	c.rw.opcode = io.opcode;
 	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.rw.slba = cpu_to_le64(io.slba);
 	c.rw.length = cpu_to_le16(io.nblocks);
 	c.rw.control = cpu_to_le16(io.control);
@@ -1005,7 +1021,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
-		return ns->ns_id;
+		return ns->head->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
@@ -1156,6 +1172,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
 	}
 }
 
+static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
+{
+	return !uuid_is_null(&ids->uuid) ||
+		memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) ||
+		memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1211,7 +1234,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		return -ENODEV;
 	}
 
-	id = nvme_identify_ns(ctrl, ns->ns_id);
+	id = nvme_identify_ns(ctrl, ns->head->ns_id);
 	if (!id)
 		return -ENODEV;
 
@@ -1220,10 +1243,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 		goto out;
 	}
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ids);
-	if (!nvme_ns_ids_equal(&ns->ids, &ids)) {
+	nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids);
+	if (!nvme_ns_ids_equal(&ns->head->ids, &ids)) {
 		dev_err(ctrl->device,
-			"identifiers changed for nsid %d\n", ns->ns_id);
+			"identifiers changed for nsid %d\n", ns->head->ns_id);
 		ret = -ENODEV;
 	}
 
@@ -1264,7 +1287,7 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
 	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
@@ -1793,6 +1816,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	subsys->instance = ret;
 	kref_init(&subsys->ref);
 	INIT_LIST_HEAD(&subsys->ctrls);
+	INIT_LIST_HEAD(&subsys->nsheads);
 	nvme_init_subnqn(subsys, ctrl, id);
 	memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
 	memcpy(subsys->model, id->mn, sizeof(subsys->model));
@@ -2116,7 +2140,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 	struct nvme_subsystem *subsys = ns->ctrl->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
@@ -2139,7 +2163,7 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->ns_id);
+		ns->head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
@@ -2147,7 +2171,7 @@ static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
 			  char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->ids.nguid);
+	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
@@ -2155,7 +2179,7 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2173,7 +2197,7 @@ static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->ids.eui64);
+	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
@@ -2181,7 +2205,7 @@ static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
 								char *buf)
 {
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->ns_id);
+	return sprintf(buf, "%d\n", ns->head->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
@@ -2199,7 +2223,7 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->ids;
+	struct nvme_ns_ids *ids = &ns->head->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2351,12 +2375,114 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
+		unsigned nsid)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (h->ns_id == nsid && kref_get_unless_zero(&h->ref))
+			return h;
+	}
+
+	return NULL;
+}
+
+static int __nvme_check_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_head *new)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (nvme_ns_ids_valid(&new->ids) &&
+		    nvme_ns_ids_equal(&new->ids, &h->ids))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
+		unsigned nsid, struct nvme_id_ns *id)
+{
+	struct nvme_ns_head *head;
+	int ret = -ENOMEM;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		goto out;
+
+	INIT_LIST_HEAD(&head->list);
+	head->ns_id = nsid;
+	init_srcu_struct(&head->srcu);
+	kref_init(&head->ref);
+
+	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
+
+	ret = __nvme_check_ids(ctrl->subsys, head);
+	if (ret) {
+		dev_err(ctrl->device,
+			"duplicate IDs for nsid %d\n", nsid);
+		goto out_free_head;
+	}
+
+	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
+	return head;
+out_free_head:
+	cleanup_srcu_struct(&head->srcu);
+	kfree(head);
+out:
+	return ERR_PTR(ret);
+}
+
+static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
+		struct nvme_id_ns *id)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	bool is_shared = id->nmic & (1 << 0);
+	struct nvme_ns_head *head = NULL;
+	int ret = 0;
+
+	mutex_lock(&ctrl->subsys->lock);
+	if (is_shared)
+		head = __nvme_find_ns_head(ctrl->subsys, nsid);
+	if (!head) {
+		head = nvme_alloc_ns_head(ctrl, nsid, id);
+		if (IS_ERR(head)) {
+			ret = PTR_ERR(head);
+			goto out_unlock;
+		}
+	} else {
+		struct nvme_ns_ids ids;
+
+		nvme_report_ns_ids(ctrl, nsid, id, &ids);
+		if (!nvme_ns_ids_equal(&head->ids, &ids)) {
+			dev_err(ctrl->device,
+				"IDs don't match for shared namespace %d\n",
+					nsid);
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
+	list_add_tail(&ns->siblings, &head->list);
+	ns->head = head;
+
+out_unlock:
+	mutex_unlock(&ctrl->subsys->lock);
+	return ret;
+}
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
 	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
 
-	return nsa->ns_id - nsb->ns_id;
+	return nsa->head->ns_id - nsb->head->ns_id;
 }
 
 static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -2365,13 +2491,13 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid) {
+		if (ns->head->ns_id == nsid) {
 			if (!kref_get_unless_zero(&ns->kref))
 				continue;
 			ret = ns;
 			break;
 		}
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			break;
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
@@ -2386,7 +2512,7 @@ static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
 	if (!ctrl->nr_streams)
 		return 0;
 
-	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	ret = nvme_get_stream_params(ctrl, &s, ns->head->ns_id);
 	if (ret)
 		return ret;
 
@@ -2428,7 +2554,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->ctrl = ctrl;
 
 	kref_init(&ns->kref);
-	ns->ns_id = nsid;
 	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
@@ -2444,18 +2569,19 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	nvme_report_ns_ids(ctrl, ns->ns_id, id, &ns->ids);
+	if (nvme_init_ns_head(ns, nsid, id))
+		goto out_free_id;
 
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
 			dev_warn(ctrl->device, "LightNVM init failure\n");
-			goto out_free_id;
+			goto out_unlink_ns;
 		}
 	}
 
 	disk = alloc_disk_node(0, node);
 	if (!disk)
-		goto out_free_id;
+		goto out_unlink_ns;
 
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
@@ -2483,6 +2609,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 	return;
+ out_unlink_ns:
+	mutex_lock(&ctrl->subsys->lock);
+	list_del_rcu(&ns->siblings);
+	mutex_unlock(&ctrl->subsys->lock);
  out_free_id:
 	kfree(id);
  out_free_queue:
@@ -2495,6 +2625,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
+	struct nvme_ns_head *head = ns->head;
+
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -2509,10 +2641,16 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 
+	mutex_lock(&ns->ctrl->subsys->lock);
+	if (head)
+		list_del_rcu(&ns->siblings);
+	mutex_unlock(&ns->ctrl->subsys->lock);
+
 	mutex_lock(&ns->ctrl->namespaces_mutex);
 	list_del_init(&ns->list);
 	mutex_unlock(&ns->ctrl->namespaces_mutex);
 
+	synchronize_srcu(&head->srcu);
 	nvme_put_ns(ns);
 }
 
@@ -2535,7 +2673,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nsid)
+		if (ns->head->ns_id > nsid)
 			nvme_ns_remove(ns);
 	}
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 1f79e3f141e6..44e46276319c 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -305,7 +305,7 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 	int ret;
 
 	c.identity.opcode = nvme_nvm_admin_identity;
-	c.identity.nsid = cpu_to_le32(ns->ns_id);
+	c.identity.nsid = cpu_to_le32(ns->head->ns_id);
 	c.identity.chnl_off = 0;
 
 	nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
@@ -344,7 +344,7 @@ static int nvme_nvm_get_l2p_tbl(struct nvm_dev *nvmdev, u64 slba, u32 nlb,
 	int ret = 0;
 
 	c.l2p.opcode = nvme_nvm_admin_get_l2p_tbl;
-	c.l2p.nsid = cpu_to_le32(ns->ns_id);
+	c.l2p.nsid = cpu_to_le32(ns->head->ns_id);
 	entries = kmalloc(len, GFP_KERNEL);
 	if (!entries)
 		return -ENOMEM;
@@ -402,7 +402,7 @@ static int nvme_nvm_get_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr ppa,
 	int ret = 0;
 
 	c.get_bb.opcode = nvme_nvm_admin_get_bb_tbl;
-	c.get_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.get_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.get_bb.spba = cpu_to_le64(ppa.ppa);
 
 	bb_tbl = kzalloc(tblsz, GFP_KERNEL);
@@ -452,7 +452,7 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas,
 	int ret = 0;
 
 	c.set_bb.opcode = nvme_nvm_admin_set_bb_tbl;
-	c.set_bb.nsid = cpu_to_le32(ns->ns_id);
+	c.set_bb.nsid = cpu_to_le32(ns->head->ns_id);
 	c.set_bb.spba = cpu_to_le64(ppas->ppa);
 	c.set_bb.nlb = cpu_to_le16(nr_ppas - 1);
 	c.set_bb.value = type;
@@ -469,7 +469,7 @@ static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
 				    struct nvme_nvm_command *c)
 {
 	c->ph_rw.opcode = rqd->opcode;
-	c->ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c->ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c->ph_rw.spba = cpu_to_le64(rqd->ppa_addr.ppa);
 	c->ph_rw.metadata = cpu_to_le64(rqd->dma_meta_list);
 	c->ph_rw.control = cpu_to_le16(rqd->flags);
@@ -691,7 +691,7 @@ static int nvme_nvm_submit_vio(struct nvme_ns *ns,
 
 	memset(&c, 0, sizeof(c));
 	c.ph_rw.opcode = vio.opcode;
-	c.ph_rw.nsid = cpu_to_le32(ns->ns_id);
+	c.ph_rw.nsid = cpu_to_le32(ns->head->ns_id);
 	c.ph_rw.control = cpu_to_le16(vio.control);
 	c.ph_rw.length = cpu_to_le16(vio.nppas);
 
@@ -728,7 +728,7 @@ static int nvme_nvm_user_vcmd(struct nvme_ns *ns, int admin,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = vcmd.opcode;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.nsid = cpu_to_le32(ns->head->ns_id);
 	c.common.cdw2[0] = cpu_to_le32(vcmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(vcmd.cdw3);
 	/* cdw11-12 */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index efbf4dde6c87..849413def126 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -207,6 +207,7 @@ struct nvme_subsystem {
 	struct list_head	entry;
 	struct mutex		lock;
 	struct list_head	ctrls;
+	struct list_head	nsheads;
 	char			subnqn[NVMF_NQN_SIZE];
 	char			serial[20];
 	char			model[40];
@@ -223,18 +224,34 @@ struct nvme_ns_ids {
 	uuid_t	uuid;
 };
 
+/*
+ * Anchor structure for namespaces.  There is one for each namespace in a
+ * NVMe subsystem that any of our controllers can see, and the namespace
+ * structure for each controller is chained of it.  For private namespaces
+ * there is a 1:1 relation to our namespace structures, that is ->list
+ * only ever has a single entry for private namespaces.
+ */
+struct nvme_ns_head {
+	struct list_head	list;
+	struct srcu_struct      srcu;
+	unsigned		ns_id;
+	struct nvme_ns_ids	ids;
+	struct list_head	entry;
+	struct kref		ref;
+};
+
 struct nvme_ns {
 	struct list_head list;
 
 	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
+	struct list_head siblings;
 	struct nvm_dev *ndev;
 	struct kref kref;
+	struct nvme_ns_head *head;
 	int instance;
 
-	unsigned ns_id;
-	struct nvme_ns_ids ids;
 	int lba_shift;
 	u16 ms;
 	u16 sgs;
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

This patch adds native multipath support to the nvme driver.  For each
namespace we create only single block device node, which can be used
to access that namespace through any of the controllers that refer to it.
The gendisk for each controllers path to the name space still exists
inside the kernel, but is hidden from userspace.  The character device
nodes are still available on a per-controller basis.  A new link from
the sysfs directory for the subsystem allows to find all controllers
for a given subsystem.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 398 ++++++++++++++++++++++++++++++++++++++++-------
 drivers/nvme/host/nvme.h |  15 +-
 drivers/nvme/host/pci.c  |   2 +
 3 files changed, 355 insertions(+), 60 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1db26729bd89..22c06cd3bef0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -102,6 +102,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
+static void nvme_failover_req(struct request *req)
+{
+	struct nvme_ns *ns = req->q->queuedata;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ns->head->requeue_lock, flags);
+	blk_steal_bios(&ns->head->requeue_list, req);
+	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+	blk_mq_end_request(req, 0);
+
+	nvme_reset_ctrl(ns->ctrl);
+	kblockd_schedule_work(&ns->head->requeue_work);
+}
+
 static blk_status_t nvme_error_status(struct request *req)
 {
 	switch (nvme_req(req)->status & 0x7ff) {
@@ -129,6 +143,53 @@ static blk_status_t nvme_error_status(struct request *req)
 	}
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+	if (!(req->cmd_flags & REQ_NVME_MPATH))
+		return false;
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
+		return false;
+	}
+
+	/* Everything else could be a path failure, so should be retried */
+	return true;
+}
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
 	if (blk_noretry_request(req))
@@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+		if (nvme_req_needs_failover(req)) {
+			nvme_failover_req(req);
+			return;
+		}
+
 		nvme_req(req)->retries++;
 		blk_mq_requeue_request(req, true);
 		return;
@@ -171,6 +237,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->head)
+			kblockd_schedule_work(&ns->head->requeue_work);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -238,9 +316,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
 	if (changed)
 		ctrl->state = new_state;
-
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
+	if (changed && ctrl->state == NVME_CTRL_LIVE)
+		nvme_kick_requeue_lists(ctrl);
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -250,6 +329,15 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	del_gendisk(head->disk);
+	blk_set_queue_dying(head->disk->queue);
+	/* make sure all pending bios are cleaned up */
+	kblockd_schedule_work(&head->requeue_work);
+	flush_work(&head->requeue_work);
+	blk_cleanup_queue(head->disk->queue);
+	put_disk(head->disk);
+	ida_simple_remove(&head->subsys->ns_ida, head->instance);
+
 	list_del_init(&head->entry);
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -266,9 +354,7 @@ static void nvme_free_ns(struct kref *kref)
 
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
-
 	put_disk(ns->disk);
-	ida_simple_remove(&ns->ctrl->ns_ida, ns->instance);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -1013,11 +1099,9 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return status;
 }
 
-static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
-		unsigned int cmd, unsigned long arg)
+static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
+		unsigned long arg)
 {
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
@@ -1040,18 +1124,10 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	}
 }
 
+/* should never be called due to GENHD_FL_HIDDEN */
 static int nvme_open(struct block_device *bdev, fmode_t mode)
 {
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-
-	if (!kref_get_unless_zero(&ns->kref))
-		return -ENXIO;
-	return 0;
-}
-
-static void nvme_release(struct gendisk *disk, fmode_t mode)
-{
-	nvme_put_ns(disk->private_data);
+	return WARN_ON_ONCE(-ENXIO);
 }
 
 static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1081,8 +1157,10 @@ static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
 	if (blk_get_integrity(disk) &&
 	    (ns->pi_type != pi_type || ns->ms != old_ms ||
 	     bs != queue_logical_block_size(disk->queue) ||
-	     (ns->ms && ns->ext)))
+	     (ns->ms && ns->ext))) {
 		blk_integrity_unregister(disk);
+		blk_integrity_unregister(ns->head->disk);
+	}
 
 	ns->pi_type = pi_type;
 }
@@ -1110,7 +1188,9 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 	}
 	integrity.tuple_size = ns->ms;
 	blk_integrity_register(ns->disk, &integrity);
+	blk_integrity_register(ns->head->disk, &integrity);
 	blk_queue_max_integrity_segments(ns->queue, 1);
+	blk_queue_max_integrity_segments(ns->head->disk->queue, 1);
 }
 #else
 static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
@@ -1128,7 +1208,7 @@ static void nvme_set_chunk_size(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size));
 }
 
-static void nvme_config_discard(struct nvme_ns *ns)
+static void nvme_config_discard(struct nvme_ns *ns, struct request_queue *queue)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1139,18 +1219,18 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	if (ctrl->nr_streams && ns->sws && ns->sgs) {
 		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
 
-		ns->queue->limits.discard_alignment = sz;
-		ns->queue->limits.discard_granularity = sz;
+		queue->limits.discard_alignment = sz;
+		queue->limits.discard_granularity = sz;
 	} else {
 		ns->queue->limits.discard_alignment = logical_block_size;
 		ns->queue->limits.discard_granularity = logical_block_size;
 	}
-	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
-	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+	blk_queue_max_discard_sectors(queue, UINT_MAX);
+	blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue);
 
 	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
+		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
@@ -1207,17 +1287,25 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
+	blk_queue_logical_block_size(ns->head->disk->queue, bs);
 	if (ns->noiob)
 		nvme_set_chunk_size(ns);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) {
 		set_capacity(disk, 0);
-	else
+		if (ns->head)
+			set_capacity(ns->head->disk, 0);
+	} else {
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+		if (ns->head)
+			set_capacity(ns->head->disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+	}
 
-	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM) {
+		nvme_config_discard(ns, ns->queue);
+		nvme_config_discard(ns, ns->head->disk->queue);
+	}
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
@@ -1255,6 +1343,29 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	return ret;
 }
 
+static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (ns->ctrl->state == NVME_CTRL_LIVE) {
+			rcu_assign_pointer(head->current_path, ns);
+			return ns;
+		}
+	}
+
+	return NULL;
+}
+
+static inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns = srcu_dereference(head->current_path, &head->srcu);
+
+	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
+		ns = __nvme_find_path(head);
+	return ns;
+}
+
 static char nvme_pr_type(enum pr_type type)
 {
 	switch (type) {
@@ -1278,8 +1389,10 @@ static char nvme_pr_type(enum pr_type type)
 static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 				u64 key, u64 sa_key, u8 op)
 {
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_ns_head *head = bdev->bd_disk->private_data;
+	struct nvme_ns *ns;
 	struct nvme_command c;
+	int srcu_idx, ret;
 	u8 data[16] = { 0, };
 
 	put_unaligned_le64(key, &data[0]);
@@ -1287,10 +1400,17 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->head->ns_id);
+	c.common.nsid = cpu_to_le32(head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
-	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (likely(ns))
+		ret = nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+	else
+		ret = -EWOULDBLOCK;
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
 }
 
 static int nvme_pr_register(struct block_device *bdev, u64 old,
@@ -1369,15 +1489,16 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 EXPORT_SYMBOL_GPL(nvme_sec_submit);
 #endif /* CONFIG_BLK_SED_OPAL */
 
+/*
+ * While we don't expose the per-controller devices to userspace we still
+ * need valid file operations for them, for one because the block layer
+ * expects to use the owner field for module refcounting, and also because
+ * we call revalidate_disk internally.
+ */
 static const struct block_device_operations nvme_fops = {
 	.owner		= THIS_MODULE,
-	.ioctl		= nvme_ioctl,
-	.compat_ioctl	= nvme_ioctl,
 	.open		= nvme_open,
-	.release	= nvme_release,
-	.getgeo		= nvme_getgeo,
 	.revalidate_disk= nvme_revalidate_disk,
-	.pr_ops		= &nvme_pr_ops,
 };
 
 static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
@@ -1774,6 +1895,7 @@ static void nvme_destroy_subsystem(struct kref *ref)
 	list_del(&subsys->entry);
 	mutex_unlock(&nvme_subsystems_lock);
 
+	ida_destroy(&subsys->ns_ida);
 	device_del(&subsys->dev);
 	put_device(&subsys->dev);
 }
@@ -1803,7 +1925,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
-	int ret;
+	int ret = -ENOMEM;
 
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
@@ -1854,12 +1976,21 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 				"failed to register subsystem device.\n");
 			goto out_unlock;
 		}
+		ida_init(&subsys->ns_ida);
 		list_add_tail(&subsys->entry, &nvme_subsystems);
 	}
 
 	ctrl->subsys = subsys;
 	mutex_unlock(&nvme_subsystems_lock);
 
+	if (sysfs_create_link(&subsys->dev.kobj, &ctrl->device->kobj,
+			dev_name(ctrl->device))) {
+		dev_err(ctrl->device,
+			"failed to create sysfs link from subsystem.\n");
+		/* the transport driver will eventually put the subsystem */
+		return -EINVAL;
+	}
+
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
 	mutex_unlock(&subsys->lock);
@@ -2375,6 +2506,121 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
+		struct bio *bio)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct device *dev = disk_to_dev(head->disk);
+	struct nvme_ns *ns;
+	blk_qc_t ret = BLK_QC_T_NONE;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (likely(ns)) {
+		bio->bi_disk = ns->disk;
+		bio->bi_opf |= REQ_NVME_MPATH;
+		ret = direct_make_request(bio);
+	} else if (!list_empty_careful(&head->list)) {
+		dev_warn_ratelimited(dev, "no path available - requeing I/O\n");
+
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		dev_warn_ratelimited(dev, "no path - failing I/O\n");
+
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static bool nvme_ns_head_poll(struct request_queue *q, blk_qc_t qc)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct nvme_ns *ns;
+	bool found = false;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = srcu_dereference(head->current_path, &head->srcu);
+	if (likely(ns && ns->ctrl->state == NVME_CTRL_LIVE))
+		found = ns->queue->poll_fn(q, qc);
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return found;
+}
+
+static int nvme_ns_head_open(struct block_device *bdev, fmode_t mode)
+{
+	struct nvme_ns_head *head = bdev->bd_disk->private_data;
+
+	if (!kref_get_unless_zero(&head->ref))
+		return -ENXIO;
+	return 0;
+}
+
+static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode)
+{
+	nvme_put_ns_head(disk->private_data);
+}
+
+/*
+ * Issue the ioctl on the first available path.  Note that unlike normal block
+ * layer requests we will not retry failed request on another controller.
+ */
+static int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns_head *head = bdev->bd_disk->private_data;
+	struct nvme_ns *ns;
+	int srcu_idx, ret;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (likely(ns))
+		ret = nvme_ns_ioctl(ns, cmd, arg);
+	else
+		ret = -EWOULDBLOCK;
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static const struct block_device_operations nvme_ns_head_ops = {
+	.owner		= THIS_MODULE,
+	.open		= nvme_ns_head_open,
+	.release	= nvme_ns_head_release,
+	.ioctl		= nvme_ns_head_ioctl,
+	.compat_ioctl	= nvme_ns_head_ioctl,
+	.getgeo		= nvme_getgeo,
+	.pr_ops		= &nvme_pr_ops,
+};
+
+static void nvme_requeue_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&head->requeue_lock);
+	next = bio_list_get(&head->requeue_list);
+	spin_unlock_irq(&head->requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+
+		/*
+		 * Reset disk to the mpath node and resubmit to select a new
+		 * path.
+		 */
+		bio->bi_disk = head->disk;
+		generic_make_request(bio);
+	}
+}
+
 static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
 		unsigned nsid)
 {
@@ -2410,15 +2656,23 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		unsigned nsid, struct nvme_id_ns *id)
 {
 	struct nvme_ns_head *head;
+	struct request_queue *q;
 	int ret = -ENOMEM;
 
 	head = kzalloc(sizeof(*head), GFP_KERNEL);
 	if (!head)
 		goto out;
-
+	ret = ida_simple_get(&ctrl->subsys->ns_ida, 1, 0, GFP_KERNEL);
+	if (ret < 0)
+		goto out_free_head;
+	head->instance = ret;
 	INIT_LIST_HEAD(&head->list);
 	head->ns_id = nsid;
+	bio_list_init(&head->requeue_list);
+	spin_lock_init(&head->requeue_lock);
+	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
+	head->subsys = ctrl->subsys;
 	kref_init(&head->ref);
 
 	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
@@ -2427,20 +2681,46 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	if (ret) {
 		dev_err(ctrl->device,
 			"duplicate IDs for nsid %d\n", nsid);
-		goto out_free_head;
+		goto out_release_instance;
 	}
 
+	ret = -ENOMEM;
+	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
+	if (!q)
+		goto out_free_head;
+	q->queuedata = head;
+	blk_queue_make_request(q, nvme_ns_head_make_request);
+	q->poll_fn = nvme_ns_head_poll;
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	/* set to a default value for 512 until disk is validated */
+	blk_queue_logical_block_size(q, 512);
+	nvme_set_queue_limits(ctrl, q);
+
+	head->disk = alloc_disk(0);
+	if (!head->disk)
+		goto out_cleanup_queue;
+	head->disk->fops = &nvme_ns_head_ops;
+	head->disk->private_data = head;
+	head->disk->queue = q;
+	head->disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(head->disk->disk_name, "nvme%dn%d",
+			ctrl->subsys->instance, nsid);
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
+
+out_cleanup_queue:
+	blk_cleanup_queue(q);
 out_free_head:
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
+out_release_instance:
+	ida_simple_remove(&ctrl->subsys->ns_ida, head->instance);
 out:
 	return ERR_PTR(ret);
 }
 
 static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
-		struct nvme_id_ns *id)
+		struct nvme_id_ns *id, bool *new)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	bool is_shared = id->nmic & (1 << 0);
@@ -2456,6 +2736,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = PTR_ERR(head);
 			goto out_unlock;
 		}
+
+		*new = true;
 	} else {
 		struct nvme_ns_ids ids;
 
@@ -2467,6 +2749,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = -EINVAL;
 			goto out_unlock;
 		}
+
+		*new = false;
 	}
 
 	list_add_tail(&ns->siblings, &head->list);
@@ -2537,18 +2821,15 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
+	bool new = true;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
 		return;
 
-	ns->instance = ida_simple_get(&ctrl->ns_ida, 1, 0, GFP_KERNEL);
-	if (ns->instance < 0)
-		goto out_free_ns;
-
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
-		goto out_release_instance;
+		goto out_free_ns;
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
@@ -2560,8 +2841,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_set_queue_limits(ctrl, ns->queue);
 	nvme_setup_streams_ns(ctrl, ns);
 
-	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
-
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id)
 		goto out_free_queue;
@@ -2569,9 +2848,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	if (nvme_init_ns_head(ns, nsid, id))
+	if (nvme_init_ns_head(ns, nsid, id, &new))
 		goto out_free_id;
 
+	sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
+			ctrl->cntlid, ns->head->instance);
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
 			dev_warn(ctrl->device, "LightNVM init failure\n");
@@ -2586,7 +2867,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
 	disk->queue = ns->queue;
-	disk->flags = GENHD_FL_EXT_DEVT;
+	disk->flags = GENHD_FL_HIDDEN;
 	memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
 	ns->disk = disk;
 
@@ -2608,6 +2889,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
+
+	if (new)
+		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
+
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2617,8 +2902,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	kfree(id);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
- out_release_instance:
-	ida_simple_remove(&ctrl->ns_ida, ns->instance);
  out_free_ns:
 	kfree(ns);
 }
@@ -2633,8 +2916,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
-		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
 		del_gendisk(ns->disk);
@@ -2642,8 +2923,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	}
 
 	mutex_lock(&ns->ctrl->subsys->lock);
-	if (head)
+	if (head) {
+		if (head->current_path == ns)
+			rcu_assign_pointer(head->current_path, NULL);
 		list_del_rcu(&ns->siblings);
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	mutex_lock(&ns->ctrl->namespaces_mutex);
@@ -2948,12 +3232,12 @@ static void nvme_free_ctrl(struct device *dev)
 	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
-	ida_destroy(&ctrl->ns_ida);
 
 	if (subsys) {
 		mutex_lock(&subsys->lock);
 		list_del(&ctrl->subsys_entry);
 		mutex_unlock(&subsys->lock);
+		sysfs_remove_link(&subsys->dev.kobj, dev_name(ctrl->device));
 	}
 
 	ctrl->ops->free_ctrl(ctrl);
@@ -3006,8 +3290,6 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	if (ret)
 		goto out_free_name;
 
-	ida_init(&ctrl->ns_ida);
-
 	/*
 	 * Initialize latency tolerance controls.  The sysfs files won't
 	 * be visible to userspace unless the device actually supports APST.
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 849413def126..87dd77e8fbdf 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -95,6 +95,11 @@ struct nvme_request {
 	u16			status;
 };
 
+/*
+ * Mark a bio as coming in through the mpath node.
+ */
+#define REQ_NVME_MPATH		REQ_DRV
+
 enum {
 	NVME_REQ_CANCELLED		= (1 << 0),
 };
@@ -136,7 +141,6 @@ struct nvme_ctrl {
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
-	struct ida ns_ida;
 	struct work_struct reset_work;
 
 	struct nvme_subsystem *subsys;
@@ -213,6 +217,7 @@ struct nvme_subsystem {
 	char			model[40];
 	char			firmware_rev[8];
 	u16			vendor_id;
+	struct ida		ns_ida;
 };
 
 /*
@@ -232,12 +237,19 @@ struct nvme_ns_ids {
  * only ever has a single entry for private namespaces.
  */
 struct nvme_ns_head {
+	struct nvme_ns __rcu	*current_path;
+	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct nvme_subsystem	*subsys;
+	struct bio_list		requeue_list;
+	spinlock_t		requeue_lock;
+	struct work_struct	requeue_work;
 	unsigned		ns_id;
 	struct nvme_ns_ids	ids;
 	struct list_head	entry;
 	struct kref		ref;
+	int			instance;
 };
 
 struct nvme_ns {
@@ -250,7 +262,6 @@ struct nvme_ns {
 	struct nvm_dev *ndev;
 	struct kref kref;
 	struct nvme_ns_head *head;
-	int instance;
 
 	int lba_shift;
 	u16 ms;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7735571ffc9a..bbece5edabff 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1050,6 +1050,8 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
 
+	printk_ratelimited("%s: called\n", __func__);
+
 	return __nvme_poll(nvmeq, tag);
 }
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


This patch adds native multipath support to the nvme driver.  For each
namespace we create only single block device node, which can be used
to access that namespace through any of the controllers that refer to it.
The gendisk for each controllers path to the name space still exists
inside the kernel, but is hidden from userspace.  The character device
nodes are still available on a per-controller basis.  A new link from
the sysfs directory for the subsystem allows to find all controllers
for a given subsystem.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 398 ++++++++++++++++++++++++++++++++++++++++-------
 drivers/nvme/host/nvme.h |  15 +-
 drivers/nvme/host/pci.c  |   2 +
 3 files changed, 355 insertions(+), 60 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1db26729bd89..22c06cd3bef0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -102,6 +102,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
 	return ret;
 }
 
+static void nvme_failover_req(struct request *req)
+{
+	struct nvme_ns *ns = req->q->queuedata;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ns->head->requeue_lock, flags);
+	blk_steal_bios(&ns->head->requeue_list, req);
+	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
+	blk_mq_end_request(req, 0);
+
+	nvme_reset_ctrl(ns->ctrl);
+	kblockd_schedule_work(&ns->head->requeue_work);
+}
+
 static blk_status_t nvme_error_status(struct request *req)
 {
 	switch (nvme_req(req)->status & 0x7ff) {
@@ -129,6 +143,53 @@ static blk_status_t nvme_error_status(struct request *req)
 	}
 }
 
+static bool nvme_req_needs_failover(struct request *req)
+{
+	if (!(req->cmd_flags & REQ_NVME_MPATH))
+		return false;
+
+	switch (nvme_req(req)->status & 0x7ff) {
+	/*
+	 * Generic command status:
+	 */
+	case NVME_SC_INVALID_OPCODE:
+	case NVME_SC_INVALID_FIELD:
+	case NVME_SC_INVALID_NS:
+	case NVME_SC_LBA_RANGE:
+	case NVME_SC_CAP_EXCEEDED:
+	case NVME_SC_RESERVATION_CONFLICT:
+		return false;
+
+	/*
+	 * I/O command set specific error.  Unfortunately these values are
+	 * reused for fabrics commands, but those should never get here.
+	 */
+	case NVME_SC_BAD_ATTRIBUTES:
+	case NVME_SC_INVALID_PI:
+	case NVME_SC_READ_ONLY:
+	case NVME_SC_ONCS_NOT_SUPPORTED:
+		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
+			nvme_fabrics_command);
+		return false;
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	case NVME_SC_WRITE_FAULT:
+	case NVME_SC_READ_ERROR:
+	case NVME_SC_GUARD_CHECK:
+	case NVME_SC_APPTAG_CHECK:
+	case NVME_SC_REFTAG_CHECK:
+	case NVME_SC_COMPARE_FAILED:
+	case NVME_SC_ACCESS_DENIED:
+	case NVME_SC_UNWRITTEN_BLOCK:
+		return false;
+	}
+
+	/* Everything else could be a path failure, so should be retried */
+	return true;
+}
+
 static inline bool nvme_req_needs_retry(struct request *req)
 {
 	if (blk_noretry_request(req))
@@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
 void nvme_complete_rq(struct request *req)
 {
 	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
+		if (nvme_req_needs_failover(req)) {
+			nvme_failover_req(req);
+			return;
+		}
+
 		nvme_req(req)->retries++;
 		blk_mq_requeue_request(req, true);
 		return;
@@ -171,6 +237,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_request);
 
+static void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->head)
+			kblockd_schedule_work(&ns->head->requeue_work);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -238,9 +316,10 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
 	if (changed)
 		ctrl->state = new_state;
-
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
+	if (changed && ctrl->state == NVME_CTRL_LIVE)
+		nvme_kick_requeue_lists(ctrl);
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -250,6 +329,15 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	del_gendisk(head->disk);
+	blk_set_queue_dying(head->disk->queue);
+	/* make sure all pending bios are cleaned up */
+	kblockd_schedule_work(&head->requeue_work);
+	flush_work(&head->requeue_work);
+	blk_cleanup_queue(head->disk->queue);
+	put_disk(head->disk);
+	ida_simple_remove(&head->subsys->ns_ida, head->instance);
+
 	list_del_init(&head->entry);
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
@@ -266,9 +354,7 @@ static void nvme_free_ns(struct kref *kref)
 
 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
-
 	put_disk(ns->disk);
-	ida_simple_remove(&ns->ctrl->ns_ida, ns->instance);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -1013,11 +1099,9 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return status;
 }
 
-static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
-		unsigned int cmd, unsigned long arg)
+static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
+		unsigned long arg)
 {
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-
 	switch (cmd) {
 	case NVME_IOCTL_ID:
 		force_successful_syscall_return();
@@ -1040,18 +1124,10 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
 	}
 }
 
+/* should never be called due to GENHD_FL_HIDDEN */
 static int nvme_open(struct block_device *bdev, fmode_t mode)
 {
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-
-	if (!kref_get_unless_zero(&ns->kref))
-		return -ENXIO;
-	return 0;
-}
-
-static void nvme_release(struct gendisk *disk, fmode_t mode)
-{
-	nvme_put_ns(disk->private_data);
+	return WARN_ON_ONCE(-ENXIO);
 }
 
 static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1081,8 +1157,10 @@ static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
 	if (blk_get_integrity(disk) &&
 	    (ns->pi_type != pi_type || ns->ms != old_ms ||
 	     bs != queue_logical_block_size(disk->queue) ||
-	     (ns->ms && ns->ext)))
+	     (ns->ms && ns->ext))) {
 		blk_integrity_unregister(disk);
+		blk_integrity_unregister(ns->head->disk);
+	}
 
 	ns->pi_type = pi_type;
 }
@@ -1110,7 +1188,9 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 	}
 	integrity.tuple_size = ns->ms;
 	blk_integrity_register(ns->disk, &integrity);
+	blk_integrity_register(ns->head->disk, &integrity);
 	blk_queue_max_integrity_segments(ns->queue, 1);
+	blk_queue_max_integrity_segments(ns->head->disk->queue, 1);
 }
 #else
 static void nvme_prep_integrity(struct gendisk *disk, struct nvme_id_ns *id,
@@ -1128,7 +1208,7 @@ static void nvme_set_chunk_size(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(ns->queue, rounddown_pow_of_two(chunk_size));
 }
 
-static void nvme_config_discard(struct nvme_ns *ns)
+static void nvme_config_discard(struct nvme_ns *ns, struct request_queue *queue)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
@@ -1139,18 +1219,18 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	if (ctrl->nr_streams && ns->sws && ns->sgs) {
 		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
 
-		ns->queue->limits.discard_alignment = sz;
-		ns->queue->limits.discard_granularity = sz;
+		queue->limits.discard_alignment = sz;
+		queue->limits.discard_granularity = sz;
 	} else {
 		ns->queue->limits.discard_alignment = logical_block_size;
 		ns->queue->limits.discard_granularity = logical_block_size;
 	}
-	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
-	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+	blk_queue_max_discard_sectors(queue, UINT_MAX);
+	blk_queue_max_discard_segments(queue, NVME_DSM_MAX_RANGES);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, queue);
 
 	if (ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
-		blk_queue_max_write_zeroes_sectors(ns->queue, UINT_MAX);
+		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
@@ -1207,17 +1287,25 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
+	blk_queue_logical_block_size(ns->head->disk->queue, bs);
 	if (ns->noiob)
 		nvme_set_chunk_size(ns);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
 		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk)) {
 		set_capacity(disk, 0);
-	else
+		if (ns->head)
+			set_capacity(ns->head->disk, 0);
+	} else {
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+		if (ns->head)
+			set_capacity(ns->head->disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+	}
 
-	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM) {
+		nvme_config_discard(ns, ns->queue);
+		nvme_config_discard(ns, ns->head->disk->queue);
+	}
 	blk_mq_unfreeze_queue(disk->queue);
 }
 
@@ -1255,6 +1343,29 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	return ret;
 }
 
+static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (ns->ctrl->state == NVME_CTRL_LIVE) {
+			rcu_assign_pointer(head->current_path, ns);
+			return ns;
+		}
+	}
+
+	return NULL;
+}
+
+static inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *ns = srcu_dereference(head->current_path, &head->srcu);
+
+	if (unlikely(!ns || ns->ctrl->state != NVME_CTRL_LIVE))
+		ns = __nvme_find_path(head);
+	return ns;
+}
+
 static char nvme_pr_type(enum pr_type type)
 {
 	switch (type) {
@@ -1278,8 +1389,10 @@ static char nvme_pr_type(enum pr_type type)
 static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 				u64 key, u64 sa_key, u8 op)
 {
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_ns_head *head = bdev->bd_disk->private_data;
+	struct nvme_ns *ns;
 	struct nvme_command c;
+	int srcu_idx, ret;
 	u8 data[16] = { 0, };
 
 	put_unaligned_le64(key, &data[0]);
@@ -1287,10 +1400,17 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->head->ns_id);
+	c.common.nsid = cpu_to_le32(head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);
 
-	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (likely(ns))
+		ret = nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+	else
+		ret = -EWOULDBLOCK;
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
 }
 
 static int nvme_pr_register(struct block_device *bdev, u64 old,
@@ -1369,15 +1489,16 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 EXPORT_SYMBOL_GPL(nvme_sec_submit);
 #endif /* CONFIG_BLK_SED_OPAL */
 
+/*
+ * While we don't expose the per-controller devices to userspace we still
+ * need valid file operations for them, for one because the block layer
+ * expects to use the owner field for module refcounting, and also because
+ * we call revalidate_disk internally.
+ */
 static const struct block_device_operations nvme_fops = {
 	.owner		= THIS_MODULE,
-	.ioctl		= nvme_ioctl,
-	.compat_ioctl	= nvme_ioctl,
 	.open		= nvme_open,
-	.release	= nvme_release,
-	.getgeo		= nvme_getgeo,
 	.revalidate_disk= nvme_revalidate_disk,
-	.pr_ops		= &nvme_pr_ops,
 };
 
 static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
@@ -1774,6 +1895,7 @@ static void nvme_destroy_subsystem(struct kref *ref)
 	list_del(&subsys->entry);
 	mutex_unlock(&nvme_subsystems_lock);
 
+	ida_destroy(&subsys->ns_ida);
 	device_del(&subsys->dev);
 	put_device(&subsys->dev);
 }
@@ -1803,7 +1925,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
-	int ret;
+	int ret = -ENOMEM;
 
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
@@ -1854,12 +1976,21 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 				"failed to register subsystem device.\n");
 			goto out_unlock;
 		}
+		ida_init(&subsys->ns_ida);
 		list_add_tail(&subsys->entry, &nvme_subsystems);
 	}
 
 	ctrl->subsys = subsys;
 	mutex_unlock(&nvme_subsystems_lock);
 
+	if (sysfs_create_link(&subsys->dev.kobj, &ctrl->device->kobj,
+			dev_name(ctrl->device))) {
+		dev_err(ctrl->device,
+			"failed to create sysfs link from subsystem.\n");
+		/* the transport driver will eventually put the subsystem */
+		return -EINVAL;
+	}
+
 	mutex_lock(&subsys->lock);
 	list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
 	mutex_unlock(&subsys->lock);
@@ -2375,6 +2506,121 @@ static const struct attribute_group *nvme_dev_attr_groups[] = {
 	NULL,
 };
 
+static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
+		struct bio *bio)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct device *dev = disk_to_dev(head->disk);
+	struct nvme_ns *ns;
+	blk_qc_t ret = BLK_QC_T_NONE;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (likely(ns)) {
+		bio->bi_disk = ns->disk;
+		bio->bi_opf |= REQ_NVME_MPATH;
+		ret = direct_make_request(bio);
+	} else if (!list_empty_careful(&head->list)) {
+		dev_warn_ratelimited(dev, "no path available - requeing I/O\n");
+
+		spin_lock_irq(&head->requeue_lock);
+		bio_list_add(&head->requeue_list, bio);
+		spin_unlock_irq(&head->requeue_lock);
+	} else {
+		dev_warn_ratelimited(dev, "no path - failing I/O\n");
+
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+	}
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static bool nvme_ns_head_poll(struct request_queue *q, blk_qc_t qc)
+{
+	struct nvme_ns_head *head = q->queuedata;
+	struct nvme_ns *ns;
+	bool found = false;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = srcu_dereference(head->current_path, &head->srcu);
+	if (likely(ns && ns->ctrl->state == NVME_CTRL_LIVE))
+		found = ns->queue->poll_fn(q, qc);
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return found;
+}
+
+static int nvme_ns_head_open(struct block_device *bdev, fmode_t mode)
+{
+	struct nvme_ns_head *head = bdev->bd_disk->private_data;
+
+	if (!kref_get_unless_zero(&head->ref))
+		return -ENXIO;
+	return 0;
+}
+
+static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode)
+{
+	nvme_put_ns_head(disk->private_data);
+}
+
+/*
+ * Issue the ioctl on the first available path.  Note that unlike normal block
+ * layer requests we will not retry failed request on another controller.
+ */
+static int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns_head *head = bdev->bd_disk->private_data;
+	struct nvme_ns *ns;
+	int srcu_idx, ret;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (likely(ns))
+		ret = nvme_ns_ioctl(ns, cmd, arg);
+	else
+		ret = -EWOULDBLOCK;
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+static const struct block_device_operations nvme_ns_head_ops = {
+	.owner		= THIS_MODULE,
+	.open		= nvme_ns_head_open,
+	.release	= nvme_ns_head_release,
+	.ioctl		= nvme_ns_head_ioctl,
+	.compat_ioctl	= nvme_ns_head_ioctl,
+	.getgeo		= nvme_getgeo,
+	.pr_ops		= &nvme_pr_ops,
+};
+
+static void nvme_requeue_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&head->requeue_lock);
+	next = bio_list_get(&head->requeue_list);
+	spin_unlock_irq(&head->requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+
+		/*
+		 * Reset disk to the mpath node and resubmit to select a new
+		 * path.
+		 */
+		bio->bi_disk = head->disk;
+		generic_make_request(bio);
+	}
+}
+
 static struct nvme_ns_head *__nvme_find_ns_head(struct nvme_subsystem *subsys,
 		unsigned nsid)
 {
@@ -2410,15 +2656,23 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		unsigned nsid, struct nvme_id_ns *id)
 {
 	struct nvme_ns_head *head;
+	struct request_queue *q;
 	int ret = -ENOMEM;
 
 	head = kzalloc(sizeof(*head), GFP_KERNEL);
 	if (!head)
 		goto out;
-
+	ret = ida_simple_get(&ctrl->subsys->ns_ida, 1, 0, GFP_KERNEL);
+	if (ret < 0)
+		goto out_free_head;
+	head->instance = ret;
 	INIT_LIST_HEAD(&head->list);
 	head->ns_id = nsid;
+	bio_list_init(&head->requeue_list);
+	spin_lock_init(&head->requeue_lock);
+	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	init_srcu_struct(&head->srcu);
+	head->subsys = ctrl->subsys;
 	kref_init(&head->ref);
 
 	nvme_report_ns_ids(ctrl, nsid, id, &head->ids);
@@ -2427,20 +2681,46 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	if (ret) {
 		dev_err(ctrl->device,
 			"duplicate IDs for nsid %d\n", nsid);
-		goto out_free_head;
+		goto out_release_instance;
 	}
 
+	ret = -ENOMEM;
+	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
+	if (!q)
+		goto out_free_head;
+	q->queuedata = head;
+	blk_queue_make_request(q, nvme_ns_head_make_request);
+	q->poll_fn = nvme_ns_head_poll;
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
+	/* set to a default value for 512 until disk is validated */
+	blk_queue_logical_block_size(q, 512);
+	nvme_set_queue_limits(ctrl, q);
+
+	head->disk = alloc_disk(0);
+	if (!head->disk)
+		goto out_cleanup_queue;
+	head->disk->fops = &nvme_ns_head_ops;
+	head->disk->private_data = head;
+	head->disk->queue = q;
+	head->disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(head->disk->disk_name, "nvme%dn%d",
+			ctrl->subsys->instance, nsid);
 	list_add_tail(&head->entry, &ctrl->subsys->nsheads);
 	return head;
+
+out_cleanup_queue:
+	blk_cleanup_queue(q);
 out_free_head:
 	cleanup_srcu_struct(&head->srcu);
 	kfree(head);
+out_release_instance:
+	ida_simple_remove(&ctrl->subsys->ns_ida, head->instance);
 out:
 	return ERR_PTR(ret);
 }
 
 static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
-		struct nvme_id_ns *id)
+		struct nvme_id_ns *id, bool *new)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	bool is_shared = id->nmic & (1 << 0);
@@ -2456,6 +2736,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = PTR_ERR(head);
 			goto out_unlock;
 		}
+
+		*new = true;
 	} else {
 		struct nvme_ns_ids ids;
 
@@ -2467,6 +2749,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
 			ret = -EINVAL;
 			goto out_unlock;
 		}
+
+		*new = false;
 	}
 
 	list_add_tail(&ns->siblings, &head->list);
@@ -2537,18 +2821,15 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
 	int node = dev_to_node(ctrl->dev);
+	bool new = true;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
 		return;
 
-	ns->instance = ida_simple_get(&ctrl->ns_ida, 1, 0, GFP_KERNEL);
-	if (ns->instance < 0)
-		goto out_free_ns;
-
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
-		goto out_release_instance;
+		goto out_free_ns;
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
@@ -2560,8 +2841,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_set_queue_limits(ctrl, ns->queue);
 	nvme_setup_streams_ns(ctrl, ns);
 
-	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
-
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id)
 		goto out_free_queue;
@@ -2569,9 +2848,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (id->ncap == 0)
 		goto out_free_id;
 
-	if (nvme_init_ns_head(ns, nsid, id))
+	if (nvme_init_ns_head(ns, nsid, id, &new))
 		goto out_free_id;
 
+	sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
+			ctrl->cntlid, ns->head->instance);
 	if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
 		if (nvme_nvm_register(ns, disk_name, node)) {
 			dev_warn(ctrl->device, "LightNVM init failure\n");
@@ -2586,7 +2867,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
 	disk->queue = ns->queue;
-	disk->flags = GENHD_FL_EXT_DEVT;
+	disk->flags = GENHD_FL_HIDDEN;
 	memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
 	ns->disk = disk;
 
@@ -2608,6 +2889,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
+
+	if (new)
+		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
+
 	return;
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
@@ -2617,8 +2902,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	kfree(id);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
- out_release_instance:
-	ida_simple_remove(&ctrl->ns_ida, ns->instance);
  out_free_ns:
 	kfree(ns);
 }
@@ -2633,8 +2916,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
-		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
 		del_gendisk(ns->disk);
@@ -2642,8 +2923,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	}
 
 	mutex_lock(&ns->ctrl->subsys->lock);
-	if (head)
+	if (head) {
+		if (head->current_path == ns)
+			rcu_assign_pointer(head->current_path, NULL);
 		list_del_rcu(&ns->siblings);
+	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	mutex_lock(&ns->ctrl->namespaces_mutex);
@@ -2948,12 +3232,12 @@ static void nvme_free_ctrl(struct device *dev)
 	struct nvme_subsystem *subsys = ctrl->subsys;
 
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
-	ida_destroy(&ctrl->ns_ida);
 
 	if (subsys) {
 		mutex_lock(&subsys->lock);
 		list_del(&ctrl->subsys_entry);
 		mutex_unlock(&subsys->lock);
+		sysfs_remove_link(&subsys->dev.kobj, dev_name(ctrl->device));
 	}
 
 	ctrl->ops->free_ctrl(ctrl);
@@ -3006,8 +3290,6 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	if (ret)
 		goto out_free_name;
 
-	ida_init(&ctrl->ns_ida);
-
 	/*
 	 * Initialize latency tolerance controls.  The sysfs files won't
 	 * be visible to userspace unless the device actually supports APST.
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 849413def126..87dd77e8fbdf 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -95,6 +95,11 @@ struct nvme_request {
 	u16			status;
 };
 
+/*
+ * Mark a bio as coming in through the mpath node.
+ */
+#define REQ_NVME_MPATH		REQ_DRV
+
 enum {
 	NVME_REQ_CANCELLED		= (1 << 0),
 };
@@ -136,7 +141,6 @@ struct nvme_ctrl {
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
-	struct ida ns_ida;
 	struct work_struct reset_work;
 
 	struct nvme_subsystem *subsys;
@@ -213,6 +217,7 @@ struct nvme_subsystem {
 	char			model[40];
 	char			firmware_rev[8];
 	u16			vendor_id;
+	struct ida		ns_ida;
 };
 
 /*
@@ -232,12 +237,19 @@ struct nvme_ns_ids {
  * only ever has a single entry for private namespaces.
  */
 struct nvme_ns_head {
+	struct nvme_ns __rcu	*current_path;
+	struct gendisk		*disk;
 	struct list_head	list;
 	struct srcu_struct      srcu;
+	struct nvme_subsystem	*subsys;
+	struct bio_list		requeue_list;
+	spinlock_t		requeue_lock;
+	struct work_struct	requeue_work;
 	unsigned		ns_id;
 	struct nvme_ns_ids	ids;
 	struct list_head	entry;
 	struct kref		ref;
+	int			instance;
 };
 
 struct nvme_ns {
@@ -250,7 +262,6 @@ struct nvme_ns {
 	struct nvm_dev *ndev;
 	struct kref kref;
 	struct nvme_ns_head *head;
-	int instance;
 
 	int lba_shift;
 	u16 ms;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7735571ffc9a..bbece5edabff 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1050,6 +1050,8 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
 
+	printk_ratelimited("%s: called\n", __func__);
+
 	return __nvme_poll(nvmeq, tag);
 }
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes
  2017-10-23 14:51 ` Christoph Hellwig
@ 2017-10-23 14:51   ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

We do this by adding a helper that returns the ns_head for a device that
can belong to either the per-controller or per-subsystem block device
nodes, and otherwise reuse all the existing code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 22c06cd3bef0..334735db90c8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -76,6 +76,7 @@ static DEFINE_IDA(nvme_instance_ida);
 static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
+static const struct attribute_group nvme_ns_id_attr_group;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
@@ -329,6 +330,8 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
+			   &nvme_ns_id_attr_group);
 	del_gendisk(head->disk);
 	blk_set_queue_dying(head->disk->queue);
 	/* make sure all pending bios are cleaned up */
@@ -1925,7 +1928,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
-	int ret = -ENOMEM;
+	int ret;
 
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
@@ -2267,12 +2270,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
 }
 static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
 
+static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (disk->fops == &nvme_fops)
+		return nvme_get_ns_from_dev(dev)->head;
+	else
+		return disk->private_data;
+}
+
 static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
-	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	struct nvme_ns_head *head = dev_to_ns_head(dev);
+	struct nvme_ns_ids *ids = &head->ids;
+	struct nvme_subsystem *subsys = head->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
@@ -2294,23 +2307,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->head->ns_id);
+		head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
 static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
-			  char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
+	return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
 static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2325,22 +2336,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
 static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
+	return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
 static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->head->ns_id);
+	return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
-static struct attribute *nvme_ns_attrs[] = {
+static struct attribute *nvme_ns_id_attrs[] = {
 	&dev_attr_wwid.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_nguid.attr,
@@ -2349,12 +2358,11 @@ static struct attribute *nvme_ns_attrs[] = {
 	NULL,
 };
 
-static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
+static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
 		struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2372,9 +2380,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 	return a->mode;
 }
 
-static const struct attribute_group nvme_ns_attr_group = {
-	.attrs		= nvme_ns_attrs,
-	.is_visible	= nvme_ns_attrs_are_visible,
+static const struct attribute_group nvme_ns_id_attr_group = {
+	.attrs		= nvme_ns_id_attrs,
+	.is_visible	= nvme_ns_id_attrs_are_visible,
 };
 
 #define nvme_show_str_function(field)						\
@@ -2883,15 +2891,20 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	device_add_disk(ctrl->device, ns->disk);
 	if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group))
+					&nvme_ns_id_attr_group))
 		pr_warn("%s: failed to create sysfs group for identification\n",
 			ns->disk->disk_name);
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 
-	if (new)
+	if (new) {
 		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
+		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
+				&nvme_ns_id_attr_group))
+			pr_warn("%s: failed to create sysfs group for identification\n",
+				ns->head->disk->disk_name);
+	}
 
 	return;
  out_unlink_ns:
@@ -2916,6 +2929,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
+		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
+				   &nvme_ns_id_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
 		del_gendisk(ns->disk);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes
@ 2017-10-23 14:51   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 14:51 UTC (permalink / raw)


We do this by adding a helper that returns the ns_head for a device that
can belong to either the per-controller or per-subsystem block device
nodes, and otherwise reuse all the existing code.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
---
 drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 22c06cd3bef0..334735db90c8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -76,6 +76,7 @@ static DEFINE_IDA(nvme_instance_ida);
 static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
+static const struct attribute_group nvme_ns_id_attr_group;
 
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
@@ -329,6 +330,8 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
+	sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
+			   &nvme_ns_id_attr_group);
 	del_gendisk(head->disk);
 	blk_set_queue_dying(head->disk->queue);
 	/* make sure all pending bios are cleaned up */
@@ -1925,7 +1928,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
-	int ret = -ENOMEM;
+	int ret;
 
 	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
 	if (!subsys)
@@ -2267,12 +2270,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
 }
 static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
 
+static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+
+	if (disk->fops == &nvme_fops)
+		return nvme_get_ns_from_dev(dev)->head;
+	else
+		return disk->private_data;
+}
+
 static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
-	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+	struct nvme_ns_head *head = dev_to_ns_head(dev);
+	struct nvme_ns_ids *ids = &head->ids;
+	struct nvme_subsystem *subsys = head->subsys;
 	int serial_len = sizeof(subsys->serial);
 	int model_len = sizeof(subsys->model);
 
@@ -2294,23 +2307,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
 
 	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
 		serial_len, subsys->serial, model_len, subsys->model,
-		ns->head->ns_id);
+		head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
 static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
-			  char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
+	return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
 static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	/* For backward compatibility expose the NGUID to userspace if
 	 * we have no UUID set
@@ -2325,22 +2336,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
 static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
+	return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
 static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
-								char *buf)
+		char *buf)
 {
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	return sprintf(buf, "%d\n", ns->head->ns_id);
+	return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
-static struct attribute *nvme_ns_attrs[] = {
+static struct attribute *nvme_ns_id_attrs[] = {
 	&dev_attr_wwid.attr,
 	&dev_attr_uuid.attr,
 	&dev_attr_nguid.attr,
@@ -2349,12 +2358,11 @@ static struct attribute *nvme_ns_attrs[] = {
 	NULL,
 };
 
-static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
+static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
 		struct attribute *a, int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
-	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-	struct nvme_ns_ids *ids = &ns->head->ids;
+	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
 
 	if (a == &dev_attr_uuid.attr) {
 		if (uuid_is_null(&ids->uuid) ||
@@ -2372,9 +2380,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 	return a->mode;
 }
 
-static const struct attribute_group nvme_ns_attr_group = {
-	.attrs		= nvme_ns_attrs,
-	.is_visible	= nvme_ns_attrs_are_visible,
+static const struct attribute_group nvme_ns_id_attr_group = {
+	.attrs		= nvme_ns_id_attrs,
+	.is_visible	= nvme_ns_id_attrs_are_visible,
 };
 
 #define nvme_show_str_function(field)						\
@@ -2883,15 +2891,20 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	device_add_disk(ctrl->device, ns->disk);
 	if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group))
+					&nvme_ns_id_attr_group))
 		pr_warn("%s: failed to create sysfs group for identification\n",
 			ns->disk->disk_name);
 	if (ns->ndev && nvme_nvm_register_sysfs(ns))
 		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
 			ns->disk->disk_name);
 
-	if (new)
+	if (new) {
 		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
+		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
+				&nvme_ns_id_attr_group))
+			pr_warn("%s: failed to create sysfs group for identification\n",
+				ns->head->disk->disk_name);
+	}
 
 	return;
  out_unlink_ns:
@@ -2916,6 +2929,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
+		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
+				   &nvme_ns_id_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
 		del_gendisk(ns->disk);
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-23 15:32     ` Sagi Grimberg
  -1 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-23 15:32 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Hannes Reinecke, Johannes Thumshirn, linux-nvme,
	linux-block


> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7735571ffc9a..bbece5edabff 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1050,6 +1050,8 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
>   {
>   	struct nvme_queue *nvmeq = hctx->driver_data;
>   
> +	printk_ratelimited("%s: called\n", __func__);
> +

This must be a left-over...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-23 15:32     ` Sagi Grimberg
  0 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-23 15:32 UTC (permalink / raw)



> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7735571ffc9a..bbece5edabff 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1050,6 +1050,8 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
>   {
>   	struct nvme_queue *nvmeq = hctx->driver_data;
>   
> +	printk_ratelimited("%s: called\n", __func__);
> +

This must be a left-over...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-23 15:32     ` Sagi Grimberg
@ 2017-10-23 16:57       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 16:57 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Hannes Reinecke,
	Johannes Thumshirn, linux-nvme, linux-block

On Mon, Oct 23, 2017 at 06:32:47PM +0300, Sagi Grimberg wrote:
>>   	struct nvme_queue *nvmeq = hctx->driver_data;
>>   +	printk_ratelimited("%s: called\n", __func__);
>> +
>
> This must be a left-over...

Indeed, it is a left-over debug statement..

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-23 16:57       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-23 16:57 UTC (permalink / raw)


On Mon, Oct 23, 2017@06:32:47PM +0300, Sagi Grimberg wrote:
>>   	struct nvme_queue *nvmeq = hctx->driver_data;
>>   +	printk_ratelimited("%s: called\n", __func__);
>> +
>
> This must be a left-over...

Indeed, it is a left-over debug statement..

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-23 17:43     ` Sagi Grimberg
  -1 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-23 17:43 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Hannes Reinecke, Johannes Thumshirn, linux-nvme,
	linux-block


>   static inline bool nvme_req_needs_retry(struct request *req)
>   {
>   	if (blk_noretry_request(req))
> @@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
>   void nvme_complete_rq(struct request *req)
>   {
>   	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
> +		if (nvme_req_needs_failover(req)) {
> +			nvme_failover_req(req);
> +			return;
> +		}
> +
>   		nvme_req(req)->retries++;
>   		blk_mq_requeue_request(req, true);
>   		return;

Nit, consider having the !nvme_req_needs_failover() case in an else
clause, just a suggestion though.



> @@ -171,6 +237,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
>   }
>   EXPORT_SYMBOL_GPL(nvme_cancel_request);

Question,

Does the statement in nvme_cancel_request:

         if (blk_queue_dying(req->q))
                 status |= NVME_SC_DNR;

still holds?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-23 17:43     ` Sagi Grimberg
  0 siblings, 0 replies; 116+ messages in thread
From: Sagi Grimberg @ 2017-10-23 17:43 UTC (permalink / raw)



>   static inline bool nvme_req_needs_retry(struct request *req)
>   {
>   	if (blk_noretry_request(req))
> @@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
>   void nvme_complete_rq(struct request *req)
>   {
>   	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
> +		if (nvme_req_needs_failover(req)) {
> +			nvme_failover_req(req);
> +			return;
> +		}
> +
>   		nvme_req(req)->retries++;
>   		blk_mq_requeue_request(req, true);
>   		return;

Nit, consider having the !nvme_req_needs_failover() case in an else
clause, just a suggestion though.



> @@ -171,6 +237,18 @@ void nvme_cancel_request(struct request *req, void *data, bool reserved)
>   }
>   EXPORT_SYMBOL_GPL(nvme_cancel_request);

Question,

Does the statement in nvme_cancel_request:

         if (blk_queue_dying(req->q))
                 status |= NVME_SC_DNR;

still holds?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/17] block: provide a direct_make_request helper
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:05     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:05 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/blkdev.h |  1 +
>  2 files changed, 35 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 03/17] block: provide a direct_make_request helper
@ 2017-10-24  7:05     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:05 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/blkdev.h |  1 +
>  2 files changed, 35 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/17] block: add a blk_steal_bios helper
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:07     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:07 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This helpers allows to bounce steal the uncompleted bios from a request so
> that they can be reissued on another path.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  block/blk-core.c       | 20 ++++++++++++++++++++
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 22 insertions(+)
> Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 04/17] block: add a blk_steal_bios helper
@ 2017-10-24  7:07     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:07 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This helpers allows to bounce steal the uncompleted bios from a request so
> that they can be reissued on another path.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  block/blk-core.c       | 20 ++++++++++++++++++++
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 22 insertions(+)
> Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/17] block: don't look at the struct device dev_t in disk_devt
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:08     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:08 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> The hidden gendisks introduced in the next patch need to keep the dev
> field in their struct device empty so that udev won't try to create
> block device nodes for them.  To support that rewrite disk_devt to
> look at the major and first_minor fields in the gendisk itself instead
> of looking into the struct device.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/genhd.c         | 4 ----
>  include/linux/genhd.h | 2 +-
>  2 files changed, 1 insertion(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 05/17] block: don't look at the struct device dev_t in disk_devt
@ 2017-10-24  7:08     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:08 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> The hidden gendisks introduced in the next patch need to keep the dev
> field in their struct device empty so that udev won't try to create
> block device nodes for them.  To support that rewrite disk_devt to
> look at the major and first_minor fields in the gendisk itself instead
> of looking into the struct device.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  block/genhd.c         | 4 ----
>  include/linux/genhd.h | 2 +-
>  2 files changed, 1 insertion(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:18     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:18 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/genhd.c         | 57 +++++++++++++++++++++++++++++++++++----------------
>  include/linux/genhd.h |  1 +
>  2 files changed, 40 insertions(+), 18 deletions(-)
> 
Can we have some information in sysfs to figure out if a gendisk is
hidden? I'd hate having to do an inverse lookup in /proc/partitions;
it's always hard to prove that something is _not_ present.
And we already present various information (like disk_removable_show()),
so it's not without precedent.
And it would make integration with systemd/udev easier.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-24  7:18     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:18 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/genhd.c         | 57 +++++++++++++++++++++++++++++++++++----------------
>  include/linux/genhd.h |  1 +
>  2 files changed, 40 insertions(+), 18 deletions(-)
> 
Can we have some information in sysfs to figure out if a gendisk is
hidden? I'd hate having to do an inverse lookup in /proc/partitions;
it's always hard to prove that something is _not_ present.
And we already present various information (like disk_removable_show()),
so it's not without precedent.
And it would make integration with systemd/udev easier.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/17] block: add a poll_fn callback to struct request_queue
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:20     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:20 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> That we we can also poll non blk-mq queues.  Mostly needed for
> the NVMe multipath code, but could also be useful elsewhere.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-core.c             | 11 +++++++++++
>  block/blk-mq.c               | 14 +++++---------
>  drivers/nvme/target/io-cmd.c |  2 +-
>  fs/block_dev.c               |  4 ++--
>  fs/direct-io.c               |  2 +-
>  fs/iomap.c                   |  2 +-
>  include/linux/blkdev.h       |  4 +++-
>  mm/page_io.c                 |  2 +-
>  8 files changed, 25 insertions(+), 16 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 07/17] block: add a poll_fn callback to struct request_queue
@ 2017-10-24  7:20     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:20 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> That we we can also poll non blk-mq queues.  Mostly needed for
> the NVMe multipath code, but could also be useful elsewhere.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-core.c             | 11 +++++++++++
>  block/blk-mq.c               | 14 +++++---------
>  drivers/nvme/target/io-cmd.c |  2 +-
>  fs/block_dev.c               |  4 ++--
>  fs/direct-io.c               |  2 +-
>  fs/iomap.c                   |  2 +-
>  include/linux/blkdev.h       |  4 +++-
>  mm/page_io.c                 |  2 +-
>  8 files changed, 25 insertions(+), 16 deletions(-)
> Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/17] nvme: use kref_get_unless_zero in nvme_find_get_ns
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:21     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:21 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> For kref_get_unless_zero to protect against lookup vs free races we need
> to use it in all places where we aren't guaranteed to already hold a
> reference.  There is no such guarantee in nvme_find_get_ns, so switch to
> kref_get_unless_zero in this function.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/nvme/host/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 08/17] nvme: use kref_get_unless_zero in nvme_find_get_ns
@ 2017-10-24  7:21     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:21 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> For kref_get_unless_zero to protect against lookup vs free races we need
> to use it in all places where we aren't guaranteed to already hold a
> reference.  There is no such guarantee in nvme_find_get_ns, so switch to
> kref_get_unless_zero in this function.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/17] nvme: simplify nvme_open
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:21     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:21 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Now that we are protected against lookup vs free races for the namespace
> by using kref_get_unless_zero we don't need the hack of NULLing out the
> disk private data during removal.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/nvme/host/core.c | 40 ++++++++++------------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 09/17] nvme: simplify nvme_open
@ 2017-10-24  7:21     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:21 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Now that we are protected against lookup vs free races for the namespace
> by using kref_get_unless_zero we don't need the hack of NULLing out the
> disk private data during removal.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 40 ++++++++++------------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/17] nvme: switch controller refcounting to use struct device
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:23     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:23 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Instead of allocating a separate struct device for the character device
> handle embedd it into struct nvme_ctrl and use it for the main controller
> refcounting.  This removes double refcounting and gets us an automatic
> reference for the character device operations.  We keep ctrl->device as a
> pointer for now to avoid chaning printks all over, but in the future we
> could look into message printing helpers that take a controller structure
> similar to what other subsystems do.
> 
> Note the delete_ctrl operation always already has a reference (either
> through sysfs due this change, or because every open file on the
> /dev/nvme-fabrics node has a refernece) when it is entered now, so we
> don't need to do the unless_zero variant there.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/core.c   | 43 ++++++++++++++++++++++---------------------
>  drivers/nvme/host/fc.c     |  8 ++------
>  drivers/nvme/host/nvme.h   | 12 +++++++++++-
>  drivers/nvme/host/pci.c    |  2 +-
>  drivers/nvme/host/rdma.c   |  5 ++---
>  drivers/nvme/target/loop.c |  2 +-
>  6 files changed, 39 insertions(+), 33 deletions(-)
> Round of applause for this :-)

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 10/17] nvme: switch controller refcounting to use struct device
@ 2017-10-24  7:23     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:23 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Instead of allocating a separate struct device for the character device
> handle embedd it into struct nvme_ctrl and use it for the main controller
> refcounting.  This removes double refcounting and gets us an automatic
> reference for the character device operations.  We keep ctrl->device as a
> pointer for now to avoid chaning printks all over, but in the future we
> could look into message printing helpers that take a controller structure
> similar to what other subsystems do.
> 
> Note the delete_ctrl operation always already has a reference (either
> through sysfs due this change, or because every open file on the
> /dev/nvme-fabrics node has a refernece) when it is entered now, so we
> don't need to do the unless_zero variant there.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  drivers/nvme/host/core.c   | 43 ++++++++++++++++++++++---------------------
>  drivers/nvme/host/fc.c     |  8 ++------
>  drivers/nvme/host/nvme.h   | 12 +++++++++++-
>  drivers/nvme/host/pci.c    |  2 +-
>  drivers/nvme/host/rdma.c   |  5 ++---
>  drivers/nvme/target/loop.c |  2 +-
>  6 files changed, 39 insertions(+), 33 deletions(-)
> Round of applause for this :-)

Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 11/17] nvme: get rid of nvme_ctrl_list
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:24     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:24 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Use the core chrdev code to set up the link between the character device
> and the nvme controller.  This allows us to get rid of the global list
> of all controllers, and also ensures that we have both a reference to
> the controller and the transport module before the open method of the
> character device is called.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sgi@grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/nvme/host/core.c | 76 ++++++++++--------------------------------------
>  drivers/nvme/host/nvme.h |  3 +-
>  2 files changed, 18 insertions(+), 61 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 11/17] nvme: get rid of nvme_ctrl_list
@ 2017-10-24  7:24     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:24 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Use the core chrdev code to set up the link between the character device
> and the nvme controller.  This allows us to get rid of the global list
> of all controllers, and also ensures that we have both a reference to
> the controller and the transport module before the open method of the
> character device is called.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sgi at grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 76 ++++++++++--------------------------------------
>  drivers/nvme/host/nvme.h |  3 +-
>  2 files changed, 18 insertions(+), 61 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 12/17] nvme: check for a live controller in nvme_dev_open
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:25     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:25 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This is a much more sensible check than just the admin queue.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@rimbeg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/nvme/host/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a56a1e0432e7..df525ab42fcd 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1891,7 +1891,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
>  	struct nvme_ctrl *ctrl =
>  		container_of(inode->i_cdev, struct nvme_ctrl, cdev);
>  
> -	if (!ctrl->admin_q)
> +	if (ctrl->state != NVME_CTRL_LIVE)
>  		return -EWOULDBLOCK;
>  	file->private_data = ctrl;
>  	return 0;
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 12/17] nvme: check for a live controller in nvme_dev_open
@ 2017-10-24  7:25     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:25 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This is a much more sensible check than just the admin queue.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at rimbeg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a56a1e0432e7..df525ab42fcd 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1891,7 +1891,7 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
>  	struct nvme_ctrl *ctrl =
>  		container_of(inode->i_cdev, struct nvme_ctrl, cdev);
>  
> -	if (!ctrl->admin_q)
> +	if (ctrl->state != NVME_CTRL_LIVE)
>  		return -EWOULDBLOCK;
>  	file->private_data = ctrl;
>  	return 0;
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 14/17] nvme: introduce a nvme_ns_ids structure
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:27     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:27 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This allows us to manage the various uniqueue namespace identifiers
> together instead needing various variables and arguments.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Keith Busch <keith.busch@intel.com>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++---------------------
>  drivers/nvme/host/nvme.h | 14 +++++++---
>  2 files changed, 49 insertions(+), 34 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 14/17] nvme: introduce a nvme_ns_ids structure
@ 2017-10-24  7:27     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:27 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This allows us to manage the various uniqueue namespace identifiers
> together instead needing various variables and arguments.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Keith Busch <keith.busch at intel.com>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++---------------------
>  drivers/nvme/host/nvme.h | 14 +++++++---
>  2 files changed, 49 insertions(+), 34 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 13/17] nvme: track subsystems
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:33     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:33 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This adds a new nvme_subsystem structure so that we can track multiple
> controllers that belong to a single subsystem.  For now we only use it
> to store the NQN, and to check that we don't have duplicate NQNs unless
> the involved subsystems support multiple controllers.
> 
> Includes code originally from Hannes Reinecke to expose the subsystems
> in sysfs.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/core.c    | 200 +++++++++++++++++++++++++++++++++++++-------
>  drivers/nvme/host/fabrics.c |   4 +-
>  drivers/nvme/host/nvme.h    |  26 ++++--
>  3 files changed, 194 insertions(+), 36 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 13/17] nvme: track subsystems
@ 2017-10-24  7:33     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:33 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This adds a new nvme_subsystem structure so that we can track multiple
> controllers that belong to a single subsystem.  For now we only use it
> to store the NQN, and to check that we don't have duplicate NQNs unless
> the involved subsystems support multiple controllers.
> 
> Includes code originally from Hannes Reinecke to expose the subsystems
> in sysfs.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  drivers/nvme/host/core.c    | 200 +++++++++++++++++++++++++++++++++++++-------
>  drivers/nvme/host/fabrics.c |   4 +-
>  drivers/nvme/host/nvme.h    |  26 ++++--
>  3 files changed, 194 insertions(+), 36 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/17] nvme: track shared namespaces
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:34     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:34 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head that holds information about an actual
> namespace, unlike struct nvme_ns, which only holds the per-controller
> namespace information.  For private namespaces there is a 1:1 relation of
> the two, but for shared namespaces this lets us discover all the paths to
> it.  For now only the identifiers are moved to the new structure, but most
> of the information in struct nvme_ns should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Keith Busch <keith.busch@intel.com>
> Reviewed-by: Javier González <javier@cnexlabs.com>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/nvme/host/core.c     | 190 +++++++++++++++++++++++++++++++++++++------
>  drivers/nvme/host/lightnvm.c |  14 ++--
>  drivers/nvme/host/nvme.h     |  21 ++++-
>  3 files changed, 190 insertions(+), 35 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 15/17] nvme: track shared namespaces
@ 2017-10-24  7:34     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:34 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head that holds information about an actual
> namespace, unlike struct nvme_ns, which only holds the per-controller
> namespace information.  For private namespaces there is a 1:1 relation of
> the two, but for shared namespaces this lets us discover all the paths to
> it.  For now only the identifiers are moved to the new structure, but most
> of the information in struct nvme_ns should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Keith Busch <keith.busch at intel.com>
> Reviewed-by: Javier Gonz?lez <javier at cnexlabs.com>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c     | 190 +++++++++++++++++++++++++++++++++++++------
>  drivers/nvme/host/lightnvm.c |  14 ++--
>  drivers/nvme/host/nvme.h     |  21 ++++-
>  3 files changed, 190 insertions(+), 35 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:43     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:43 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace.  The character device
> nodes are still available on a per-controller basis.  A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/core.c | 398 ++++++++++++++++++++++++++++++++++++++++-------
>  drivers/nvme/host/nvme.h |  15 +-
>  drivers/nvme/host/pci.c  |   2 +
>  3 files changed, 355 insertions(+), 60 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1db26729bd89..22c06cd3bef0 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -102,6 +102,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
>  	return ret;
>  }
>  
> +static void nvme_failover_req(struct request *req)
> +{
> +	struct nvme_ns *ns = req->q->queuedata;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> +	blk_steal_bios(&ns->head->requeue_list, req);
> +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> +	blk_mq_end_request(req, 0);
> +
> +	nvme_reset_ctrl(ns->ctrl);
> +	kblockd_schedule_work(&ns->head->requeue_work);
> +}
> +
>  static blk_status_t nvme_error_status(struct request *req)
>  {
>  	switch (nvme_req(req)->status & 0x7ff) {
> @@ -129,6 +143,53 @@ static blk_status_t nvme_error_status(struct request *req)
>  	}
>  }
>  
> +static bool nvme_req_needs_failover(struct request *req)
> +{
> +	if (!(req->cmd_flags & REQ_NVME_MPATH))
> +		return false;
> +
> +	switch (nvme_req(req)->status & 0x7ff) {
> +	/*
> +	 * Generic command status:
> +	 */
> +	case NVME_SC_INVALID_OPCODE:
> +	case NVME_SC_INVALID_FIELD:
> +	case NVME_SC_INVALID_NS:
> +	case NVME_SC_LBA_RANGE:
> +	case NVME_SC_CAP_EXCEEDED:
> +	case NVME_SC_RESERVATION_CONFLICT:
> +		return false;
> +
> +	/*
> +	 * I/O command set specific error.  Unfortunately these values are
> +	 * reused for fabrics commands, but those should never get here.
> +	 */
> +	case NVME_SC_BAD_ATTRIBUTES:
> +	case NVME_SC_INVALID_PI:
> +	case NVME_SC_READ_ONLY:
> +	case NVME_SC_ONCS_NOT_SUPPORTED:
> +		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
> +			nvme_fabrics_command);
> +		return false;
> +
> +	/*
> +	 * Media and Data Integrity Errors:
> +	 */
> +	case NVME_SC_WRITE_FAULT:
> +	case NVME_SC_READ_ERROR:
> +	case NVME_SC_GUARD_CHECK:
> +	case NVME_SC_APPTAG_CHECK:
> +	case NVME_SC_REFTAG_CHECK:
> +	case NVME_SC_COMPARE_FAILED:
> +	case NVME_SC_ACCESS_DENIED:
> +	case NVME_SC_UNWRITTEN_BLOCK:
> +		return false;
> +	}
> +
> +	/* Everything else could be a path failure, so should be retried */
> +	return true;
> +}
> +
>  static inline bool nvme_req_needs_retry(struct request *req)
>  {
>  	if (blk_noretry_request(req))
> @@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
>  void nvme_complete_rq(struct request *req)
>  {
>  	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
> +		if (nvme_req_needs_failover(req)) {
> +			nvme_failover_req(req);
> +			return;
> +		}
> +
>  		nvme_req(req)->retries++;
>  		blk_mq_requeue_request(req, true);
>  		return;
Sure this works? nvme_req_needs_retry() checks blk_noretry_request():

static inline bool nvme_req_needs_retry(struct request *req)
{
	if (blk_noretry_request(req))
		return false;

which has:
#define blk_noretry_request(rq) \
	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
			     REQ_FAILFAST_DRIVER))

The original idea here was to _set_ these bits on multipath path devices
so that they won't attempt any retry, but rather forward the I/O error
to the multipath device itself for failover.
So if these bits are set (as they should be for multipathed devices)
we'll never attempt any failover...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-24  7:43     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:43 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace.  The character device
> nodes are still available on a per-controller basis.  A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  drivers/nvme/host/core.c | 398 ++++++++++++++++++++++++++++++++++++++++-------
>  drivers/nvme/host/nvme.h |  15 +-
>  drivers/nvme/host/pci.c  |   2 +
>  3 files changed, 355 insertions(+), 60 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1db26729bd89..22c06cd3bef0 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -102,6 +102,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
>  	return ret;
>  }
>  
> +static void nvme_failover_req(struct request *req)
> +{
> +	struct nvme_ns *ns = req->q->queuedata;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ns->head->requeue_lock, flags);
> +	blk_steal_bios(&ns->head->requeue_list, req);
> +	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> +	blk_mq_end_request(req, 0);
> +
> +	nvme_reset_ctrl(ns->ctrl);
> +	kblockd_schedule_work(&ns->head->requeue_work);
> +}
> +
>  static blk_status_t nvme_error_status(struct request *req)
>  {
>  	switch (nvme_req(req)->status & 0x7ff) {
> @@ -129,6 +143,53 @@ static blk_status_t nvme_error_status(struct request *req)
>  	}
>  }
>  
> +static bool nvme_req_needs_failover(struct request *req)
> +{
> +	if (!(req->cmd_flags & REQ_NVME_MPATH))
> +		return false;
> +
> +	switch (nvme_req(req)->status & 0x7ff) {
> +	/*
> +	 * Generic command status:
> +	 */
> +	case NVME_SC_INVALID_OPCODE:
> +	case NVME_SC_INVALID_FIELD:
> +	case NVME_SC_INVALID_NS:
> +	case NVME_SC_LBA_RANGE:
> +	case NVME_SC_CAP_EXCEEDED:
> +	case NVME_SC_RESERVATION_CONFLICT:
> +		return false;
> +
> +	/*
> +	 * I/O command set specific error.  Unfortunately these values are
> +	 * reused for fabrics commands, but those should never get here.
> +	 */
> +	case NVME_SC_BAD_ATTRIBUTES:
> +	case NVME_SC_INVALID_PI:
> +	case NVME_SC_READ_ONLY:
> +	case NVME_SC_ONCS_NOT_SUPPORTED:
> +		WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
> +			nvme_fabrics_command);
> +		return false;
> +
> +	/*
> +	 * Media and Data Integrity Errors:
> +	 */
> +	case NVME_SC_WRITE_FAULT:
> +	case NVME_SC_READ_ERROR:
> +	case NVME_SC_GUARD_CHECK:
> +	case NVME_SC_APPTAG_CHECK:
> +	case NVME_SC_REFTAG_CHECK:
> +	case NVME_SC_COMPARE_FAILED:
> +	case NVME_SC_ACCESS_DENIED:
> +	case NVME_SC_UNWRITTEN_BLOCK:
> +		return false;
> +	}
> +
> +	/* Everything else could be a path failure, so should be retried */
> +	return true;
> +}
> +
>  static inline bool nvme_req_needs_retry(struct request *req)
>  {
>  	if (blk_noretry_request(req))
> @@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request *req)
>  void nvme_complete_rq(struct request *req)
>  {
>  	if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
> +		if (nvme_req_needs_failover(req)) {
> +			nvme_failover_req(req);
> +			return;
> +		}
> +
>  		nvme_req(req)->retries++;
>  		blk_mq_requeue_request(req, true);
>  		return;
Sure this works? nvme_req_needs_retry() checks blk_noretry_request():

static inline bool nvme_req_needs_retry(struct request *req)
{
	if (blk_noretry_request(req))
		return false;

which has:
#define blk_noretry_request(rq) \
	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
			     REQ_FAILFAST_DRIVER))

The original idea here was to _set_ these bits on multipath path devices
so that they won't attempt any retry, but rather forward the I/O error
to the multipath device itself for failover.
So if these bits are set (as they should be for multipathed devices)
we'll never attempt any failover...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  7:45     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:45 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Johannes Thumshirn, linux-nvme, linux-block

On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> We do this by adding a helper that returns the ns_head for a device that
> can belong to either the per-controller or per-subsystem block device
> nodes, and otherwise reuse all the existing code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Keith Busch <keith.busch@intel.com>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 42 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 22c06cd3bef0..334735db90c8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -76,6 +76,7 @@ static DEFINE_IDA(nvme_instance_ida);
>  static dev_t nvme_chr_devt;
>  static struct class *nvme_class;
>  static struct class *nvme_subsys_class;
> +static const struct attribute_group nvme_ns_id_attr_group;
>  
>  static __le32 nvme_get_log_dw10(u8 lid, size_t size)
>  {
> @@ -329,6 +330,8 @@ static void nvme_free_ns_head(struct kref *ref)
>  	struct nvme_ns_head *head =
>  		container_of(ref, struct nvme_ns_head, ref);
>  
> +	sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
> +			   &nvme_ns_id_attr_group);
>  	del_gendisk(head->disk);
>  	blk_set_queue_dying(head->disk->queue);
>  	/* make sure all pending bios are cleaned up */
> @@ -1925,7 +1928,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
>  static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
>  {
>  	struct nvme_subsystem *subsys, *found;
> -	int ret = -ENOMEM;
> +	int ret;
>  
>  	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
>  	if (!subsys)
> @@ -2267,12 +2270,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
>  }
>  static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
>  
> +static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
> +{
> +	struct gendisk *disk = dev_to_disk(dev);
> +
> +	if (disk->fops == &nvme_fops)
> +		return nvme_get_ns_from_dev(dev)->head;
> +	else
> +		return disk->private_data;
> +}
> +
>  static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	struct nvme_ns_ids *ids = &ns->head->ids;
> -	struct nvme_subsystem *subsys = ns->ctrl->subsys;
> +	struct nvme_ns_head *head = dev_to_ns_head(dev);
> +	struct nvme_ns_ids *ids = &head->ids;
> +	struct nvme_subsystem *subsys = head->subsys;
>  	int serial_len = sizeof(subsys->serial);
>  	int model_len = sizeof(subsys->model);
>  
> @@ -2294,23 +2307,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
>  
>  	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
>  		serial_len, subsys->serial, model_len, subsys->model,
> -		ns->head->ns_id);
> +		head->ns_id);
>  }
>  static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
>  
>  static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
> -			  char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
> +	return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
>  }
>  static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
>  
>  static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	struct nvme_ns_ids *ids = &ns->head->ids;
> +	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
>  
>  	/* For backward compatibility expose the NGUID to userspace if
>  	 * we have no UUID set
> @@ -2325,22 +2336,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
>  
>  static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
> +	return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
>  }
>  static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
>  
>  static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%d\n", ns->head->ns_id);
> +	return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
>  }
>  static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
>  
> -static struct attribute *nvme_ns_attrs[] = {
> +static struct attribute *nvme_ns_id_attrs[] = {
>  	&dev_attr_wwid.attr,
>  	&dev_attr_uuid.attr,
>  	&dev_attr_nguid.attr,
> @@ -2349,12 +2358,11 @@ static struct attribute *nvme_ns_attrs[] = {
>  	NULL,
>  };
>  
> -static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
> +static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
>  		struct attribute *a, int n)
>  {
>  	struct device *dev = container_of(kobj, struct device, kobj);
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	struct nvme_ns_ids *ids = &ns->head->ids;
> +	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
>  
>  	if (a == &dev_attr_uuid.attr) {
>  		if (uuid_is_null(&ids->uuid) ||
> @@ -2372,9 +2380,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
>  	return a->mode;
>  }
>  
> -static const struct attribute_group nvme_ns_attr_group = {
> -	.attrs		= nvme_ns_attrs,
> -	.is_visible	= nvme_ns_attrs_are_visible,
> +static const struct attribute_group nvme_ns_id_attr_group = {
> +	.attrs		= nvme_ns_id_attrs,
> +	.is_visible	= nvme_ns_id_attrs_are_visible,
>  };
>  
>  #define nvme_show_str_function(field)						\
> @@ -2883,15 +2891,20 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>  
>  	device_add_disk(ctrl->device, ns->disk);
>  	if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
> -					&nvme_ns_attr_group))
> +					&nvme_ns_id_attr_group))
>  		pr_warn("%s: failed to create sysfs group for identification\n",
>  			ns->disk->disk_name);
>  	if (ns->ndev && nvme_nvm_register_sysfs(ns))
>  		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
>  			ns->disk->disk_name);
>  
> -	if (new)
> +	if (new) {
>  		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
> +		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
> +				&nvme_ns_id_attr_group))
> +			pr_warn("%s: failed to create sysfs group for identification\n",
> +				ns->head->disk->disk_name);
> +	}
>  
>  	return;
>   out_unlink_ns:
device_add_disk_with_groups()?
What happened to that proposal?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes
@ 2017-10-24  7:45     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-24  7:45 UTC (permalink / raw)


On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> We do this by adding a helper that returns the ns_head for a device that
> can belong to either the per-controller or per-subsystem block device
> nodes, and otherwise reuse all the existing code.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Keith Busch <keith.busch at intel.com>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> ---
>  drivers/nvme/host/core.c | 69 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 42 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 22c06cd3bef0..334735db90c8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -76,6 +76,7 @@ static DEFINE_IDA(nvme_instance_ida);
>  static dev_t nvme_chr_devt;
>  static struct class *nvme_class;
>  static struct class *nvme_subsys_class;
> +static const struct attribute_group nvme_ns_id_attr_group;
>  
>  static __le32 nvme_get_log_dw10(u8 lid, size_t size)
>  {
> @@ -329,6 +330,8 @@ static void nvme_free_ns_head(struct kref *ref)
>  	struct nvme_ns_head *head =
>  		container_of(ref, struct nvme_ns_head, ref);
>  
> +	sysfs_remove_group(&disk_to_dev(head->disk)->kobj,
> +			   &nvme_ns_id_attr_group);
>  	del_gendisk(head->disk);
>  	blk_set_queue_dying(head->disk->queue);
>  	/* make sure all pending bios are cleaned up */
> @@ -1925,7 +1928,7 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
>  static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
>  {
>  	struct nvme_subsystem *subsys, *found;
> -	int ret = -ENOMEM;
> +	int ret;
>  
>  	subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
>  	if (!subsys)
> @@ -2267,12 +2270,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
>  }
>  static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
>  
> +static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
> +{
> +	struct gendisk *disk = dev_to_disk(dev);
> +
> +	if (disk->fops == &nvme_fops)
> +		return nvme_get_ns_from_dev(dev)->head;
> +	else
> +		return disk->private_data;
> +}
> +
>  static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	struct nvme_ns_ids *ids = &ns->head->ids;
> -	struct nvme_subsystem *subsys = ns->ctrl->subsys;
> +	struct nvme_ns_head *head = dev_to_ns_head(dev);
> +	struct nvme_ns_ids *ids = &head->ids;
> +	struct nvme_subsystem *subsys = head->subsys;
>  	int serial_len = sizeof(subsys->serial);
>  	int model_len = sizeof(subsys->model);
>  
> @@ -2294,23 +2307,21 @@ static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
>  
>  	return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
>  		serial_len, subsys->serial, model_len, subsys->model,
> -		ns->head->ns_id);
> +		head->ns_id);
>  }
>  static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
>  
>  static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
> -			  char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%pU\n", ns->head->ids.nguid);
> +	return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
>  }
>  static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
>  
>  static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	struct nvme_ns_ids *ids = &ns->head->ids;
> +	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
>  
>  	/* For backward compatibility expose the NGUID to userspace if
>  	 * we have no UUID set
> @@ -2325,22 +2336,20 @@ static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
>  static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
>  
>  static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
> +	return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
>  }
>  static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
>  
>  static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
> -								char *buf)
> +		char *buf)
>  {
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	return sprintf(buf, "%d\n", ns->head->ns_id);
> +	return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
>  }
>  static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
>  
> -static struct attribute *nvme_ns_attrs[] = {
> +static struct attribute *nvme_ns_id_attrs[] = {
>  	&dev_attr_wwid.attr,
>  	&dev_attr_uuid.attr,
>  	&dev_attr_nguid.attr,
> @@ -2349,12 +2358,11 @@ static struct attribute *nvme_ns_attrs[] = {
>  	NULL,
>  };
>  
> -static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
> +static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
>  		struct attribute *a, int n)
>  {
>  	struct device *dev = container_of(kobj, struct device, kobj);
> -	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> -	struct nvme_ns_ids *ids = &ns->head->ids;
> +	struct nvme_ns_ids *ids = &dev_to_ns_head(dev)->ids;
>  
>  	if (a == &dev_attr_uuid.attr) {
>  		if (uuid_is_null(&ids->uuid) ||
> @@ -2372,9 +2380,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
>  	return a->mode;
>  }
>  
> -static const struct attribute_group nvme_ns_attr_group = {
> -	.attrs		= nvme_ns_attrs,
> -	.is_visible	= nvme_ns_attrs_are_visible,
> +static const struct attribute_group nvme_ns_id_attr_group = {
> +	.attrs		= nvme_ns_id_attrs,
> +	.is_visible	= nvme_ns_id_attrs_are_visible,
>  };
>  
>  #define nvme_show_str_function(field)						\
> @@ -2883,15 +2891,20 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>  
>  	device_add_disk(ctrl->device, ns->disk);
>  	if (sysfs_create_group(&disk_to_dev(ns->disk)->kobj,
> -					&nvme_ns_attr_group))
> +					&nvme_ns_id_attr_group))
>  		pr_warn("%s: failed to create sysfs group for identification\n",
>  			ns->disk->disk_name);
>  	if (ns->ndev && nvme_nvm_register_sysfs(ns))
>  		pr_warn("%s: failed to register lightnvm sysfs group for identification\n",
>  			ns->disk->disk_name);
>  
> -	if (new)
> +	if (new) {
>  		device_add_disk(&ns->head->subsys->dev, ns->head->disk);
> +		if (sysfs_create_group(&disk_to_dev(ns->head->disk)->kobj,
> +				&nvme_ns_id_attr_group))
> +			pr_warn("%s: failed to create sysfs group for identification\n",
> +				ns->head->disk->disk_name);
> +	}
>  
>  	return;
>   out_unlink_ns:
device_add_disk_with_groups()?
What happened to that proposal?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/17] block: add a blk_steal_bios helper
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24  8:44     ` Max Gurtovoy
  -1 siblings, 0 replies; 116+ messages in thread
From: Max Gurtovoy @ 2017-10-24  8:44 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block



On 10/23/2017 5:51 PM, Christoph Hellwig wrote:
> This helpers allows to bounce steal the uncompleted bios from a request so
> that they can be reissued on another path.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>   block/blk-core.c       | 20 ++++++++++++++++++++
>   include/linux/blkdev.h |  2 ++
>   2 files changed, 22 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b8c80f39f5fe..e804529e65a5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2767,6 +2767,26 @@ struct request *blk_fetch_request(struct request_queue *q)
>   }
>   EXPORT_SYMBOL(blk_fetch_request);
>   
> +/*
> + * Steal bios from a request.  The request must not have been partially
> + * completed before.
> + */

Maybe we can add to the comment that "list" is the destination for the 
stolen bio.

> +void blk_steal_bios(struct bio_list *list, struct request *rq)
> +{
> +	if (rq->bio) {
> +		if (list->tail)
> +			list->tail->bi_next = rq->bio;
> +		else
> +			list->head = rq->bio;
> +		list->tail = rq->biotail;

if list->tail != NULL don't we lose the "list->tail->bi_next = rq->bio;" 
assignment after assigning "list->tail = rq->biotail;" ?

> +	}
> +
> +	rq->bio = NULL;

we can add this NULL assignment inside the big "if", but I'm not sure 
regarding the next 2 assignments.
Anyway not a big deal.

> +	rq->biotail = NULL;
> +	rq->__data_len = 0;
> +}
> +EXPORT_SYMBOL_GPL(blk_steal_bios);
> +

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 04/17] block: add a blk_steal_bios helper
@ 2017-10-24  8:44     ` Max Gurtovoy
  0 siblings, 0 replies; 116+ messages in thread
From: Max Gurtovoy @ 2017-10-24  8:44 UTC (permalink / raw)




On 10/23/2017 5:51 PM, Christoph Hellwig wrote:
> This helpers allows to bounce steal the uncompleted bios from a request so
> that they can be reissued on another path.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   block/blk-core.c       | 20 ++++++++++++++++++++
>   include/linux/blkdev.h |  2 ++
>   2 files changed, 22 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b8c80f39f5fe..e804529e65a5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2767,6 +2767,26 @@ struct request *blk_fetch_request(struct request_queue *q)
>   }
>   EXPORT_SYMBOL(blk_fetch_request);
>   
> +/*
> + * Steal bios from a request.  The request must not have been partially
> + * completed before.
> + */

Maybe we can add to the comment that "list" is the destination for the 
stolen bio.

> +void blk_steal_bios(struct bio_list *list, struct request *rq)
> +{
> +	if (rq->bio) {
> +		if (list->tail)
> +			list->tail->bi_next = rq->bio;
> +		else
> +			list->head = rq->bio;
> +		list->tail = rq->biotail;

if list->tail != NULL don't we lose the "list->tail->bi_next = rq->bio;" 
assignment after assigning "list->tail = rq->biotail;" ?

> +	}
> +
> +	rq->bio = NULL;

we can add this NULL assignment inside the big "if", but I'm not sure 
regarding the next 2 assignments.
Anyway not a big deal.

> +	rq->biotail = NULL;
> +	rq->__data_len = 0;
> +}
> +EXPORT_SYMBOL_GPL(blk_steal_bios);
> +

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 03/17] block: provide a direct_make_request helper
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24 17:57     ` Javier González
  -1 siblings, 0 replies; 116+ messages in thread
From: Javier González @ 2017-10-24 17:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Hannes Reinecke, Johannes Thumshirn

[-- Attachment #1: Type: text/plain, Size: 593 bytes --]

> On 23 Oct 2017, at 16.51, Christoph Hellwig <hch@lst.de> wrote:
> 
> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> ---
> block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++++
> include/linux/blkdev.h |  1 +
> 2 files changed, 35 insertions(+)
> 


Reviewed-by: Javier González <javier@cnexlabs.com>


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 03/17] block: provide a direct_make_request helper
@ 2017-10-24 17:57     ` Javier González
  0 siblings, 0 replies; 116+ messages in thread
From: Javier González @ 2017-10-24 17:57 UTC (permalink / raw)


> On 23 Oct 2017,@16.51, Christoph Hellwig <hch@lst.de> wrote:
> 
> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> block/blk-core.c       | 34 ++++++++++++++++++++++++++++++++++
> include/linux/blkdev.h |  1 +
> 2 files changed, 35 insertions(+)
> 


Reviewed-by: Javier Gonz?lez <javier at cnexlabs.com>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171024/6cf03741/attachment.sig>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-24 21:40     ` Mike Snitzer
  -1 siblings, 0 replies; 116+ messages in thread
From: Mike Snitzer @ 2017-10-24 21:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Hannes Reinecke, Johannes Thumshirn

On Mon, Oct 23 2017 at 10:51am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.

Having the NVme driver go to such lengths to hide its resources from
upper layers is certainly the work of an evil genius experiencing some
serious territorial issues.  Not sugar-coating it.. you wouldn't.

I kept meaning to reply to your earlier iterations on this series to
ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
that the NVMe driver doesn't implicitly consume (and hide) all
per-controler devices?

Ah well.  There is only one correct way to do NVMe multipathing after
all right?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-24 21:40     ` Mike Snitzer
  0 siblings, 0 replies; 116+ messages in thread
From: Mike Snitzer @ 2017-10-24 21:40 UTC (permalink / raw)


On Mon, Oct 23 2017 at 10:51am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.

Having the NVme driver go to such lengths to hide its resources from
upper layers is certainly the work of an evil genius experiencing some
serious territorial issues.  Not sugar-coating it.. you wouldn't.

I kept meaning to reply to your earlier iterations on this series to
ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
that the NVMe driver doesn't implicitly consume (and hide) all
per-controler devices?

Ah well.  There is only one correct way to do NVMe multipathing after
all right?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/17] block: add a blk_steal_bios helper
  2017-10-24  8:44     ` Max Gurtovoy
@ 2017-10-28  6:13       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:13 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Hannes Reinecke, Johannes Thumshirn, linux-nvme, linux-block

On Tue, Oct 24, 2017 at 11:44:26AM +0300, Max Gurtovoy wrote:
>> + * Steal bios from a request.  The request must not have been partially
>> + * completed before.
>> + */
>
> Maybe we can add to the comment that "list" is the destination for the 
> stolen bio.

Sure.

>> +void blk_steal_bios(struct bio_list *list, struct request *rq)
>> +{
>> +	if (rq->bio) {
>> +		if (list->tail)
>> +			list->tail->bi_next = rq->bio;
>> +		else
>> +			list->head = rq->bio;
>> +		list->tail = rq->biotail;
>
> if list->tail != NULL don't we lose the "list->tail->bi_next = rq->bio;" 
> assignment after assigning "list->tail = rq->biotail;" ?

the biolists are a little weird, they are a single linked list of bi_next
plus a tail pointer.

So if the list is emptry (->tail == NULL) we assign the biolist
to ->head and point ->tail to end of the list in the request.

But if the list is not empty we let ->bi_next of the last entry
(as found in ->tail) point to the list we splice on, and still update
->tail to end of the list we spliced on.  So I think this looks all ok.

>> +	}
>> +
>> +	rq->bio = NULL;
>
> we can add this NULL assignment inside the big "if", but I'm not sure 
> regarding the next 2 assignments.
> Anyway not a big deal.

We can move both the ->bio and ->biotail assignments, and I've done it.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 04/17] block: add a blk_steal_bios helper
@ 2017-10-28  6:13       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:13 UTC (permalink / raw)


On Tue, Oct 24, 2017@11:44:26AM +0300, Max Gurtovoy wrote:
>> + * Steal bios from a request.  The request must not have been partially
>> + * completed before.
>> + */
>
> Maybe we can add to the comment that "list" is the destination for the 
> stolen bio.

Sure.

>> +void blk_steal_bios(struct bio_list *list, struct request *rq)
>> +{
>> +	if (rq->bio) {
>> +		if (list->tail)
>> +			list->tail->bi_next = rq->bio;
>> +		else
>> +			list->head = rq->bio;
>> +		list->tail = rq->biotail;
>
> if list->tail != NULL don't we lose the "list->tail->bi_next = rq->bio;" 
> assignment after assigning "list->tail = rq->biotail;" ?

the biolists are a little weird, they are a single linked list of bi_next
plus a tail pointer.

So if the list is emptry (->tail == NULL) we assign the biolist
to ->head and point ->tail to end of the list in the request.

But if the list is not empty we let ->bi_next of the last entry
(as found in ->tail) point to the list we splice on, and still update
->tail to end of the list we spliced on.  So I think this looks all ok.

>> +	}
>> +
>> +	rq->bio = NULL;
>
> we can add this NULL assignment inside the big "if", but I'm not sure 
> regarding the next 2 assignments.
> Anyway not a big deal.

We can move both the ->bio and ->biotail assignments, and I've done it.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-24  7:18     ` Hannes Reinecke
@ 2017-10-28  6:15       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:15 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Johannes Thumshirn, linux-nvme, linux-block

On Tue, Oct 24, 2017 at 09:18:47AM +0200, Hannes Reinecke wrote:
> Can we have some information in sysfs to figure out if a gendisk is
> hidden? I'd hate having to do an inverse lookup in /proc/partitions;
> it's always hard to prove that something is _not_ present.
> And we already present various information (like disk_removable_show()),
> so it's not without precedent.
> And it would make integration with systemd/udev easier.

Sure, I'll add a hidden flag sysfs file.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-28  6:15       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:15 UTC (permalink / raw)


On Tue, Oct 24, 2017@09:18:47AM +0200, Hannes Reinecke wrote:
> Can we have some information in sysfs to figure out if a gendisk is
> hidden? I'd hate having to do an inverse lookup in /proc/partitions;
> it's always hard to prove that something is _not_ present.
> And we already present various information (like disk_removable_show()),
> so it's not without precedent.
> And it would make integration with systemd/udev easier.

Sure, I'll add a hidden flag sysfs file.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes
  2017-10-24  7:45     ` Hannes Reinecke
@ 2017-10-28  6:20       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:20 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Johannes Thumshirn, linux-nvme, linux-block

On Tue, Oct 24, 2017 at 09:45:44AM +0200, Hannes Reinecke wrote:
> device_add_disk_with_groups()?
> What happened to that proposal?

Doesn't look like it got anywhere near a tree I could base this work
on.

And what happen to not fullquoting a patch for a two line comment?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes
@ 2017-10-28  6:20       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:20 UTC (permalink / raw)


On Tue, Oct 24, 2017@09:45:44AM +0200, Hannes Reinecke wrote:
> device_add_disk_with_groups()?
> What happened to that proposal?

Doesn't look like it got anywhere near a tree I could base this work
on.

And what happen to not fullquoting a patch for a two line comment?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-24  7:43     ` Hannes Reinecke
@ 2017-10-28  6:32       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Johannes Thumshirn, linux-nvme, linux-block

> Sure this works?

It does.

> nvme_req_needs_retry() checks blk_noretry_request():

> The original idea here was to _set_ these bits on multipath path devices
> so that they won't attempt any retry, but rather forward the I/O error
> to the multipath device itself for failover.
> So if these bits are set (as they should be for multipathed devices)
> we'll never attempt any failover...

While that might have been the "original" idea, it isn't what this code
does.  We never set any of the REQ_FAILFAST_* bits in
nvme_ns_head_make_request.  In NVMe there aren't really any device
equivalents of REQ_FAILFAST_ that make sense for multipath.  The only
one that we map to is the limited retry bit, and that is media centric,
so a failver would not help.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-28  6:32       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:32 UTC (permalink / raw)


> Sure this works?

It does.

> nvme_req_needs_retry() checks blk_noretry_request():

> The original idea here was to _set_ these bits on multipath path devices
> so that they won't attempt any retry, but rather forward the I/O error
> to the multipath device itself for failover.
> So if these bits are set (as they should be for multipathed devices)
> we'll never attempt any failover...

While that might have been the "original" idea, it isn't what this code
does.  We never set any of the REQ_FAILFAST_* bits in
nvme_ns_head_make_request.  In NVMe there aren't really any device
equivalents of REQ_FAILFAST_ that make sense for multipath.  The only
one that we map to is the limited retry bit, and that is media centric,
so a failver would not help.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-24 21:40     ` Mike Snitzer
@ 2017-10-28  6:38       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:38 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Sagi Grimberg,
	linux-nvme, Keith Busch, Hannes Reinecke, Johannes Thumshirn

On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
> Having the NVme driver go to such lengths to hide its resources from
> upper layers is certainly the work of an evil genius experiencing some
> serious territorial issues.  Not sugar-coating it.. you wouldn't.

I'm pretty surre Hannes will appreciate being called an evil genius :)

> I kept meaning to reply to your earlier iterations on this series to
> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
> that the NVMe driver doesn't implicitly consume (and hide) all
> per-controler devices?

I thought about adding it, but mostly for a different reason: it's
quite a bit of code, and we now start to see NVMe in deeply embedded
contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
might be a good idea to give people a chance to avoid the overhead.

> Ah well.  There is only one correct way to do NVMe multipathing after
> all right?

I don't think you'll get very useful results, even if you try.  But I
guess we'll just have to tell people to use SuSE if they want NVMe
multipathing to work then :)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-28  6:38       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:38 UTC (permalink / raw)


On Tue, Oct 24, 2017@05:40:00PM -0400, Mike Snitzer wrote:
> Having the NVme driver go to such lengths to hide its resources from
> upper layers is certainly the work of an evil genius experiencing some
> serious territorial issues.  Not sugar-coating it.. you wouldn't.

I'm pretty surre Hannes will appreciate being called an evil genius :)

> I kept meaning to reply to your earlier iterations on this series to
> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
> that the NVMe driver doesn't implicitly consume (and hide) all
> per-controler devices?

I thought about adding it, but mostly for a different reason: it's
quite a bit of code, and we now start to see NVMe in deeply embedded
contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
might be a good idea to give people a chance to avoid the overhead.

> Ah well.  There is only one correct way to do NVMe multipathing after
> all right?

I don't think you'll get very useful results, even if you try.  But I
guess we'll just have to tell people to use SuSE if they want NVMe
multipathing to work then :)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28  6:38       ` Christoph Hellwig
@ 2017-10-28  7:20         ` Guan Junxiong
  -1 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-28  7:20 UTC (permalink / raw)
  To: Christoph Hellwig, Mike Snitzer
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, linux-block,
	Hannes Reinecke, Johannes Thumshirn, niuhaoxin, Shenhong (C)

Hi Christoph,

On 2017/10/28 14:38, Christoph Hellwig wrote:
> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
>> Having the NVme driver go to such lengths to hide its resources from
>> upper layers is certainly the work of an evil genius experiencing some
>> serious territorial issues.  Not sugar-coating it.. you wouldn't.
> 
> I'm pretty surre Hannes will appreciate being called an evil genius :)
> 
>> I kept meaning to reply to your earlier iterations on this series to
>> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
>> that the NVMe driver doesn't implicitly consume (and hide) all
>> per-controler devices?
> 
> I thought about adding it, but mostly for a different reason: it's
> quite a bit of code, and we now start to see NVMe in deeply embedded
> contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
> might be a good idea to give people a chance to avoid the overhead.
> 

Think of some current advanced features of DM-Multipath combined with multipath-tools
such as path-latency priority grouping, intermittent IO error accounting for path
degradation, delayed or immediate or follow-over failback feature.
Those features, which is significant in some scenario, need to use per-controller block devices.
Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
hide or show per-controller block devices.

How about let me to add this this CONFIG_NVME_MULTIPATHING knob?

Regards
Guan



>> Ah well.  There is only one correct way to do NVMe multipathing after
>> all right?
> 
> I don't think you'll get very useful results, even if you try.  But I
> guess we'll just have to tell people to use SuSE if they want NVMe
> multipathing to work then :)
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-28  7:20         ` Guan Junxiong
  0 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-28  7:20 UTC (permalink / raw)


Hi Christoph,

On 2017/10/28 14:38, Christoph Hellwig wrote:
> On Tue, Oct 24, 2017@05:40:00PM -0400, Mike Snitzer wrote:
>> Having the NVme driver go to such lengths to hide its resources from
>> upper layers is certainly the work of an evil genius experiencing some
>> serious territorial issues.  Not sugar-coating it.. you wouldn't.
> 
> I'm pretty surre Hannes will appreciate being called an evil genius :)
> 
>> I kept meaning to reply to your earlier iterations on this series to
>> ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
>> that the NVMe driver doesn't implicitly consume (and hide) all
>> per-controler devices?
> 
> I thought about adding it, but mostly for a different reason: it's
> quite a bit of code, and we now start to see NVMe in deeply embedded
> contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
> might be a good idea to give people a chance to avoid the overhead.
> 

Think of some current advanced features of DM-Multipath combined with multipath-tools
such as path-latency priority grouping, intermittent IO error accounting for path
degradation, delayed or immediate or follow-over failback feature.
Those features, which is significant in some scenario, need to use per-controller block devices.
Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
hide or show per-controller block devices.

How about let me to add this this CONFIG_NVME_MULTIPATHING knob?

Regards
Guan



>> Ah well.  There is only one correct way to do NVMe multipathing after
>> all right?
> 
> I don't think you'll get very useful results, even if you try.  But I
> guess we'll just have to tell people to use SuSE if they want NVMe
> multipathing to work then :)
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28  7:20         ` Guan Junxiong
@ 2017-10-28  7:42           ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  7:42 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Christoph Hellwig, Mike Snitzer, Jens Axboe, Keith Busch,
	Sagi Grimberg, linux-nvme, linux-block, Hannes Reinecke,
	Johannes Thumshirn, niuhaoxin, Shenhong (C)

On Sat, Oct 28, 2017 at 03:20:07PM +0800, Guan Junxiong wrote:
> Think of some current advanced features of DM-Multipath combined with multipath-tools
> such as path-latency priority grouping, intermittent IO error accounting for path
> degradation, delayed or immediate or follow-over failback feature.
> Those features, which is significant in some scenario, need to use per-controller block devices.
> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
> hide or show per-controller block devices.
> 
> How about let me to add this this CONFIG_NVME_MULTIPATHING knob?

There is defintively not going to be a run-time nob, sorry.  You've
spent this work on dm-multipath after the nvme group pretty clearly
said that this is not what we are going to support, and we have not
interest in supporting this.  Especially as proper path discovery,
asymetic namespaces states and similar can only be supported properly
with the in-kernel code.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-28  7:42           ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-28  7:42 UTC (permalink / raw)


On Sat, Oct 28, 2017@03:20:07PM +0800, Guan Junxiong wrote:
> Think of some current advanced features of DM-Multipath combined with multipath-tools
> such as path-latency priority grouping, intermittent IO error accounting for path
> degradation, delayed or immediate or follow-over failback feature.
> Those features, which is significant in some scenario, need to use per-controller block devices.
> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
> hide or show per-controller block devices.
> 
> How about let me to add this this CONFIG_NVME_MULTIPATHING knob?

There is defintively not going to be a run-time nob, sorry.  You've
spent this work on dm-multipath after the nvme group pretty clearly
said that this is not what we are going to support, and we have not
interest in supporting this.  Especially as proper path discovery,
asymetic namespaces states and similar can only be supported properly
with the in-kernel code.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28  7:42           ` Christoph Hellwig
@ 2017-10-28 10:09             ` Guan Junxiong
  -1 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-28 10:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme,
	linux-block, Hannes Reinecke, Johannes Thumshirn, niuhaoxin,
	Shenhong (C)



On 2017/10/28 15:42, Christoph Hellwig wrote:
> On Sat, Oct 28, 2017 at 03:20:07PM +0800, Guan Junxiong wrote:
>> Think of some current advanced features of DM-Multipath combined with multipath-tools
>> such as path-latency priority grouping, intermittent IO error accounting for path
>> degradation, delayed or immediate or follow-over failback feature.
>> Those features, which is significant in some scenario, need to use per-controller block devices.
>> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
>> hide or show per-controller block devices.
>>
>> How about let me to add this this CONFIG_NVME_MULTIPATHING knob?
> 
> There is defintively not going to be a run-time nob, sorry.  You've
> spent this work on dm-multipath after the nvme group pretty clearly
> said that this is not what we are going to support, and we have not
> interest in supporting this.  Especially as proper path discovery,
> asymetic namespaces states and similar can only be supported properly
> with the in-kernel code.
> 
Does it mean some extra works such as:
1) showing the path topology of nvme multipath device
2) daemon to implement immediate and delayed failback
3) detecting sub-healthy path due to shaky link
4) grouping paths besides ANA
...
and so on,
need to integrate into the user-space tool such as nvme-cli
or a new invented user-space tool named "nvme-mpath" ?
Which do you prefer?

And the kernel also needs to export more setting and getting methods.

Regards
Guan
.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-28 10:09             ` Guan Junxiong
  0 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-28 10:09 UTC (permalink / raw)




On 2017/10/28 15:42, Christoph Hellwig wrote:
> On Sat, Oct 28, 2017@03:20:07PM +0800, Guan Junxiong wrote:
>> Think of some current advanced features of DM-Multipath combined with multipath-tools
>> such as path-latency priority grouping, intermittent IO error accounting for path
>> degradation, delayed or immediate or follow-over failback feature.
>> Those features, which is significant in some scenario, need to use per-controller block devices.
>> Therefore, I think it is worthy adding a CONFIG_NVME_MULTIPATHING knob to
>> hide or show per-controller block devices.
>>
>> How about let me to add this this CONFIG_NVME_MULTIPATHING knob?
> 
> There is defintively not going to be a run-time nob, sorry.  You've
> spent this work on dm-multipath after the nvme group pretty clearly
> said that this is not what we are going to support, and we have not
> interest in supporting this.  Especially as proper path discovery,
> asymetic namespaces states and similar can only be supported properly
> with the in-kernel code.
> 
Does it mean some extra works such as:
1) showing the path topology of nvme multipath device
2) daemon to implement immediate and delayed failback
3) detecting sub-healthy path due to shaky link
4) grouping paths besides ANA
...
and so on,
need to integrate into the user-space tool such as nvme-cli
or a new invented user-space tool named "nvme-mpath" ?
Which do you prefer?

And the kernel also needs to export more setting and getting methods.

Regards
Guan
.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28  6:38       ` Christoph Hellwig
@ 2017-10-28 14:17         ` Mike Snitzer
  -1 siblings, 0 replies; 116+ messages in thread
From: Mike Snitzer @ 2017-10-28 14:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Hannes Reinecke, Johannes Thumshirn

On Sat, Oct 28 2017 at  2:38am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
> > Having the NVme driver go to such lengths to hide its resources from
> > upper layers is certainly the work of an evil genius experiencing some
> > serious territorial issues.  Not sugar-coating it.. you wouldn't.
> 
> I'm pretty surre Hannes will appreciate being called an evil genius :)

Well, to be fair.. it doesn't take that much brain power to arrive at
isolationism.

And pretty sure Hannes had a better idea with using the blkdev_get
(holders/claim) interface but you quickly dismissed that.  Despite it
being the best _existing_ way to ensure mutual exclusion with the rest
of the block layer.

> > I kept meaning to reply to your earlier iterations on this series to
> > ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
> > that the NVMe driver doesn't implicitly consume (and hide) all
> > per-controler devices?
> 
> I thought about adding it, but mostly for a different reason: it's
> quite a bit of code, and we now start to see NVMe in deeply embedded
> contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
> might be a good idea to give people a chance to avoid the overhead.

OK, so please add it.

> > Ah well.  There is only one correct way to do NVMe multipathing after
> > all right?
> 
> I don't think you'll get very useful results, even if you try.  But I
> guess we'll just have to tell people to use SuSE if they want NVMe
> multipathing to work then :)

Don't do that.  Don't assume you know it all.  Don't fabricate vendor
wars in your head and then project it out to the rest of the community.
We're all in this together.

Fact is Hannes and I have exchanged private mail (in response to this
very thread) and we agree that your approach is currently not suitable
for enterprise deployment.  Hannes needs to also deal with the coming
duality of Linux multipathing and you aren't making it easy.  Just
because you're able to be myopic doesn't mean the rest of us with way
more direct customers behind us can be.

You're finding your way with this new multipath model and I'm happy to
see that happen.  But what I'm not happy about is the steps you're
taking to be actively disruptive.  There were so many ways this all
could've gone and sadly you've elected to stand up an architecture that
doesn't even allow the prospect of reuse.  And for what?  Because doing
so takes 10% more work?  Well we can backfill that work if you'll grant
us an open-mind.

I'm really not against you.  I just need very basic controls put into
the NVMe multipathing code that allows it to be disabled (yet reused).
Not that I have immediate plans to actually _use_ it.  My hope is I can
delegate NVMe multipathing to NVMe!  But I need the latitude to find my
way with the requirements I am, or will be, needing to consider.

Please don't paint me and others in my position into a corner.

Mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-28 14:17         ` Mike Snitzer
  0 siblings, 0 replies; 116+ messages in thread
From: Mike Snitzer @ 2017-10-28 14:17 UTC (permalink / raw)


On Sat, Oct 28 2017 at  2:38am -0400,
Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Oct 24, 2017@05:40:00PM -0400, Mike Snitzer wrote:
> > Having the NVme driver go to such lengths to hide its resources from
> > upper layers is certainly the work of an evil genius experiencing some
> > serious territorial issues.  Not sugar-coating it.. you wouldn't.
> 
> I'm pretty surre Hannes will appreciate being called an evil genius :)

Well, to be fair.. it doesn't take that much brain power to arrive at
isolationism.

And pretty sure Hannes had a better idea with using the blkdev_get
(holders/claim) interface but you quickly dismissed that.  Despite it
being the best _existing_ way to ensure mutual exclusion with the rest
of the block layer.

> > I kept meaning to reply to your earlier iterations on this series to
> > ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
> > that the NVMe driver doesn't implicitly consume (and hide) all
> > per-controler devices?
> 
> I thought about adding it, but mostly for a different reason: it's
> quite a bit of code, and we now start to see NVMe in deeply embedded
> contexts, e.g. the latest Compact Flash spec is based on NVMe, so it
> might be a good idea to give people a chance to avoid the overhead.

OK, so please add it.

> > Ah well.  There is only one correct way to do NVMe multipathing after
> > all right?
> 
> I don't think you'll get very useful results, even if you try.  But I
> guess we'll just have to tell people to use SuSE if they want NVMe
> multipathing to work then :)

Don't do that.  Don't assume you know it all.  Don't fabricate vendor
wars in your head and then project it out to the rest of the community.
We're all in this together.

Fact is Hannes and I have exchanged private mail (in response to this
very thread) and we agree that your approach is currently not suitable
for enterprise deployment.  Hannes needs to also deal with the coming
duality of Linux multipathing and you aren't making it easy.  Just
because you're able to be myopic doesn't mean the rest of us with way
more direct customers behind us can be.

You're finding your way with this new multipath model and I'm happy to
see that happen.  But what I'm not happy about is the steps you're
taking to be actively disruptive.  There were so many ways this all
could've gone and sadly you've elected to stand up an architecture that
doesn't even allow the prospect of reuse.  And for what?  Because doing
so takes 10% more work?  Well we can backfill that work if you'll grant
us an open-mind.

I'm really not against you.  I just need very basic controls put into
the NVMe multipathing code that allows it to be disabled (yet reused).
Not that I have immediate plans to actually _use_ it.  My hope is I can
delegate NVMe multipathing to NVMe!  But I need the latitude to find my
way with the requirements I am, or will be, needing to consider.

Please don't paint me and others in my position into a corner.

Mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28 10:09             ` Guan Junxiong
@ 2017-10-29  8:00               ` Anish Jhaveri
  -1 siblings, 0 replies; 116+ messages in thread
From: Anish Jhaveri @ 2017-10-29  8:00 UTC (permalink / raw)
  To: Guan Junxiong, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Shenhong (C),
	Sagi Grimberg, Mike Snitzer, linux-nvme, Keith Busch,
	Hannes Reinecke, Johannes Thumshirn, niuhaoxin

On Saturday, October 28, 2017 3:10 AM,  Guan Junxiong wrote:

>Does it mean some extra works such as:
>1) showing the path topology of nvme multipath device
Isn't connection to the target suppose guide the host to the shortest and f=
astest path available path or so called most optimized path. Sysfs entry ca=
n easily store that as we store path related info over there.=20

>2) daemon to implement immediate and delayed failback
This is also based on target, whenever target is ready for immediate or del=
ayed failback it will let the connect from host succeed. Until then host is=
 in reconnecting state across all path until it finds an optimized or un-op=
timized path. Why is this extra daemon needed ?
=20
>4) grouping paths besides ANA
Why we cannot use NS Group here.

Would be good idea to move away from legacy way of doing things for NVME de=
vices. NVME Multipath implementation by Christoph is very simple. How about=
 not making it super complicated. =20

Best regards,
Anish

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-29  8:00               ` Anish Jhaveri
  0 siblings, 0 replies; 116+ messages in thread
From: Anish Jhaveri @ 2017-10-29  8:00 UTC (permalink / raw)


On Saturday, October 28, 2017 3:10 AM,  Guan Junxiong wrote:

>Does it mean some extra works such as:
>1) showing the path topology of nvme multipath device
Isn't connection to the target suppose guide the host to the shortest and fastest path available path or so called most optimized path. Sysfs entry can easily store that as we store path related info over there. 

>2) daemon to implement immediate and delayed failback
This is also based on target, whenever target is ready for immediate or delayed failback it will let the connect from host succeed. Until then host is in reconnecting state across all path until it finds an optimized or un-optimized path. Why is this extra daemon needed ?
 
>4) grouping paths besides ANA
Why we cannot use NS Group here.

Would be good idea to move away from legacy way of doing things for NVME devices. NVME Multipath implementation by Christoph is very simple. How about not making it super complicated.  

Best regards,
Anish

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28 10:09             ` Guan Junxiong
@ 2017-10-29  8:57               ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-29  8:57 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Christoph Hellwig, Mike Snitzer, Jens Axboe, Keith Busch,
	Sagi Grimberg, linux-nvme, linux-block, Hannes Reinecke,
	Johannes Thumshirn, niuhaoxin, Shenhong (C)

On Sat, Oct 28, 2017 at 06:09:46PM +0800, Guan Junxiong wrote:
> Does it mean some extra works such as:
> 1) showing the path topology of nvme multipath device

It's in sysfs.  And Johannes volunteered to also add nvme-cli
support.

> 2) daemon to implement immediate and delayed failback

The whole point is to not have a daemon.

> 3) detecting sub-healthy path due to shaky link

We can do this in kernel space.  It just needs someone to implement it.

> 4) grouping paths besides ANA

We don't want to do non-standard grouping.  Please work with the
NVMe working group for your grouping ideas.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-29  8:57               ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-29  8:57 UTC (permalink / raw)


On Sat, Oct 28, 2017@06:09:46PM +0800, Guan Junxiong wrote:
> Does it mean some extra works such as:
> 1) showing the path topology of nvme multipath device

It's in sysfs.  And Johannes volunteered to also add nvme-cli
support.

> 2) daemon to implement immediate and delayed failback

The whole point is to not have a daemon.

> 3) detecting sub-healthy path due to shaky link

We can do this in kernel space.  It just needs someone to implement it.

> 4) grouping paths besides ANA

We don't want to do non-standard grouping.  Please work with the
NVMe working group for your grouping ideas.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-28 14:17         ` Mike Snitzer
@ 2017-10-29 10:01           ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-29 10:01 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, linux-block,
	Johannes Thumshirn

On 10/28/2017 04:17 PM, Mike Snitzer wrote:
> On Sat, Oct 28 2017 at  2:38am -0400,
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Tue, Oct 24, 2017 at 05:40:00PM -0400, Mike Snitzer wrote:
[ .. ]
>>> Ah well.  There is only one correct way to do NVMe multipathing after
>>> all right?
>>
>> I don't think you'll get very useful results, even if you try.  But I
>> guess we'll just have to tell people to use SuSE if they want NVMe
>> multipathing to work then :)
> 
> Don't do that.  Don't assume you know it all.  Don't fabricate vendor
> wars in your head and then project it out to the rest of the community.
> We're all in this together.
> 
> Fact is Hannes and I have exchanged private mail (in response to this
> very thread) and we agree that your approach is currently not suitable
> for enterprise deployment.  Hannes needs to also deal with the coming
> duality of Linux multipathing and you aren't making it easy.  Just
> because you're able to be myopic doesn't mean the rest of us with way
> more direct customers behind us can be.
> 
> You're finding your way with this new multipath model and I'm happy to
> see that happen.  But what I'm not happy about is the steps you're
> taking to be actively disruptive.  There were so many ways this all
> could've gone and sadly you've elected to stand up an architecture that
> doesn't even allow the prospect of reuse.  And for what?  Because doing
> so takes 10% more work?  Well we can backfill that work if you'll grant
> us an open-mind.
> 
> I'm really not against you.  I just need very basic controls put into
> the NVMe multipathing code that allows it to be disabled (yet reused).
> Not that I have immediate plans to actually _use_ it.  My hope is I can
> delegate NVMe multipathing to NVMe!  But I need the latitude to find my
> way with the requirements I am, or will be, needing to consider.
> 
> Please don't paint me and others in my position into a corner.
> 
To add my some cents from the SUSE perspective:
We have _quite_ some deployments on dm-multipathing, and most of our
customers are quite accustomed to setting up deployments with that.
It will be impossible to ensure that all customers and installation
scripts will _not_ try starting up multipathing, and for some scenarios
even preferring to use dm-multipath just because their tooling is geared
up for that.
So we absolutely need peaceful coexistence between dm-multipathing and
nvme multipathing. The precise level can be discussed (whether it being
a global on/off switch or some more fine-grained thingie), but we simply
cannot declare nvme multipathing being the one true way for NVMe.
The support-calls alone will kill us.

Note: this has _nothing_ to do with performance. I'm perfectly fine to
accept that dm-multipath has sub-optimal performance.
But that should not imply that it cannot be used for NVMe.

After all, Linux is about choice, not about forcing users to do things
in one way only.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-29 10:01           ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-29 10:01 UTC (permalink / raw)


On 10/28/2017 04:17 PM, Mike Snitzer wrote:
> On Sat, Oct 28 2017 at  2:38am -0400,
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Tue, Oct 24, 2017@05:40:00PM -0400, Mike Snitzer wrote:
[ .. ]
>>> Ah well.  There is only one correct way to do NVMe multipathing after
>>> all right?
>>
>> I don't think you'll get very useful results, even if you try.  But I
>> guess we'll just have to tell people to use SuSE if they want NVMe
>> multipathing to work then :)
> 
> Don't do that.  Don't assume you know it all.  Don't fabricate vendor
> wars in your head and then project it out to the rest of the community.
> We're all in this together.
> 
> Fact is Hannes and I have exchanged private mail (in response to this
> very thread) and we agree that your approach is currently not suitable
> for enterprise deployment.  Hannes needs to also deal with the coming
> duality of Linux multipathing and you aren't making it easy.  Just
> because you're able to be myopic doesn't mean the rest of us with way
> more direct customers behind us can be.
> 
> You're finding your way with this new multipath model and I'm happy to
> see that happen.  But what I'm not happy about is the steps you're
> taking to be actively disruptive.  There were so many ways this all
> could've gone and sadly you've elected to stand up an architecture that
> doesn't even allow the prospect of reuse.  And for what?  Because doing
> so takes 10% more work?  Well we can backfill that work if you'll grant
> us an open-mind.
> 
> I'm really not against you.  I just need very basic controls put into
> the NVMe multipathing code that allows it to be disabled (yet reused).
> Not that I have immediate plans to actually _use_ it.  My hope is I can
> delegate NVMe multipathing to NVMe!  But I need the latitude to find my
> way with the requirements I am, or will be, needing to consider.
> 
> Please don't paint me and others in my position into a corner.
> 
To add my some cents from the SUSE perspective:
We have _quite_ some deployments on dm-multipathing, and most of our
customers are quite accustomed to setting up deployments with that.
It will be impossible to ensure that all customers and installation
scripts will _not_ try starting up multipathing, and for some scenarios
even preferring to use dm-multipath just because their tooling is geared
up for that.
So we absolutely need peaceful coexistence between dm-multipathing and
nvme multipathing. The precise level can be discussed (whether it being
a global on/off switch or some more fine-grained thingie), but we simply
cannot declare nvme multipathing being the one true way for NVMe.
The support-calls alone will kill us.

Note: this has _nothing_ to do with performance. I'm perfectly fine to
accept that dm-multipath has sub-optimal performance.
But that should not imply that it cannot be used for NVMe.

After all, Linux is about choice, not about forcing users to do things
in one way only.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-23 14:51   ` Christoph Hellwig
@ 2017-10-30  3:37     ` Guan Junxiong
  -1 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-30  3:37 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Hannes Reinecke, Johannes Thumshirn, niuhaoxin, Shenhong (C)



On 2017/10/23 22:51, Christoph Hellwig wrote:
> @@ -2427,20 +2681,46 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
>  	if (ret) {
>  		dev_err(ctrl->device,
>  			"duplicate IDs for nsid %d\n", nsid);
> -		goto out_free_head;
> +		goto out_release_instance;
>  	}
>  
> +	ret = -ENOMEM;
> +	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
> +	if (!q)
> +		goto out_free_head;
> +	q->queuedata = head;
> +	blk_queue_make_request(q, nvme_ns_head_make_request);
> +	q->poll_fn = nvme_ns_head_poll;
> +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> +	/* set to a default value for 512 until disk is validated */
> +	blk_queue_logical_block_size(q, 512);
> +	nvme_set_queue_limits(ctrl, q);
> +
> +	head->disk = alloc_disk(0);
> +	if (!head->disk)
> +		goto out_cleanup_queue;
> +	head->disk->fops = &nvme_ns_head_ops;
> +	head->disk->private_data = head;
> +	head->disk->queue = q;
> +	head->disk->flags = GENHD_FL_EXT_DEVT;
> +	sprintf(head->disk->disk_name, "nvme%dn%d",
> +			ctrl->subsys->instance, nsid);

Is it okay to use head->instance instead of nsid for disk name nvme#n# ?
Becuase _nsid_ sets are not continuous sometimes, so disk name is ugly in that case.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-10-30  3:37     ` Guan Junxiong
  0 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-30  3:37 UTC (permalink / raw)




On 2017/10/23 22:51, Christoph Hellwig wrote:
> @@ -2427,20 +2681,46 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
>  	if (ret) {
>  		dev_err(ctrl->device,
>  			"duplicate IDs for nsid %d\n", nsid);
> -		goto out_free_head;
> +		goto out_release_instance;
>  	}
>  
> +	ret = -ENOMEM;
> +	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
> +	if (!q)
> +		goto out_free_head;
> +	q->queuedata = head;
> +	blk_queue_make_request(q, nvme_ns_head_make_request);
> +	q->poll_fn = nvme_ns_head_poll;
> +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
> +	/* set to a default value for 512 until disk is validated */
> +	blk_queue_logical_block_size(q, 512);
> +	nvme_set_queue_limits(ctrl, q);
> +
> +	head->disk = alloc_disk(0);
> +	if (!head->disk)
> +		goto out_cleanup_queue;
> +	head->disk->fops = &nvme_ns_head_ops;
> +	head->disk->private_data = head;
> +	head->disk->queue = q;
> +	head->disk->flags = GENHD_FL_EXT_DEVT;
> +	sprintf(head->disk->disk_name, "nvme%dn%d",
> +			ctrl->subsys->instance, nsid);

Is it okay to use head->instance instead of nsid for disk name nvme#n# ?
Becuase _nsid_ sets are not continuous sometimes, so disk name is ugly in that case.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-29 10:01           ` Hannes Reinecke
@ 2017-10-30  4:09             ` Guan Junxiong
  -1 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-30  4:09 UTC (permalink / raw)
  To: Hannes Reinecke, Mike Snitzer, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Johannes Thumshirn, niuhaoxin, Shenhong (C)

Hi Christoph, Mike and Hannes

On 2017/10/29 18:01, Hannes Reinecke wrote:
> After all, Linux is about choice, not about forcing users to do things
> in one way only.


I have added an option CONFIG_NVME_SHOW_CTRL_BLK_DEV to show per-controller
block devices nodes.
It is tested simply that dm-multipath and nvme-mpath can coexist.

This patch, based on the Christoph V5 of nvme-mpath,  is in the following.

Does it look good for you ?

Regards
Guan

--
>From de3f446af6591d68ef84333138e744f12db4d695 Mon Sep 17 00:00:00 2001
From: Junxiong Guan <guanjunxiong@huawei.com>
Date: Mon, 30 Oct 2017 04:59:20 -0400
Subject: [PATCH] nvme: add an option to show per-controller block devices
 nodes

Signed-off-by: Junxiong Guan <guanjunxiong@huawei.com>
---
 drivers/nvme/host/Kconfig |  9 ++++++
 drivers/nvme/host/core.c  | 80 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 46d6cb1e03bd..725bff035f38 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -13,6 +13,15 @@ config BLK_DEV_NVME
 	  To compile this driver as a module, choose M here: the
 	  module will be called nvme.

+config NVME_SHOW_CTRL_BLK_DEV
+	bool "Show per-controller block devices of NVMe"
+	depends on NVME_CORE
+	---help---
+	  This adds support to show per-controller block devices nodes in
+	  the user space.
+
+	  If unsure, say N.
+
 config NVME_FABRICS
 	tristate

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 334735db90c8..ae37e274108c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -72,6 +72,9 @@ static DEFINE_IDA(nvme_subsystems_ida);
 static LIST_HEAD(nvme_subsystems);
 static DEFINE_MUTEX(nvme_subsystems_lock);

+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+static DEFINE_SPINLOCK(dev_list_lock);
+#endif
 static DEFINE_IDA(nvme_instance_ida);
 static dev_t nvme_chr_devt;
 static struct class *nvme_class;
@@ -357,6 +360,14 @@ static void nvme_free_ns(struct kref *kref)

 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
+
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	if (ns->disk) {
+		spin_lock(&dev_list_lock);
+		ns->disk->private_data = NULL;
+		spin_unlock(&dev_list_lock);
+	}
+#endif
 	put_disk(ns->disk);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
@@ -1127,10 +1138,32 @@ static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
 	}
 }

-/* should never be called due to GENHD_FL_HIDDEN */
 static int nvme_open(struct block_device *bdev, fmode_t mode)
 {
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	struct nvme_ns *ns;
+
+	spin_lock(&dev_list_lock);
+	ns = bdev->bd_disk->private_data;
+	if (ns) {
+		if (!kref_get_unless_zero(&ns->kref))
+			goto fail;
+		if (!try_module_get(ns->ctrl->ops->module))
+			goto fail_put_ns;
+	}
+	spin_unlock(&dev_list_lock);
+
+	return ns ? 0 : -ENXIO;
+
+fail_put_ns:
+	kref_put(&ns->kref, nvme_free_ns);
+fail:
+	spin_unlock(&dev_list_lock);
+	return -ENXIO;
+#else
+	/* should never be called due to GENHD_FL_HIDDEN */
 	return WARN_ON_ONCE(-ENXIO);
+#endif /* CONFIG_NVME_SHOW_CTRL_BLK_DEV */
 }

 static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1392,10 +1425,15 @@ static char nvme_pr_type(enum pr_type type)
 static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 				u64 key, u64 sa_key, u8 op)
 {
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_ns_head *head = ns->head;
+#else
 	struct nvme_ns_head *head = bdev->bd_disk->private_data;
 	struct nvme_ns *ns;
-	struct nvme_command c;
 	int srcu_idx, ret;
+#endif
+	struct nvme_command c;
 	u8 data[16] = { 0, };

 	put_unaligned_le64(key, &data[0]);
@@ -1406,6 +1444,9 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 	c.common.nsid = cpu_to_le32(head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);

+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+#else
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
 	if (likely(ns))
@@ -1413,7 +1454,9 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 	else
 		ret = -EWOULDBLOCK;
 	srcu_read_unlock(&head->srcu, srcu_idx);
+
 	return ret;
+#endif
 }

 static int nvme_pr_register(struct block_device *bdev, u64 old,
@@ -1492,6 +1535,34 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 EXPORT_SYMBOL_GPL(nvme_sec_submit);
 #endif /* CONFIG_BLK_SED_OPAL */

+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+static void nvme_release(struct gendisk *disk, fmode_t mode)
+{
+	struct nvme_ns *ns = disk->private_data;
+
+	module_put(ns->ctrl->ops->module);
+	nvme_put_ns(ns);
+}
+
+static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	return nvme_ns_ioctl(ns, cmd, arg);
+}
+
+static const struct block_device_operations nvme_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvme_ioctl,
+	.compat_ioctl	= nvme_ioctl,
+	.open		= nvme_open,
+	.release	= nvme_release,
+	.getgeo		= nvme_getgeo,
+	.revalidate_disk= nvme_revalidate_disk,
+	.pr_ops		=&nvme_pr_ops,
+};
+#else
 /*
  * While we don't expose the per-controller devices to userspace we still
  * need valid file operations for them, for one because the block layer
@@ -1503,6 +1574,7 @@ static const struct block_device_operations nvme_fops = {
 	.open		= nvme_open,
 	.revalidate_disk= nvme_revalidate_disk,
 };
+#endif /* CONFIG_NVME_SHOW_CTRL_BLK_DEV */

 static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
 {
@@ -2875,7 +2947,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
 	disk->queue = ns->queue;
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	disk->flags = GENHD_FL_EXT_DEVT;
+#else
 	disk->flags = GENHD_FL_HIDDEN;
+#endif
 	memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
 	ns->disk = disk;

-- 
2.11.1

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-30  4:09             ` Guan Junxiong
  0 siblings, 0 replies; 116+ messages in thread
From: Guan Junxiong @ 2017-10-30  4:09 UTC (permalink / raw)


Hi Christoph, Mike and Hannes

On 2017/10/29 18:01, Hannes Reinecke wrote:
> After all, Linux is about choice, not about forcing users to do things
> in one way only.


I have added an option CONFIG_NVME_SHOW_CTRL_BLK_DEV to show per-controller
block devices nodes.
It is tested simply that dm-multipath and nvme-mpath can coexist.

This patch, based on the Christoph V5 of nvme-mpath,  is in the following.

Does it look good for you ?

Regards
Guan

--
>From de3f446af6591d68ef84333138e744f12db4d695 Mon Sep 17 00:00:00 2001
From: Junxiong Guan <guanjunxiong@huawei.com>
Date: Mon, 30 Oct 2017 04:59:20 -0400
Subject: [PATCH] nvme: add an option to show per-controller block devices
 nodes

Signed-off-by: Junxiong Guan <guanjunxiong at huawei.com>
---
 drivers/nvme/host/Kconfig |  9 ++++++
 drivers/nvme/host/core.c  | 80 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 46d6cb1e03bd..725bff035f38 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -13,6 +13,15 @@ config BLK_DEV_NVME
 	  To compile this driver as a module, choose M here: the
 	  module will be called nvme.

+config NVME_SHOW_CTRL_BLK_DEV
+	bool "Show per-controller block devices of NVMe"
+	depends on NVME_CORE
+	---help---
+	  This adds support to show per-controller block devices nodes in
+	  the user space.
+
+	  If unsure, say N.
+
 config NVME_FABRICS
 	tristate

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 334735db90c8..ae37e274108c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -72,6 +72,9 @@ static DEFINE_IDA(nvme_subsystems_ida);
 static LIST_HEAD(nvme_subsystems);
 static DEFINE_MUTEX(nvme_subsystems_lock);

+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+static DEFINE_SPINLOCK(dev_list_lock);
+#endif
 static DEFINE_IDA(nvme_instance_ida);
 static dev_t nvme_chr_devt;
 static struct class *nvme_class;
@@ -357,6 +360,14 @@ static void nvme_free_ns(struct kref *kref)

 	if (ns->ndev)
 		nvme_nvm_unregister(ns);
+
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	if (ns->disk) {
+		spin_lock(&dev_list_lock);
+		ns->disk->private_data = NULL;
+		spin_unlock(&dev_list_lock);
+	}
+#endif
 	put_disk(ns->disk);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
@@ -1127,10 +1138,32 @@ static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
 	}
 }

-/* should never be called due to GENHD_FL_HIDDEN */
 static int nvme_open(struct block_device *bdev, fmode_t mode)
 {
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	struct nvme_ns *ns;
+
+	spin_lock(&dev_list_lock);
+	ns = bdev->bd_disk->private_data;
+	if (ns) {
+		if (!kref_get_unless_zero(&ns->kref))
+			goto fail;
+		if (!try_module_get(ns->ctrl->ops->module))
+			goto fail_put_ns;
+	}
+	spin_unlock(&dev_list_lock);
+
+	return ns ? 0 : -ENXIO;
+
+fail_put_ns:
+	kref_put(&ns->kref, nvme_free_ns);
+fail:
+	spin_unlock(&dev_list_lock);
+	return -ENXIO;
+#else
+	/* should never be called due to GENHD_FL_HIDDEN */
 	return WARN_ON_ONCE(-ENXIO);
+#endif /* CONFIG_NVME_SHOW_CTRL_BLK_DEV */
 }

 static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1392,10 +1425,15 @@ static char nvme_pr_type(enum pr_type type)
 static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 				u64 key, u64 sa_key, u8 op)
 {
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_ns_head *head = ns->head;
+#else
 	struct nvme_ns_head *head = bdev->bd_disk->private_data;
 	struct nvme_ns *ns;
-	struct nvme_command c;
 	int srcu_idx, ret;
+#endif
+	struct nvme_command c;
 	u8 data[16] = { 0, };

 	put_unaligned_le64(key, &data[0]);
@@ -1406,6 +1444,9 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 	c.common.nsid = cpu_to_le32(head->ns_id);
 	c.common.cdw10[0] = cpu_to_le32(cdw10);

+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+#else
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
 	if (likely(ns))
@@ -1413,7 +1454,9 @@ static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
 	else
 		ret = -EWOULDBLOCK;
 	srcu_read_unlock(&head->srcu, srcu_idx);
+
 	return ret;
+#endif
 }

 static int nvme_pr_register(struct block_device *bdev, u64 old,
@@ -1492,6 +1535,34 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
 EXPORT_SYMBOL_GPL(nvme_sec_submit);
 #endif /* CONFIG_BLK_SED_OPAL */

+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+static void nvme_release(struct gendisk *disk, fmode_t mode)
+{
+	struct nvme_ns *ns = disk->private_data;
+
+	module_put(ns->ctrl->ops->module);
+	nvme_put_ns(ns);
+}
+
+static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	return nvme_ns_ioctl(ns, cmd, arg);
+}
+
+static const struct block_device_operations nvme_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvme_ioctl,
+	.compat_ioctl	= nvme_ioctl,
+	.open		= nvme_open,
+	.release	= nvme_release,
+	.getgeo		= nvme_getgeo,
+	.revalidate_disk= nvme_revalidate_disk,
+	.pr_ops		=&nvme_pr_ops,
+};
+#else
 /*
  * While we don't expose the per-controller devices to userspace we still
  * need valid file operations for them, for one because the block layer
@@ -1503,6 +1574,7 @@ static const struct block_device_operations nvme_fops = {
 	.open		= nvme_open,
 	.revalidate_disk= nvme_revalidate_disk,
 };
+#endif /* CONFIG_NVME_SHOW_CTRL_BLK_DEV */

 static int nvme_wait_ready(struct nvme_ctrl *ctrl, u64 cap, bool enabled)
 {
@@ -2875,7 +2947,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	disk->fops = &nvme_fops;
 	disk->private_data = ns;
 	disk->queue = ns->queue;
+#ifdef CONFIG_NVME_SHOW_CTRL_BLK_DEV
+	disk->flags = GENHD_FL_EXT_DEVT;
+#else
 	disk->flags = GENHD_FL_HIDDEN;
+#endif
 	memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
 	ns->disk = disk;

-- 
2.11.1

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems
  2017-10-30  3:37     ` Guan Junxiong
@ 2017-11-02 18:22       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-11-02 18:22 UTC (permalink / raw)
  To: Guan Junxiong
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Shenhong (C),
	Sagi Grimberg, linux-nvme, linux-block, Hannes Reinecke,
	Johannes Thumshirn, niuhaoxin

On Mon, Oct 30, 2017 at 11:37:55AM +0800, Guan Junxiong wrote:
> > +	head->disk->flags = GENHD_FL_EXT_DEVT;
> > +	sprintf(head->disk->disk_name, "nvme%dn%d",
> > +			ctrl->subsys->instance, nsid);
> 
> Is it okay to use head->instance instead of nsid for disk name nvme#n# ?
> Becuase _nsid_ sets are not continuous sometimes, so disk name is ugly in that case.

This actually was supposed to be the ns_head instance, that's why the
ns_ida moved to the subsystem.  I've fixed it up.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/17] nvme: implement multipath access to nvme subsystems
@ 2017-11-02 18:22       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-11-02 18:22 UTC (permalink / raw)


On Mon, Oct 30, 2017@11:37:55AM +0800, Guan Junxiong wrote:
> > +	head->disk->flags = GENHD_FL_EXT_DEVT;
> > +	sprintf(head->disk->disk_name, "nvme%dn%d",
> > +			ctrl->subsys->instance, nsid);
> 
> Is it okay to use head->instance instead of nsid for disk name nvme#n# ?
> Becuase _nsid_ sets are not continuous sometimes, so disk name is ugly in that case.

This actually was supposed to be the ns_head instance, that's why the
ns_ida moved to the subsystem.  I've fixed it up.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-19 12:45     ` Hannes Reinecke
@ 2017-10-19 13:15       ` Christoph Hellwig
  -1 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-19 13:15 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Sagi Grimberg,
	linux-nvme, Keith Busch, Johannes Thumshirn

On Thu, Oct 19, 2017 at 02:45:03PM +0200, Hannes Reinecke wrote:
> > No way in hell this would get through.  Preemptive NAK right here.
> >
> > That whole idea is just amazingly stupid, and no one has even
> > explained a reason for it.
> 
> I wonder what has changed ...

The nvme-controller devices aren't accessible to anyone outside the
nvme driver itself, so they aren't just hidden, as far as the rest
of the kernel is concerned they don't exist.  They just allow reusing
block layer code in the nvme driver instead of reimplementing it.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-19 13:15       ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-19 13:15 UTC (permalink / raw)


On Thu, Oct 19, 2017@02:45:03PM +0200, Hannes Reinecke wrote:
> > No way in hell this would get through.  Preemptive NAK right here.
> >
> > That whole idea is just amazingly stupid, and no one has even
> > explained a reason for it.
> 
> I wonder what has changed ...

The nvme-controller devices aren't accessible to anyone outside the
nvme driver itself, so they aren't just hidden, as far as the rest
of the kernel is concerned they don't exist.  They just allow reusing
block layer code in the nvme driver instead of reimplementing it.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-18 16:52   ` Christoph Hellwig
@ 2017-10-19 12:45     ` Hannes Reinecke
  -1 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-19 12:45 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: linux-block, Sagi Grimberg, linux-nvme, Keith Busch, Johannes Thumshirn

On 10/18/2017 06:52 PM, Christoph Hellwig wrote:
> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/genhd.c         | 53 ++++++++++++++++++++++++++++++++++-----------------
>  include/linux/genhd.h |  1 +
>  2 files changed, 36 insertions(+), 18 deletions(-)
> 
To quote:

On Fri, Oct 06, 2017 at 09:13 AM, Christoph Hellwig wrote:
> On Fri, Oct 06, 2017 at 10:39:50AM +1100, NeilBrown wrote:
[ .. ]
>>
>> There is some precedent for hiding things from /proc/partitions.
>> removable devices like CDROMs are hidden, and you can easily hide
>> individual devices by setting GENHD_FL_SUPPRESS_PARTITION_INFO.
>> We might be able to get that through.  It is certainly worth writing
>> a patch and letting people experiment with it.
>
> No way in hell this would get through.  Preemptive NAK right here.
>
> That whole idea is just amazingly stupid, and no one has even
> explained a reason for it.

I wonder what has changed ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-19 12:45     ` Hannes Reinecke
  0 siblings, 0 replies; 116+ messages in thread
From: Hannes Reinecke @ 2017-10-19 12:45 UTC (permalink / raw)


On 10/18/2017 06:52 PM, Christoph Hellwig wrote:
> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/genhd.c         | 53 ++++++++++++++++++++++++++++++++++-----------------
>  include/linux/genhd.h |  1 +
>  2 files changed, 36 insertions(+), 18 deletions(-)
> 
To quote:

On Fri, Oct 06, 2017@09:13 AM, Christoph Hellwig wrote:
> On Fri, Oct 06, 2017@10:39:50AM +1100, NeilBrown wrote:
[ .. ]
>>
>> There is some precedent for hiding things from /proc/partitions.
>> removable devices like CDROMs are hidden, and you can easily hide
>> individual devices by setting GENHD_FL_SUPPRESS_PARTITION_INFO.
>> We might be able to get that through.  It is certainly worth writing
>> a patch and letting people experiment with it.
>
> No way in hell this would get through.  Preemptive NAK right here.
>
> That whole idea is just amazingly stupid, and no one has even
> explained a reason for it.

I wonder what has changed ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-18 16:52   ` Christoph Hellwig
@ 2017-10-19  8:31     ` Johannes Thumshirn
  -1 siblings, 0 replies; 116+ messages in thread
From: Johannes Thumshirn @ 2017-10-19  8:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Sagi Grimberg, linux-nvme, Keith Busch,
	Hannes Reinecke

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
--=20
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
GF: Felix Imend=C3=B6rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N=C3=BCrnberg)
Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-19  8:31     ` Johannes Thumshirn
  0 siblings, 0 replies; 116+ messages in thread
From: Johannes Thumshirn @ 2017-10-19  8:31 UTC (permalink / raw)


Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
  2017-10-18 16:52 nvme multipath support V4 Christoph Hellwig
@ 2017-10-18 16:52   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-18 16:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Hannes Reinecke, Johannes Thumshirn,
	linux-nvme, linux-block

With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/genhd.c         | 53 ++++++++++++++++++++++++++++++++++-----------------
 include/linux/genhd.h |  1 +
 2 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 1174d24e405e..0b28cd491b1d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -585,6 +585,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		dev_set_uevent_suppress(ddev, 0);
+		return;
+	}
+
 	disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
 
@@ -616,6 +621,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter)))
 		kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
 	disk_part_iter_exit(&piter);
+
+	err = sysfs_create_link(&ddev->kobj,
+				&disk->queue->backing_dev_info->dev->kobj,
+				"bdi");
+	WARN_ON(err);
 }
 
 /**
@@ -630,7 +640,6 @@ static void register_disk(struct device *parent, struct gendisk *disk)
  */
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
 
@@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 * parameters make sense.
 	 */
 	WARN_ON(disk->minors && !(disk->major || disk->first_minor));
-	WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
+	WARN_ON(!disk->minors &&
+		!(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));
 
 	disk->flags |= GENHD_FL_UP;
 
@@ -648,18 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 		WARN_ON(1);
 		return;
 	}
-	disk_to_dev(disk)->devt = devt;
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
 	disk_alloc_events(disk);
 
-	/* Register BDI before referencing it from bdev */
-	bdi = disk->queue->backing_dev_info;
-	bdi_register_owner(bdi, disk_to_dev(disk));
-
-	blk_register_region(disk_devt(disk), disk->minors, NULL,
-			    exact_match, exact_lock, disk);
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		/*
+		 * Don't let hidden disks show up in /proc/partitions,
+		 * and don't bother scanning for partitions either.
+		 */
+		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+		disk->flags |= GENHD_FL_NO_PART_SCAN;
+	} else {
+		/* Register BDI before referencing it from bdev */
+		disk_to_dev(disk)->devt = devt;
+		bdi_register_owner(disk->queue->backing_dev_info,
+				disk_to_dev(disk));
+		blk_register_region(disk_devt(disk), disk->minors, NULL,
+				    exact_match, exact_lock, disk);
+	}
 	register_disk(parent, disk);
 	blk_register_queue(disk);
 
@@ -669,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 */
 	WARN_ON_ONCE(!blk_get_queue(disk->queue));
 
-	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
-				   "bdi");
-	WARN_ON(retval);
-
 	disk_add_events(disk);
 	blk_integrity_add(disk);
 }
@@ -701,7 +715,8 @@ void del_gendisk(struct gendisk *disk)
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
-	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	if (disk->queue) {
 		/*
 		 * Unregister bdi before releasing device numbers (as they can
@@ -712,13 +727,15 @@ void del_gendisk(struct gendisk *disk)
 	} else {
 		WARN_ON(1);
 	}
-	blk_unregister_region(disk_devt(disk), disk->minors);
+
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+		blk_unregister_region(disk_devt(disk), disk->minors);
+		kobject_put(disk->part0.holder_dir);
+		kobject_put(disk->slave_dir);
+	}
 
 	part_stat_set_all(&disk->part0, 0);
 	disk->part0.stamp = 0;
-
-	kobject_put(disk->part0.holder_dir);
-	kobject_put(disk->slave_dir);
 	if (!sysfs_deprecated)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c0ed5db33c2..93aae3476f58 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -140,6 +140,7 @@ struct hd_struct {
 #define GENHD_FL_NATIVE_CAPACITY		128
 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE	256
 #define GENHD_FL_NO_PART_SCAN			512
+#define GENHD_FL_HIDDEN				1024
 
 enum {
 	DISK_EVENT_MEDIA_CHANGE			= 1 << 0, /* media changed */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/17] block: introduce GENHD_FL_HIDDEN
@ 2017-10-18 16:52   ` Christoph Hellwig
  0 siblings, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2017-10-18 16:52 UTC (permalink / raw)


With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/genhd.c         | 53 ++++++++++++++++++++++++++++++++++-----------------
 include/linux/genhd.h |  1 +
 2 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 1174d24e405e..0b28cd491b1d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -585,6 +585,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	 */
 	pm_runtime_set_memalloc_noio(ddev, true);
 
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		dev_set_uevent_suppress(ddev, 0);
+		return;
+	}
+
 	disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
 
@@ -616,6 +621,11 @@ static void register_disk(struct device *parent, struct gendisk *disk)
 	while ((part = disk_part_iter_next(&piter)))
 		kobject_uevent(&part_to_dev(part)->kobj, KOBJ_ADD);
 	disk_part_iter_exit(&piter);
+
+	err = sysfs_create_link(&ddev->kobj,
+				&disk->queue->backing_dev_info->dev->kobj,
+				"bdi");
+	WARN_ON(err);
 }
 
 /**
@@ -630,7 +640,6 @@ static void register_disk(struct device *parent, struct gendisk *disk)
  */
 void device_add_disk(struct device *parent, struct gendisk *disk)
 {
-	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
 
@@ -639,7 +648,8 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 * parameters make sense.
 	 */
 	WARN_ON(disk->minors && !(disk->major || disk->first_minor));
-	WARN_ON(!disk->minors && !(disk->flags & GENHD_FL_EXT_DEVT));
+	WARN_ON(!disk->minors &&
+		!(disk->flags & (GENHD_FL_EXT_DEVT | GENHD_FL_HIDDEN)));
 
 	disk->flags |= GENHD_FL_UP;
 
@@ -648,18 +658,26 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 		WARN_ON(1);
 		return;
 	}
-	disk_to_dev(disk)->devt = devt;
 	disk->major = MAJOR(devt);
 	disk->first_minor = MINOR(devt);
 
 	disk_alloc_events(disk);
 
-	/* Register BDI before referencing it from bdev */
-	bdi = disk->queue->backing_dev_info;
-	bdi_register_owner(bdi, disk_to_dev(disk));
-
-	blk_register_region(disk_devt(disk), disk->minors, NULL,
-			    exact_match, exact_lock, disk);
+	if (disk->flags & GENHD_FL_HIDDEN) {
+		/*
+		 * Don't let hidden disks show up in /proc/partitions,
+		 * and don't bother scanning for partitions either.
+		 */
+		disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+		disk->flags |= GENHD_FL_NO_PART_SCAN;
+	} else {
+		/* Register BDI before referencing it from bdev */
+		disk_to_dev(disk)->devt = devt;
+		bdi_register_owner(disk->queue->backing_dev_info,
+				disk_to_dev(disk));
+		blk_register_region(disk_devt(disk), disk->minors, NULL,
+				    exact_match, exact_lock, disk);
+	}
 	register_disk(parent, disk);
 	blk_register_queue(disk);
 
@@ -669,10 +687,6 @@ void device_add_disk(struct device *parent, struct gendisk *disk)
 	 */
 	WARN_ON_ONCE(!blk_get_queue(disk->queue));
 
-	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
-				   "bdi");
-	WARN_ON(retval);
-
 	disk_add_events(disk);
 	blk_integrity_add(disk);
 }
@@ -701,7 +715,8 @@ void del_gendisk(struct gendisk *disk)
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
-	sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
+	if (!(disk->flags & GENHD_FL_HIDDEN))
+		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 	if (disk->queue) {
 		/*
 		 * Unregister bdi before releasing device numbers (as they can
@@ -712,13 +727,15 @@ void del_gendisk(struct gendisk *disk)
 	} else {
 		WARN_ON(1);
 	}
-	blk_unregister_region(disk_devt(disk), disk->minors);
+
+	if (!(disk->flags & GENHD_FL_HIDDEN)) {
+		blk_unregister_region(disk_devt(disk), disk->minors);
+		kobject_put(disk->part0.holder_dir);
+		kobject_put(disk->slave_dir);
+	}
 
 	part_stat_set_all(&disk->part0, 0);
 	disk->part0.stamp = 0;
-
-	kobject_put(disk->part0.holder_dir);
-	kobject_put(disk->slave_dir);
 	if (!sysfs_deprecated)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c0ed5db33c2..93aae3476f58 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -140,6 +140,7 @@ struct hd_struct {
 #define GENHD_FL_NATIVE_CAPACITY		128
 #define GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE	256
 #define GENHD_FL_NO_PART_SCAN			512
+#define GENHD_FL_HIDDEN				1024
 
 enum {
 	DISK_EVENT_MEDIA_CHANGE			= 1 << 0, /* media changed */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2017-11-02 18:22 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-23 14:51 nvme multipath support V5 Christoph Hellwig
2017-10-23 14:51 ` Christoph Hellwig
2017-10-23 14:51 ` [PATCH 01/17] block: move REQ_NOWAIT Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-23 14:51 ` [PATCH 02/17] block: add REQ_DRV bit Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-23 14:51 ` [PATCH 03/17] block: provide a direct_make_request helper Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:05   ` Hannes Reinecke
2017-10-24  7:05     ` Hannes Reinecke
2017-10-24 17:57   ` Javier González
2017-10-24 17:57     ` Javier González
2017-10-23 14:51 ` [PATCH 04/17] block: add a blk_steal_bios helper Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:07   ` Hannes Reinecke
2017-10-24  7:07     ` Hannes Reinecke
2017-10-24  8:44   ` Max Gurtovoy
2017-10-24  8:44     ` Max Gurtovoy
2017-10-28  6:13     ` Christoph Hellwig
2017-10-28  6:13       ` Christoph Hellwig
2017-10-23 14:51 ` [PATCH 05/17] block: don't look at the struct device dev_t in disk_devt Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:08   ` Hannes Reinecke
2017-10-24  7:08     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 06/17] block: introduce GENHD_FL_HIDDEN Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:18   ` Hannes Reinecke
2017-10-24  7:18     ` Hannes Reinecke
2017-10-28  6:15     ` Christoph Hellwig
2017-10-28  6:15       ` Christoph Hellwig
2017-10-24 21:40   ` Mike Snitzer
2017-10-24 21:40     ` Mike Snitzer
2017-10-28  6:38     ` Christoph Hellwig
2017-10-28  6:38       ` Christoph Hellwig
2017-10-28  7:20       ` Guan Junxiong
2017-10-28  7:20         ` Guan Junxiong
2017-10-28  7:42         ` Christoph Hellwig
2017-10-28  7:42           ` Christoph Hellwig
2017-10-28 10:09           ` Guan Junxiong
2017-10-28 10:09             ` Guan Junxiong
2017-10-29  8:00             ` Anish Jhaveri
2017-10-29  8:00               ` Anish Jhaveri
2017-10-29  8:57             ` Christoph Hellwig
2017-10-29  8:57               ` Christoph Hellwig
2017-10-28 14:17       ` Mike Snitzer
2017-10-28 14:17         ` Mike Snitzer
2017-10-29 10:01         ` Hannes Reinecke
2017-10-29 10:01           ` Hannes Reinecke
2017-10-30  4:09           ` Guan Junxiong
2017-10-30  4:09             ` Guan Junxiong
2017-10-23 14:51 ` [PATCH 07/17] block: add a poll_fn callback to struct request_queue Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:20   ` Hannes Reinecke
2017-10-24  7:20     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 08/17] nvme: use kref_get_unless_zero in nvme_find_get_ns Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:21   ` Hannes Reinecke
2017-10-24  7:21     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 09/17] nvme: simplify nvme_open Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:21   ` Hannes Reinecke
2017-10-24  7:21     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 10/17] nvme: switch controller refcounting to use struct device Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:23   ` Hannes Reinecke
2017-10-24  7:23     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 11/17] nvme: get rid of nvme_ctrl_list Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:24   ` Hannes Reinecke
2017-10-24  7:24     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 12/17] nvme: check for a live controller in nvme_dev_open Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:25   ` Hannes Reinecke
2017-10-24  7:25     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 13/17] nvme: track subsystems Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:33   ` Hannes Reinecke
2017-10-24  7:33     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 14/17] nvme: introduce a nvme_ns_ids structure Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:27   ` Hannes Reinecke
2017-10-24  7:27     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 15/17] nvme: track shared namespaces Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:34   ` Hannes Reinecke
2017-10-24  7:34     ` Hannes Reinecke
2017-10-23 14:51 ` [PATCH 16/17] nvme: implement multipath access to nvme subsystems Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-23 15:32   ` Sagi Grimberg
2017-10-23 15:32     ` Sagi Grimberg
2017-10-23 16:57     ` Christoph Hellwig
2017-10-23 16:57       ` Christoph Hellwig
2017-10-23 17:43   ` Sagi Grimberg
2017-10-23 17:43     ` Sagi Grimberg
2017-10-24  7:43   ` Hannes Reinecke
2017-10-24  7:43     ` Hannes Reinecke
2017-10-28  6:32     ` Christoph Hellwig
2017-10-28  6:32       ` Christoph Hellwig
2017-10-30  3:37   ` Guan Junxiong
2017-10-30  3:37     ` Guan Junxiong
2017-11-02 18:22     ` Christoph Hellwig
2017-11-02 18:22       ` Christoph Hellwig
2017-10-23 14:51 ` [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes Christoph Hellwig
2017-10-23 14:51   ` Christoph Hellwig
2017-10-24  7:45   ` Hannes Reinecke
2017-10-24  7:45     ` Hannes Reinecke
2017-10-28  6:20     ` Christoph Hellwig
2017-10-28  6:20       ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2017-10-18 16:52 nvme multipath support V4 Christoph Hellwig
2017-10-18 16:52 ` [PATCH 06/17] block: introduce GENHD_FL_HIDDEN Christoph Hellwig
2017-10-18 16:52   ` Christoph Hellwig
2017-10-19  8:31   ` Johannes Thumshirn
2017-10-19  8:31     ` Johannes Thumshirn
2017-10-19 12:45   ` Hannes Reinecke
2017-10-19 12:45     ` Hannes Reinecke
2017-10-19 13:15     ` Christoph Hellwig
2017-10-19 13:15       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.