Linux-NVME Archive on lore.kernel.org
 help / color / Atom feed
* NVMe mega patchbomb for Linux 4.5-rc
@ 2015-11-20 16:34 Christoph Hellwig
  2015-11-20 16:34 ` [PATCH 01/47] nvme: add missing unmaps in nvme_queue_rq Christoph Hellwig
                   ` (34 more replies)
  0 siblings, 35 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:34 UTC (permalink / raw)


This series is a respin of all my pending NVMe patches that didn't make
it into 4.4-rc, as well a few patches from Keith that logically were
inbetween them, and include:

 - split of them NVMe driver ?nto a core and PCIe specific part
 - lots of abort / reset fixes
 - optimizations of the completion path by better relying on blk-mq

Since the last posting of all the series the major changes are another
set of fixes to the abort path to hopefully make it rock solide now.
To get there we had to add another core block request flag to avoid
struct request reuse just before a reset.  Note that various patches
have been squashed into the original patches to avoid fixups.

A git tree is also available at:

	http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-req.9

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 01/47] nvme: add missing unmaps in nvme_queue_rq
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
@ 2015-11-20 16:34 ` Christoph Hellwig
  2015-11-20 16:34 ` [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers Christoph Hellwig
                   ` (33 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:34 UTC (permalink / raw)


When we fail various metadata related operations in nvme_queue_rq we
need to unmap the data SGL.

Cc: stable at vger.kernel.org
Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 394fd16..9444884 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -896,19 +896,28 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 			goto retry_cmd;
 		}
 		if (blk_integrity_rq(req)) {
-			if (blk_rq_count_integrity_sg(req->q, req->bio) != 1)
+			if (blk_rq_count_integrity_sg(req->q, req->bio) != 1) {
+				dma_unmap_sg(dev->dev, iod->sg, iod->nents,
+						dma_dir);
 				goto error_cmd;
+			}
 
 			sg_init_table(iod->meta_sg, 1);
 			if (blk_rq_map_integrity_sg(
-					req->q, req->bio, iod->meta_sg) != 1)
+					req->q, req->bio, iod->meta_sg) != 1) {
+				dma_unmap_sg(dev->dev, iod->sg, iod->nents,
+						dma_dir);
 				goto error_cmd;
+			}
 
 			if (rq_data_dir(req))
 				nvme_dif_remap(req, nvme_dif_prep);
 
-			if (!dma_map_sg(nvmeq->q_dmadev, iod->meta_sg, 1, dma_dir))
+			if (!dma_map_sg(nvmeq->q_dmadev, iod->meta_sg, 1, dma_dir)) {
+				dma_unmap_sg(dev->dev, iod->sg, iod->nents,
+						dma_dir);
 				goto error_cmd;
+			}
 		}
 	}
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
  2015-11-20 16:34 ` [PATCH 01/47] nvme: add missing unmaps in nvme_queue_rq Christoph Hellwig
@ 2015-11-20 16:34 ` Christoph Hellwig
  2015-11-20 21:43   ` Jeff Moyer
  2015-11-20 22:20   ` Jeff Moyer
  2015-11-20 16:34 ` [PATCH 03/47] block: defer timeouts to a workqueue Christoph Hellwig
                   ` (32 subsequent siblings)
  34 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:34 UTC (permalink / raw)


We only added the request to the request list for the !blk-mq case,
so we should only delete it in that case as well.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-timeout.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 246dfb1..aa40aa9 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -158,11 +158,13 @@ void blk_abort_request(struct request *req)
 {
 	if (blk_mark_rq_complete(req))
 		return;
-	blk_delete_timer(req);
-	if (req->q->mq_ops)
+
+	if (req->q->mq_ops) {
 		blk_mq_rq_timed_out(req, false);
-	else
+	} else {
+		blk_delete_timer(req);
 		blk_rq_timed_out(req);
+	}
 }
 EXPORT_SYMBOL_GPL(blk_abort_request);
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 03/47] block: defer timeouts to a workqueue
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
  2015-11-20 16:34 ` [PATCH 01/47] nvme: add missing unmaps in nvme_queue_rq Christoph Hellwig
  2015-11-20 16:34 ` [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers Christoph Hellwig
@ 2015-11-20 16:34 ` Christoph Hellwig
  2015-11-23 20:31   ` Jeff Moyer
  2015-11-20 16:34 ` [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value Christoph Hellwig
                   ` (31 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:34 UTC (permalink / raw)


Timer context is not very useful for drivers to perform any meaningful abort
action from.  So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals.  But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c       | 8 ++++++++
 block/blk-mq.c         | 8 +++++---
 block/blk-timeout.c    | 5 +++--
 block/blk.h            | 2 +-
 include/linux/blkdev.h | 1 +
 5 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5131993b..1de0974 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -664,6 +664,13 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 	wake_up_all(&q->mq_freeze_wq);
 }
 
+static void blk_rq_timed_out_timer(unsigned long data)
+{
+	struct request_queue *q = (struct request_queue *)data;
+
+	kblockd_schedule_work(&q->timeout_work);
+}
+
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
 	struct request_queue *q;
@@ -825,6 +832,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
 		goto fail;
 
+	INIT_WORK(&q->timeout_work, blk_timeout_work);
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
 	q->unprep_rq_fn		= NULL;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ae09de..8354601 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -85,6 +85,7 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
+		cancel_work_sync(&q->timeout_work);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -617,9 +618,10 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 	}
 }
 
-static void blk_mq_rq_timer(unsigned long priv)
+static void blk_mq_timeout_work(struct work_struct *work)
 {
-	struct request_queue *q = (struct request_queue *)priv;
+	struct request_queue *q =
+		container_of(work, struct request_queue, timeout_work);
 	struct blk_mq_timeout_data data = {
 		.next		= 0,
 		.next_set	= 0,
@@ -2015,7 +2017,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		hctxs[i]->queue_num = i;
 	}
 
-	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
+	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
 	q->nr_queues = nr_cpu_ids;
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index aa40aa9..aedd128 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -127,9 +127,10 @@ static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout
 	}
 }
 
-void blk_rq_timed_out_timer(unsigned long data)
+void blk_timeout_work(struct work_struct *work)
 {
-	struct request_queue *q = (struct request_queue *) data;
+	struct request_queue *q =
+		container_of(work, struct request_queue, timeout_work);
 	unsigned long flags, next = 0;
 	struct request *rq, *tmp;
 	int next_set = 0;
diff --git a/block/blk.h b/block/blk.h
index da722eb..37b9165 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -95,7 +95,7 @@ static inline void blk_flush_integrity(void)
 }
 #endif
 
-void blk_rq_timed_out_timer(unsigned long data);
+void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3fe27f8..9a8424a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -407,6 +407,7 @@ struct request_queue {
 
 	unsigned int		rq_timeout;
 	struct timer_list	timeout;
+	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
 
 	struct list_head	icq_list;
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (2 preceding siblings ...)
  2015-11-20 16:34 ` [PATCH 03/47] block: defer timeouts to a workqueue Christoph Hellwig
@ 2015-11-20 16:34 ` Christoph Hellwig
  2015-11-24 15:16   ` Jeff Moyer
  2015-11-20 16:35 ` [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers Christoph Hellwig
                   ` (30 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:34 UTC (permalink / raw)


This marks the request as one that's not actually completed yet, but
should be reaped next time blk_mq_complete_request comes in.  This is
useful it the abort handler kicked of a reset that will complete all
pending requests.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq.c         | 6 +++++-
 block/blk-softirq.c    | 3 ++-
 block/blk-timeout.c    | 3 +++
 block/blk.h            | 1 +
 include/linux/blkdev.h | 1 +
 5 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8354601..76773dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -383,7 +383,8 @@ void blk_mq_complete_request(struct request *rq, int error)
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-	if (!blk_mark_rq_complete(rq)) {
+	if (!blk_mark_rq_complete(rq) ||
+	    test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags)) {
 		rq->errors = error;
 		__blk_mq_complete_request(rq);
 	}
@@ -586,6 +587,9 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
+	case BLK_EH_QUIESCED:
+		set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
+		break;
 	default:
 		printk(KERN_ERR "block: bad eh return: %d\n", ret);
 		break;
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 53b1737..9d47fbc 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -167,7 +167,8 @@ void blk_complete_request(struct request *req)
 {
 	if (unlikely(blk_should_fake_timeout(req->q)))
 		return;
-	if (!blk_mark_rq_complete(req))
+	if (!blk_mark_rq_complete(req) ||
+	    test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))
 		__blk_complete_request(req);
 }
 EXPORT_SYMBOL(blk_complete_request);
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index aedd128..b3a7f20 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -96,6 +96,9 @@ static void blk_rq_timed_out(struct request *req)
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
+	case BLK_EH_QUIESCED:
+		set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
+		break;
 	case BLK_EH_NOT_HANDLED:
 		/*
 		 * LLD handles this for now but in the future
diff --git a/block/blk.h b/block/blk.h
index 37b9165..f4c98f8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -120,6 +120,7 @@ void blk_account_io_done(struct request *req);
 enum rq_atomic_flags {
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
+	REQ_ATOM_QUIESCED,
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9a8424a..5df5fb13 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -223,6 +223,7 @@ enum blk_eh_timer_return {
 	BLK_EH_NOT_HANDLED,
 	BLK_EH_HANDLED,
 	BLK_EH_RESET_TIMER,
+	BLK_EH_QUIESCED,
 };
 
 typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (3 preceding siblings ...)
  2015-11-20 16:34 ` [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-23 21:23   ` Jeff Moyer
  2015-11-24 21:22   ` Jens Axboe
  2015-11-20 16:35 ` [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request Christoph Hellwig
                   ` (29 subsequent siblings)
  34 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


We only use them inconsistenly, and the other flags don't use wrappers like
this either.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c    |  2 +-
 block/blk-mq.c      |  6 +++---
 block/blk-softirq.c |  2 +-
 block/blk-timeout.c |  6 +++---
 block/blk.h         | 18 ++++--------------
 5 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1de0974..af9c315 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1375,7 +1375,7 @@ EXPORT_SYMBOL(blk_rq_set_block_pc);
 void blk_requeue_request(struct request_queue *q, struct request *rq)
 {
 	blk_delete_timer(rq);
-	blk_clear_rq_complete(rq);
+	clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 	trace_block_rq_requeue(q, rq);
 
 	if (rq->cmd_flags & REQ_QUEUED)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 76773dc..c932605 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -383,7 +383,7 @@ void blk_mq_complete_request(struct request *rq, int error)
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-	if (!blk_mark_rq_complete(rq) ||
+	if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags) ||
 	    test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags)) {
 		rq->errors = error;
 		__blk_mq_complete_request(rq);
@@ -583,7 +583,7 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		break;
 	case BLK_EH_RESET_TIMER:
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
+		clear_bit(REQ_ATOM_COMPLETE, &req->atomic_flags);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -614,7 +614,7 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		return;
 
 	if (time_after_eq(jiffies, rq->deadline)) {
-		if (!blk_mark_rq_complete(rq))
+		if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 			blk_mq_rq_timed_out(rq, reserved);
 	} else if (!data->next_set || time_after(data->next, rq->deadline)) {
 		data->next = rq->deadline;
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 9d47fbc..b89f655 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -167,7 +167,7 @@ void blk_complete_request(struct request *req)
 {
 	if (unlikely(blk_should_fake_timeout(req->q)))
 		return;
-	if (!blk_mark_rq_complete(req) ||
+	if (!test_and_set_bit(REQ_ATOM_COMPLETE, &req->atomic_flags) ||
 	    test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))
 		__blk_complete_request(req);
 }
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index b3a7f20..cc10db2 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -94,7 +94,7 @@ static void blk_rq_timed_out(struct request *req)
 		break;
 	case BLK_EH_RESET_TIMER:
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
+		clear_bit(REQ_ATOM_COMPLETE, &req->atomic_flags);
 		break;
 	case BLK_EH_QUIESCED:
 		set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
@@ -122,7 +122,7 @@ static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout
 		/*
 		 * Check if we raced with end io completion
 		 */
-		if (!blk_mark_rq_complete(rq))
+		if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 			blk_rq_timed_out(rq);
 	} else if (!*next_set || time_after(*next_timeout, rq->deadline)) {
 		*next_timeout = rq->deadline;
@@ -160,7 +160,7 @@ void blk_timeout_work(struct work_struct *work)
  */
 void blk_abort_request(struct request *req)
 {
-	if (blk_mark_rq_complete(req))
+	if (test_and_set_bit(REQ_ATOM_COMPLETE, &req->atomic_flags))
 		return;
 
 	if (req->q->mq_ops) {
diff --git a/block/blk.h b/block/blk.h
index f4c98f8..1d95107 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -118,26 +118,16 @@ void blk_account_io_done(struct request *req);
  * Internal atomic flags for request handling
  */
 enum rq_atomic_flags {
+	/*
+	 * EH timer and IO completion will both attempt to 'grab' the request,
+	 * make sure that only one of them succeeds by setting this flag.
+	 */
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
 	REQ_ATOM_QUIESCED,
 };
 
 /*
- * EH timer and IO completion will both attempt to 'grab' the request, make
- * sure that only one of them succeeds
- */
-static inline int blk_mark_rq_complete(struct request *rq)
-{
-	return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-}
-
-static inline void blk_clear_rq_complete(struct request *rq)
-{
-	clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-}
-
-/*
  * Internal elevator interface
  */
 #define ELV_ON_HASH(rq) ((rq)->cmd_flags & REQ_HASHED)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (4 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-24 15:19   ` Jeff Moyer
  2015-11-24 21:21   ` Jens Axboe
  2015-11-20 16:35 ` [PATCH 07/47] nvme: move struct nvme_iod to pci.c Christoph Hellwig
                   ` (28 subsequent siblings)
  34 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c                  |   11 +-
 block/blk-mq-tag.c                |   11 +-
 block/blk-mq.c                    |   20 +-
 block/blk-mq.h                    |   11 +-
 block/blk.h                       |    2 +-
 drivers/block/mtip32xx/mtip32xx.c |    2 +-
 drivers/nvme/host/core.c          | 1172 +++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/pci.c           |   11 +-
 include/linux/blk-mq.h            |    8 +-
 9 files changed, 1210 insertions(+), 38 deletions(-)
 create mode 100644 drivers/nvme/host/core.c

diff --git a/block/blk-core.c b/block/blk-core.c
index af9c315..d2100aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -630,7 +630,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, gfp_t gfp)
+int blk_queue_enter(struct request_queue *q, bool nowait)
 {
 	while (true) {
 		int ret;
@@ -638,7 +638,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (percpu_ref_tryget_live(&q->q_usage_counter))
 			return 0;
 
-		if (!gfpflags_allow_blocking(gfp))
+		if (nowait)
 			return -EBUSY;
 
 		ret = wait_event_interruptible(q->mq_freeze_wq,
@@ -1284,7 +1284,9 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	if (q->mq_ops)
-		return blk_mq_alloc_request(q, rw, gfp_mask, false);
+		return blk_mq_alloc_request(q, rw,
+			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+				0 : BLK_MQ_REQ_NOWAIT);
 	else
 		return blk_old_get_request(q, rw, gfp_mask);
 }
@@ -2052,8 +2054,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
-
+		if (likely(blk_queue_enter(q, false) == 0)) {
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a07ca34..abdbb47 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -268,7 +268,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 	if (tag != -1)
 		return tag;
 
-	if (!gfpflags_allow_blocking(data->gfp))
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
 		return -1;
 
 	bs = bt_wait_ptr(bt, hctx);
@@ -303,7 +303,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		data->ctx = blk_mq_get_ctx(data->q);
 		data->hctx = data->q->mq_ops->map_queue(data->q,
 				data->ctx->cpu);
-		if (data->reserved) {
+		if (data->flags & BLK_MQ_REQ_RESERVED) {
 			bt = &data->hctx->tags->breserved_tags;
 		} else {
 			last_tag = &data->ctx->last_tag;
@@ -349,10 +349,9 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_alloc_data *data)
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
-	if (!data->reserved)
-		return __blk_mq_get_tag(data);
-
-	return __blk_mq_get_reserved_tag(data);
+	if (data->flags & BLK_MQ_REQ_RESERVED)
+		return __blk_mq_get_reserved_tag(data);
+	return __blk_mq_get_tag(data);
 }
 
 static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c932605..6da03f1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,8 +230,8 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, int rw)
 	return NULL;
 }
 
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
-		bool reserved)
+struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
+		unsigned int flags)
 {
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_hw_ctx *hctx;
@@ -239,24 +239,22 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	struct blk_mq_alloc_data alloc_data;
 	int ret;
 
-	ret = blk_queue_enter(q, gfp);
+	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
-			reserved, ctx, hctx);
+	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
+	if (!rq && !(flags & BLK_MQ_REQ_NOWAIT)) {
 		__blk_mq_run_hw_queue(hctx);
 		blk_mq_put_ctx(ctx);
 
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
-		blk_mq_set_alloc_data(&alloc_data, q, gfp, reserved, ctx,
-				hctx);
+		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 		rq =  __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 	}
@@ -1181,8 +1179,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		rw |= REQ_SYNC;
 
 	trace_block_getrq(q, bio, rw);
-	blk_mq_set_alloc_data(&alloc_data, q, GFP_ATOMIC, false, ctx,
-			hctx);
+	blk_mq_set_alloc_data(&alloc_data, q, BLK_MQ_REQ_NOWAIT, ctx, hctx);
 	rq = __blk_mq_alloc_request(&alloc_data, rw);
 	if (unlikely(!rq)) {
 		__blk_mq_run_hw_queue(hctx);
@@ -1191,8 +1188,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
-		blk_mq_set_alloc_data(&alloc_data, q,
-				__GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
+		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 		hctx = alloc_data.hctx;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 713820b..eaede8e 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -96,8 +96,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
 struct blk_mq_alloc_data {
 	/* input parameter */
 	struct request_queue *q;
-	gfp_t gfp;
-	bool reserved;
+	unsigned int flags;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
@@ -105,13 +104,11 @@ struct blk_mq_alloc_data {
 };
 
 static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
-		struct request_queue *q, gfp_t gfp, bool reserved,
-		struct blk_mq_ctx *ctx,
-		struct blk_mq_hw_ctx *hctx)
+		struct request_queue *q, unsigned int flags,
+		struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
 {
 	data->q = q;
-	data->gfp = gfp;
-	data->reserved = reserved;
+	data->flags = flags;
 	data->ctx = ctx;
 	data->hctx = hctx;
 }
diff --git a/block/blk.h b/block/blk.h
index 1d95107..38bf997 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,7 +72,7 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+int blk_queue_enter(struct request_queue *q, bool nowait);
 void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index a28a562..cf3b51a 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
 {
 	struct request *rq;
 
-	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
+	rq = blk_mq_alloc_request(dd->queue, 0, BLK_MQ_REQ_RESERVED);
 	return blk_mq_rq_to_pdu(rq);
 }
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
new file mode 100644
index 0000000..53cf507
--- /dev/null
+++ b/drivers/nvme/host/core.c
@@ -0,0 +1,1172 @@
+/*
+ * NVM Express device driver
+ * Copyright (c) 2011-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/errno.h>
+#include <linux/hdreg.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_sort.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/pr.h>
+#include <linux/ptrace.h>
+#include <linux/nvme_ioctl.h>
+#include <linux/t10-pi.h>
+#include <scsi/sg.h>
+#include <asm/unaligned.h>
+
+#include "nvme.h"
+
+#define NVME_MINORS		(1U << MINORBITS)
+
+static int nvme_major;
+module_param(nvme_major, int, 0);
+
+static int nvme_char_major;
+module_param(nvme_char_major, int, 0);
+
+static LIST_HEAD(nvme_ctrl_list);
+DEFINE_SPINLOCK(dev_list_lock);
+
+static struct class *nvme_class;
+
+static void nvme_free_ns(struct kref *kref)
+{
+	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+
+	if (ns->type == NVME_NS_LIGHTNVM)
+		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
+
+	spin_lock(&dev_list_lock);
+	ns->disk->private_data = NULL;
+	spin_unlock(&dev_list_lock);
+
+	nvme_put_ctrl(ns->ctrl);
+	put_disk(ns->disk);
+	kfree(ns);
+}
+
+static void nvme_put_ns(struct nvme_ns *ns)
+{
+	kref_put(&ns->kref, nvme_free_ns);
+}
+
+static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
+{
+	struct nvme_ns *ns;
+
+	spin_lock(&dev_list_lock);
+	ns = disk->private_data;
+	if (ns && !kref_get_unless_zero(&ns->kref))
+		ns = NULL;
+	spin_unlock(&dev_list_lock);
+
+	return ns;
+}
+
+static struct request *nvme_alloc_request(struct request_queue *q,
+		struct nvme_command *cmd)
+{
+	bool write = cmd->common.opcode & 1;
+	struct request *req;
+
+	req = blk_mq_alloc_request(q, write, 0);
+	if (IS_ERR(req))
+		return req;
+
+	req->cmd_type = REQ_TYPE_DRV_PRIV;
+	req->cmd_flags |= REQ_FAILFAST_DRIVER;
+	req->__data_len = 0;
+	req->__sector = (sector_t) -1;
+	req->bio = req->biotail = NULL;
+
+	req->cmd = (unsigned char *)cmd;
+	req->cmd_len = sizeof(struct nvme_command);
+	req->special = (void *)0;
+
+	return req;
+}
+
+/*
+ * Returns 0 on success.  If the result is negative, it's a Linux error code;
+ * if the result is positive, it's an NVM Express status code
+ */
+int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
+{
+	struct request *req;
+	int ret;
+
+	req = nvme_alloc_request(q, cmd);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
+	if (buffer && bufflen) {
+		ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
+		if (ret)
+			goto out;
+	}
+
+	blk_execute_rq(req->q, NULL, req, 0);
+	if (result)
+		*result = (u32)(uintptr_t)req->special;
+	ret = req->errors;
+ out:
+	blk_mq_free_request(req);
+	return ret;
+}
+
+int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen)
+{
+	return __nvme_submit_sync_cmd(q, cmd, buffer, bufflen, NULL, 0);
+}
+
+int __nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen,
+		void __user *meta_buffer, unsigned meta_len, u32 meta_seed,
+		u32 *result, unsigned timeout)
+{
+	bool write = cmd->common.opcode & 1;
+	struct nvme_ns *ns = q->queuedata;
+	struct gendisk *disk = ns ? ns->disk : NULL;
+	struct request *req;
+	struct bio *bio = NULL;
+	void *meta = NULL;
+	int ret;
+
+	req = nvme_alloc_request(q, cmd);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
+	if (ubuffer && bufflen) {
+		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
+				GFP_KERNEL);
+		if (ret)
+			goto out;
+		bio = req->bio;
+
+		if (!disk)
+			goto submit;
+		bio->bi_bdev = bdget_disk(disk, 0);
+		if (!bio->bi_bdev) {
+			ret = -ENODEV;
+			goto out_unmap;
+		}
+
+		if (meta_buffer) {
+			struct bio_integrity_payload *bip;
+
+			meta = kmalloc(meta_len, GFP_KERNEL);
+			if (!meta) {
+				ret = -ENOMEM;
+				goto out_unmap;
+			}
+
+			if (write) {
+				if (copy_from_user(meta, meta_buffer,
+						meta_len)) {
+					ret = -EFAULT;
+					goto out_free_meta;
+				}
+			}
+
+			bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
+			if (!bip) {
+				ret = -ENOMEM;
+				goto out_free_meta;
+			}
+
+			bip->bip_iter.bi_size = meta_len;
+			bip->bip_iter.bi_sector = meta_seed;
+
+			ret = bio_integrity_add_page(bio, virt_to_page(meta),
+					meta_len, offset_in_page(meta));
+			if (ret != meta_len) {
+				ret = -ENOMEM;
+				goto out_free_meta;
+			}
+		}
+	}
+ submit:
+	blk_execute_rq(req->q, disk, req, 0);
+	ret = req->errors;
+	if (result)
+		*result = (u32)(uintptr_t)req->special;
+	if (meta && !ret && !write) {
+		if (copy_to_user(meta_buffer, meta, meta_len))
+			ret = -EFAULT;
+	}
+ out_free_meta:
+	kfree(meta);
+ out_unmap:
+	if (bio) {
+		if (disk && bio->bi_bdev)
+			bdput(bio->bi_bdev);
+		blk_rq_unmap_user(bio);
+	}
+ out:
+	blk_mq_free_request(req);
+	return ret;
+}
+
+int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen, u32 *result,
+		unsigned timeout)
+{
+	return __nvme_submit_user_cmd(q, cmd, ubuffer, bufflen, NULL, 0, 0,
+			result, timeout);
+}
+
+int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
+{
+	struct nvme_command c = { };
+	int error;
+
+	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = cpu_to_le32(1);
+
+	*id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
+	if (!*id)
+		return -ENOMEM;
+
+	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
+			sizeof(struct nvme_id_ctrl));
+	if (error)
+		kfree(*id);
+	return error;
+}
+
+static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *ns_list)
+{
+	struct nvme_command c = { };
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = cpu_to_le32(2);
+	c.identify.nsid = cpu_to_le32(nsid);
+	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
+}
+
+int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
+		struct nvme_id_ns **id)
+{
+	struct nvme_command c = { };
+	int error;
+
+	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
+	c.identify.opcode = nvme_admin_identify,
+	c.identify.nsid = cpu_to_le32(nsid),
+
+	*id = kmalloc(sizeof(struct nvme_id_ns), GFP_KERNEL);
+	if (!*id)
+		return -ENOMEM;
+
+	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
+			sizeof(struct nvme_id_ns));
+	if (error)
+		kfree(*id);
+	return error;
+}
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
+					dma_addr_t dma_addr, u32 *result)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.features.opcode = nvme_admin_get_features;
+	c.features.nsid = cpu_to_le32(nsid);
+	c.features.prp1 = cpu_to_le64(dma_addr);
+	c.features.fid = cpu_to_le32(fid);
+
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
+}
+
+int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+					dma_addr_t dma_addr, u32 *result)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.features.opcode = nvme_admin_set_features;
+	c.features.prp1 = cpu_to_le64(dma_addr);
+	c.features.fid = cpu_to_le32(fid);
+	c.features.dword11 = cpu_to_le32(dword11);
+
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
+}
+
+int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
+{
+	struct nvme_command c = { };
+	int error;
+
+	c.common.opcode = nvme_admin_get_log_page,
+	c.common.nsid = cpu_to_le32(0xFFFFFFFF),
+	c.common.cdw10[0] = cpu_to_le32(
+			(((sizeof(struct nvme_smart_log) / 4) - 1) << 16) |
+			 NVME_LOG_SMART),
+
+	*log = kmalloc(sizeof(struct nvme_smart_log), GFP_KERNEL);
+	if (!*log)
+		return -ENOMEM;
+
+	error = nvme_submit_sync_cmd(dev->admin_q, &c, *log,
+			sizeof(struct nvme_smart_log));
+	if (error)
+		kfree(*log);
+	return error;
+}
+
+static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
+{
+	struct nvme_user_io io;
+	struct nvme_command c;
+	unsigned length, meta_len;
+	void __user *metadata;
+
+	if (copy_from_user(&io, uio, sizeof(io)))
+		return -EFAULT;
+
+	switch (io.opcode) {
+	case nvme_cmd_write:
+	case nvme_cmd_read:
+	case nvme_cmd_compare:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	length = (io.nblocks + 1) << ns->lba_shift;
+	meta_len = (io.nblocks + 1) * ns->ms;
+	metadata = (void __user *)(uintptr_t)io.metadata;
+
+	if (ns->ext) {
+		length += meta_len;
+		meta_len = 0;
+	} else if (meta_len) {
+		if ((io.metadata & 3) || !io.metadata)
+			return -EINVAL;
+	}
+
+	memset(&c, 0, sizeof(c));
+	c.rw.opcode = io.opcode;
+	c.rw.flags = io.flags;
+	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.slba = cpu_to_le64(io.slba);
+	c.rw.length = cpu_to_le16(io.nblocks);
+	c.rw.control = cpu_to_le16(io.control);
+	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
+	c.rw.reftag = cpu_to_le32(io.reftag);
+	c.rw.apptag = cpu_to_le16(io.apptag);
+	c.rw.appmask = cpu_to_le16(io.appmask);
+
+	return __nvme_submit_user_cmd(ns->queue, &c,
+			(void __user *)(uintptr_t)io.addr, length,
+			metadata, meta_len, io.slba, NULL, 0);
+}
+
+static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+			struct nvme_passthru_cmd __user *ucmd)
+{
+	struct nvme_passthru_cmd cmd;
+	struct nvme_command c;
+	unsigned timeout = 0;
+	int status;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+		return -EFAULT;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = cmd.opcode;
+	c.common.flags = cmd.flags;
+	c.common.nsid = cpu_to_le32(cmd.nsid);
+	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	c.common.cdw10[0] = cpu_to_le32(cmd.cdw10);
+	c.common.cdw10[1] = cpu_to_le32(cmd.cdw11);
+	c.common.cdw10[2] = cpu_to_le32(cmd.cdw12);
+	c.common.cdw10[3] = cpu_to_le32(cmd.cdw13);
+	c.common.cdw10[4] = cpu_to_le32(cmd.cdw14);
+	c.common.cdw10[5] = cpu_to_le32(cmd.cdw15);
+
+	if (cmd.timeout_ms)
+		timeout = msecs_to_jiffies(cmd.timeout_ms);
+
+	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+			(void __user *)cmd.addr, cmd.data_len,
+			&cmd.result, timeout);
+	if (status >= 0) {
+		if (put_user(cmd.result, &ucmd->result))
+			return -EFAULT;
+	}
+
+	return status;
+}
+
+static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	switch (cmd) {
+	case NVME_IOCTL_ID:
+		force_successful_syscall_return();
+		return ns->ns_id;
+	case NVME_IOCTL_ADMIN_CMD:
+		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
+	case NVME_IOCTL_IO_CMD:
+		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
+	case NVME_IOCTL_SUBMIT_IO:
+		return nvme_submit_io(ns, (void __user *)arg);
+	case SG_GET_VERSION_NUM:
+		return nvme_sg_get_version_num((void __user *)arg);
+	case SG_IO:
+		return nvme_sg_io(ns, (void __user *)arg);
+	default:
+		return -ENOTTY;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case SG_IO:
+		return -ENOIOCTLCMD;
+	}
+	return nvme_ioctl(bdev, mode, cmd, arg);
+}
+#else
+#define nvme_compat_ioctl	NULL
+#endif
+
+static int nvme_open(struct block_device *bdev, fmode_t mode)
+{
+	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
+}
+
+static void nvme_release(struct gendisk *disk, fmode_t mode)
+{
+	nvme_put_ns(disk->private_data);
+}
+
+static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bdev->bd_disk) >> 11;
+	return 0;
+}
+
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static void nvme_init_integrity(struct nvme_ns *ns)
+{
+	struct blk_integrity integrity;
+
+	switch (ns->pi_type) {
+	case NVME_NS_DPS_PI_TYPE3:
+		integrity.profile = &t10_pi_type3_crc;
+		break;
+	case NVME_NS_DPS_PI_TYPE1:
+	case NVME_NS_DPS_PI_TYPE2:
+		integrity.profile = &t10_pi_type1_crc;
+		break;
+	default:
+		integrity.profile = NULL;
+		break;
+	}
+	integrity.tuple_size = ns->ms;
+	blk_integrity_register(ns->disk, &integrity);
+	blk_queue_max_integrity_segments(ns->queue, 1);
+}
+#else
+static void nvme_init_integrity(struct nvme_ns *ns)
+{
+}
+#endif /* CONFIG_BLK_DEV_INTEGRITY */
+
+static void nvme_config_discard(struct nvme_ns *ns)
+{
+	u32 logical_block_size = queue_logical_block_size(ns->queue);
+	ns->queue->limits.discard_zeroes_data = 0;
+	ns->queue->limits.discard_alignment = logical_block_size;
+	ns->queue->limits.discard_granularity = logical_block_size;
+	blk_queue_max_discard_sectors(ns->queue, 0xffffffff);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+}
+
+static int nvme_revalidate_disk(struct gendisk *disk)
+{
+	struct nvme_ns *ns = disk->private_data;
+	struct nvme_id_ns *id;
+	u8 lbaf, pi_type;
+	u16 old_ms;
+	unsigned short bs;
+
+	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
+		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
+				__func__, ns->ctrl->instance, ns->ns_id);
+		return -ENODEV;
+	}
+	if (id->ncap == 0) {
+		kfree(id);
+		return -ENODEV;
+	}
+
+	if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
+		if (nvme_nvm_register(ns->queue, disk->disk_name)) {
+			dev_warn(ns->ctrl->dev,
+				"%s: LightNVM init failure\n", __func__);
+			kfree(id);
+			return -ENODEV;
+		}
+		ns->type = NVME_NS_LIGHTNVM;
+	}
+
+	old_ms = ns->ms;
+	lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
+	ns->lba_shift = id->lbaf[lbaf].ds;
+	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
+	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
+
+	/*
+	 * If identify namespace failed, use default 512 byte block size so
+	 * block layer can use before failing read/write for 0 capacity.
+	 */
+	if (ns->lba_shift == 0)
+		ns->lba_shift = 9;
+	bs = 1 << ns->lba_shift;
+	/* XXX: PI implementation requires metadata equal t10 pi tuple size */
+	pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
+					id->dps & NVME_NS_DPS_PI_MASK : 0;
+
+	blk_mq_freeze_queue(disk->queue);
+	if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
+				ns->ms != old_ms ||
+				bs != queue_logical_block_size(disk->queue) ||
+				(ns->ms && ns->ext)))
+		blk_integrity_unregister(disk);
+
+	ns->pi_type = pi_type;
+	blk_queue_logical_block_size(ns->queue, bs);
+
+	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
+		nvme_init_integrity(ns);
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+		set_capacity(disk, 0);
+	else
+		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+
+	if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
+		nvme_config_discard(ns);
+	blk_mq_unfreeze_queue(disk->queue);
+
+	kfree(id);
+	return 0;
+}
+
+static char nvme_pr_type(enum pr_type type)
+{
+	switch (type) {
+	case PR_WRITE_EXCLUSIVE:
+		return 1;
+	case PR_EXCLUSIVE_ACCESS:
+		return 2;
+	case PR_WRITE_EXCLUSIVE_REG_ONLY:
+		return 3;
+	case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+		return 4;
+	case PR_WRITE_EXCLUSIVE_ALL_REGS:
+		return 5;
+	case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+		return 6;
+	default:
+		return 0;
+	}
+};
+
+static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
+				u64 key, u64 sa_key, u8 op)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_command c;
+	u8 data[16] = { 0, };
+
+	put_unaligned_le64(key, &data[0]);
+	put_unaligned_le64(sa_key, &data[8]);
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = op;
+	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.cdw10[0] = cpu_to_le32(cdw10);
+
+	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+}
+
+static int nvme_pr_register(struct block_device *bdev, u64 old,
+		u64 new, unsigned flags)
+{
+	u32 cdw10;
+
+	if (flags & ~PR_FL_IGNORE_KEY)
+		return -EOPNOTSUPP;
+
+	cdw10 = old ? 2 : 0;
+	cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
+	cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
+	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
+}
+
+static int nvme_pr_reserve(struct block_device *bdev, u64 key,
+		enum pr_type type, unsigned flags)
+{
+	u32 cdw10;
+
+	if (flags & ~PR_FL_IGNORE_KEY)
+		return -EOPNOTSUPP;
+
+	cdw10 = nvme_pr_type(type) << 8;
+	cdw10 |= ((flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0);
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_acquire);
+}
+
+static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
+		enum pr_type type, bool abort)
+{
+	u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
+	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
+}
+
+static int nvme_pr_clear(struct block_device *bdev, u64 key)
+{
+	u32 cdw10 = 1 | key ? 1 << 3 : 0;
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register);
+}
+
+static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
+{
+	u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
+}
+
+static const struct pr_ops nvme_pr_ops = {
+	.pr_register	= nvme_pr_register,
+	.pr_reserve	= nvme_pr_reserve,
+	.pr_release	= nvme_pr_release,
+	.pr_preempt	= nvme_pr_preempt,
+	.pr_clear	= nvme_pr_clear,
+};
+
+static const struct block_device_operations nvme_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvme_ioctl,
+	.compat_ioctl	= nvme_compat_ioctl,
+	.open		= nvme_open,
+	.release	= nvme_release,
+	.getgeo		= nvme_getgeo,
+	.revalidate_disk= nvme_revalidate_disk,
+	.pr_ops		= &nvme_pr_ops,
+};
+
+/*
+ * Initialize the cached copies of the Identify data and various controller
+ * register in our nvme_ctrl structure.  This should be called as soon as
+ * the admin queue is fully up and running.
+ */
+int nvme_init_identify(struct nvme_ctrl *ctrl)
+{
+	struct nvme_id_ctrl *id;
+	u64 cap;
+	int ret, page_shift;
+
+	ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
+	if (ret) {
+		dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
+		return ret;
+	}
+
+	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
+	if (ret) {
+		dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
+		return ret;
+	}
+	page_shift = NVME_CAP_MPSMIN(cap) + 12;
+	ctrl->page_size = 1 << page_shift;
+
+	if (ctrl->vs >= NVME_VS(1, 1))
+		ctrl->subsystem = NVME_CAP_NSSRC(cap);
+
+	ret = nvme_identify_ctrl(ctrl, &id);
+	if (ret) {
+		dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
+		return -EIO;
+	}
+
+	ctrl->oncs = le16_to_cpup(&id->oncs);
+	atomic_set(&ctrl->abort_limit, id->acl + 1);
+	ctrl->vwc = id->vwc;
+	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
+	memcpy(ctrl->model, id->mn, sizeof(id->mn));
+	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
+	if (id->mdts)
+		ctrl->max_hw_sectors = 1 << (id->mdts + page_shift - 9);
+	else
+		ctrl->max_hw_sectors = UINT_MAX;
+
+	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[3]) {
+		unsigned int max_hw_sectors;
+
+		ctrl->stripe_size = 1 << (id->vs[3] + page_shift);
+		max_hw_sectors = ctrl->stripe_size >> (page_shift - 9);
+		if (ctrl->max_hw_sectors) {
+			ctrl->max_hw_sectors = min(max_hw_sectors,
+							ctrl->max_hw_sectors);
+		} else {
+			ctrl->max_hw_sectors = max_hw_sectors;
+		}
+	}
+
+	kfree(id);
+	return 0;
+}
+
+static int nvme_dev_open(struct inode *inode, struct file *file)
+{
+	struct nvme_ctrl *ctrl;
+	int instance = iminor(inode);
+	int ret = -ENODEV;
+
+	spin_lock(&dev_list_lock);
+	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
+		if (ctrl->instance != instance)
+			continue;
+
+		if (!ctrl->admin_q) {
+			ret = -EWOULDBLOCK;
+			break;
+		}
+		if (!kref_get_unless_zero(&ctrl->kref))
+			break;
+		file->private_data = ctrl;
+		ret = 0;
+		break;
+	}
+	spin_unlock(&dev_list_lock);
+
+	return ret;
+}
+
+static int nvme_dev_release(struct inode *inode, struct file *file)
+{
+	nvme_put_ctrl(file->private_data);
+	return 0;
+}
+
+static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct nvme_ctrl *ctrl = file->private_data;
+	void __user *argp = (void __user *)arg;
+	struct nvme_ns *ns;
+
+	switch (cmd) {
+	case NVME_IOCTL_ADMIN_CMD:
+		return nvme_user_cmd(ctrl, NULL, argp);
+	case NVME_IOCTL_IO_CMD:
+		if (list_empty(&ctrl->namespaces))
+			return -ENOTTY;
+		ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list);
+		return nvme_user_cmd(ctrl, ns, argp);
+	case NVME_IOCTL_RESET:
+		dev_warn(ctrl->dev, "resetting controller\n");
+		return ctrl->ops->reset_ctrl(ctrl);
+	case NVME_IOCTL_SUBSYS_RESET:
+		return nvme_reset_subsystem(ctrl);
+	default:
+		return -ENOTTY;
+	}
+}
+
+static const struct file_operations nvme_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nvme_dev_open,
+	.release	= nvme_dev_release,
+	.unlocked_ioctl	= nvme_dev_ioctl,
+	.compat_ioctl	= nvme_dev_ioctl,
+};
+
+static ssize_t nvme_sysfs_reset(struct device *dev,
+				struct device_attribute *attr, const char *buf,
+				size_t count)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	int ret;
+
+	ret = ctrl->ops->reset_ctrl(ctrl);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
+
+static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
+	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
+
+	return nsa->ns_id - nsb->ns_id;
+}
+
+static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->ns_id == nsid)
+			return ns;
+		if (ns->ns_id > nsid)
+			break;
+	}
+	return NULL;
+}
+
+static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+	struct gendisk *disk;
+	int node = dev_to_node(ctrl->dev);
+
+	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
+	if (!ns)
+		return;
+
+	ns->queue = blk_mq_init_queue(ctrl->tagset);
+	if (IS_ERR(ns->queue))
+		goto out_free_ns;
+	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	ns->queue->queuedata = ns;
+	ns->ctrl = ctrl;
+
+	disk = alloc_disk_node(0, node);
+	if (!disk)
+		goto out_free_queue;
+
+	kref_init(&ns->kref);
+	ns->ns_id = nsid;
+	ns->disk = disk;
+	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
+
+	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
+	if (ctrl->max_hw_sectors) {
+		blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
+		blk_queue_max_segments(ns->queue,
+			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
+	}
+	if (ctrl->stripe_size)
+		blk_queue_chunk_sectors(ns->queue, ctrl->stripe_size >> 9);
+	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
+		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
+	blk_queue_virt_boundary(ns->queue, ctrl->page_size - 1);
+
+	disk->major = nvme_major;
+	disk->first_minor = 0;
+	disk->fops = &nvme_fops;
+	disk->private_data = ns;
+	disk->queue = ns->queue;
+	disk->driverfs_dev = ctrl->device;
+	disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, nsid);
+
+	if (nvme_revalidate_disk(ns->disk))
+		goto out_free_disk;
+
+	list_add_tail(&ns->list, &ctrl->namespaces);
+	kref_get(&ctrl->kref);
+	if (ns->type != NVME_NS_LIGHTNVM)
+		add_disk(ns->disk);
+
+	return;
+ out_free_disk:
+	kfree(disk);
+ out_free_queue:
+	blk_cleanup_queue(ns->queue);
+ out_free_ns:
+	kfree(ns);
+}
+
+static void nvme_ns_remove(struct nvme_ns *ns)
+{
+	bool kill = nvme_io_incapable(ns->ctrl) &&
+			!blk_queue_dying(ns->queue);
+
+	if (kill)
+		blk_set_queue_dying(ns->queue);
+	if (ns->disk->flags & GENHD_FL_UP) {
+		if (blk_get_integrity(ns->disk))
+			blk_integrity_unregister(ns->disk);
+		del_gendisk(ns->disk);
+	}
+	if (kill || !blk_queue_dying(ns->queue)) {
+		blk_mq_abort_requeue_list(ns->queue);
+		blk_cleanup_queue(ns->queue);
+	}
+	list_del_init(&ns->list);
+	nvme_put_ns(ns);
+}
+
+static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+
+	ns = nvme_find_ns(ctrl, nsid);
+	if (ns) {
+		if (revalidate_disk(ns->disk))
+			nvme_ns_remove(ns);
+	} else
+		nvme_alloc_ns(ctrl, nsid);
+}
+
+static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
+{
+	struct nvme_ns *ns;
+	__le32 *ns_list;
+	unsigned i, j, nsid, prev = 0, num_lists = DIV_ROUND_UP(nn, 1024);
+	int ret = 0;
+
+	ns_list = kzalloc(0x1000, GFP_KERNEL);
+	if (!ns_list)
+		return -ENOMEM;
+
+	for (i = 0; i < num_lists; i++) {
+		ret = nvme_identify_ns_list(ctrl, prev, ns_list);
+		if (ret)
+			goto out;
+
+		for (j = 0; j < min(nn, 1024U); j++) {
+			nsid = le32_to_cpu(ns_list[j]);
+			if (!nsid)
+				goto out;
+
+			nvme_validate_ns(ctrl, nsid);
+
+			while (++prev < nsid) {
+				ns = nvme_find_ns(ctrl, prev);
+				if (ns)
+					nvme_ns_remove(ns);
+			}
+		}
+		nn -= j;
+	}
+ out:
+	kfree(ns_list);
+	return ret;
+}
+
+static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
+{
+	struct nvme_ns *ns, *next;
+	unsigned i;
+
+	for (i = 1; i <= nn; i++)
+		nvme_validate_ns(ctrl, i);
+
+	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
+		if (ns->ns_id > nn)
+			nvme_ns_remove(ns);
+	}
+}
+
+void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
+{
+	struct nvme_id_ctrl *id;
+	unsigned nn;
+
+	if (nvme_identify_ctrl(ctrl, &id))
+		return;
+
+	nn = le32_to_cpu(id->nn);
+	if (ctrl->vs >= NVME_VS(1, 1)) {
+		if (!nvme_scan_ns_list(ctrl, nn))
+			goto done;
+	}
+	__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
+ done:
+	list_sort(NULL, &ctrl->namespaces, ns_cmp);
+	kfree(id);
+}
+
+void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns, *next;
+
+	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
+		nvme_ns_remove(ns);
+}
+
+static DEFINE_IDA(nvme_instance_ida);
+
+static int nvme_set_instance(struct nvme_ctrl *ctrl)
+{
+	int instance, error;
+
+	do {
+		if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
+			return -ENODEV;
+
+		spin_lock(&dev_list_lock);
+		error = ida_get_new(&nvme_instance_ida, &instance);
+		spin_unlock(&dev_list_lock);
+	} while (error == -EAGAIN);
+
+	if (error)
+		return -ENODEV;
+
+	ctrl->instance = instance;
+	return 0;
+}
+
+static void nvme_release_instance(struct nvme_ctrl *ctrl)
+{
+	spin_lock(&dev_list_lock);
+	ida_remove(&nvme_instance_ida, ctrl->instance);
+	spin_unlock(&dev_list_lock);
+}
+
+void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
+ {
+	device_remove_file(ctrl->device, &dev_attr_reset_controller);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+
+	spin_lock(&dev_list_lock);
+	list_del(&ctrl->node);
+	spin_unlock(&dev_list_lock);
+}
+
+static void nvme_free_ctrl(struct kref *kref)
+{
+	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+
+	put_device(ctrl->device);
+	nvme_release_instance(ctrl);
+
+	ctrl->ops->free_ctrl(ctrl);
+}
+
+void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+{
+	kref_put(&ctrl->kref, nvme_free_ctrl);
+}
+
+/*
+ * Initialize a NVMe controller structures.  This needs to be called during
+ * earliest initialization so that we have the initialized structured around
+ * during probing.
+ */
+int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
+		const struct nvme_ctrl_ops *ops, u16 vendor,
+		unsigned long quirks)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&ctrl->namespaces);
+	kref_init(&ctrl->kref);
+	ctrl->dev = dev;
+	ctrl->ops = ops;
+	ctrl->vendor = vendor;
+	ctrl->quirks = quirks;
+
+	ret = nvme_set_instance(ctrl);
+	if (ret)
+		goto out;
+
+	ctrl->device = device_create(nvme_class, ctrl->dev,
+				MKDEV(nvme_char_major, ctrl->instance),
+				dev, "nvme%d", ctrl->instance);
+	if (IS_ERR(ctrl->device)) {
+		ret = PTR_ERR(ctrl->device);
+		goto out_release_instance;
+	}
+	get_device(ctrl->device);
+	dev_set_drvdata(ctrl->device, ctrl);
+
+	ret = device_create_file(ctrl->device, &dev_attr_reset_controller);
+	if (ret)
+		goto out_put_device;
+
+	spin_lock(&dev_list_lock);
+	list_add_tail(&ctrl->node, &nvme_ctrl_list);
+	spin_unlock(&dev_list_lock);
+
+	return 0;
+
+out_put_device:
+	put_device(ctrl->device);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+out_release_instance:
+	nvme_release_instance(ctrl);
+out:
+	return ret;
+}
+
+int __init nvme_core_init(void)
+{
+	int result;
+
+	result = register_blkdev(nvme_major, "nvme");
+	if (result < 0)
+		return result;
+	else if (result > 0)
+		nvme_major = result;
+
+	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
+							&nvme_dev_fops);
+	if (result < 0)
+		goto unregister_blkdev;
+	else if (result > 0)
+		nvme_char_major = result;
+
+	nvme_class = class_create(THIS_MODULE, "nvme");
+	if (IS_ERR(nvme_class)) {
+		result = PTR_ERR(nvme_class);
+		goto unregister_chrdev;
+	}
+
+	return 0;
+
+ unregister_chrdev:
+	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+ unregister_blkdev:
+	unregister_blkdev(nvme_major, "nvme");
+	return result;
+}
+
+void nvme_core_exit(void)
+{
+	unregister_blkdev(nvme_major, "nvme");
+	class_destroy(nvme_class);
+	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+}
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9444884..5c5f455 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1040,7 +1040,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	struct request *req;
 	int ret;
 
-	req = blk_mq_alloc_request(q, write, GFP_KERNEL, false);
+	req = blk_mq_alloc_request(q, write, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1093,7 +1093,8 @@ static int nvme_submit_async_admin_req(struct nvme_dev *dev)
 	struct nvme_cmd_info *cmd_info;
 	struct request *req;
 
-	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC, true);
+	req = blk_mq_alloc_request(dev->admin_q, WRITE,
+			BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1118,7 +1119,7 @@ static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
 	struct request *req;
 	struct nvme_cmd_info *cmd_rq;
 
-	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1319,8 +1320,8 @@ static void nvme_abort_req(struct request *req)
 	if (!dev->abort_limit)
 		return;
 
-	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
-									false);
+	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE,
+			BLK_MQ_REQ_NOWAIT);
 	if (IS_ERR(abort_req))
 		return;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index daf17d7..7fc9296 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -188,8 +188,14 @@ void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
+
+enum {
+	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
+	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
+};
+
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
-		gfp_t gfp, bool reserved);
+		unsigned int flags);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
 struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags);
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 07/47] nvme: move struct nvme_iod to pci.c
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (5 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 08/47] nvme: split command submission helpers out of pci.c Christoph Hellwig
                   ` (27 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


This structure is specific to the PCIe driver internals and should be moved
to pci.c.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Acked-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/nvme.h | 17 -----------------
 drivers/nvme/host/pci.c  | 17 +++++++++++++++++
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index fdb4e5b..2cead2c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -94,23 +94,6 @@ struct nvme_ns {
 	u32 mode_select_block_len;
 };
 
-/*
- * The nvme_iod describes the data in an I/O, including the list of PRP
- * entries.  You can't see it in this data structure because C doesn't let
- * me express that.  Use nvme_alloc_iod to ensure there's enough space
- * allocated to store the PRP list.
- */
-struct nvme_iod {
-	unsigned long private;	/* For the use of the submitter of the I/O */
-	int npages;		/* In the PRP list. 0 means small pool in use */
-	int offset;		/* Of PRP list */
-	int nents;		/* Used in scatterlist */
-	int length;		/* Of data, in bytes */
-	dma_addr_t first_dma;
-	struct scatterlist meta_sg[1]; /* metadata requires single contiguous buffer */
-	struct scatterlist sg[0];
-};
-
 static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
 {
 	return (sector >> (ns->lba_shift - 9));
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5c5f455..3c2fba7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -130,6 +130,23 @@ struct nvme_queue {
 };
 
 /*
+ * The nvme_iod describes the data in an I/O, including the list of PRP
+ * entries.  You can't see it in this data structure because C doesn't let
+ * me express that.  Use nvme_alloc_iod to ensure there's enough space
+ * allocated to store the PRP list.
+ */
+struct nvme_iod {
+	unsigned long private;	/* For the use of the submitter of the I/O */
+	int npages;		/* In the PRP list. 0 means small pool in use */
+	int offset;		/* Of PRP list */
+	int nents;		/* Used in scatterlist */
+	int length;		/* Of data, in bytes */
+	dma_addr_t first_dma;
+	struct scatterlist meta_sg[1]; /* metadata requires single contiguous buffer */
+	struct scatterlist sg[0];
+};
+
+/*
  * Check we didin't inadvertently grow the command struct
  */
 static inline void _nvme_check_size(void)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 08/47] nvme: split command submission helpers out of pci.c
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (6 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 07/47] nvme: move struct nvme_iod to pci.c Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 09/47] nvme: add a vendor field to struct nvme_dev Christoph Hellwig
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Create a new core.c and start by adding the command submission helpers
to it, which are already abstracted away from the actual hardware queues
by the block layer.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Acked-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/Makefile |    2 +-
 drivers/nvme/host/core.c   | 1059 ++------------------------------------------
 drivers/nvme/host/nvme.h   |    3 +
 drivers/nvme/host/pci.c    |  155 +------
 4 files changed, 35 insertions(+), 1184 deletions(-)

diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 219dc206..3e26dc9 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -1,4 +1,4 @@
 
 obj-$(CONFIG_BLK_DEV_NVME)     += nvme.o
 
-nvme-y		+= pci.o scsi.o lightnvm.o
+nvme-y		+= core.o pci.o scsi.o lightnvm.o
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 53cf507..ce938a4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -15,77 +15,28 @@
 #include <linux/blkdev.h>
 #include <linux/blk-mq.h>
 #include <linux/errno.h>
-#include <linux/hdreg.h>
 #include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/list_sort.h>
 #include <linux/slab.h>
 #include <linux/types.h>
-#include <linux/pr.h>
-#include <linux/ptrace.h>
-#include <linux/nvme_ioctl.h>
-#include <linux/t10-pi.h>
-#include <scsi/sg.h>
-#include <asm/unaligned.h>
 
 #include "nvme.h"
 
-#define NVME_MINORS		(1U << MINORBITS)
-
-static int nvme_major;
-module_param(nvme_major, int, 0);
-
-static int nvme_char_major;
-module_param(nvme_char_major, int, 0);
-
-static LIST_HEAD(nvme_ctrl_list);
-DEFINE_SPINLOCK(dev_list_lock);
-
-static struct class *nvme_class;
-
-static void nvme_free_ns(struct kref *kref)
-{
-	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
-
-	if (ns->type == NVME_NS_LIGHTNVM)
-		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
-
-	spin_lock(&dev_list_lock);
-	ns->disk->private_data = NULL;
-	spin_unlock(&dev_list_lock);
-
-	nvme_put_ctrl(ns->ctrl);
-	put_disk(ns->disk);
-	kfree(ns);
-}
-
-static void nvme_put_ns(struct nvme_ns *ns)
-{
-	kref_put(&ns->kref, nvme_free_ns);
-}
-
-static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
-{
-	struct nvme_ns *ns;
-
-	spin_lock(&dev_list_lock);
-	ns = disk->private_data;
-	if (ns && !kref_get_unless_zero(&ns->kref))
-		ns = NULL;
-	spin_unlock(&dev_list_lock);
-
-	return ns;
-}
-
-static struct request *nvme_alloc_request(struct request_queue *q,
-		struct nvme_command *cmd)
+/*
+ * Returns 0 on success.  If the result is negative, it's a Linux error code;
+ * if the result is positive, it's an NVM Express status code
+ */
+int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, void __user *ubuffer, unsigned bufflen,
+		u32 *result, unsigned timeout)
 {
 	bool write = cmd->common.opcode & 1;
+	struct bio *bio = NULL;
 	struct request *req;
+	int ret;
 
 	req = blk_mq_alloc_request(q, write, 0);
 	if (IS_ERR(req))
-		return req;
+		return PTR_ERR(req);
 
 	req->cmd_type = REQ_TYPE_DRV_PRIV;
 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
@@ -93,149 +44,42 @@ static struct request *nvme_alloc_request(struct request_queue *q,
 	req->__sector = (sector_t) -1;
 	req->bio = req->biotail = NULL;
 
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
 	req->cmd = (unsigned char *)cmd;
 	req->cmd_len = sizeof(struct nvme_command);
 	req->special = (void *)0;
 
-	return req;
-}
-
-/*
- * Returns 0 on success.  If the result is negative, it's a Linux error code;
- * if the result is positive, it's an NVM Express status code
- */
-int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
-{
-	struct request *req;
-	int ret;
-
-	req = nvme_alloc_request(q, cmd);
-	if (IS_ERR(req))
-		return PTR_ERR(req);
-
-	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
-
 	if (buffer && bufflen) {
 		ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
 		if (ret)
 			goto out;
-	}
-
-	blk_execute_rq(req->q, NULL, req, 0);
-	if (result)
-		*result = (u32)(uintptr_t)req->special;
-	ret = req->errors;
- out:
-	blk_mq_free_request(req);
-	return ret;
-}
-
-int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, unsigned bufflen)
-{
-	return __nvme_submit_sync_cmd(q, cmd, buffer, bufflen, NULL, 0);
-}
-
-int __nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void __user *ubuffer, unsigned bufflen,
-		void __user *meta_buffer, unsigned meta_len, u32 meta_seed,
-		u32 *result, unsigned timeout)
-{
-	bool write = cmd->common.opcode & 1;
-	struct nvme_ns *ns = q->queuedata;
-	struct gendisk *disk = ns ? ns->disk : NULL;
-	struct request *req;
-	struct bio *bio = NULL;
-	void *meta = NULL;
-	int ret;
-
-	req = nvme_alloc_request(q, cmd);
-	if (IS_ERR(req))
-		return PTR_ERR(req);
-
-	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
-
-	if (ubuffer && bufflen) {
+	} else if (ubuffer && bufflen) {
 		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
 				GFP_KERNEL);
 		if (ret)
 			goto out;
 		bio = req->bio;
-
-		if (!disk)
-			goto submit;
-		bio->bi_bdev = bdget_disk(disk, 0);
-		if (!bio->bi_bdev) {
-			ret = -ENODEV;
-			goto out_unmap;
-		}
-
-		if (meta_buffer) {
-			struct bio_integrity_payload *bip;
-
-			meta = kmalloc(meta_len, GFP_KERNEL);
-			if (!meta) {
-				ret = -ENOMEM;
-				goto out_unmap;
-			}
-
-			if (write) {
-				if (copy_from_user(meta, meta_buffer,
-						meta_len)) {
-					ret = -EFAULT;
-					goto out_free_meta;
-				}
-			}
-
-			bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
-			if (!bip) {
-				ret = -ENOMEM;
-				goto out_free_meta;
-			}
-
-			bip->bip_iter.bi_size = meta_len;
-			bip->bip_iter.bi_sector = meta_seed;
-
-			ret = bio_integrity_add_page(bio, virt_to_page(meta),
-					meta_len, offset_in_page(meta));
-			if (ret != meta_len) {
-				ret = -ENOMEM;
-				goto out_free_meta;
-			}
-		}
 	}
- submit:
-	blk_execute_rq(req->q, disk, req, 0);
-	ret = req->errors;
+
+	blk_execute_rq(req->q, NULL, req, 0);
+	if (bio)
+		blk_rq_unmap_user(bio);
 	if (result)
 		*result = (u32)(uintptr_t)req->special;
-	if (meta && !ret && !write) {
-		if (copy_to_user(meta_buffer, meta, meta_len))
-			ret = -EFAULT;
-	}
- out_free_meta:
-	kfree(meta);
- out_unmap:
-	if (bio) {
-		if (disk && bio->bi_bdev)
-			bdput(bio->bi_bdev);
-		blk_rq_unmap_user(bio);
-	}
+	ret = req->errors;
  out:
 	blk_mq_free_request(req);
 	return ret;
 }
 
-int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void __user *ubuffer, unsigned bufflen, u32 *result,
-		unsigned timeout)
+int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen)
 {
-	return __nvme_submit_user_cmd(q, cmd, ubuffer, bufflen, NULL, 0, 0,
-			result, timeout);
+	return __nvme_submit_sync_cmd(q, cmd, buffer, NULL, bufflen, NULL, 0);
 }
 
-int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
+int nvme_identify_ctrl(struct nvme_dev *dev, struct nvme_id_ctrl **id)
 {
 	struct nvme_command c = { };
 	int error;
@@ -255,17 +99,7 @@ int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 	return error;
 }
 
-static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *ns_list)
-{
-	struct nvme_command c = { };
-
-	c.identify.opcode = nvme_admin_identify;
-	c.identify.cns = cpu_to_le32(2);
-	c.identify.nsid = cpu_to_le32(nsid);
-	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
-}
-
-int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
+int nvme_identify_ns(struct nvme_dev *dev, unsigned nsid,
 		struct nvme_id_ns **id)
 {
 	struct nvme_command c = { };
@@ -286,7 +120,7 @@ int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 	return error;
 }
 
-int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
+int nvme_get_features(struct nvme_dev *dev, unsigned fid, unsigned nsid,
 					dma_addr_t dma_addr, u32 *result)
 {
 	struct nvme_command c;
@@ -297,10 +131,11 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 	c.features.prp1 = cpu_to_le64(dma_addr);
 	c.features.fid = cpu_to_le32(fid);
 
-	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
+			result, 0);
 }
 
-int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
 					dma_addr_t dma_addr, u32 *result)
 {
 	struct nvme_command c;
@@ -311,10 +146,11 @@ int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
-	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
+			result, 0);
 }
 
-int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
+int nvme_get_log_page(struct nvme_dev *dev, struct nvme_smart_log **log)
 {
 	struct nvme_command c = { };
 	int error;
@@ -335,838 +171,3 @@ int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
 		kfree(*log);
 	return error;
 }
-
-static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
-{
-	struct nvme_user_io io;
-	struct nvme_command c;
-	unsigned length, meta_len;
-	void __user *metadata;
-
-	if (copy_from_user(&io, uio, sizeof(io)))
-		return -EFAULT;
-
-	switch (io.opcode) {
-	case nvme_cmd_write:
-	case nvme_cmd_read:
-	case nvme_cmd_compare:
-		break;
-	default:
-		return -EINVAL;
-	}
-
-	length = (io.nblocks + 1) << ns->lba_shift;
-	meta_len = (io.nblocks + 1) * ns->ms;
-	metadata = (void __user *)(uintptr_t)io.metadata;
-
-	if (ns->ext) {
-		length += meta_len;
-		meta_len = 0;
-	} else if (meta_len) {
-		if ((io.metadata & 3) || !io.metadata)
-			return -EINVAL;
-	}
-
-	memset(&c, 0, sizeof(c));
-	c.rw.opcode = io.opcode;
-	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
-	c.rw.slba = cpu_to_le64(io.slba);
-	c.rw.length = cpu_to_le16(io.nblocks);
-	c.rw.control = cpu_to_le16(io.control);
-	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
-	c.rw.reftag = cpu_to_le32(io.reftag);
-	c.rw.apptag = cpu_to_le16(io.apptag);
-	c.rw.appmask = cpu_to_le16(io.appmask);
-
-	return __nvme_submit_user_cmd(ns->queue, &c,
-			(void __user *)(uintptr_t)io.addr, length,
-			metadata, meta_len, io.slba, NULL, 0);
-}
-
-static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-			struct nvme_passthru_cmd __user *ucmd)
-{
-	struct nvme_passthru_cmd cmd;
-	struct nvme_command c;
-	unsigned timeout = 0;
-	int status;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return -EACCES;
-	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
-		return -EFAULT;
-
-	memset(&c, 0, sizeof(c));
-	c.common.opcode = cmd.opcode;
-	c.common.flags = cmd.flags;
-	c.common.nsid = cpu_to_le32(cmd.nsid);
-	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
-	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
-	c.common.cdw10[0] = cpu_to_le32(cmd.cdw10);
-	c.common.cdw10[1] = cpu_to_le32(cmd.cdw11);
-	c.common.cdw10[2] = cpu_to_le32(cmd.cdw12);
-	c.common.cdw10[3] = cpu_to_le32(cmd.cdw13);
-	c.common.cdw10[4] = cpu_to_le32(cmd.cdw14);
-	c.common.cdw10[5] = cpu_to_le32(cmd.cdw15);
-
-	if (cmd.timeout_ms)
-		timeout = msecs_to_jiffies(cmd.timeout_ms);
-
-	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
-			(void __user *)cmd.addr, cmd.data_len,
-			&cmd.result, timeout);
-	if (status >= 0) {
-		if (put_user(cmd.result, &ucmd->result))
-			return -EFAULT;
-	}
-
-	return status;
-}
-
-static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
-		unsigned int cmd, unsigned long arg)
-{
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-
-	switch (cmd) {
-	case NVME_IOCTL_ID:
-		force_successful_syscall_return();
-		return ns->ns_id;
-	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
-	case NVME_IOCTL_IO_CMD:
-		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
-	case NVME_IOCTL_SUBMIT_IO:
-		return nvme_submit_io(ns, (void __user *)arg);
-	case SG_GET_VERSION_NUM:
-		return nvme_sg_get_version_num((void __user *)arg);
-	case SG_IO:
-		return nvme_sg_io(ns, (void __user *)arg);
-	default:
-		return -ENOTTY;
-	}
-}
-
-#ifdef CONFIG_COMPAT
-static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
-			unsigned int cmd, unsigned long arg)
-{
-	switch (cmd) {
-	case SG_IO:
-		return -ENOIOCTLCMD;
-	}
-	return nvme_ioctl(bdev, mode, cmd, arg);
-}
-#else
-#define nvme_compat_ioctl	NULL
-#endif
-
-static int nvme_open(struct block_device *bdev, fmode_t mode)
-{
-	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
-}
-
-static void nvme_release(struct gendisk *disk, fmode_t mode)
-{
-	nvme_put_ns(disk->private_data);
-}
-
-static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
-{
-	/* some standard values */
-	geo->heads = 1 << 6;
-	geo->sectors = 1 << 5;
-	geo->cylinders = get_capacity(bdev->bd_disk) >> 11;
-	return 0;
-}
-
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-static void nvme_init_integrity(struct nvme_ns *ns)
-{
-	struct blk_integrity integrity;
-
-	switch (ns->pi_type) {
-	case NVME_NS_DPS_PI_TYPE3:
-		integrity.profile = &t10_pi_type3_crc;
-		break;
-	case NVME_NS_DPS_PI_TYPE1:
-	case NVME_NS_DPS_PI_TYPE2:
-		integrity.profile = &t10_pi_type1_crc;
-		break;
-	default:
-		integrity.profile = NULL;
-		break;
-	}
-	integrity.tuple_size = ns->ms;
-	blk_integrity_register(ns->disk, &integrity);
-	blk_queue_max_integrity_segments(ns->queue, 1);
-}
-#else
-static void nvme_init_integrity(struct nvme_ns *ns)
-{
-}
-#endif /* CONFIG_BLK_DEV_INTEGRITY */
-
-static void nvme_config_discard(struct nvme_ns *ns)
-{
-	u32 logical_block_size = queue_logical_block_size(ns->queue);
-	ns->queue->limits.discard_zeroes_data = 0;
-	ns->queue->limits.discard_alignment = logical_block_size;
-	ns->queue->limits.discard_granularity = logical_block_size;
-	blk_queue_max_discard_sectors(ns->queue, 0xffffffff);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
-}
-
-static int nvme_revalidate_disk(struct gendisk *disk)
-{
-	struct nvme_ns *ns = disk->private_data;
-	struct nvme_id_ns *id;
-	u8 lbaf, pi_type;
-	u16 old_ms;
-	unsigned short bs;
-
-	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
-		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
-				__func__, ns->ctrl->instance, ns->ns_id);
-		return -ENODEV;
-	}
-	if (id->ncap == 0) {
-		kfree(id);
-		return -ENODEV;
-	}
-
-	if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
-		if (nvme_nvm_register(ns->queue, disk->disk_name)) {
-			dev_warn(ns->ctrl->dev,
-				"%s: LightNVM init failure\n", __func__);
-			kfree(id);
-			return -ENODEV;
-		}
-		ns->type = NVME_NS_LIGHTNVM;
-	}
-
-	old_ms = ns->ms;
-	lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
-	ns->lba_shift = id->lbaf[lbaf].ds;
-	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
-	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
-
-	/*
-	 * If identify namespace failed, use default 512 byte block size so
-	 * block layer can use before failing read/write for 0 capacity.
-	 */
-	if (ns->lba_shift == 0)
-		ns->lba_shift = 9;
-	bs = 1 << ns->lba_shift;
-	/* XXX: PI implementation requires metadata equal t10 pi tuple size */
-	pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
-					id->dps & NVME_NS_DPS_PI_MASK : 0;
-
-	blk_mq_freeze_queue(disk->queue);
-	if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
-				ns->ms != old_ms ||
-				bs != queue_logical_block_size(disk->queue) ||
-				(ns->ms && ns->ext)))
-		blk_integrity_unregister(disk);
-
-	ns->pi_type = pi_type;
-	blk_queue_logical_block_size(ns->queue, bs);
-
-	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
-		nvme_init_integrity(ns);
-	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
-		set_capacity(disk, 0);
-	else
-		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
-
-	if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
-	blk_mq_unfreeze_queue(disk->queue);
-
-	kfree(id);
-	return 0;
-}
-
-static char nvme_pr_type(enum pr_type type)
-{
-	switch (type) {
-	case PR_WRITE_EXCLUSIVE:
-		return 1;
-	case PR_EXCLUSIVE_ACCESS:
-		return 2;
-	case PR_WRITE_EXCLUSIVE_REG_ONLY:
-		return 3;
-	case PR_EXCLUSIVE_ACCESS_REG_ONLY:
-		return 4;
-	case PR_WRITE_EXCLUSIVE_ALL_REGS:
-		return 5;
-	case PR_EXCLUSIVE_ACCESS_ALL_REGS:
-		return 6;
-	default:
-		return 0;
-	}
-};
-
-static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
-				u64 key, u64 sa_key, u8 op)
-{
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-	struct nvme_command c;
-	u8 data[16] = { 0, };
-
-	put_unaligned_le64(key, &data[0]);
-	put_unaligned_le64(sa_key, &data[8]);
-
-	memset(&c, 0, sizeof(c));
-	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
-	c.common.cdw10[0] = cpu_to_le32(cdw10);
-
-	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
-}
-
-static int nvme_pr_register(struct block_device *bdev, u64 old,
-		u64 new, unsigned flags)
-{
-	u32 cdw10;
-
-	if (flags & ~PR_FL_IGNORE_KEY)
-		return -EOPNOTSUPP;
-
-	cdw10 = old ? 2 : 0;
-	cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
-	cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
-	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
-}
-
-static int nvme_pr_reserve(struct block_device *bdev, u64 key,
-		enum pr_type type, unsigned flags)
-{
-	u32 cdw10;
-
-	if (flags & ~PR_FL_IGNORE_KEY)
-		return -EOPNOTSUPP;
-
-	cdw10 = nvme_pr_type(type) << 8;
-	cdw10 |= ((flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0);
-	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_acquire);
-}
-
-static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
-		enum pr_type type, bool abort)
-{
-	u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
-	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
-}
-
-static int nvme_pr_clear(struct block_device *bdev, u64 key)
-{
-	u32 cdw10 = 1 | key ? 1 << 3 : 0;
-	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register);
-}
-
-static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
-{
-	u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
-	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
-}
-
-static const struct pr_ops nvme_pr_ops = {
-	.pr_register	= nvme_pr_register,
-	.pr_reserve	= nvme_pr_reserve,
-	.pr_release	= nvme_pr_release,
-	.pr_preempt	= nvme_pr_preempt,
-	.pr_clear	= nvme_pr_clear,
-};
-
-static const struct block_device_operations nvme_fops = {
-	.owner		= THIS_MODULE,
-	.ioctl		= nvme_ioctl,
-	.compat_ioctl	= nvme_compat_ioctl,
-	.open		= nvme_open,
-	.release	= nvme_release,
-	.getgeo		= nvme_getgeo,
-	.revalidate_disk= nvme_revalidate_disk,
-	.pr_ops		= &nvme_pr_ops,
-};
-
-/*
- * Initialize the cached copies of the Identify data and various controller
- * register in our nvme_ctrl structure.  This should be called as soon as
- * the admin queue is fully up and running.
- */
-int nvme_init_identify(struct nvme_ctrl *ctrl)
-{
-	struct nvme_id_ctrl *id;
-	u64 cap;
-	int ret, page_shift;
-
-	ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
-	if (ret) {
-		dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
-		return ret;
-	}
-
-	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
-	if (ret) {
-		dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
-		return ret;
-	}
-	page_shift = NVME_CAP_MPSMIN(cap) + 12;
-	ctrl->page_size = 1 << page_shift;
-
-	if (ctrl->vs >= NVME_VS(1, 1))
-		ctrl->subsystem = NVME_CAP_NSSRC(cap);
-
-	ret = nvme_identify_ctrl(ctrl, &id);
-	if (ret) {
-		dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
-		return -EIO;
-	}
-
-	ctrl->oncs = le16_to_cpup(&id->oncs);
-	atomic_set(&ctrl->abort_limit, id->acl + 1);
-	ctrl->vwc = id->vwc;
-	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
-	memcpy(ctrl->model, id->mn, sizeof(id->mn));
-	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
-	if (id->mdts)
-		ctrl->max_hw_sectors = 1 << (id->mdts + page_shift - 9);
-	else
-		ctrl->max_hw_sectors = UINT_MAX;
-
-	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[3]) {
-		unsigned int max_hw_sectors;
-
-		ctrl->stripe_size = 1 << (id->vs[3] + page_shift);
-		max_hw_sectors = ctrl->stripe_size >> (page_shift - 9);
-		if (ctrl->max_hw_sectors) {
-			ctrl->max_hw_sectors = min(max_hw_sectors,
-							ctrl->max_hw_sectors);
-		} else {
-			ctrl->max_hw_sectors = max_hw_sectors;
-		}
-	}
-
-	kfree(id);
-	return 0;
-}
-
-static int nvme_dev_open(struct inode *inode, struct file *file)
-{
-	struct nvme_ctrl *ctrl;
-	int instance = iminor(inode);
-	int ret = -ENODEV;
-
-	spin_lock(&dev_list_lock);
-	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
-		if (ctrl->instance != instance)
-			continue;
-
-		if (!ctrl->admin_q) {
-			ret = -EWOULDBLOCK;
-			break;
-		}
-		if (!kref_get_unless_zero(&ctrl->kref))
-			break;
-		file->private_data = ctrl;
-		ret = 0;
-		break;
-	}
-	spin_unlock(&dev_list_lock);
-
-	return ret;
-}
-
-static int nvme_dev_release(struct inode *inode, struct file *file)
-{
-	nvme_put_ctrl(file->private_data);
-	return 0;
-}
-
-static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
-		unsigned long arg)
-{
-	struct nvme_ctrl *ctrl = file->private_data;
-	void __user *argp = (void __user *)arg;
-	struct nvme_ns *ns;
-
-	switch (cmd) {
-	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(ctrl, NULL, argp);
-	case NVME_IOCTL_IO_CMD:
-		if (list_empty(&ctrl->namespaces))
-			return -ENOTTY;
-		ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list);
-		return nvme_user_cmd(ctrl, ns, argp);
-	case NVME_IOCTL_RESET:
-		dev_warn(ctrl->dev, "resetting controller\n");
-		return ctrl->ops->reset_ctrl(ctrl);
-	case NVME_IOCTL_SUBSYS_RESET:
-		return nvme_reset_subsystem(ctrl);
-	default:
-		return -ENOTTY;
-	}
-}
-
-static const struct file_operations nvme_dev_fops = {
-	.owner		= THIS_MODULE,
-	.open		= nvme_dev_open,
-	.release	= nvme_dev_release,
-	.unlocked_ioctl	= nvme_dev_ioctl,
-	.compat_ioctl	= nvme_dev_ioctl,
-};
-
-static ssize_t nvme_sysfs_reset(struct device *dev,
-				struct device_attribute *attr, const char *buf,
-				size_t count)
-{
-	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
-	int ret;
-
-	ret = ctrl->ops->reset_ctrl(ctrl);
-	if (ret < 0)
-		return ret;
-	return count;
-}
-static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
-
-static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
-{
-	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
-	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
-
-	return nsa->ns_id - nsb->ns_id;
-}
-
-static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
-{
-	struct nvme_ns *ns;
-
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->ns_id == nsid)
-			return ns;
-		if (ns->ns_id > nsid)
-			break;
-	}
-	return NULL;
-}
-
-static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
-{
-	struct nvme_ns *ns;
-	struct gendisk *disk;
-	int node = dev_to_node(ctrl->dev);
-
-	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
-	if (!ns)
-		return;
-
-	ns->queue = blk_mq_init_queue(ctrl->tagset);
-	if (IS_ERR(ns->queue))
-		goto out_free_ns;
-	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
-	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
-	ns->queue->queuedata = ns;
-	ns->ctrl = ctrl;
-
-	disk = alloc_disk_node(0, node);
-	if (!disk)
-		goto out_free_queue;
-
-	kref_init(&ns->kref);
-	ns->ns_id = nsid;
-	ns->disk = disk;
-	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
-
-	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
-	if (ctrl->max_hw_sectors) {
-		blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
-		blk_queue_max_segments(ns->queue,
-			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
-	}
-	if (ctrl->stripe_size)
-		blk_queue_chunk_sectors(ns->queue, ctrl->stripe_size >> 9);
-	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
-		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
-	blk_queue_virt_boundary(ns->queue, ctrl->page_size - 1);
-
-	disk->major = nvme_major;
-	disk->first_minor = 0;
-	disk->fops = &nvme_fops;
-	disk->private_data = ns;
-	disk->queue = ns->queue;
-	disk->driverfs_dev = ctrl->device;
-	disk->flags = GENHD_FL_EXT_DEVT;
-	sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, nsid);
-
-	if (nvme_revalidate_disk(ns->disk))
-		goto out_free_disk;
-
-	list_add_tail(&ns->list, &ctrl->namespaces);
-	kref_get(&ctrl->kref);
-	if (ns->type != NVME_NS_LIGHTNVM)
-		add_disk(ns->disk);
-
-	return;
- out_free_disk:
-	kfree(disk);
- out_free_queue:
-	blk_cleanup_queue(ns->queue);
- out_free_ns:
-	kfree(ns);
-}
-
-static void nvme_ns_remove(struct nvme_ns *ns)
-{
-	bool kill = nvme_io_incapable(ns->ctrl) &&
-			!blk_queue_dying(ns->queue);
-
-	if (kill)
-		blk_set_queue_dying(ns->queue);
-	if (ns->disk->flags & GENHD_FL_UP) {
-		if (blk_get_integrity(ns->disk))
-			blk_integrity_unregister(ns->disk);
-		del_gendisk(ns->disk);
-	}
-	if (kill || !blk_queue_dying(ns->queue)) {
-		blk_mq_abort_requeue_list(ns->queue);
-		blk_cleanup_queue(ns->queue);
-	}
-	list_del_init(&ns->list);
-	nvme_put_ns(ns);
-}
-
-static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
-{
-	struct nvme_ns *ns;
-
-	ns = nvme_find_ns(ctrl, nsid);
-	if (ns) {
-		if (revalidate_disk(ns->disk))
-			nvme_ns_remove(ns);
-	} else
-		nvme_alloc_ns(ctrl, nsid);
-}
-
-static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
-{
-	struct nvme_ns *ns;
-	__le32 *ns_list;
-	unsigned i, j, nsid, prev = 0, num_lists = DIV_ROUND_UP(nn, 1024);
-	int ret = 0;
-
-	ns_list = kzalloc(0x1000, GFP_KERNEL);
-	if (!ns_list)
-		return -ENOMEM;
-
-	for (i = 0; i < num_lists; i++) {
-		ret = nvme_identify_ns_list(ctrl, prev, ns_list);
-		if (ret)
-			goto out;
-
-		for (j = 0; j < min(nn, 1024U); j++) {
-			nsid = le32_to_cpu(ns_list[j]);
-			if (!nsid)
-				goto out;
-
-			nvme_validate_ns(ctrl, nsid);
-
-			while (++prev < nsid) {
-				ns = nvme_find_ns(ctrl, prev);
-				if (ns)
-					nvme_ns_remove(ns);
-			}
-		}
-		nn -= j;
-	}
- out:
-	kfree(ns_list);
-	return ret;
-}
-
-static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
-{
-	struct nvme_ns *ns, *next;
-	unsigned i;
-
-	for (i = 1; i <= nn; i++)
-		nvme_validate_ns(ctrl, i);
-
-	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->ns_id > nn)
-			nvme_ns_remove(ns);
-	}
-}
-
-void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
-{
-	struct nvme_id_ctrl *id;
-	unsigned nn;
-
-	if (nvme_identify_ctrl(ctrl, &id))
-		return;
-
-	nn = le32_to_cpu(id->nn);
-	if (ctrl->vs >= NVME_VS(1, 1)) {
-		if (!nvme_scan_ns_list(ctrl, nn))
-			goto done;
-	}
-	__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
- done:
-	list_sort(NULL, &ctrl->namespaces, ns_cmp);
-	kfree(id);
-}
-
-void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
-{
-	struct nvme_ns *ns, *next;
-
-	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
-		nvme_ns_remove(ns);
-}
-
-static DEFINE_IDA(nvme_instance_ida);
-
-static int nvme_set_instance(struct nvme_ctrl *ctrl)
-{
-	int instance, error;
-
-	do {
-		if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
-			return -ENODEV;
-
-		spin_lock(&dev_list_lock);
-		error = ida_get_new(&nvme_instance_ida, &instance);
-		spin_unlock(&dev_list_lock);
-	} while (error == -EAGAIN);
-
-	if (error)
-		return -ENODEV;
-
-	ctrl->instance = instance;
-	return 0;
-}
-
-static void nvme_release_instance(struct nvme_ctrl *ctrl)
-{
-	spin_lock(&dev_list_lock);
-	ida_remove(&nvme_instance_ida, ctrl->instance);
-	spin_unlock(&dev_list_lock);
-}
-
-void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
- {
-	device_remove_file(ctrl->device, &dev_attr_reset_controller);
-	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
-
-	spin_lock(&dev_list_lock);
-	list_del(&ctrl->node);
-	spin_unlock(&dev_list_lock);
-}
-
-static void nvme_free_ctrl(struct kref *kref)
-{
-	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
-
-	put_device(ctrl->device);
-	nvme_release_instance(ctrl);
-
-	ctrl->ops->free_ctrl(ctrl);
-}
-
-void nvme_put_ctrl(struct nvme_ctrl *ctrl)
-{
-	kref_put(&ctrl->kref, nvme_free_ctrl);
-}
-
-/*
- * Initialize a NVMe controller structures.  This needs to be called during
- * earliest initialization so that we have the initialized structured around
- * during probing.
- */
-int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
-		const struct nvme_ctrl_ops *ops, u16 vendor,
-		unsigned long quirks)
-{
-	int ret;
-
-	INIT_LIST_HEAD(&ctrl->namespaces);
-	kref_init(&ctrl->kref);
-	ctrl->dev = dev;
-	ctrl->ops = ops;
-	ctrl->vendor = vendor;
-	ctrl->quirks = quirks;
-
-	ret = nvme_set_instance(ctrl);
-	if (ret)
-		goto out;
-
-	ctrl->device = device_create(nvme_class, ctrl->dev,
-				MKDEV(nvme_char_major, ctrl->instance),
-				dev, "nvme%d", ctrl->instance);
-	if (IS_ERR(ctrl->device)) {
-		ret = PTR_ERR(ctrl->device);
-		goto out_release_instance;
-	}
-	get_device(ctrl->device);
-	dev_set_drvdata(ctrl->device, ctrl);
-
-	ret = device_create_file(ctrl->device, &dev_attr_reset_controller);
-	if (ret)
-		goto out_put_device;
-
-	spin_lock(&dev_list_lock);
-	list_add_tail(&ctrl->node, &nvme_ctrl_list);
-	spin_unlock(&dev_list_lock);
-
-	return 0;
-
-out_put_device:
-	put_device(ctrl->device);
-	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
-out_release_instance:
-	nvme_release_instance(ctrl);
-out:
-	return ret;
-}
-
-int __init nvme_core_init(void)
-{
-	int result;
-
-	result = register_blkdev(nvme_major, "nvme");
-	if (result < 0)
-		return result;
-	else if (result > 0)
-		nvme_major = result;
-
-	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
-							&nvme_dev_fops);
-	if (result < 0)
-		goto unregister_blkdev;
-	else if (result > 0)
-		nvme_char_major = result;
-
-	nvme_class = class_create(THIS_MODULE, "nvme");
-	if (IS_ERR(nvme_class)) {
-		result = PTR_ERR(nvme_class);
-		goto unregister_chrdev;
-	}
-
-	return 0;
-
- unregister_chrdev:
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
- unregister_blkdev:
-	unregister_blkdev(nvme_major, "nvme");
-	return result;
-}
-
-void nvme_core_exit(void)
-{
-	unregister_blkdev(nvme_major, "nvme");
-	class_destroy(nvme_class);
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
-}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2cead2c..a53977c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -22,6 +22,9 @@
 extern unsigned char nvme_io_timeout;
 #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
 
+extern unsigned char admin_timeout;
+#define ADMIN_TIMEOUT	(admin_timeout * HZ)
+
 enum {
 	NVME_NS_LBA		= 0,
 	NVME_NS_LIGHTNVM	= 1,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3c2fba7..094c355 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -52,10 +52,9 @@
 #define NVME_AQ_DEPTH		256
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
-#define ADMIN_TIMEOUT		(admin_timeout * HZ)
 #define SHUTDOWN_TIMEOUT	(shutdown_timeout * HZ)
 
-static unsigned char admin_timeout = 60;
+unsigned char admin_timeout = 60;
 module_param(admin_timeout, byte, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
 
@@ -1044,65 +1043,6 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 	return 0;
 }
 
-/*
- * Returns 0 on success.  If the result is negative, it's a Linux error code;
- * if the result is positive, it's an NVM Express status code
- */
-int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, void __user *ubuffer, unsigned bufflen,
-		u32 *result, unsigned timeout)
-{
-	bool write = cmd->common.opcode & 1;
-	struct bio *bio = NULL;
-	struct request *req;
-	int ret;
-
-	req = blk_mq_alloc_request(q, write, 0);
-	if (IS_ERR(req))
-		return PTR_ERR(req);
-
-	req->cmd_type = REQ_TYPE_DRV_PRIV;
-	req->cmd_flags |= REQ_FAILFAST_DRIVER;
-	req->__data_len = 0;
-	req->__sector = (sector_t) -1;
-	req->bio = req->biotail = NULL;
-
-	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
-
-	req->cmd = (unsigned char *)cmd;
-	req->cmd_len = sizeof(struct nvme_command);
-	req->special = (void *)0;
-
-	if (buffer && bufflen) {
-		ret = blk_rq_map_kern(q, req, buffer, bufflen,
-				      __GFP_DIRECT_RECLAIM);
-		if (ret)
-			goto out;
-	} else if (ubuffer && bufflen) {
-		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
-				      __GFP_DIRECT_RECLAIM);
-		if (ret)
-			goto out;
-		bio = req->bio;
-	}
-
-	blk_execute_rq(req->q, NULL, req, 0);
-	if (bio)
-		blk_rq_unmap_user(bio);
-	if (result)
-		*result = (u32)(uintptr_t)req->special;
-	ret = req->errors;
- out:
-	blk_mq_free_request(req);
-	return ret;
-}
-
-int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, unsigned bufflen)
-{
-	return __nvme_submit_sync_cmd(q, cmd, buffer, NULL, bufflen, NULL, 0);
-}
-
 static int nvme_submit_async_admin_req(struct nvme_dev *dev)
 {
 	struct nvme_queue *nvmeq = dev->queues[0];
@@ -1215,99 +1155,6 @@ static int adapter_delete_sq(struct nvme_dev *dev, u16 sqid)
 	return adapter_delete_queue(dev, nvme_admin_delete_sq, sqid);
 }
 
-int nvme_identify_ctrl(struct nvme_dev *dev, struct nvme_id_ctrl **id)
-{
-	struct nvme_command c = { };
-	int error;
-
-	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
-	c.identify.opcode = nvme_admin_identify;
-	c.identify.cns = cpu_to_le32(1);
-
-	*id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
-	if (!*id)
-		return -ENOMEM;
-
-	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
-			sizeof(struct nvme_id_ctrl));
-	if (error)
-		kfree(*id);
-	return error;
-}
-
-int nvme_identify_ns(struct nvme_dev *dev, unsigned nsid,
-		struct nvme_id_ns **id)
-{
-	struct nvme_command c = { };
-	int error;
-
-	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
-	c.identify.opcode = nvme_admin_identify,
-	c.identify.nsid = cpu_to_le32(nsid),
-
-	*id = kmalloc(sizeof(struct nvme_id_ns), GFP_KERNEL);
-	if (!*id)
-		return -ENOMEM;
-
-	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
-			sizeof(struct nvme_id_ns));
-	if (error)
-		kfree(*id);
-	return error;
-}
-
-int nvme_get_features(struct nvme_dev *dev, unsigned fid, unsigned nsid,
-					dma_addr_t dma_addr, u32 *result)
-{
-	struct nvme_command c;
-
-	memset(&c, 0, sizeof(c));
-	c.features.opcode = nvme_admin_get_features;
-	c.features.nsid = cpu_to_le32(nsid);
-	c.features.prp1 = cpu_to_le64(dma_addr);
-	c.features.fid = cpu_to_le32(fid);
-
-	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
-			result, 0);
-}
-
-int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
-					dma_addr_t dma_addr, u32 *result)
-{
-	struct nvme_command c;
-
-	memset(&c, 0, sizeof(c));
-	c.features.opcode = nvme_admin_set_features;
-	c.features.prp1 = cpu_to_le64(dma_addr);
-	c.features.fid = cpu_to_le32(fid);
-	c.features.dword11 = cpu_to_le32(dword11);
-
-	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
-			result, 0);
-}
-
-int nvme_get_log_page(struct nvme_dev *dev, struct nvme_smart_log **log)
-{
-	struct nvme_command c = { };
-	int error;
-
-	c.common.opcode = nvme_admin_get_log_page,
-	c.common.nsid = cpu_to_le32(0xFFFFFFFF),
-	c.common.cdw10[0] = cpu_to_le32(
-			(((sizeof(struct nvme_smart_log) / 4) - 1) << 16) |
-			 NVME_LOG_SMART),
-
-	*log = kmalloc(sizeof(struct nvme_smart_log), GFP_KERNEL);
-	if (!*log)
-		return -ENOMEM;
-
-	error = nvme_submit_sync_cmd(dev->admin_q, &c, *log,
-			sizeof(struct nvme_smart_log));
-	if (error)
-		kfree(*log);
-	return error;
-}
-
 /**
  * nvme_abort_req - Attempt aborting a request
  *
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 09/47] nvme: add a vendor field to struct nvme_dev
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (7 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 08/47] nvme: split command submission helpers out of pci.c Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 10/47] nvme: use offset instead of a struct for registers Christoph Hellwig
                   ` (25 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


The SCSI translation layer currently has to poke into the PCI device
structure to find a vendor ID for the device identification fallback.
We won't nessecarily have a PCI device behind the device structure in
the future, so add a new vendor field that can be filled out by the
PCIe driver instead.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Acked-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/nvme.h | 1 +
 drivers/nvme/host/pci.c  | 3 +++
 drivers/nvme/host/scsi.c | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a53977c..02fca73 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -74,6 +74,7 @@ struct nvme_dev {
 	u16 abort_limit;
 	u8 event_limit;
 	u8 vwc;
+	u16 vendor;
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 094c355..79bca52 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3150,6 +3150,9 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_WORK(&dev->reset_work, nvme_reset_work);
 	dev->dev = get_device(&pdev->dev);
 	pci_set_drvdata(pdev, dev);
+
+	dev->vendor = pdev->vendor;
+
 	result = nvme_set_instance(dev);
 	if (result)
 		goto put_pci;
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index c3d8d38..8f2d2c5 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -657,7 +657,7 @@ static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 		inq_response[6] = 0x00;    /* Rsvd */
 		inq_response[7] = 0x44;    /* Designator Length */
 
-		sprintf(&inq_response[8], "%04x", to_pci_dev(dev->dev)->vendor);
+		sprintf(&inq_response[8], "%04x", dev->vendor);
 		memcpy(&inq_response[12], dev->model, sizeof(dev->model));
 		sprintf(&inq_response[52], "%04x", tmp_id);
 		memcpy(&inq_response[56], dev->serial, sizeof(dev->serial));
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 10/47] nvme: use offset instead of a struct for registers
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (8 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 09/47] nvme: add a vendor field to struct nvme_dev Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 11/47] nvme: split a new struct nvme_ctrl out of struct nvme_dev Christoph Hellwig
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


This makes life easier for future non-PCI drivers where access to the
registers might be more complicated.  Note that Linux drivers are
pretty evenly split between the two versions, and in fact the NVMe
driver already uses offsets for the doorbells.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Acked-by: Keith Busch <keith.busch at intel.com>
[Fixed CMBSZ offset]
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 60 ++++++++++++++++++++++++++----------------------
 drivers/nvme/host/scsi.c |  6 ++---
 include/linux/nvme.h     | 27 +++++++++++-----------
 4 files changed, 49 insertions(+), 46 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 02fca73..93e268c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -51,7 +51,7 @@ struct nvme_dev {
 	u32 db_stride;
 	u32 ctrl_config;
 	struct msix_entry *entry;
-	struct nvme_bar __iomem *bar;
+	void __iomem *bar;
 	struct list_head namespaces;
 	struct kref kref;
 	struct device *device;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 79bca52..e9de117 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1321,7 +1321,7 @@ static void nvme_disable_queue(struct nvme_dev *dev, int qid)
 
 	/* Don't tell the adapter to delete the admin queue.
 	 * Don't tell a removed adapter to delete IO queues. */
-	if (qid && readl(&dev->bar->csts) != -1) {
+	if (qid && readl(dev->bar + NVME_REG_CSTS) != -1) {
 		adapter_delete_sq(dev, qid);
 		adapter_delete_cq(dev, qid);
 	}
@@ -1474,7 +1474,7 @@ static int nvme_wait_ready(struct nvme_dev *dev, u64 cap, bool enabled)
 
 	timeout = ((NVME_CAP_TIMEOUT(cap) + 1) * HZ / 2) + jiffies;
 
-	while ((readl(&dev->bar->csts) & NVME_CSTS_RDY) != bit) {
+	while ((readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_RDY) != bit) {
 		msleep(100);
 		if (fatal_signal_pending(current))
 			return -EINTR;
@@ -1499,7 +1499,7 @@ static int nvme_disable_ctrl(struct nvme_dev *dev, u64 cap)
 {
 	dev->ctrl_config &= ~NVME_CC_SHN_MASK;
 	dev->ctrl_config &= ~NVME_CC_ENABLE;
-	writel(dev->ctrl_config, &dev->bar->cc);
+	writel(dev->ctrl_config, dev->bar + NVME_REG_CC);
 
 	return nvme_wait_ready(dev, cap, false);
 }
@@ -1508,7 +1508,7 @@ static int nvme_enable_ctrl(struct nvme_dev *dev, u64 cap)
 {
 	dev->ctrl_config &= ~NVME_CC_SHN_MASK;
 	dev->ctrl_config |= NVME_CC_ENABLE;
-	writel(dev->ctrl_config, &dev->bar->cc);
+	writel(dev->ctrl_config, dev->bar + NVME_REG_CC);
 
 	return nvme_wait_ready(dev, cap, true);
 }
@@ -1520,10 +1520,10 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
 	dev->ctrl_config &= ~NVME_CC_SHN_MASK;
 	dev->ctrl_config |= NVME_CC_SHN_NORMAL;
 
-	writel(dev->ctrl_config, &dev->bar->cc);
+	writel(dev->ctrl_config, dev->bar + NVME_REG_CC);
 
 	timeout = SHUTDOWN_TIMEOUT + jiffies;
-	while ((readl(&dev->bar->csts) & NVME_CSTS_SHST_MASK) !=
+	while ((readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_SHST_MASK) !=
 							NVME_CSTS_SHST_CMPLT) {
 		msleep(100);
 		if (fatal_signal_pending(current))
@@ -1599,7 +1599,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 {
 	int result;
 	u32 aqa;
-	u64 cap = lo_hi_readq(&dev->bar->cap);
+	u64 cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
 	struct nvme_queue *nvmeq;
 	unsigned page_shift = PAGE_SHIFT;
 	unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
@@ -1620,11 +1620,12 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 		page_shift = dev_page_max;
 	}
 
-	dev->subsystem = readl(&dev->bar->vs) >= NVME_VS(1, 1) ?
+	dev->subsystem = readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 1) ?
 						NVME_CAP_NSSRC(cap) : 0;
 
-	if (dev->subsystem && (readl(&dev->bar->csts) & NVME_CSTS_NSSRO))
-		writel(NVME_CSTS_NSSRO, &dev->bar->csts);
+	if (dev->subsystem &&
+	    (readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_NSSRO))
+		writel(NVME_CSTS_NSSRO, dev->bar + NVME_REG_CSTS);
 
 	result = nvme_disable_ctrl(dev, cap);
 	if (result < 0)
@@ -1647,9 +1648,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	dev->ctrl_config |= NVME_CC_ARB_RR | NVME_CC_SHN_NONE;
 	dev->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
 
-	writel(aqa, &dev->bar->aqa);
-	lo_hi_writeq(nvmeq->sq_dma_addr, &dev->bar->asq);
-	lo_hi_writeq(nvmeq->cq_dma_addr, &dev->bar->acq);
+	writel(aqa, dev->bar + NVME_REG_AQA);
+	lo_hi_writeq(nvmeq->sq_dma_addr, dev->bar + NVME_REG_ASQ);
+	lo_hi_writeq(nvmeq->cq_dma_addr, dev->bar + NVME_REG_ACQ);
 
 	result = nvme_enable_ctrl(dev, cap);
 	if (result)
@@ -1791,7 +1792,7 @@ static int nvme_subsys_reset(struct nvme_dev *dev)
 	if (!dev->subsystem)
 		return -ENOTTY;
 
-	writel(0x4E564D65, &dev->bar->nssr); /* "NVMe" */
+	writel(0x4E564D65, dev->bar + NVME_REG_NSSR); /* "NVMe" */
 	return 0;
 }
 
@@ -2078,14 +2079,14 @@ static int nvme_kthread(void *data)
 		spin_lock(&dev_list_lock);
 		list_for_each_entry_safe(dev, next, &dev_list, node) {
 			int i;
-			u32 csts = readl(&dev->bar->csts);
+			u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 			if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
 							csts & NVME_CSTS_CFS) {
 				if (!__nvme_reset(dev)) {
 					dev_warn(dev->dev,
 						"Failed status: %x, reset controller\n",
-						readl(&dev->bar->csts));
+						readl(dev->bar + NVME_REG_CSTS));
 				}
 				continue;
 			}
@@ -2245,11 +2246,11 @@ static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
 	if (!use_cmb_sqes)
 		return NULL;
 
-	dev->cmbsz = readl(&dev->bar->cmbsz);
+	dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
 	if (!(NVME_CMB_SZ(dev->cmbsz)))
 		return NULL;
 
-	cmbloc = readl(&dev->bar->cmbloc);
+	cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
 
 	szu = (u64)1 << (12 + 4 * NVME_CMB_SZU(dev->cmbsz));
 	size = szu * NVME_CMB_SZ(dev->cmbsz);
@@ -2323,7 +2324,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 				return -ENOMEM;
 			size = db_bar_size(dev, nr_io_queues);
 		} while (1);
-		dev->dbs = ((void __iomem *)dev->bar) + 4096;
+		dev->dbs = dev->bar + 4096;
 		adminq->q_db = dev->dbs;
 	}
 
@@ -2399,8 +2400,9 @@ static struct nvme_ns *nvme_find_ns(struct nvme_dev *dev, unsigned nsid)
 
 static inline bool nvme_io_incapable(struct nvme_dev *dev)
 {
-	return (!dev->bar || readl(&dev->bar->csts) & NVME_CSTS_CFS ||
-							dev->online_queues < 2);
+	return (!dev->bar ||
+		readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_CFS ||
+		dev->online_queues < 2);
 }
 
 static void nvme_ns_remove(struct nvme_ns *ns)
@@ -2480,7 +2482,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int res;
 	struct nvme_id_ctrl *ctrl;
-	int shift = NVME_CAP_MPSMIN(lo_hi_readq(&dev->bar->cap)) + 12;
+	int shift = NVME_CAP_MPSMIN(lo_hi_readq(dev->bar + NVME_REG_CAP)) + 12;
 
 	res = nvme_identify_ctrl(dev, &ctrl);
 	if (res) {
@@ -2556,7 +2558,7 @@ static int nvme_dev_map(struct nvme_dev *dev)
 	if (!dev->bar)
 		goto disable;
 
-	if (readl(&dev->bar->csts) == -1) {
+	if (readl(dev->bar + NVME_REG_CSTS) == -1) {
 		result = -ENODEV;
 		goto unmap;
 	}
@@ -2571,11 +2573,12 @@ static int nvme_dev_map(struct nvme_dev *dev)
 			goto unmap;
 	}
 
-	cap = lo_hi_readq(&dev->bar->cap);
+	cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
+
 	dev->q_depth = min_t(int, NVME_CAP_MQES(cap) + 1, NVME_Q_DEPTH);
 	dev->db_stride = 1 << NVME_CAP_STRIDE(cap);
-	dev->dbs = ((void __iomem *)dev->bar) + 4096;
-	if (readl(&dev->bar->vs) >= NVME_VS(1, 2))
+	dev->dbs = dev->bar + 4096;
+	if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 2))
 		dev->cmb = nvme_map_cmb(dev);
 
 	return 0;
@@ -2634,7 +2637,8 @@ static void nvme_wait_dq(struct nvme_delq_ctx *dq, struct nvme_dev *dev)
 			 * queues than admin tags.
 			 */
 			set_current_state(TASK_RUNNING);
-			nvme_disable_ctrl(dev, lo_hi_readq(&dev->bar->cap));
+			nvme_disable_ctrl(dev,
+				lo_hi_readq(dev->bar + NVME_REG_CAP));
 			nvme_clear_queue(dev->queues[0]);
 			flush_kthread_worker(dq->worker);
 			nvme_disable_queue(dev, 0);
@@ -2806,7 +2810,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 
 	if (dev->bar) {
 		nvme_freeze_queues(dev);
-		csts = readl(&dev->bar->csts);
+		csts = readl(dev->bar + NVME_REG_CSTS);
 	}
 	if (csts & NVME_CSTS_CFS || !(csts & NVME_CSTS_RDY)) {
 		for (i = dev->queue_count - 1; i >= 0; i--) {
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index 8f2d2c5..a5f6af1 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -611,7 +611,7 @@ static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 
 	memset(inq_response, 0, alloc_len);
 	inq_response[1] = INQ_DEVICE_IDENTIFICATION_PAGE;    /* Page Code */
-	if (readl(&dev->bar->vs) >= NVME_VS(1, 1)) {
+	if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 1)) {
 		struct nvme_id_ns *id_ns;
 		void *eui;
 		int len;
@@ -623,7 +623,7 @@ static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 
 		eui = id_ns->eui64;
 		len = sizeof(id_ns->eui64);
-		if (readl(&dev->bar->vs) >= NVME_VS(1, 2)) {
+		if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 2)) {
 			if (bitmap_empty(eui, len * 8)) {
 				eui = id_ns->nguid;
 				len = sizeof(id_ns->nguid);
@@ -2297,7 +2297,7 @@ static int nvme_trans_test_unit_ready(struct nvme_ns *ns,
 {
 	struct nvme_dev *dev = ns->dev;
 
-	if (!(readl(&dev->bar->csts) & NVME_CSTS_RDY))
+	if (!(readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_RDY))
 		return nvme_trans_completion(hdr, SAM_STAT_CHECK_CONDITION,
 					    NOT_READY, SCSI_ASC_LUN_NOT_READY,
 					    SCSI_ASCQ_CAUSE_NOT_REPORTABLE);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 3af5f45..a55986f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -17,20 +17,19 @@
 
 #include <linux/types.h>
 
-struct nvme_bar {
-	__u64			cap;	/* Controller Capabilities */
-	__u32			vs;	/* Version */
-	__u32			intms;	/* Interrupt Mask Set */
-	__u32			intmc;	/* Interrupt Mask Clear */
-	__u32			cc;	/* Controller Configuration */
-	__u32			rsvd1;	/* Reserved */
-	__u32			csts;	/* Controller Status */
-	__u32			nssr;	/* Subsystem Reset */
-	__u32			aqa;	/* Admin Queue Attributes */
-	__u64			asq;	/* Admin SQ Base Address */
-	__u64			acq;	/* Admin CQ Base Address */
-	__u32			cmbloc; /* Controller Memory Buffer Location */
-	__u32			cmbsz;  /* Controller Memory Buffer Size */
+enum {
+	NVME_REG_CAP	= 0x0000,	/* Controller Capabilities */
+	NVME_REG_VS	= 0x0008,	/* Version */
+	NVME_REG_INTMS	= 0x000c,	/* Interrupt Mask Set */
+	NVME_REG_INTMC	= 0x0010,	/* Interrupt Mask Set */
+	NVME_REG_CC	= 0x0014,	/* Controller Configuration */
+	NVME_REG_CSTS	= 0x001c,	/* Controller Status */
+	NVME_REG_NSSR	= 0x0020,	/* NVM Subsystem Reset */
+	NVME_REG_AQA	= 0x0024,	/* Admin Queue Attributes */
+	NVME_REG_ASQ	= 0x0028,	/* Admin SQ Base Address */
+	NVME_REG_ACQ	= 0x0030,	/* Admin SQ Base Address */
+	NVME_REG_CMBLOC = 0x0038,	/* Controller Memory Buffer Location */
+	NVME_REG_CMBSZ	= 0x003c,	/* Controller Memory Buffer Size */
 };
 
 #define NVME_CAP_MQES(cap)	((cap) & 0xffff)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 11/47] nvme: split a new struct nvme_ctrl out of struct nvme_dev
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (9 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 10/47] nvme: use offset instead of a struct for registers Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 12/47] nvme: simplify nvme_setup_prps calling convention Christoph Hellwig
                   ` (23 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


The new struct nvme_ctrl will be used by the common NVMe code that sits
on top of struct request_queue and the new nvme_ctrl_ops abstraction.
It only contains the bare minimum required, which consists of values
sampled during controller probe, the admin queue pointer and a second
struct device pointer at the moment, but more will follow later.  Only
values that are not used in the I/O fast path should be moved to
struct nvme_ctrl so that drivers can optimize their cache line usage
easily.  That's also the reason why we have two device pointers as
the struct device is used for DMA mapping purposes.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Acked-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c |  10 +--
 drivers/nvme/host/nvme.h |  61 ++++++---------
 drivers/nvme/host/pci.c  | 190 +++++++++++++++++++++++++++++++----------------
 drivers/nvme/host/scsi.c |  85 ++++++++++-----------
 4 files changed, 190 insertions(+), 156 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ce938a4..ca54a34 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -79,7 +79,7 @@ int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	return __nvme_submit_sync_cmd(q, cmd, buffer, NULL, bufflen, NULL, 0);
 }
 
-int nvme_identify_ctrl(struct nvme_dev *dev, struct nvme_id_ctrl **id)
+int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 {
 	struct nvme_command c = { };
 	int error;
@@ -99,7 +99,7 @@ int nvme_identify_ctrl(struct nvme_dev *dev, struct nvme_id_ctrl **id)
 	return error;
 }
 
-int nvme_identify_ns(struct nvme_dev *dev, unsigned nsid,
+int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id)
 {
 	struct nvme_command c = { };
@@ -120,7 +120,7 @@ int nvme_identify_ns(struct nvme_dev *dev, unsigned nsid,
 	return error;
 }
 
-int nvme_get_features(struct nvme_dev *dev, unsigned fid, unsigned nsid,
+int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 					dma_addr_t dma_addr, u32 *result)
 {
 	struct nvme_command c;
@@ -135,7 +135,7 @@ int nvme_get_features(struct nvme_dev *dev, unsigned fid, unsigned nsid,
 			result, 0);
 }
 
-int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
+int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 					dma_addr_t dma_addr, u32 *result)
 {
 	struct nvme_command c;
@@ -150,7 +150,7 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
 			result, 0);
 }
 
-int nvme_get_log_page(struct nvme_dev *dev, struct nvme_smart_log **log)
+int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
 {
 	struct nvme_command c = { };
 	int error;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 93e268c..71e81e5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -30,46 +30,16 @@ enum {
 	NVME_NS_LIGHTNVM	= 1,
 };
 
-/*
- * Represents an NVM Express device.  Each nvme_dev is a PCI function.
- */
-struct nvme_dev {
-	struct list_head node;
-	struct nvme_queue **queues;
+struct nvme_ctrl {
+	const struct nvme_ctrl_ops *ops;
 	struct request_queue *admin_q;
-	struct blk_mq_tag_set tagset;
-	struct blk_mq_tag_set admin_tagset;
-	u32 __iomem *dbs;
 	struct device *dev;
-	struct dma_pool *prp_page_pool;
-	struct dma_pool *prp_small_pool;
 	int instance;
-	unsigned queue_count;
-	unsigned online_queues;
-	unsigned max_qid;
-	int q_depth;
-	u32 db_stride;
-	u32 ctrl_config;
-	struct msix_entry *entry;
-	void __iomem *bar;
-	struct list_head namespaces;
-	struct kref kref;
-	struct device *device;
-	struct work_struct reset_work;
-	struct work_struct probe_work;
-	struct work_struct scan_work;
+
 	char name[12];
 	char serial[20];
 	char model[40];
 	char firmware_rev[8];
-	bool subsystem;
-	u32 max_hw_sectors;
-	u32 stripe_size;
-	u32 page_size;
-	void __iomem *cmb;
-	dma_addr_t cmb_dma_addr;
-	u64 cmb_size;
-	u32 cmbsz;
 	u16 oncs;
 	u16 abort_limit;
 	u8 event_limit;
@@ -83,7 +53,7 @@ struct nvme_dev {
 struct nvme_ns {
 	struct list_head list;
 
-	struct nvme_dev *dev;
+	struct nvme_ctrl *ctrl;
 	struct request_queue *queue;
 	struct gendisk *disk;
 	struct kref kref;
@@ -98,6 +68,19 @@ struct nvme_ns {
 	u32 mode_select_block_len;
 };
 
+struct nvme_ctrl_ops {
+	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
+};
+
+static inline bool nvme_ctrl_ready(struct nvme_ctrl *ctrl)
+{
+	u32 val = 0;
+
+	if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val))
+		return false;
+	return val & NVME_CSTS_RDY;
+}
+
 static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
 {
 	return (sector >> (ns->lba_shift - 9));
@@ -108,13 +91,13 @@ int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void *buffer, void __user *ubuffer, unsigned bufflen,
 		u32 *result, unsigned timeout);
-int nvme_identify_ctrl(struct nvme_dev *dev, struct nvme_id_ctrl **id);
-int nvme_identify_ns(struct nvme_dev *dev, unsigned nsid,
+int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id);
+int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id);
-int nvme_get_log_page(struct nvme_dev *dev, struct nvme_smart_log **log);
-int nvme_get_features(struct nvme_dev *dev, unsigned fid, unsigned nsid,
+int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 			dma_addr_t dma_addr, u32 *result);
-int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
+int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 			dma_addr_t dma_addr, u32 *result);
 
 struct sg_io_hdr;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e9de117..ea0a446 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -87,6 +87,9 @@ static wait_queue_head_t nvme_kthread_wait;
 
 static struct class *nvme_class;
 
+struct nvme_dev;
+struct nvme_queue;
+
 static int __nvme_reset(struct nvme_dev *dev);
 static int nvme_reset(struct nvme_dev *dev);
 static void nvme_process_cq(struct nvme_queue *nvmeq);
@@ -102,6 +105,49 @@ struct async_cmd_info {
 };
 
 /*
+ * Represents an NVM Express device.  Each nvme_dev is a PCI function.
+ */
+struct nvme_dev {
+	struct list_head node;
+	struct nvme_queue **queues;
+	struct blk_mq_tag_set tagset;
+	struct blk_mq_tag_set admin_tagset;
+	u32 __iomem *dbs;
+	struct device *dev;
+	struct dma_pool *prp_page_pool;
+	struct dma_pool *prp_small_pool;
+	unsigned queue_count;
+	unsigned online_queues;
+	unsigned max_qid;
+	int q_depth;
+	u32 db_stride;
+	u32 ctrl_config;
+	struct msix_entry *entry;
+	void __iomem *bar;
+	struct list_head namespaces;
+	struct kref kref;
+	struct device *device;
+	struct work_struct reset_work;
+	struct work_struct probe_work;
+	struct work_struct scan_work;
+	bool subsystem;
+	u32 max_hw_sectors;
+	u32 stripe_size;
+	u32 page_size;
+	void __iomem *cmb;
+	dma_addr_t cmb_dma_addr;
+	u64 cmb_size;
+	u32 cmbsz;
+
+	struct nvme_ctrl ctrl;
+};
+
+static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
+{
+	return container_of(ctrl, struct nvme_dev, ctrl);
+}
+
+/*
  * An NVM Express queue.  Each device has at least two (one for admin
  * commands and one for I/O commands).
  */
@@ -333,7 +379,7 @@ static void async_req_completion(struct nvme_queue *nvmeq, void *ctx,
 	u16 status = le16_to_cpup(&cqe->status) >> 1;
 
 	if (status == NVME_SC_SUCCESS || status == NVME_SC_ABORT_REQ)
-		++nvmeq->dev->event_limit;
+		++nvmeq->dev->ctrl.event_limit;
 	if (status != NVME_SC_SUCCESS)
 		return;
 
@@ -357,7 +403,7 @@ static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
 	blk_mq_free_request(req);
 
 	dev_warn(nvmeq->q_dmadev, "Abort status:%x result:%x", status, result);
-	++nvmeq->dev->abort_limit;
+	++nvmeq->dev->ctrl.abort_limit;
 }
 
 static void async_completion(struct nvme_queue *nvmeq, void *ctx,
@@ -1050,7 +1096,7 @@ static int nvme_submit_async_admin_req(struct nvme_dev *dev)
 	struct nvme_cmd_info *cmd_info;
 	struct request *req;
 
-	req = blk_mq_alloc_request(dev->admin_q, WRITE,
+	req = blk_mq_alloc_request(dev->ctrl.admin_q, WRITE,
 			BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
@@ -1076,7 +1122,7 @@ static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
 	struct request *req;
 	struct nvme_cmd_info *cmd_rq;
 
-	req = blk_mq_alloc_request(dev->admin_q, WRITE, 0);
+	req = blk_mq_alloc_request(dev->ctrl.admin_q, WRITE, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1100,7 +1146,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 	c.delete_queue.opcode = opcode;
 	c.delete_queue.qid = cpu_to_le16(id);
 
-	return nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0);
+	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
@@ -1121,7 +1167,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cq_flags = cpu_to_le16(flags);
 	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
 
-	return nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0);
+	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
 
 static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
@@ -1142,7 +1188,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
 	c.create_sq.sq_flags = cpu_to_le16(flags);
 	c.create_sq.cqid = cpu_to_le16(qid);
 
-	return nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0);
+	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
 
 static int adapter_delete_cq(struct nvme_dev *dev, u16 cqid)
@@ -1181,10 +1227,10 @@ static void nvme_abort_req(struct request *req)
 		return;
 	}
 
-	if (!dev->abort_limit)
+	if (!dev->ctrl.abort_limit)
 		return;
 
-	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE,
+	abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, WRITE,
 			BLK_MQ_REQ_NOWAIT);
 	if (IS_ERR(abort_req))
 		return;
@@ -1198,7 +1244,7 @@ static void nvme_abort_req(struct request *req)
 	cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
 	cmd.abort.command_id = abort_req->tag;
 
-	--dev->abort_limit;
+	--dev->ctrl.abort_limit;
 	cmd_rq->aborted = 1;
 
 	dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
@@ -1293,8 +1339,8 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->admin_q)
-		blk_mq_freeze_queue_start(nvmeq->dev->admin_q);
+	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
+		blk_mq_freeze_queue_start(nvmeq->dev->ctrl.admin_q);
 
 	irq_set_affinity_hint(vector, NULL);
 	free_irq(vector, nvmeq);
@@ -1390,7 +1436,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	nvmeq->q_dmadev = dev->dev;
 	nvmeq->dev = dev;
 	snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
-			dev->instance, qid);
+			dev->ctrl.instance, qid);
 	spin_lock_init(&nvmeq->q_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
@@ -1558,15 +1604,15 @@ static struct blk_mq_ops nvme_mq_ops = {
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
 {
-	if (dev->admin_q && !blk_queue_dying(dev->admin_q)) {
-		blk_cleanup_queue(dev->admin_q);
+	if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) {
+		blk_cleanup_queue(dev->ctrl.admin_q);
 		blk_mq_free_tag_set(&dev->admin_tagset);
 	}
 }
 
 static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 {
-	if (!dev->admin_q) {
+	if (!dev->ctrl.admin_q) {
 		dev->admin_tagset.ops = &nvme_mq_admin_ops;
 		dev->admin_tagset.nr_hw_queues = 1;
 		dev->admin_tagset.queue_depth = NVME_AQ_DEPTH - 1;
@@ -1579,18 +1625,18 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
 			return -ENOMEM;
 
-		dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
-		if (IS_ERR(dev->admin_q)) {
+		dev->ctrl.admin_q = blk_mq_init_queue(&dev->admin_tagset);
+		if (IS_ERR(dev->ctrl.admin_q)) {
 			blk_mq_free_tag_set(&dev->admin_tagset);
 			return -ENOMEM;
 		}
-		if (!blk_get_queue(dev->admin_q)) {
+		if (!blk_get_queue(dev->ctrl.admin_q)) {
 			nvme_dev_remove_admin(dev);
-			dev->admin_q = NULL;
+			dev->ctrl.admin_q = NULL;
 			return -ENODEV;
 		}
 	} else
-		blk_mq_unfreeze_queue(dev->admin_q);
+		blk_mq_unfreeze_queue(dev->ctrl.admin_q);
 
 	return 0;
 }
@@ -1672,7 +1718,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
-	struct nvme_dev *dev = ns->dev;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
 	struct nvme_user_io io;
 	struct nvme_command c;
 	unsigned length, meta_len;
@@ -1747,7 +1793,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	return status;
 }
 
-static int nvme_user_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 			struct nvme_passthru_cmd __user *ucmd)
 {
 	struct nvme_passthru_cmd cmd;
@@ -1776,7 +1822,7 @@ static int nvme_user_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
 	if (cmd.timeout_ms)
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
-	status = __nvme_submit_sync_cmd(ns ? ns->queue : dev->admin_q, &c,
+	status = __nvme_submit_sync_cmd(ns ? ns->queue : ctrl->admin_q, &c,
 			NULL, (void __user *)(uintptr_t)cmd.addr, cmd.data_len,
 			&cmd.result, timeout);
 	if (status >= 0) {
@@ -1806,9 +1852,9 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
 		force_successful_syscall_return();
 		return ns->ns_id;
 	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(ns->dev, NULL, (void __user *)arg);
+		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
-		return nvme_user_cmd(ns->dev, ns, (void __user *)arg);
+		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
 	case NVME_IOCTL_SUBMIT_IO:
 		return nvme_submit_io(ns, (void __user *)arg);
 	case SG_GET_VERSION_NUM:
@@ -1838,6 +1884,7 @@ static void nvme_free_dev(struct kref *kref);
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
 
 	if (ns->type == NVME_NS_LIGHTNVM)
 		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
@@ -1846,7 +1893,7 @@ static void nvme_free_ns(struct kref *kref)
 	ns->disk->private_data = NULL;
 	spin_unlock(&dev_list_lock);
 
-	kref_put(&ns->dev->kref, nvme_free_dev);
+	kref_put(&dev->kref, nvme_free_dev);
 	put_disk(ns->disk);
 	kfree(ns);
 }
@@ -1895,15 +1942,15 @@ static void nvme_config_discard(struct nvme_ns *ns)
 static int nvme_revalidate_disk(struct gendisk *disk)
 {
 	struct nvme_ns *ns = disk->private_data;
-	struct nvme_dev *dev = ns->dev;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
 	struct nvme_id_ns *id;
 	u8 lbaf, pi_type;
 	u16 old_ms;
 	unsigned short bs;
 
-	if (nvme_identify_ns(dev, ns->ns_id, &id)) {
+	if (nvme_identify_ns(&dev->ctrl, ns->ns_id, &id)) {
 		dev_warn(dev->dev, "%s: Identify failure nvme%dn%d\n", __func__,
-						dev->instance, ns->ns_id);
+						dev->ctrl.instance, ns->ns_id);
 		return -ENODEV;
 	}
 	if (id->ncap == 0) {
@@ -1959,7 +2006,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	else
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
 
-	if (dev->oncs & NVME_CTRL_ONCS_DSM)
+	if (dev->ctrl.oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
 	blk_mq_unfreeze_queue(disk->queue);
 
@@ -2097,10 +2144,10 @@ static int nvme_kthread(void *data)
 				spin_lock_irq(&nvmeq->q_lock);
 				nvme_process_cq(nvmeq);
 
-				while ((i == 0) && (dev->event_limit > 0)) {
+				while (i == 0 && dev->ctrl.event_limit > 0) {
 					if (nvme_submit_async_admin_req(dev))
 						break;
-					dev->event_limit--;
+					dev->ctrl.event_limit--;
 				}
 				spin_unlock_irq(&nvmeq->q_lock);
 			}
@@ -2126,7 +2173,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 		goto out_free_ns;
 	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
-	ns->dev = dev;
+	ns->ctrl = &dev->ctrl;
 	ns->queue->queuedata = ns;
 
 	disk = alloc_disk_node(0, node);
@@ -2147,7 +2194,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 	}
 	if (dev->stripe_size)
 		blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
-	if (dev->vwc & NVME_CTRL_VWC_PRESENT)
+	if (dev->ctrl.vwc & NVME_CTRL_VWC_PRESENT)
 		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
 	blk_queue_virt_boundary(ns->queue, dev->page_size - 1);
 
@@ -2158,7 +2205,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 	disk->queue = ns->queue;
 	disk->driverfs_dev = dev->device;
 	disk->flags = GENHD_FL_EXT_DEVT;
-	sprintf(disk->disk_name, "nvme%dn%d", dev->instance, nsid);
+	sprintf(disk->disk_name, "nvme%dn%d", dev->ctrl.instance, nsid);
 
 	/*
 	 * Initialize capacity to 0 until we establish the namespace format and
@@ -2223,7 +2270,7 @@ static int set_queue_count(struct nvme_dev *dev, int count)
 	u32 result;
 	u32 q_count = (count - 1) | ((count - 1) << 16);
 
-	status = nvme_set_features(dev, NVME_FEAT_NUM_QUEUES, q_count, 0,
+	status = nvme_set_features(&dev->ctrl, NVME_FEAT_NUM_QUEUES, q_count, 0,
 								&result);
 	if (status < 0)
 		return status;
@@ -2407,7 +2454,8 @@ static inline bool nvme_io_incapable(struct nvme_dev *dev)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
-	bool kill = nvme_io_incapable(ns->dev) && !blk_queue_dying(ns->queue);
+	bool kill = nvme_io_incapable(to_nvme_dev(ns->ctrl)) &&
+			!blk_queue_dying(ns->queue);
 
 	if (kill)
 		blk_set_queue_dying(ns->queue);
@@ -2464,7 +2512,7 @@ static void nvme_dev_scan(struct work_struct *work)
 
 	if (!dev->tagset.tags)
 		return;
-	if (nvme_identify_ctrl(dev, &ctrl))
+	if (nvme_identify_ctrl(&dev->ctrl, &ctrl))
 		return;
 	nvme_scan_namespaces(dev, le32_to_cpup(&ctrl->nn));
 	kfree(ctrl);
@@ -2484,18 +2532,18 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	struct nvme_id_ctrl *ctrl;
 	int shift = NVME_CAP_MPSMIN(lo_hi_readq(dev->bar + NVME_REG_CAP)) + 12;
 
-	res = nvme_identify_ctrl(dev, &ctrl);
+	res = nvme_identify_ctrl(&dev->ctrl, &ctrl);
 	if (res) {
 		dev_err(dev->dev, "Identify Controller failed (%d)\n", res);
 		return -EIO;
 	}
 
-	dev->oncs = le16_to_cpup(&ctrl->oncs);
-	dev->abort_limit = ctrl->acl + 1;
-	dev->vwc = ctrl->vwc;
-	memcpy(dev->serial, ctrl->sn, sizeof(ctrl->sn));
-	memcpy(dev->model, ctrl->mn, sizeof(ctrl->mn));
-	memcpy(dev->firmware_rev, ctrl->fr, sizeof(ctrl->fr));
+	dev->ctrl.oncs = le16_to_cpup(&ctrl->oncs);
+	dev->ctrl.abort_limit = ctrl->acl + 1;
+	dev->ctrl.vwc = ctrl->vwc;
+	memcpy(dev->ctrl.serial, ctrl->sn, sizeof(ctrl->sn));
+	memcpy(dev->ctrl.model, ctrl->mn, sizeof(ctrl->mn));
+	memcpy(dev->ctrl.firmware_rev, ctrl->fr, sizeof(ctrl->fr));
 	if (ctrl->mdts)
 		dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
 	else
@@ -2726,7 +2774,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 	DEFINE_KTHREAD_WORKER_ONSTACK(worker);
 	struct nvme_delq_ctx dq;
 	struct task_struct *kworker_task = kthread_run(kthread_worker_fn,
-					&worker, "nvme%d", dev->instance);
+					&worker, "nvme%d", dev->ctrl.instance);
 
 	if (IS_ERR(kworker_task)) {
 		dev_err(dev->dev,
@@ -2877,14 +2925,14 @@ static int nvme_set_instance(struct nvme_dev *dev)
 	if (error)
 		return -ENODEV;
 
-	dev->instance = instance;
+	dev->ctrl.instance = instance;
 	return 0;
 }
 
 static void nvme_release_instance(struct nvme_dev *dev)
 {
 	spin_lock(&dev_list_lock);
-	ida_remove(&nvme_instance_ida, dev->instance);
+	ida_remove(&nvme_instance_ida, dev->ctrl.instance);
 	spin_unlock(&dev_list_lock);
 }
 
@@ -2897,8 +2945,8 @@ static void nvme_free_dev(struct kref *kref)
 	nvme_release_instance(dev);
 	if (dev->tagset.tags)
 		blk_mq_free_tag_set(&dev->tagset);
-	if (dev->admin_q)
-		blk_put_queue(dev->admin_q);
+	if (dev->ctrl.admin_q)
+		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
 	kfree(dev->entry);
 	kfree(dev);
@@ -2912,8 +2960,8 @@ static int nvme_dev_open(struct inode *inode, struct file *f)
 
 	spin_lock(&dev_list_lock);
 	list_for_each_entry(dev, &dev_list, node) {
-		if (dev->instance == instance) {
-			if (!dev->admin_q) {
+		if (dev->ctrl.instance == instance) {
+			if (!dev->ctrl.admin_q) {
 				ret = -EWOULDBLOCK;
 				break;
 			}
@@ -2943,12 +2991,12 @@ static long nvme_dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 
 	switch (cmd) {
 	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(dev, NULL, (void __user *)arg);
+		return nvme_user_cmd(&dev->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
 		if (list_empty(&dev->namespaces))
 			return -ENOTTY;
 		ns = list_first_entry(&dev->namespaces, struct nvme_ns, list);
-		return nvme_user_cmd(dev, ns, (void __user *)arg);
+		return nvme_user_cmd(&dev->ctrl, ns, (void __user *)arg);
 	case NVME_IOCTL_RESET:
 		dev_warn(dev->dev, "resetting controller\n");
 		return nvme_reset(dev);
@@ -3009,7 +3057,7 @@ static void nvme_probe_work(struct work_struct *work)
 	if (result)
 		goto free_tags;
 
-	dev->event_limit = 1;
+	dev->ctrl.event_limit = 1;
 
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
@@ -3027,8 +3075,8 @@ static void nvme_probe_work(struct work_struct *work)
 
  free_tags:
 	nvme_dev_remove_admin(dev);
-	blk_put_queue(dev->admin_q);
-	dev->admin_q = NULL;
+	blk_put_queue(dev->ctrl.admin_q);
+	dev->ctrl.admin_q = NULL;
 	dev->queues[0]->tags = NULL;
  disable:
 	nvme_disable_queue(dev, 0);
@@ -3056,7 +3104,7 @@ static void nvme_dead_ctrl(struct nvme_dev *dev)
 	dev_warn(dev->dev, "Device failed to resume\n");
 	kref_get(&dev->kref);
 	if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
-						dev->instance))) {
+						dev->ctrl.instance))) {
 		dev_err(dev->dev,
 			"Failed to start controller remove task\n");
 		kref_put(&dev->kref, nvme_free_dev);
@@ -3098,7 +3146,7 @@ static int nvme_reset(struct nvme_dev *dev)
 {
 	int ret;
 
-	if (!dev->admin_q || blk_queue_dying(dev->admin_q))
+	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
 		return -ENODEV;
 
 	spin_lock(&dev_list_lock);
@@ -3129,6 +3177,16 @@ static ssize_t nvme_sysfs_reset(struct device *dev,
 }
 static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
 
+static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
+{
+	*val = readl(to_nvme_dev(ctrl)->bar + off);
+	return 0;
+}
+
+static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
+	.reg_read32		= nvme_pci_reg_read32,
+};
+
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	int node, result = -ENOMEM;
@@ -3155,7 +3213,9 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev->dev = get_device(&pdev->dev);
 	pci_set_drvdata(pdev, dev);
 
-	dev->vendor = pdev->vendor;
+	dev->ctrl.vendor = pdev->vendor;
+	dev->ctrl.ops = &nvme_pci_ctrl_ops;
+	dev->ctrl.dev = dev->dev;
 
 	result = nvme_set_instance(dev);
 	if (result)
@@ -3167,8 +3227,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	kref_init(&dev->kref);
 	dev->device = device_create(nvme_class, &pdev->dev,
-				MKDEV(nvme_char_major, dev->instance),
-				dev, "nvme%d", dev->instance);
+				MKDEV(nvme_char_major, dev->ctrl.instance),
+				dev, "nvme%d", dev->ctrl.instance);
 	if (IS_ERR(dev->device)) {
 		result = PTR_ERR(dev->device);
 		goto release_pools;
@@ -3187,7 +3247,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return 0;
 
  put_dev:
-	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->instance));
+	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->ctrl.instance));
 	put_device(dev->device);
  release_pools:
 	nvme_release_prp_pools(dev);
@@ -3234,7 +3294,7 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_dev_remove(dev);
 	nvme_dev_shutdown(dev);
 	nvme_dev_remove_admin(dev);
-	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->instance));
+	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->ctrl.instance));
 	nvme_free_queues(dev, 0);
 	nvme_release_cmb(dev);
 	nvme_release_prp_pools(dev);
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index a5f6af1..00d0bdd 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -524,7 +524,7 @@ static int nvme_trans_standard_inquiry_page(struct nvme_ns *ns,
 					struct sg_io_hdr *hdr, u8 *inq_response,
 					int alloc_len)
 {
-	struct nvme_dev *dev = ns->dev;
+	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ns *id_ns;
 	int res;
 	int nvme_sc;
@@ -532,10 +532,10 @@ static int nvme_trans_standard_inquiry_page(struct nvme_ns *ns,
 	u8 resp_data_format = 0x02;
 	u8 protect;
 	u8 cmdque = 0x01 << 1;
-	u8 fw_offset = sizeof(dev->firmware_rev);
+	u8 fw_offset = sizeof(ctrl->firmware_rev);
 
 	/* nvme ns identify - use DPS value for PROTECT field */
-	nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+	nvme_sc = nvme_identify_ns(ctrl, ns->ns_id, &id_ns);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		return res;
@@ -553,12 +553,12 @@ static int nvme_trans_standard_inquiry_page(struct nvme_ns *ns,
 	inq_response[5] = protect;	/* sccs=0 | acc=0 | tpgs=0 | pc3=0 */
 	inq_response[7] = cmdque;	/* wbus16=0 | sync=0 | vs=0 */
 	strncpy(&inq_response[8], "NVMe    ", 8);
-	strncpy(&inq_response[16], dev->model, 16);
+	strncpy(&inq_response[16], ctrl->model, 16);
 
-	while (dev->firmware_rev[fw_offset - 1] == ' ' && fw_offset > 4)
+	while (ctrl->firmware_rev[fw_offset - 1] == ' ' && fw_offset > 4)
 		fw_offset--;
 	fw_offset -= 4;
-	strncpy(&inq_response[32], dev->firmware_rev + fw_offset, 4);
+	strncpy(&inq_response[32], ctrl->firmware_rev + fw_offset, 4);
 
 	xfer_len = min(alloc_len, STANDARD_INQUIRY_LENGTH);
 	return nvme_trans_copy_to_user(hdr, inq_response, xfer_len);
@@ -588,13 +588,12 @@ static int nvme_trans_unit_serial_page(struct nvme_ns *ns,
 					struct sg_io_hdr *hdr, u8 *inq_response,
 					int alloc_len)
 {
-	struct nvme_dev *dev = ns->dev;
 	int xfer_len;
 
 	memset(inq_response, 0, STANDARD_INQUIRY_LENGTH);
 	inq_response[1] = INQ_UNIT_SERIAL_NUMBER_PAGE; /* Page Code */
 	inq_response[3] = INQ_SERIAL_NUMBER_LENGTH;    /* Page Length */
-	strncpy(&inq_response[4], dev->serial, INQ_SERIAL_NUMBER_LENGTH);
+	strncpy(&inq_response[4], ns->ctrl->serial, INQ_SERIAL_NUMBER_LENGTH);
 
 	xfer_len = min(alloc_len, STANDARD_INQUIRY_LENGTH);
 	return nvme_trans_copy_to_user(hdr, inq_response, xfer_len);
@@ -603,27 +602,32 @@ static int nvme_trans_unit_serial_page(struct nvme_ns *ns,
 static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 					u8 *inq_response, int alloc_len)
 {
-	struct nvme_dev *dev = ns->dev;
+	struct nvme_ctrl *ctrl = ns->ctrl;
 	int res;
 	int nvme_sc;
 	int xfer_len;
+	u32 vs;
 	__be32 tmp_id = cpu_to_be32(ns->ns_id);
 
+	res = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &vs);
+	if (res)
+		return res;
+
 	memset(inq_response, 0, alloc_len);
 	inq_response[1] = INQ_DEVICE_IDENTIFICATION_PAGE;    /* Page Code */
-	if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 1)) {
+	if (vs >= NVME_VS(1, 1)) {
 		struct nvme_id_ns *id_ns;
 		void *eui;
 		int len;
 
-		nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+		nvme_sc = nvme_identify_ns(ctrl, ns->ns_id, &id_ns);
 		res = nvme_trans_status_code(hdr, nvme_sc);
 		if (res)
 			return res;
 
 		eui = id_ns->eui64;
 		len = sizeof(id_ns->eui64);
-		if (readl(dev->bar + NVME_REG_VS) >= NVME_VS(1, 2)) {
+		if (vs >= NVME_VS(1, 2)) {
 			if (bitmap_empty(eui, len * 8)) {
 				eui = id_ns->nguid;
 				len = sizeof(id_ns->nguid);
@@ -657,10 +661,10 @@ static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 		inq_response[6] = 0x00;    /* Rsvd */
 		inq_response[7] = 0x44;    /* Designator Length */
 
-		sprintf(&inq_response[8], "%04x", dev->vendor);
-		memcpy(&inq_response[12], dev->model, sizeof(dev->model));
+		sprintf(&inq_response[8], "%04x", ctrl->vendor);
+		memcpy(&inq_response[12], ctrl->model, sizeof(ctrl->model));
 		sprintf(&inq_response[52], "%04x", tmp_id);
-		memcpy(&inq_response[56], dev->serial, sizeof(dev->serial));
+		memcpy(&inq_response[56], ctrl->serial, sizeof(ctrl->serial));
 	}
 	xfer_len = alloc_len;
 	return nvme_trans_copy_to_user(hdr, inq_response, xfer_len);
@@ -672,7 +676,7 @@ static int nvme_trans_ext_inq_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	u8 *inq_response;
 	int res;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
+	struct nvme_ctrl *ctrl = ns->ctrl;
 	struct nvme_id_ctrl *id_ctrl;
 	struct nvme_id_ns *id_ns;
 	int xfer_len;
@@ -688,7 +692,7 @@ static int nvme_trans_ext_inq_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	if (inq_response == NULL)
 		return -ENOMEM;
 
-	nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+	nvme_sc = nvme_identify_ns(ctrl, ns->ns_id, &id_ns);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		goto out_free_inq;
@@ -704,7 +708,7 @@ static int nvme_trans_ext_inq_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	app_chk = protect << 1;
 	ref_chk = protect;
 
-	nvme_sc = nvme_identify_ctrl(dev, &id_ctrl);
+	nvme_sc = nvme_identify_ctrl(ctrl, &id_ctrl);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		goto out_free_inq;
@@ -815,7 +819,6 @@ static int nvme_trans_log_info_exceptions(struct nvme_ns *ns,
 	int res;
 	int xfer_len;
 	u8 *log_response;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_smart_log *smart_log;
 	u8 temp_c;
 	u16 temp_k;
@@ -824,7 +827,7 @@ static int nvme_trans_log_info_exceptions(struct nvme_ns *ns,
 	if (log_response == NULL)
 		return -ENOMEM;
 
-	res = nvme_get_log_page(dev, &smart_log);
+	res = nvme_get_log_page(ns->ctrl, &smart_log);
 	if (res < 0)
 		goto out_free_response;
 
@@ -862,7 +865,6 @@ static int nvme_trans_log_temperature(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	int res;
 	int xfer_len;
 	u8 *log_response;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_smart_log *smart_log;
 	u32 feature_resp;
 	u8 temp_c_cur, temp_c_thresh;
@@ -872,7 +874,7 @@ static int nvme_trans_log_temperature(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	if (log_response == NULL)
 		return -ENOMEM;
 
-	res = nvme_get_log_page(dev, &smart_log);
+	res = nvme_get_log_page(ns->ctrl, &smart_log);
 	if (res < 0)
 		goto out_free_response;
 
@@ -886,7 +888,7 @@ static int nvme_trans_log_temperature(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	kfree(smart_log);
 
 	/* Get Features for Temp Threshold */
-	res = nvme_get_features(dev, NVME_FEAT_TEMP_THRESH, 0, 0,
+	res = nvme_get_features(ns->ctrl, NVME_FEAT_TEMP_THRESH, 0, 0,
 								&feature_resp);
 	if (res != NVME_SC_SUCCESS)
 		temp_c_thresh = LOG_TEMP_UNKNOWN;
@@ -948,7 +950,6 @@ static int nvme_trans_fill_blk_desc(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 {
 	int res;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_id_ns *id_ns;
 	u8 flbas;
 	u32 lba_length;
@@ -958,7 +959,7 @@ static int nvme_trans_fill_blk_desc(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	else if (llbaa > 0 && len < MODE_PAGE_LLBAA_BLK_DES_LEN)
 		return -EINVAL;
 
-	nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+	nvme_sc = nvme_identify_ns(ns->ctrl, ns->ns_id, &id_ns);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		return res;
@@ -1014,14 +1015,13 @@ static int nvme_trans_fill_caching_page(struct nvme_ns *ns,
 {
 	int res = 0;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	u32 feature_resp;
 	u8 vwc;
 
 	if (len < MODE_PAGE_CACHING_LEN)
 		return -EINVAL;
 
-	nvme_sc = nvme_get_features(dev, NVME_FEAT_VOLATILE_WC, 0, 0,
+	nvme_sc = nvme_get_features(ns->ctrl, NVME_FEAT_VOLATILE_WC, 0, 0,
 								&feature_resp);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
@@ -1207,12 +1207,11 @@ static int nvme_trans_power_state(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 {
 	int res;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_id_ctrl *id_ctrl;
 	int lowest_pow_st;	/* max npss = lowest power consumption */
 	unsigned ps_desired = 0;
 
-	nvme_sc = nvme_identify_ctrl(dev, &id_ctrl);
+	nvme_sc = nvme_identify_ctrl(ns->ctrl, &id_ctrl);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		return res;
@@ -1256,7 +1255,7 @@ static int nvme_trans_power_state(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 				SCSI_ASCQ_CAUSE_NOT_REPORTABLE);
 		break;
 	}
-	nvme_sc = nvme_set_features(dev, NVME_FEAT_POWER_MGMT, ps_desired, 0,
+	nvme_sc = nvme_set_features(ns->ctrl, NVME_FEAT_POWER_MGMT, ps_desired, 0,
 				    NULL);
 	return nvme_trans_status_code(hdr, nvme_sc);
 }
@@ -1280,7 +1279,6 @@ static int nvme_trans_send_download_fw_cmd(struct nvme_ns *ns, struct sg_io_hdr
 					u8 buffer_id)
 {
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_command c;
 
 	if (hdr->iovec_count > 0) {
@@ -1297,7 +1295,7 @@ static int nvme_trans_send_download_fw_cmd(struct nvme_ns *ns, struct sg_io_hdr
 	c.dlfw.numd = cpu_to_le32((tot_len/BYTES_TO_DWORDS) - 1);
 	c.dlfw.offset = cpu_to_le32(offset/BYTES_TO_DWORDS);
 
-	nvme_sc = __nvme_submit_sync_cmd(dev->admin_q, &c, NULL,
+	nvme_sc = __nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL,
 			hdr->dxferp, tot_len, NULL, 0);
 	return nvme_trans_status_code(hdr, nvme_sc);
 }
@@ -1364,14 +1362,13 @@ static int nvme_trans_modesel_get_mp(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 {
 	int res = 0;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	unsigned dword11;
 
 	switch (page_code) {
 	case MODE_PAGE_CACHING:
 		dword11 = ((mode_page[2] & CACHING_MODE_PAGE_WCE_MASK) ? 1 : 0);
-		nvme_sc = nvme_set_features(dev, NVME_FEAT_VOLATILE_WC, dword11,
-					    0, NULL);
+		nvme_sc = nvme_set_features(ns->ctrl, NVME_FEAT_VOLATILE_WC,
+					    dword11, 0, NULL);
 		res = nvme_trans_status_code(hdr, nvme_sc);
 		break;
 	case MODE_PAGE_CONTROL:
@@ -1473,7 +1470,6 @@ static int nvme_trans_fmt_set_blk_size_count(struct nvme_ns *ns,
 {
 	int res = 0;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	u8 flbas;
 
 	/*
@@ -1486,7 +1482,7 @@ static int nvme_trans_fmt_set_blk_size_count(struct nvme_ns *ns,
 	if (ns->mode_select_num_blocks == 0 || ns->mode_select_block_len == 0) {
 		struct nvme_id_ns *id_ns;
 
-		nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+		nvme_sc = nvme_identify_ns(ns->ctrl, ns->ns_id, &id_ns);
 		res = nvme_trans_status_code(hdr, nvme_sc);
 		if (res)
 			return res;
@@ -1570,7 +1566,6 @@ static int nvme_trans_fmt_send_cmd(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 {
 	int res;
 	int nvme_sc;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_id_ns *id_ns;
 	u8 i;
 	u8 flbas, nlbaf;
@@ -1579,7 +1574,7 @@ static int nvme_trans_fmt_send_cmd(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	struct nvme_command c;
 
 	/* Loop thru LBAF's in id_ns to match reqd lbaf, put in cdw10 */
-	nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+	nvme_sc = nvme_identify_ns(ns->ctrl, ns->ns_id, &id_ns);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		return res;
@@ -1611,7 +1606,7 @@ static int nvme_trans_fmt_send_cmd(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	c.format.nsid = cpu_to_le32(ns->ns_id);
 	c.format.cdw10 = cpu_to_le32(cdw10);
 
-	nvme_sc = nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0);
+	nvme_sc = nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 
 	kfree(id_ns);
@@ -2040,7 +2035,6 @@ static int nvme_trans_read_capacity(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	u32 alloc_len;
 	u32 resp_size;
 	u32 xfer_len;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_id_ns *id_ns;
 	u8 *response;
 
@@ -2052,7 +2046,7 @@ static int nvme_trans_read_capacity(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 		resp_size = READ_CAP_10_RESP_SIZE;
 	}
 
-	nvme_sc = nvme_identify_ns(dev, ns->ns_id, &id_ns);
+	nvme_sc = nvme_identify_ns(ns->ctrl, ns->ns_id, &id_ns);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
 		return res;	
@@ -2080,7 +2074,6 @@ static int nvme_trans_report_luns(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	int nvme_sc;
 	u32 alloc_len, xfer_len, resp_size;
 	u8 *response;
-	struct nvme_dev *dev = ns->dev;
 	struct nvme_id_ctrl *id_ctrl;
 	u32 ll_length, lun_id;
 	u8 lun_id_offset = REPORT_LUNS_FIRST_LUN_OFFSET;
@@ -2094,7 +2087,7 @@ static int nvme_trans_report_luns(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	case ALL_LUNS_RETURNED:
 	case ALL_WELL_KNOWN_LUNS_RETURNED:
 	case RESTRICTED_LUNS_RETURNED:
-		nvme_sc = nvme_identify_ctrl(dev, &id_ctrl);
+		nvme_sc = nvme_identify_ctrl(ns->ctrl, &id_ctrl);
 		res = nvme_trans_status_code(hdr, nvme_sc);
 		if (res)
 			return res;
@@ -2295,9 +2288,7 @@ static int nvme_trans_test_unit_ready(struct nvme_ns *ns,
 					struct sg_io_hdr *hdr,
 					u8 *cmd)
 {
-	struct nvme_dev *dev = ns->dev;
-
-	if (!(readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_RDY))
+	if (nvme_ctrl_ready(ns->ctrl))
 		return nvme_trans_completion(hdr, SAM_STAT_CHECK_CONDITION,
 					    NOT_READY, SCSI_ASC_LUN_NOT_READY,
 					    SCSI_ASCQ_CAUSE_NOT_REPORTABLE);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 12/47] nvme: simplify nvme_setup_prps calling convention
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (10 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 11/47] nvme: split a new struct nvme_ctrl out of struct nvme_dev Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 13/47] nvme: refactor nvme_queue_rq Christoph Hellwig
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Pass back a true/false value instead of the length which needs a compare
with the bytes in the request and drop the pointless gfp_t argument.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ea0a446..39ae49b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -709,9 +709,8 @@ release_iod:
 		blk_mq_complete_request(req, error);
 }
 
-/* length is in bytes.  gfp flags indicates whether we may sleep. */
-static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
-		int total_len, gfp_t gfp)
+static bool nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
+		int total_len)
 {
 	struct dma_pool *pool;
 	int length = total_len;
@@ -727,7 +726,7 @@ static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 
 	length -= (page_size - offset);
 	if (length <= 0)
-		return total_len;
+		return true;
 
 	dma_len -= (page_size - offset);
 	if (dma_len) {
@@ -740,7 +739,7 @@ static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 
 	if (length <= page_size) {
 		iod->first_dma = dma_addr;
-		return total_len;
+		return true;
 	}
 
 	nprps = DIV_ROUND_UP(length, page_size);
@@ -752,11 +751,11 @@ static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 		iod->npages = 1;
 	}
 
-	prp_list = dma_pool_alloc(pool, gfp, &prp_dma);
+	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 	if (!prp_list) {
 		iod->first_dma = dma_addr;
 		iod->npages = -1;
-		return (total_len - length) + page_size;
+		return false;
 	}
 	list[0] = prp_list;
 	iod->first_dma = prp_dma;
@@ -764,9 +763,9 @@ static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 	for (;;) {
 		if (i == page_size >> 3) {
 			__le64 *old_prp_list = prp_list;
-			prp_list = dma_pool_alloc(pool, gfp, &prp_dma);
+			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 			if (!prp_list)
-				return total_len - length;
+				return false;
 			list[iod->npages++] = prp_list;
 			prp_list[0] = old_prp_list[i - 1];
 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
@@ -786,7 +785,7 @@ static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 		dma_len = sg_dma_len(sg);
 	}
 
-	return total_len;
+	return true;
 }
 
 static void nvme_submit_priv(struct nvme_queue *nvmeq, struct request *req,
@@ -952,8 +951,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
 			goto retry_cmd;
 
-		if (blk_rq_bytes(req) !=
-                    nvme_setup_prps(dev, iod, blk_rq_bytes(req), GFP_ATOMIC)) {
+		if (!nvme_setup_prps(dev, iod, blk_rq_bytes(req))) {
 			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
 			goto retry_cmd;
 		}
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 13/47] nvme: refactor nvme_queue_rq
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (11 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 12/47] nvme: simplify nvme_setup_prps calling convention Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 14/47] nvme: move nvme_error_status to common code Christoph Hellwig
                   ` (21 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


This "backports" the structure I've used for the fabrics driver.  It
mostly started out as a cleanup so that I could actually understand
the code, but I think it also qualifies as a micro-optimization due
to the reduced time we hold q_lock and disable interrupts.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 219 +++++++++++++++++++++---------------------------
 1 file changed, 97 insertions(+), 122 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 39ae49b..74a829b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -788,19 +788,53 @@ static bool nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
 	return true;
 }
 
-static void nvme_submit_priv(struct nvme_queue *nvmeq, struct request *req,
-		struct nvme_iod *iod)
+static int nvme_map_data(struct nvme_dev *dev, struct nvme_iod *iod,
+		struct nvme_command *cmnd)
 {
-	struct nvme_command cmnd;
+	struct request *req = iod_get_private(iod);
+	struct request_queue *q = req->q;
+	enum dma_data_direction dma_dir = rq_data_dir(req) ?
+			DMA_TO_DEVICE : DMA_FROM_DEVICE;
+	int ret = BLK_MQ_RQ_QUEUE_ERROR;
+
+	sg_init_table(iod->sg, req->nr_phys_segments);
+	iod->nents = blk_rq_map_sg(q, req, iod->sg);
+	if (!iod->nents)
+		goto out;
+
+	ret = BLK_MQ_RQ_QUEUE_BUSY;
+	if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir))
+		goto out;
+
+	if (!nvme_setup_prps(dev, iod, blk_rq_bytes(req)))
+		goto out_unmap;
+
+	ret = BLK_MQ_RQ_QUEUE_ERROR;
+	if (blk_integrity_rq(req)) {
+		if (blk_rq_count_integrity_sg(q, req->bio) != 1)
+			goto out_unmap;
+
+		sg_init_table(iod->meta_sg, 1);
+		if (blk_rq_map_integrity_sg(q, req->bio, iod->meta_sg) != 1)
+			goto out_unmap;
 
-	memcpy(&cmnd, req->cmd, sizeof(cmnd));
-	cmnd.rw.command_id = req->tag;
-	if (req->nr_phys_segments) {
-		cmnd.rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
-		cmnd.rw.prp2 = cpu_to_le64(iod->first_dma);
+		if (rq_data_dir(req))
+			nvme_dif_remap(req, nvme_dif_prep);
+
+		if (!dma_map_sg(dev->dev, iod->meta_sg, 1, dma_dir))
+			goto out_unmap;
 	}
 
-	__nvme_submit_cmd(nvmeq, &cmnd);
+	cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+	cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
+	if (blk_integrity_rq(req))
+		cmnd->rw.metadata = cpu_to_le64(sg_dma_address(iod->meta_sg));
+	return BLK_MQ_RQ_QUEUE_OK;
+
+out_unmap:
+	dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+out:
+	return ret;
 }
 
 /*
@@ -808,46 +842,42 @@ static void nvme_submit_priv(struct nvme_queue *nvmeq, struct request *req,
  * worth having a special pool for these or additional cases to handle freeing
  * the iod.
  */
-static void nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
-		struct request *req, struct nvme_iod *iod)
+static int nvme_setup_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+		struct nvme_iod *iod, struct nvme_command *cmnd)
 {
-	struct nvme_dsm_range *range =
-				(struct nvme_dsm_range *)iod_list(iod)[0];
-	struct nvme_command cmnd;
+	struct request *req = iod_get_private(iod);
+	struct nvme_dsm_range *range;
+
+	range = dma_pool_alloc(nvmeq->dev->prp_small_pool, GFP_ATOMIC,
+						&iod->first_dma);
+	if (!range)
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	iod_list(iod)[0] = (__le64 *)range;
+	iod->npages = 0;
 
 	range->cattr = cpu_to_le32(0);
 	range->nlb = cpu_to_le32(blk_rq_bytes(req) >> ns->lba_shift);
 	range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 
-	memset(&cmnd, 0, sizeof(cmnd));
-	cmnd.dsm.opcode = nvme_cmd_dsm;
-	cmnd.dsm.command_id = req->tag;
-	cmnd.dsm.nsid = cpu_to_le32(ns->ns_id);
-	cmnd.dsm.prp1 = cpu_to_le64(iod->first_dma);
-	cmnd.dsm.nr = 0;
-	cmnd.dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
-
-	__nvme_submit_cmd(nvmeq, &cmnd);
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->dsm.opcode = nvme_cmd_dsm;
+	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->dsm.prp1 = cpu_to_le64(iod->first_dma);
+	cmnd->dsm.nr = 0;
+	cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
+	return BLK_MQ_RQ_QUEUE_OK;
 }
 
-static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
-								int cmdid)
+static void nvme_setup_flush(struct nvme_ns *ns, struct nvme_command *cmnd)
 {
-	struct nvme_command cmnd;
-
-	memset(&cmnd, 0, sizeof(cmnd));
-	cmnd.common.opcode = nvme_cmd_flush;
-	cmnd.common.command_id = cmdid;
-	cmnd.common.nsid = cpu_to_le32(ns->ns_id);
-
-	__nvme_submit_cmd(nvmeq, &cmnd);
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->common.opcode = nvme_cmd_flush;
+	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
 }
 
-static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
-							struct nvme_ns *ns)
+static void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
+		struct nvme_command *cmnd)
 {
-	struct request *req = iod_get_private(iod);
-	struct nvme_command cmnd;
 	u16 control = 0;
 	u32 dsmgmt = 0;
 
@@ -859,14 +889,12 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
-	memset(&cmnd, 0, sizeof(cmnd));
-	cmnd.rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd.rw.command_id = req->tag;
-	cmnd.rw.nsid = cpu_to_le32(ns->ns_id);
-	cmnd.rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
-	cmnd.rw.prp2 = cpu_to_le64(iod->first_dma);
-	cmnd.rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
-	cmnd.rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+	cmnd->rw.command_id = req->tag;
+	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
 	if (ns->ms) {
 		switch (ns->pi_type) {
@@ -877,23 +905,16 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
 		case NVME_NS_DPS_PI_TYPE2:
 			control |= NVME_RW_PRINFO_PRCHK_GUARD |
 					NVME_RW_PRINFO_PRCHK_REF;
-			cmnd.rw.reftag = cpu_to_le32(
+			cmnd->rw.reftag = cpu_to_le32(
 					nvme_block_nr(ns, blk_rq_pos(req)));
 			break;
 		}
-		if (blk_integrity_rq(req))
-			cmnd.rw.metadata =
-				cpu_to_le64(sg_dma_address(iod->meta_sg));
-		else
+		if (!blk_integrity_rq(req))
 			control |= NVME_RW_PRINFO_PRACT;
 	}
 
-	cmnd.rw.control = cpu_to_le16(control);
-	cmnd.rw.dsmgmt = cpu_to_le32(dsmgmt);
-
-	__nvme_submit_cmd(nvmeq, &cmnd);
-
-	return 0;
+	cmnd->rw.control = cpu_to_le16(control);
+	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
 }
 
 /*
@@ -908,7 +929,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct request *req = bd->rq;
 	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
 	struct nvme_iod *iod;
-	enum dma_data_direction dma_dir;
+	struct nvme_command cmnd;
+	int ret = BLK_MQ_RQ_QUEUE_OK;
 
 	/*
 	 * If formated with metadata, require the block layer provide a buffer
@@ -928,80 +950,33 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		return BLK_MQ_RQ_QUEUE_BUSY;
 
 	if (req->cmd_flags & REQ_DISCARD) {
-		void *range;
-		/*
-		 * We reuse the small pool to allocate the 16-byte range here
-		 * as it is not worth having a special pool for these or
-		 * additional cases to handle freeing the iod.
-		 */
-		range = dma_pool_alloc(dev->prp_small_pool, GFP_ATOMIC,
-						&iod->first_dma);
-		if (!range)
-			goto retry_cmd;
-		iod_list(iod)[0] = (__le64 *)range;
-		iod->npages = 0;
-	} else if (req->nr_phys_segments) {
-		dma_dir = rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
-
-		sg_init_table(iod->sg, req->nr_phys_segments);
-		iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
-		if (!iod->nents)
-			goto error_cmd;
-
-		if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
-			goto retry_cmd;
-
-		if (!nvme_setup_prps(dev, iod, blk_rq_bytes(req))) {
-			dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
-			goto retry_cmd;
-		}
-		if (blk_integrity_rq(req)) {
-			if (blk_rq_count_integrity_sg(req->q, req->bio) != 1) {
-				dma_unmap_sg(dev->dev, iod->sg, iod->nents,
-						dma_dir);
-				goto error_cmd;
-			}
-
-			sg_init_table(iod->meta_sg, 1);
-			if (blk_rq_map_integrity_sg(
-					req->q, req->bio, iod->meta_sg) != 1) {
-				dma_unmap_sg(dev->dev, iod->sg, iod->nents,
-						dma_dir);
-				goto error_cmd;
-			}
-
-			if (rq_data_dir(req))
-				nvme_dif_remap(req, nvme_dif_prep);
+		ret = nvme_setup_discard(nvmeq, ns, iod, &cmnd);
+	} else {
+		if (req->cmd_type == REQ_TYPE_DRV_PRIV)
+			memcpy(&cmnd, req->cmd, sizeof(cmnd));
+		else if (req->cmd_flags & REQ_FLUSH)
+			nvme_setup_flush(ns, &cmnd);
+		else
+			nvme_setup_rw(ns, req, &cmnd);
 
-			if (!dma_map_sg(nvmeq->q_dmadev, iod->meta_sg, 1, dma_dir)) {
-				dma_unmap_sg(dev->dev, iod->sg, iod->nents,
-						dma_dir);
-				goto error_cmd;
-			}
-		}
+		if (req->nr_phys_segments)
+			ret = nvme_map_data(dev, iod, &cmnd);
 	}
 
+	if (ret)
+		goto out;
+
+	cmnd.common.command_id = req->tag;
 	nvme_set_info(cmd, iod, req_completion);
-	spin_lock_irq(&nvmeq->q_lock);
-	if (req->cmd_type == REQ_TYPE_DRV_PRIV)
-		nvme_submit_priv(nvmeq, req, iod);
-	else if (req->cmd_flags & REQ_DISCARD)
-		nvme_submit_discard(nvmeq, ns, req, iod);
-	else if (req->cmd_flags & REQ_FLUSH)
-		nvme_submit_flush(nvmeq, ns, req->tag);
-	else
-		nvme_submit_iod(nvmeq, iod, ns);
 
+	spin_lock_irq(&nvmeq->q_lock);
+	__nvme_submit_cmd(nvmeq, &cmnd);
 	nvme_process_cq(nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
 	return BLK_MQ_RQ_QUEUE_OK;
-
- error_cmd:
-	nvme_free_iod(dev, iod);
-	return BLK_MQ_RQ_QUEUE_ERROR;
- retry_cmd:
+out:
 	nvme_free_iod(dev, iod);
-	return BLK_MQ_RQ_QUEUE_BUSY;
+	return ret;
 }
 
 static void __nvme_process_cq(struct nvme_queue *nvmeq, unsigned int *tag)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 14/47] nvme: move nvme_error_status to common code
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (12 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 13/47] nvme: refactor nvme_queue_rq Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 15/47] nvme: move nvme_setup_flush and nvme_setup_rw " Christoph Hellwig
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


And mark it inline so that we don't slow down the completion path by
having to turn it into a forced out of line call.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/nvme.h | 12 ++++++++++++
 drivers/nvme/host/pci.c  | 12 ------------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 71e81e5..c06bb9c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -86,6 +86,18 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
 	return (sector >> (ns->lba_shift - 9));
 }
 
+static inline int nvme_error_status(u16 status)
+{
+	switch (status & 0x7ff) {
+	case NVME_SC_SUCCESS:
+		return 0;
+	case NVME_SC_CAP_EXCEEDED:
+		return -ENOSPC;
+	default:
+		return -EIO;
+	}
+}
+
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void *buf, unsigned bufflen);
 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 74a829b..92fb388 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -545,18 +545,6 @@ static void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
 		kfree(iod);
 }
 
-static int nvme_error_status(u16 status)
-{
-	switch (status & 0x7ff) {
-	case NVME_SC_SUCCESS:
-		return 0;
-	case NVME_SC_CAP_EXCEEDED:
-		return -ENOSPC;
-	default:
-		return -EIO;
-	}
-}
-
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
 {
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 15/47] nvme: move nvme_setup_flush and nvme_setup_rw to common code
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (13 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 14/47] nvme: move nvme_error_status to common code Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 16/47] nvme: split __nvme_submit_sync_cmd Christoph Hellwig
                   ` (19 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


And mark them inline so that we don't slow down the I/O submission path by
having to turn it into a forced out of line call.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/nvme.h | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/pci.c  | 49 ----------------------------------------------
 2 files changed, 51 insertions(+), 49 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c06bb9c..682fd35 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -86,6 +86,57 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
 	return (sector >> (ns->lba_shift - 9));
 }
 
+static inline void nvme_setup_flush(struct nvme_ns *ns,
+		struct nvme_command *cmnd)
+{
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->common.opcode = nvme_cmd_flush;
+	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+}
+
+static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
+		struct nvme_command *cmnd)
+{
+	u16 control = 0;
+	u32 dsmgmt = 0;
+
+	if (req->cmd_flags & REQ_FUA)
+		control |= NVME_RW_FUA;
+	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+		control |= NVME_RW_LR;
+
+	if (req->cmd_flags & REQ_RAHEAD)
+		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+
+	memset(cmnd, 0, sizeof(*cmnd));
+	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+	cmnd->rw.command_id = req->tag;
+	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+
+	if (ns->ms) {
+		switch (ns->pi_type) {
+		case NVME_NS_DPS_PI_TYPE3:
+			control |= NVME_RW_PRINFO_PRCHK_GUARD;
+			break;
+		case NVME_NS_DPS_PI_TYPE1:
+		case NVME_NS_DPS_PI_TYPE2:
+			control |= NVME_RW_PRINFO_PRCHK_GUARD |
+					NVME_RW_PRINFO_PRCHK_REF;
+			cmnd->rw.reftag = cpu_to_le32(
+					nvme_block_nr(ns, blk_rq_pos(req)));
+			break;
+		}
+		if (!blk_integrity_rq(req))
+			control |= NVME_RW_PRINFO_PRACT;
+	}
+
+	cmnd->rw.control = cpu_to_le16(control);
+	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
+}
+
+
 static inline int nvme_error_status(u16 status)
 {
 	switch (status & 0x7ff) {
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 92fb388..7b49b13 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -856,55 +856,6 @@ static int nvme_setup_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	return BLK_MQ_RQ_QUEUE_OK;
 }
 
-static void nvme_setup_flush(struct nvme_ns *ns, struct nvme_command *cmnd)
-{
-	memset(cmnd, 0, sizeof(*cmnd));
-	cmnd->common.opcode = nvme_cmd_flush;
-	cmnd->common.nsid = cpu_to_le32(ns->ns_id);
-}
-
-static void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
-		struct nvme_command *cmnd)
-{
-	u16 control = 0;
-	u32 dsmgmt = 0;
-
-	if (req->cmd_flags & REQ_FUA)
-		control |= NVME_RW_FUA;
-	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
-		control |= NVME_RW_LR;
-
-	if (req->cmd_flags & REQ_RAHEAD)
-		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
-
-	memset(cmnd, 0, sizeof(*cmnd));
-	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-	cmnd->rw.command_id = req->tag;
-	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
-	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
-	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
-
-	if (ns->ms) {
-		switch (ns->pi_type) {
-		case NVME_NS_DPS_PI_TYPE3:
-			control |= NVME_RW_PRINFO_PRCHK_GUARD;
-			break;
-		case NVME_NS_DPS_PI_TYPE1:
-		case NVME_NS_DPS_PI_TYPE2:
-			control |= NVME_RW_PRINFO_PRCHK_GUARD |
-					NVME_RW_PRINFO_PRCHK_REF;
-			cmnd->rw.reftag = cpu_to_le32(
-					nvme_block_nr(ns, blk_rq_pos(req)));
-			break;
-		}
-		if (!blk_integrity_rq(req))
-			control |= NVME_RW_PRINFO_PRACT;
-	}
-
-	cmnd->rw.control = cpu_to_le16(control);
-	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
-}
-
 /*
  * NOTE: ns is NULL when called on the admin queue.
  */
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 16/47] nvme: split __nvme_submit_sync_cmd
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (14 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 15/47] nvme: move nvme_setup_flush and nvme_setup_rw " Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 17/47] nvme: use the block layer for userspace passthrough metadata Christoph Hellwig
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Add a separate nvme_submit_user_cmd for commands that directly DMA
to or from userspace.  We'll add metadata support to that soon and
the common version would become too messy.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 79 ++++++++++++++++++++++++++++++++++--------------
 drivers/nvme/host/nvme.h |  6 ++--
 drivers/nvme/host/pci.c  |  6 ++--
 drivers/nvme/host/scsi.c |  4 +--
 4 files changed, 65 insertions(+), 30 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ca54a34..84b7255 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -21,22 +21,15 @@
 
 #include "nvme.h"
 
-/*
- * Returns 0 on success.  If the result is negative, it's a Linux error code;
- * if the result is positive, it's an NVM Express status code
- */
-int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, void __user *ubuffer, unsigned bufflen,
-		u32 *result, unsigned timeout)
+static struct request *nvme_alloc_request(struct request_queue *q,
+		struct nvme_command *cmd)
 {
 	bool write = cmd->common.opcode & 1;
-	struct bio *bio = NULL;
 	struct request *req;
-	int ret;
 
 	req = blk_mq_alloc_request(q, write, 0);
 	if (IS_ERR(req))
-		return PTR_ERR(req);
+		return req;
 
 	req->cmd_type = REQ_TYPE_DRV_PRIV;
 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
@@ -44,17 +37,65 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	req->__sector = (sector_t) -1;
 	req->bio = req->biotail = NULL;
 
-	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
-
 	req->cmd = (unsigned char *)cmd;
 	req->cmd_len = sizeof(struct nvme_command);
 	req->special = (void *)0;
 
+	return req;
+}
+
+/*
+ * Returns 0 on success.  If the result is negative, it's a Linux error code;
+ * if the result is positive, it's an NVM Express status code
+ */
+int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
+{
+	struct request *req;
+	int ret;
+
+	req = nvme_alloc_request(q, cmd);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
 	if (buffer && bufflen) {
 		ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
 		if (ret)
 			goto out;
-	} else if (ubuffer && bufflen) {
+	}
+
+	blk_execute_rq(req->q, NULL, req, 0);
+	if (result)
+		*result = (u32)(uintptr_t)req->special;
+	ret = req->errors;
+ out:
+	blk_mq_free_request(req);
+	return ret;
+}
+
+int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen)
+{
+	return __nvme_submit_sync_cmd(q, cmd, buffer, bufflen, NULL, 0);
+}
+
+int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen, u32 *result,
+		unsigned timeout)
+{
+	struct bio *bio = NULL;
+	struct request *req;
+	int ret;
+
+	req = nvme_alloc_request(q, cmd);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
+	if (ubuffer && bufflen) {
 		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
 				GFP_KERNEL);
 		if (ret)
@@ -73,12 +114,6 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	return ret;
 }
 
-int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, unsigned bufflen)
-{
-	return __nvme_submit_sync_cmd(q, cmd, buffer, NULL, bufflen, NULL, 0);
-}
-
 int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 {
 	struct nvme_command c = { };
@@ -131,8 +166,7 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 	c.features.prp1 = cpu_to_le64(dma_addr);
 	c.features.fid = cpu_to_le32(fid);
 
-	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
-			result, 0);
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
 }
 
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
@@ -146,8 +180,7 @@ int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
-	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
-			result, 0);
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
 }
 
 int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 682fd35..aa2257b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -152,8 +152,10 @@ static inline int nvme_error_status(u16 status)
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void *buf, unsigned bufflen);
 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void *buffer, void __user *ubuffer, unsigned bufflen,
-		u32 *result, unsigned timeout);
+		void *buffer, unsigned bufflen,  u32 *result, unsigned timeout);
+int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen, u32 *result,
+		unsigned timeout);
 int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id);
 int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7b49b13..bf3299b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1692,7 +1692,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	c.rw.appmask = cpu_to_le16(io.appmask);
 	c.rw.metadata = cpu_to_le64(meta_dma);
 
-	status = __nvme_submit_sync_cmd(ns->queue, &c, NULL,
+	status = nvme_submit_user_cmd(ns->queue, &c,
 			(void __user *)(uintptr_t)io.addr, length, NULL, 0);
  unmap:
 	if (meta) {
@@ -1734,8 +1734,8 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	if (cmd.timeout_ms)
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
-	status = __nvme_submit_sync_cmd(ns ? ns->queue : ctrl->admin_q, &c,
-			NULL, (void __user *)(uintptr_t)cmd.addr, cmd.data_len,
+	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
 			&cmd.result, timeout);
 	if (status >= 0) {
 		if (put_user(cmd.result, &ucmd->result))
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index 00d0bdd..b673fe4 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -1295,7 +1295,7 @@ static int nvme_trans_send_download_fw_cmd(struct nvme_ns *ns, struct sg_io_hdr
 	c.dlfw.numd = cpu_to_le32((tot_len/BYTES_TO_DWORDS) - 1);
 	c.dlfw.offset = cpu_to_le32(offset/BYTES_TO_DWORDS);
 
-	nvme_sc = __nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL,
+	nvme_sc = nvme_submit_user_cmd(ns->ctrl->admin_q, &c,
 			hdr->dxferp, tot_len, NULL, 0);
 	return nvme_trans_status_code(hdr, nvme_sc);
 }
@@ -1699,7 +1699,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 			nvme_sc = NVME_SC_LBA_RANGE;
 			break;
 		}
-		nvme_sc = __nvme_submit_sync_cmd(ns->queue, &c, NULL,
+		nvme_sc = nvme_submit_user_cmd(ns->queue, &c,
 				next_mapping_addr, unit_len, NULL, 0);
 		if (nvme_sc)
 			break;
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 17/47] nvme: use the block layer for userspace passthrough metadata
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (15 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 16/47] nvme: split __nvme_submit_sync_cmd Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 18/47] nvme: move block_device_operations and ns/ctrl freeing to common code Christoph Hellwig
                   ` (17 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


From: Keith Busch <keith.busch@intel.com>

Use the integrity API to pass through metadata from userspace.  For PI
enabled devices this means that we now validate the reftag, which seems
like an unintentional ommission in the old code.

Thanks to Keith Busch for testing and fixes.

Signed-off-by: Christoph Hellwig <hch at lst.de>
[Skip metadata setup on admin commands]
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 83 ++++++++++++++++++++++++++++++++++++++++++------
 drivers/nvme/host/nvme.h |  4 +++
 drivers/nvme/host/pci.c  | 39 +++--------------------
 3 files changed, 83 insertions(+), 43 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84b7255..3ed9c90 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -81,12 +81,17 @@ int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	return __nvme_submit_sync_cmd(q, cmd, buffer, bufflen, NULL, 0);
 }
 
-int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
-		void __user *ubuffer, unsigned bufflen, u32 *result,
-		unsigned timeout)
+int __nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen,
+		void __user *meta_buffer, unsigned meta_len, u32 meta_seed,
+		u32 *result, unsigned timeout)
 {
-	struct bio *bio = NULL;
+	bool write = cmd->common.opcode & 1;
+	struct nvme_ns *ns = q->queuedata;
+	struct gendisk *disk = ns ? ns->disk : NULL;
 	struct request *req;
+	struct bio *bio = NULL;
+	void *meta = NULL;
 	int ret;
 
 	req = nvme_alloc_request(q, cmd);
@@ -101,19 +106,79 @@ int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
 		if (ret)
 			goto out;
 		bio = req->bio;
-	}
 
-	blk_execute_rq(req->q, NULL, req, 0);
-	if (bio)
-		blk_rq_unmap_user(bio);
+		if (!disk)
+			goto submit;
+		bio->bi_bdev = bdget_disk(disk, 0);
+		if (!bio->bi_bdev) {
+			ret = -ENODEV;
+			goto out_unmap;
+		}
+
+		if (meta_buffer) {
+			struct bio_integrity_payload *bip;
+
+			meta = kmalloc(meta_len, GFP_KERNEL);
+			if (!meta) {
+				ret = -ENOMEM;
+				goto out_unmap;
+			}
+
+			if (write) {
+				if (copy_from_user(meta, meta_buffer,
+						meta_len)) {
+					ret = -EFAULT;
+					goto out_free_meta;
+				}
+			}
+
+			bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
+			if (!bip) {
+				ret = -ENOMEM;
+				goto out_free_meta;
+			}
+
+			bip->bip_iter.bi_size = meta_len;
+			bip->bip_iter.bi_sector = meta_seed;
+
+			ret = bio_integrity_add_page(bio, virt_to_page(meta),
+					meta_len, offset_in_page(meta));
+			if (ret != meta_len) {
+				ret = -ENOMEM;
+				goto out_free_meta;
+			}
+		}
+	}
+ submit:
+	blk_execute_rq(req->q, disk, req, 0);
+	ret = req->errors;
 	if (result)
 		*result = (u32)(uintptr_t)req->special;
-	ret = req->errors;
+	if (meta && !ret && !write) {
+		if (copy_to_user(meta_buffer, meta, meta_len))
+			ret = -EFAULT;
+	}
+ out_free_meta:
+	kfree(meta);
+ out_unmap:
+	if (bio) {
+		if (disk && bio->bi_bdev)
+			bdput(bio->bi_bdev);
+		blk_rq_unmap_user(bio);
+	}
  out:
 	blk_mq_free_request(req);
 	return ret;
 }
 
+int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen, u32 *result,
+		unsigned timeout)
+{
+	return __nvme_submit_user_cmd(q, cmd, ubuffer, bufflen, NULL, 0, 0,
+			result, timeout);
+}
+
 int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 {
 	struct nvme_command c = { };
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index aa2257b..15bf2b8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -156,6 +156,10 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void __user *ubuffer, unsigned bufflen, u32 *result,
 		unsigned timeout);
+int __nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen,
+		void __user *meta_buffer, unsigned meta_len, u32 meta_seed,
+		u32 *result, unsigned timeout);
 int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id);
 int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index bf3299b..44b3b56 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1630,13 +1630,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
-	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
 	struct nvme_user_io io;
 	struct nvme_command c;
 	unsigned length, meta_len;
-	int status, write;
-	dma_addr_t meta_dma = 0;
-	void *meta = NULL;
 	void __user *metadata;
 
 	if (copy_from_user(&io, uio, sizeof(io)))
@@ -1654,29 +1650,13 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	length = (io.nblocks + 1) << ns->lba_shift;
 	meta_len = (io.nblocks + 1) * ns->ms;
 	metadata = (void __user *)(uintptr_t)io.metadata;
-	write = io.opcode & 1;
 
 	if (ns->ext) {
 		length += meta_len;
 		meta_len = 0;
-	}
-	if (meta_len) {
-		if (((io.metadata & 3) || !io.metadata) && !ns->ext)
+	} else if (meta_len) {
+		if ((io.metadata & 3) || !io.metadata)
 			return -EINVAL;
-
-		meta = dma_alloc_coherent(dev->dev, meta_len,
-						&meta_dma, GFP_KERNEL);
-
-		if (!meta) {
-			status = -ENOMEM;
-			goto unmap;
-		}
-		if (write) {
-			if (copy_from_user(meta, metadata, meta_len)) {
-				status = -EFAULT;
-				goto unmap;
-			}
-		}
 	}
 
 	memset(&c, 0, sizeof(c));
@@ -1690,19 +1670,10 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	c.rw.reftag = cpu_to_le32(io.reftag);
 	c.rw.apptag = cpu_to_le16(io.apptag);
 	c.rw.appmask = cpu_to_le16(io.appmask);
-	c.rw.metadata = cpu_to_le64(meta_dma);
 
-	status = nvme_submit_user_cmd(ns->queue, &c,
-			(void __user *)(uintptr_t)io.addr, length, NULL, 0);
- unmap:
-	if (meta) {
-		if (status == NVME_SC_SUCCESS && !write) {
-			if (copy_to_user(metadata, meta, meta_len))
-				status = -EFAULT;
-		}
-		dma_free_coherent(dev->dev, meta_len, meta, meta_dma);
-	}
-	return status;
+	return __nvme_submit_user_cmd(ns->queue, &c,
+			(void __user *)(uintptr_t)io.addr, length,
+			metadata, meta_len, io.slba, NULL, 0);
 }
 
 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 18/47] nvme: move block_device_operations and ns/ctrl freeing to common code
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (16 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 17/47] nvme: use the block layer for userspace passthrough metadata Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 19/47] nvme: add explicit quirk handling Christoph Hellwig
                   ` (16 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


This moves the block_device_operations over to common code mostly
as-is.  The only change is that the ns and ctrl refcounting got some
small refcounting to have wrappers around the kref_put operations.

A new free_ctrl operation is added to allow the PCI driver to free
it's ressources on the final drop.

Signed-off-by: Christoph Hellwig <hch at lst.de>
[Moved the integrity and pr changes due to merge conflict]
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 413 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  14 ++
 drivers/nvme/host/pci.c  | 412 ++--------------------------------------------
 3 files changed, 439 insertions(+), 400 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3ed9c90..b858df6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -15,12 +15,55 @@
 #include <linux/blkdev.h>
 #include <linux/blk-mq.h>
 #include <linux/errno.h>
+#include <linux/hdreg.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/types.h>
+#include <linux/pr.h>
+#include <linux/ptrace.h>
+#include <linux/nvme_ioctl.h>
+#include <linux/t10-pi.h>
+#include <scsi/sg.h>
+#include <asm/unaligned.h>
 
 #include "nvme.h"
 
+DEFINE_SPINLOCK(dev_list_lock);
+
+static void nvme_free_ns(struct kref *kref)
+{
+	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+
+	if (ns->type == NVME_NS_LIGHTNVM)
+		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
+
+	spin_lock(&dev_list_lock);
+	ns->disk->private_data = NULL;
+	spin_unlock(&dev_list_lock);
+
+	nvme_put_ctrl(ns->ctrl);
+	put_disk(ns->disk);
+	kfree(ns);
+}
+
+void nvme_put_ns(struct nvme_ns *ns)
+{
+	kref_put(&ns->kref, nvme_free_ns);
+}
+
+static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
+{
+	struct nvme_ns *ns;
+
+	spin_lock(&dev_list_lock);
+	ns = disk->private_data;
+	if (ns && !kref_get_unless_zero(&ns->kref))
+		ns = NULL;
+	spin_unlock(&dev_list_lock);
+
+	return ns;
+}
+
 static struct request *nvme_alloc_request(struct request_queue *q,
 		struct nvme_command *cmd)
 {
@@ -269,3 +312,373 @@ int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
 		kfree(*log);
 	return error;
 }
+
+static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
+{
+	struct nvme_user_io io;
+	struct nvme_command c;
+	unsigned length, meta_len;
+	void __user *metadata;
+
+	if (copy_from_user(&io, uio, sizeof(io)))
+		return -EFAULT;
+
+	switch (io.opcode) {
+	case nvme_cmd_write:
+	case nvme_cmd_read:
+	case nvme_cmd_compare:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	length = (io.nblocks + 1) << ns->lba_shift;
+	meta_len = (io.nblocks + 1) * ns->ms;
+	metadata = (void __user *)(uintptr_t)io.metadata;
+
+	if (ns->ext) {
+		length += meta_len;
+		meta_len = 0;
+	} else if (meta_len) {
+		if ((io.metadata & 3) || !io.metadata)
+			return -EINVAL;
+	}
+
+	memset(&c, 0, sizeof(c));
+	c.rw.opcode = io.opcode;
+	c.rw.flags = io.flags;
+	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.slba = cpu_to_le64(io.slba);
+	c.rw.length = cpu_to_le16(io.nblocks);
+	c.rw.control = cpu_to_le16(io.control);
+	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
+	c.rw.reftag = cpu_to_le32(io.reftag);
+	c.rw.apptag = cpu_to_le16(io.apptag);
+	c.rw.appmask = cpu_to_le16(io.appmask);
+
+	return __nvme_submit_user_cmd(ns->queue, &c,
+			(void __user *)(uintptr_t)io.addr, length,
+			metadata, meta_len, io.slba, NULL, 0);
+}
+
+int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+			struct nvme_passthru_cmd __user *ucmd)
+{
+	struct nvme_passthru_cmd cmd;
+	struct nvme_command c;
+	unsigned timeout = 0;
+	int status;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+		return -EFAULT;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = cmd.opcode;
+	c.common.flags = cmd.flags;
+	c.common.nsid = cpu_to_le32(cmd.nsid);
+	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	c.common.cdw10[0] = cpu_to_le32(cmd.cdw10);
+	c.common.cdw10[1] = cpu_to_le32(cmd.cdw11);
+	c.common.cdw10[2] = cpu_to_le32(cmd.cdw12);
+	c.common.cdw10[3] = cpu_to_le32(cmd.cdw13);
+	c.common.cdw10[4] = cpu_to_le32(cmd.cdw14);
+	c.common.cdw10[5] = cpu_to_le32(cmd.cdw15);
+
+	if (cmd.timeout_ms)
+		timeout = msecs_to_jiffies(cmd.timeout_ms);
+
+	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+			(void __user *)cmd.addr, cmd.data_len,
+			&cmd.result, timeout);
+	if (status >= 0) {
+		if (put_user(cmd.result, &ucmd->result))
+			return -EFAULT;
+	}
+
+	return status;
+}
+
+static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	switch (cmd) {
+	case NVME_IOCTL_ID:
+		force_successful_syscall_return();
+		return ns->ns_id;
+	case NVME_IOCTL_ADMIN_CMD:
+		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
+	case NVME_IOCTL_IO_CMD:
+		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
+	case NVME_IOCTL_SUBMIT_IO:
+		return nvme_submit_io(ns, (void __user *)arg);
+	case SG_GET_VERSION_NUM:
+		return nvme_sg_get_version_num((void __user *)arg);
+	case SG_IO:
+		return nvme_sg_io(ns, (void __user *)arg);
+	default:
+		return -ENOTTY;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case SG_IO:
+		return -ENOIOCTLCMD;
+	}
+	return nvme_ioctl(bdev, mode, cmd, arg);
+}
+#else
+#define nvme_compat_ioctl	NULL
+#endif
+
+static int nvme_open(struct block_device *bdev, fmode_t mode)
+{
+	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
+}
+
+static void nvme_release(struct gendisk *disk, fmode_t mode)
+{
+	nvme_put_ns(disk->private_data);
+}
+
+static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bdev->bd_disk) >> 11;
+	return 0;
+}
+
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static void nvme_init_integrity(struct nvme_ns *ns)
+{
+	struct blk_integrity integrity;
+
+	switch (ns->pi_type) {
+	case NVME_NS_DPS_PI_TYPE3:
+		integrity.profile = &t10_pi_type3_crc;
+		break;
+	case NVME_NS_DPS_PI_TYPE1:
+	case NVME_NS_DPS_PI_TYPE2:
+		integrity.profile = &t10_pi_type1_crc;
+		break;
+	default:
+		integrity.profile = NULL;
+		break;
+	}
+	integrity.tuple_size = ns->ms;
+	blk_integrity_register(ns->disk, &integrity);
+	blk_queue_max_integrity_segments(ns->queue, 1);
+}
+#else
+static void nvme_init_integrity(struct nvme_ns *ns)
+{
+}
+#endif /* CONFIG_BLK_DEV_INTEGRITY */
+
+static void nvme_config_discard(struct nvme_ns *ns)
+{
+	u32 logical_block_size = queue_logical_block_size(ns->queue);
+	ns->queue->limits.discard_zeroes_data = 0;
+	ns->queue->limits.discard_alignment = logical_block_size;
+	ns->queue->limits.discard_granularity = logical_block_size;
+	blk_queue_max_discard_sectors(ns->queue, 0xffffffff);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+}
+
+int nvme_revalidate_disk(struct gendisk *disk)
+{
+	struct nvme_ns *ns = disk->private_data;
+	struct nvme_id_ns *id;
+	u8 lbaf, pi_type;
+	u16 old_ms;
+	unsigned short bs;
+
+	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
+		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
+				__func__, ns->ctrl->instance, ns->ns_id);
+		return -ENODEV;
+	}
+	if (id->ncap == 0) {
+		kfree(id);
+		return -ENODEV;
+	}
+
+	if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
+		if (nvme_nvm_register(ns->queue, disk->disk_name)) {
+			dev_warn(ns->ctrl->dev,
+				"%s: LightNVM init failure\n", __func__);
+			kfree(id);
+			return -ENODEV;
+		}
+		ns->type = NVME_NS_LIGHTNVM;
+	}
+
+	old_ms = ns->ms;
+	lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
+	ns->lba_shift = id->lbaf[lbaf].ds;
+	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
+	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
+
+	/*
+	 * If identify namespace failed, use default 512 byte block size so
+	 * block layer can use before failing read/write for 0 capacity.
+	 */
+	if (ns->lba_shift == 0)
+		ns->lba_shift = 9;
+	bs = 1 << ns->lba_shift;
+
+	/* XXX: PI implementation requires metadata equal t10 pi tuple size */
+	pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
+					id->dps & NVME_NS_DPS_PI_MASK : 0;
+
+	blk_mq_freeze_queue(disk->queue);
+	if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
+				ns->ms != old_ms ||
+				bs != queue_logical_block_size(disk->queue) ||
+				(ns->ms && ns->ext)))
+		blk_integrity_unregister(disk);
+
+	ns->pi_type = pi_type;
+	blk_queue_logical_block_size(ns->queue, bs);
+
+	if (ns->ms && !ns->ext)
+		nvme_init_integrity(ns);
+
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+		set_capacity(disk, 0);
+	else
+		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+
+	if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
+		nvme_config_discard(ns);
+	blk_mq_unfreeze_queue(disk->queue);
+
+	kfree(id);
+	return 0;
+}
+
+static char nvme_pr_type(enum pr_type type)
+{
+	switch (type) {
+	case PR_WRITE_EXCLUSIVE:
+		return 1;
+	case PR_EXCLUSIVE_ACCESS:
+		return 2;
+	case PR_WRITE_EXCLUSIVE_REG_ONLY:
+		return 3;
+	case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+		return 4;
+	case PR_WRITE_EXCLUSIVE_ALL_REGS:
+		return 5;
+	case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+		return 6;
+	default:
+		return 0;
+	}
+};
+
+static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
+				u64 key, u64 sa_key, u8 op)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_command c;
+	u8 data[16] = { 0, };
+
+	put_unaligned_le64(key, &data[0]);
+	put_unaligned_le64(sa_key, &data[8]);
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = op;
+	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.cdw10[0] = cpu_to_le32(cdw10);
+
+	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+}
+
+static int nvme_pr_register(struct block_device *bdev, u64 old,
+		u64 new, unsigned flags)
+{
+	u32 cdw10;
+
+	if (flags & ~PR_FL_IGNORE_KEY)
+		return -EOPNOTSUPP;
+
+	cdw10 = old ? 2 : 0;
+	cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
+	cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
+	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
+}
+
+static int nvme_pr_reserve(struct block_device *bdev, u64 key,
+		enum pr_type type, unsigned flags)
+{
+	u32 cdw10;
+
+	if (flags & ~PR_FL_IGNORE_KEY)
+		return -EOPNOTSUPP;
+
+	cdw10 = nvme_pr_type(type) << 8;
+	cdw10 |= ((flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0);
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_acquire);
+}
+
+static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
+		enum pr_type type, bool abort)
+{
+	u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
+	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
+}
+
+static int nvme_pr_clear(struct block_device *bdev, u64 key)
+{
+	u32 cdw10 = 1 | key ? 1 << 3 : 0;
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register);
+}
+
+static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
+{
+	u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
+}
+
+static const struct pr_ops nvme_pr_ops = {
+	.pr_register	= nvme_pr_register,
+	.pr_reserve	= nvme_pr_reserve,
+	.pr_release	= nvme_pr_release,
+	.pr_preempt	= nvme_pr_preempt,
+	.pr_clear	= nvme_pr_clear,
+};
+
+const struct block_device_operations nvme_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvme_ioctl,
+	.compat_ioctl	= nvme_compat_ioctl,
+	.open		= nvme_open,
+	.release	= nvme_release,
+	.getgeo		= nvme_getgeo,
+	.revalidate_disk= nvme_revalidate_disk,
+	.pr_ops		= &nvme_pr_ops,
+};
+
+static void nvme_free_ctrl(struct kref *kref)
+{
+	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+
+	ctrl->ops->free_ctrl(ctrl);
+}
+
+void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+{
+	kref_put(&ctrl->kref, nvme_free_ctrl);
+}
+
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 15bf2b8..0756633 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -19,6 +19,8 @@
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
 
+struct nvme_passthru_cmd;
+
 extern unsigned char nvme_io_timeout;
 #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
 
@@ -34,6 +36,7 @@ struct nvme_ctrl {
 	const struct nvme_ctrl_ops *ops;
 	struct request_queue *admin_q;
 	struct device *dev;
+	struct kref kref;
 	int instance;
 
 	char name[12];
@@ -70,6 +73,7 @@ struct nvme_ns {
 
 struct nvme_ctrl_ops {
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
+	void (*free_ctrl)(struct nvme_ctrl *ctrl);
 };
 
 static inline bool nvme_ctrl_ready(struct nvme_ctrl *ctrl)
@@ -149,6 +153,9 @@ static inline int nvme_error_status(u16 status)
 	}
 }
 
+void nvme_put_ctrl(struct nvme_ctrl *ctrl);
+void nvme_put_ns(struct nvme_ns *ns);
+
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void *buf, unsigned bufflen);
 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
@@ -169,6 +176,13 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 			dma_addr_t dma_addr, u32 *result);
 
+extern const struct block_device_operations nvme_fops;
+extern spinlock_t dev_list_lock;
+
+int nvme_revalidate_disk(struct gendisk *disk);
+int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+			struct nvme_passthru_cmd __user *ucmd);
+
 struct sg_io_hdr;
 
 int nvme_sg_io(struct nvme_ns *ns, struct sg_io_hdr __user *u_hdr);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 44b3b56..424ccea 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -79,7 +79,6 @@ static bool use_cmb_sqes = true;
 module_param(use_cmb_sqes, bool, 0644);
 MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes");
 
-static DEFINE_SPINLOCK(dev_list_lock);
 static LIST_HEAD(dev_list);
 static struct task_struct *nvme_thread;
 static struct workqueue_struct *nvme_workq;
@@ -125,7 +124,6 @@ struct nvme_dev {
 	struct msix_entry *entry;
 	void __iomem *bar;
 	struct list_head namespaces;
-	struct kref kref;
 	struct device *device;
 	struct work_struct reset_work;
 	struct work_struct probe_work;
@@ -599,27 +597,6 @@ static void nvme_dif_remap(struct request *req,
 	}
 	kunmap_atomic(pmap);
 }
-
-static void nvme_init_integrity(struct nvme_ns *ns)
-{
-	struct blk_integrity integrity;
-
-	switch (ns->pi_type) {
-	case NVME_NS_DPS_PI_TYPE3:
-		integrity.profile = &t10_pi_type3_crc;
-		break;
-	case NVME_NS_DPS_PI_TYPE1:
-	case NVME_NS_DPS_PI_TYPE2:
-		integrity.profile = &t10_pi_type1_crc;
-		break;
-	default:
-		integrity.profile = NULL;
-		break;
-	}
-	integrity.tuple_size = ns->ms;
-	blk_integrity_register(ns->disk, &integrity);
-	blk_queue_max_integrity_segments(ns->queue, 1);
-}
 #else /* CONFIG_BLK_DEV_INTEGRITY */
 static void nvme_dif_remap(struct request *req,
 			void (*dif_swap)(u32 p, u32 v, struct t10_pi_tuple *pi))
@@ -631,9 +608,6 @@ static void nvme_dif_prep(u32 p, u32 v, struct t10_pi_tuple *pi)
 static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
 {
 }
-static void nvme_init_integrity(struct nvme_ns *ns)
-{
-}
 #endif
 
 static void req_completion(struct nvme_queue *nvmeq, void *ctx,
@@ -1628,94 +1602,6 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	return result;
 }
 
-static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
-{
-	struct nvme_user_io io;
-	struct nvme_command c;
-	unsigned length, meta_len;
-	void __user *metadata;
-
-	if (copy_from_user(&io, uio, sizeof(io)))
-		return -EFAULT;
-
-	switch (io.opcode) {
-	case nvme_cmd_write:
-	case nvme_cmd_read:
-	case nvme_cmd_compare:
-		break;
-	default:
-		return -EINVAL;
-	}
-
-	length = (io.nblocks + 1) << ns->lba_shift;
-	meta_len = (io.nblocks + 1) * ns->ms;
-	metadata = (void __user *)(uintptr_t)io.metadata;
-
-	if (ns->ext) {
-		length += meta_len;
-		meta_len = 0;
-	} else if (meta_len) {
-		if ((io.metadata & 3) || !io.metadata)
-			return -EINVAL;
-	}
-
-	memset(&c, 0, sizeof(c));
-	c.rw.opcode = io.opcode;
-	c.rw.flags = io.flags;
-	c.rw.nsid = cpu_to_le32(ns->ns_id);
-	c.rw.slba = cpu_to_le64(io.slba);
-	c.rw.length = cpu_to_le16(io.nblocks);
-	c.rw.control = cpu_to_le16(io.control);
-	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
-	c.rw.reftag = cpu_to_le32(io.reftag);
-	c.rw.apptag = cpu_to_le16(io.apptag);
-	c.rw.appmask = cpu_to_le16(io.appmask);
-
-	return __nvme_submit_user_cmd(ns->queue, &c,
-			(void __user *)(uintptr_t)io.addr, length,
-			metadata, meta_len, io.slba, NULL, 0);
-}
-
-static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-			struct nvme_passthru_cmd __user *ucmd)
-{
-	struct nvme_passthru_cmd cmd;
-	struct nvme_command c;
-	unsigned timeout = 0;
-	int status;
-
-	if (!capable(CAP_SYS_ADMIN))
-		return -EACCES;
-	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
-		return -EFAULT;
-
-	memset(&c, 0, sizeof(c));
-	c.common.opcode = cmd.opcode;
-	c.common.flags = cmd.flags;
-	c.common.nsid = cpu_to_le32(cmd.nsid);
-	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
-	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
-	c.common.cdw10[0] = cpu_to_le32(cmd.cdw10);
-	c.common.cdw10[1] = cpu_to_le32(cmd.cdw11);
-	c.common.cdw10[2] = cpu_to_le32(cmd.cdw12);
-	c.common.cdw10[3] = cpu_to_le32(cmd.cdw13);
-	c.common.cdw10[4] = cpu_to_le32(cmd.cdw14);
-	c.common.cdw10[5] = cpu_to_le32(cmd.cdw15);
-
-	if (cmd.timeout_ms)
-		timeout = msecs_to_jiffies(cmd.timeout_ms);
-
-	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
-			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
-			&cmd.result, timeout);
-	if (status >= 0) {
-		if (put_user(cmd.result, &ucmd->result))
-			return -EFAULT;
-	}
-
-	return status;
-}
-
 static int nvme_subsys_reset(struct nvme_dev *dev)
 {
 	if (!dev->subsystem)
@@ -1725,281 +1611,6 @@ static int nvme_subsys_reset(struct nvme_dev *dev)
 	return 0;
 }
 
-static int nvme_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
-							unsigned long arg)
-{
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-
-	switch (cmd) {
-	case NVME_IOCTL_ID:
-		force_successful_syscall_return();
-		return ns->ns_id;
-	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
-	case NVME_IOCTL_IO_CMD:
-		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
-	case NVME_IOCTL_SUBMIT_IO:
-		return nvme_submit_io(ns, (void __user *)arg);
-	case SG_GET_VERSION_NUM:
-		return nvme_sg_get_version_num((void __user *)arg);
-	case SG_IO:
-		return nvme_sg_io(ns, (void __user *)arg);
-	default:
-		return -ENOTTY;
-	}
-}
-
-#ifdef CONFIG_COMPAT
-static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
-					unsigned int cmd, unsigned long arg)
-{
-	switch (cmd) {
-	case SG_IO:
-		return -ENOIOCTLCMD;
-	}
-	return nvme_ioctl(bdev, mode, cmd, arg);
-}
-#else
-#define nvme_compat_ioctl	NULL
-#endif
-
-static void nvme_free_dev(struct kref *kref);
-static void nvme_free_ns(struct kref *kref)
-{
-	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
-	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
-
-	if (ns->type == NVME_NS_LIGHTNVM)
-		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
-
-	spin_lock(&dev_list_lock);
-	ns->disk->private_data = NULL;
-	spin_unlock(&dev_list_lock);
-
-	kref_put(&dev->kref, nvme_free_dev);
-	put_disk(ns->disk);
-	kfree(ns);
-}
-
-static int nvme_open(struct block_device *bdev, fmode_t mode)
-{
-	int ret = 0;
-	struct nvme_ns *ns;
-
-	spin_lock(&dev_list_lock);
-	ns = bdev->bd_disk->private_data;
-	if (!ns)
-		ret = -ENXIO;
-	else if (!kref_get_unless_zero(&ns->kref))
-		ret = -ENXIO;
-	spin_unlock(&dev_list_lock);
-
-	return ret;
-}
-
-static void nvme_release(struct gendisk *disk, fmode_t mode)
-{
-	struct nvme_ns *ns = disk->private_data;
-	kref_put(&ns->kref, nvme_free_ns);
-}
-
-static int nvme_getgeo(struct block_device *bd, struct hd_geometry *geo)
-{
-	/* some standard values */
-	geo->heads = 1 << 6;
-	geo->sectors = 1 << 5;
-	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
-	return 0;
-}
-
-static void nvme_config_discard(struct nvme_ns *ns)
-{
-	u32 logical_block_size = queue_logical_block_size(ns->queue);
-	ns->queue->limits.discard_zeroes_data = 0;
-	ns->queue->limits.discard_alignment = logical_block_size;
-	ns->queue->limits.discard_granularity = logical_block_size;
-	blk_queue_max_discard_sectors(ns->queue, 0xffffffff);
-	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
-}
-
-static int nvme_revalidate_disk(struct gendisk *disk)
-{
-	struct nvme_ns *ns = disk->private_data;
-	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
-	struct nvme_id_ns *id;
-	u8 lbaf, pi_type;
-	u16 old_ms;
-	unsigned short bs;
-
-	if (nvme_identify_ns(&dev->ctrl, ns->ns_id, &id)) {
-		dev_warn(dev->dev, "%s: Identify failure nvme%dn%d\n", __func__,
-						dev->ctrl.instance, ns->ns_id);
-		return -ENODEV;
-	}
-	if (id->ncap == 0) {
-		kfree(id);
-		return -ENODEV;
-	}
-
-	if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
-		if (nvme_nvm_register(ns->queue, disk->disk_name)) {
-			dev_warn(dev->dev,
-				"%s: LightNVM init failure\n", __func__);
-			kfree(id);
-			return -ENODEV;
-		}
-		ns->type = NVME_NS_LIGHTNVM;
-	}
-
-	old_ms = ns->ms;
-	lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
-	ns->lba_shift = id->lbaf[lbaf].ds;
-	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
-	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
-
-	/*
-	 * If identify namespace failed, use default 512 byte block size so
-	 * block layer can use before failing read/write for 0 capacity.
-	 */
-	if (ns->lba_shift == 0)
-		ns->lba_shift = 9;
-	bs = 1 << ns->lba_shift;
-
-	/* XXX: PI implementation requires metadata equal t10 pi tuple size */
-	pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
-					id->dps & NVME_NS_DPS_PI_MASK : 0;
-
-	blk_mq_freeze_queue(disk->queue);
-	if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
-				ns->ms != old_ms ||
-				bs != queue_logical_block_size(disk->queue) ||
-				(ns->ms && ns->ext)))
-		blk_integrity_unregister(disk);
-
-	ns->pi_type = pi_type;
-	blk_queue_logical_block_size(ns->queue, bs);
-
-	if (ns->ms && !ns->ext)
-		nvme_init_integrity(ns);
-
-	if ((ns->ms && !(ns->ms == 8 && ns->pi_type) &&
-						!blk_get_integrity(disk)) ||
-						ns->type == NVME_NS_LIGHTNVM)
-		set_capacity(disk, 0);
-	else
-		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
-
-	if (dev->ctrl.oncs & NVME_CTRL_ONCS_DSM)
-		nvme_config_discard(ns);
-	blk_mq_unfreeze_queue(disk->queue);
-
-	kfree(id);
-	return 0;
-}
-
-static char nvme_pr_type(enum pr_type type)
-{
-	switch (type) {
-	case PR_WRITE_EXCLUSIVE:
-		return 1;
-	case PR_EXCLUSIVE_ACCESS:
-		return 2;
-	case PR_WRITE_EXCLUSIVE_REG_ONLY:
-		return 3;
-	case PR_EXCLUSIVE_ACCESS_REG_ONLY:
-		return 4;
-	case PR_WRITE_EXCLUSIVE_ALL_REGS:
-		return 5;
-	case PR_EXCLUSIVE_ACCESS_ALL_REGS:
-		return 6;
-	default:
-		return 0;
-	}
-};
-
-static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
-				u64 key, u64 sa_key, u8 op)
-{
-	struct nvme_ns *ns = bdev->bd_disk->private_data;
-	struct nvme_command c;
-	u8 data[16] = { 0, };
-
-	put_unaligned_le64(key, &data[0]);
-	put_unaligned_le64(sa_key, &data[8]);
-
-	memset(&c, 0, sizeof(c));
-	c.common.opcode = op;
-	c.common.nsid = cpu_to_le32(ns->ns_id);
-	c.common.cdw10[0] = cpu_to_le32(cdw10);
-
-	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
-}
-
-static int nvme_pr_register(struct block_device *bdev, u64 old,
-		u64 new, unsigned flags)
-{
-	u32 cdw10;
-
-	if (flags & ~PR_FL_IGNORE_KEY)
-		return -EOPNOTSUPP;
-
-	cdw10 = old ? 2 : 0;
-	cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
-	cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
-	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
-}
-
-static int nvme_pr_reserve(struct block_device *bdev, u64 key,
-		enum pr_type type, unsigned flags)
-{
-	u32 cdw10;
-
-	if (flags & ~PR_FL_IGNORE_KEY)
-		return -EOPNOTSUPP;
-
-	cdw10 = nvme_pr_type(type) << 8;
-	cdw10 |= ((flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0);
-	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_acquire);
-}
-
-static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
-		enum pr_type type, bool abort)
-{
-	u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
-	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
-}
-
-static int nvme_pr_clear(struct block_device *bdev, u64 key)
-{
-	u32 cdw10 = 1 | (key ? 1 << 3 : 0);
-	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register);
-}
-
-static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
-{
-	u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
-	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
-}
-
-static const struct pr_ops nvme_pr_ops = {
-	.pr_register	= nvme_pr_register,
-	.pr_reserve	= nvme_pr_reserve,
-	.pr_release	= nvme_pr_release,
-	.pr_preempt	= nvme_pr_preempt,
-	.pr_clear	= nvme_pr_clear,
-};
-
-static const struct block_device_operations nvme_fops = {
-	.owner		= THIS_MODULE,
-	.ioctl		= nvme_ioctl,
-	.compat_ioctl	= nvme_compat_ioctl,
-	.open		= nvme_open,
-	.release	= nvme_release,
-	.getgeo		= nvme_getgeo,
-	.revalidate_disk= nvme_revalidate_disk,
-	.pr_ops		= &nvme_pr_ops,
-};
-
 static int nvme_kthread(void *data)
 {
 	struct nvme_dev *dev, *next;
@@ -2100,7 +1711,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 	if (nvme_revalidate_disk(ns->disk))
 		goto out_free_disk;
 
-	kref_get(&dev->kref);
+	kref_get(&dev->ctrl.kref);
 	if (ns->type != NVME_NS_LIGHTNVM) {
 		add_disk(ns->disk);
 		if (ns->ms) {
@@ -2349,7 +1960,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		blk_cleanup_queue(ns->queue);
 	}
 	list_del_init(&ns->list);
-	kref_put(&ns->kref, nvme_free_ns);
+	nvme_put_ns(ns);
 }
 
 static void nvme_scan_namespaces(struct nvme_dev *dev, unsigned nn)
@@ -2819,9 +2430,9 @@ static void nvme_release_instance(struct nvme_dev *dev)
 	spin_unlock(&dev_list_lock);
 }
 
-static void nvme_free_dev(struct kref *kref)
+static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 {
-	struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
 
 	put_device(dev->dev);
 	put_device(dev->device);
@@ -2848,7 +2459,7 @@ static int nvme_dev_open(struct inode *inode, struct file *f)
 				ret = -EWOULDBLOCK;
 				break;
 			}
-			if (!kref_get_unless_zero(&dev->kref))
+			if (!kref_get_unless_zero(&dev->ctrl.kref))
 				break;
 			f->private_data = dev;
 			ret = 0;
@@ -2863,7 +2474,7 @@ static int nvme_dev_open(struct inode *inode, struct file *f)
 static int nvme_dev_release(struct inode *inode, struct file *f)
 {
 	struct nvme_dev *dev = f->private_data;
-	kref_put(&dev->kref, nvme_free_dev);
+	nvme_put_ctrl(&dev->ctrl);
 	return 0;
 }
 
@@ -2978,19 +2589,19 @@ static int nvme_remove_dead_ctrl(void *arg)
 
 	if (pci_get_drvdata(pdev))
 		pci_stop_and_remove_bus_device_locked(pdev);
-	kref_put(&dev->kref, nvme_free_dev);
+	nvme_put_ctrl(&dev->ctrl);
 	return 0;
 }
 
 static void nvme_dead_ctrl(struct nvme_dev *dev)
 {
 	dev_warn(dev->dev, "Device failed to resume\n");
-	kref_get(&dev->kref);
+	kref_get(&dev->ctrl.kref);
 	if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
 						dev->ctrl.instance))) {
 		dev_err(dev->dev,
 			"Failed to start controller remove task\n");
-		kref_put(&dev->kref, nvme_free_dev);
+		nvme_put_ctrl(&dev->ctrl);
 	}
 }
 
@@ -3068,6 +2679,7 @@ static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
 
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read32		= nvme_pci_reg_read32,
+	.free_ctrl		= nvme_pci_free_ctrl,
 };
 
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
@@ -3108,7 +2720,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release;
 
-	kref_init(&dev->kref);
+	kref_init(&dev->ctrl.kref);
 	dev->device = device_create(nvme_class, &pdev->dev,
 				MKDEV(nvme_char_major, dev->ctrl.instance),
 				dev, "nvme%d", dev->ctrl.instance);
@@ -3181,7 +2793,7 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_free_queues(dev, 0);
 	nvme_release_cmb(dev);
 	nvme_release_prp_pools(dev);
-	kref_put(&dev->kref, nvme_free_dev);
+	nvme_put_ctrl(&dev->ctrl);
 }
 
 /* These functions are yet to be implemented */
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 19/47] nvme: add explicit quirk handling
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (17 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 18/47] nvme: move block_device_operations and ns/ctrl freeing to common code Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 20/47] nvme: add a common helper to read Identify Controller data Christoph Hellwig
                   ` (15 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Add an enum for all workarounds not in the spec and identify the affected
controllers at probe time.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/nvme.h | 13 +++++++++++++
 drivers/nvme/host/pci.c  |  8 +++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0756633..9838a92 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -32,6 +32,18 @@ enum {
 	NVME_NS_LIGHTNVM	= 1,
 };
 
+/*
+ * List of workarounds for devices that required behavior not specified in
+ * the standard.
+ */
+enum nvme_quirks {
+	/*
+	 * Prefers I/O aligned to a stripe size specified in a vendor
+	 * specific Identify field.
+	 */
+	NVME_QUIRK_STRIPE_SIZE			= (1 << 0),
+};
+
 struct nvme_ctrl {
 	const struct nvme_ctrl_ops *ops;
 	struct request_queue *admin_q;
@@ -48,6 +60,7 @@ struct nvme_ctrl {
 	u8 event_limit;
 	u8 vwc;
 	u16 vendor;
+	unsigned long quirks;
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 424ccea..03a0b3e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2021,7 +2021,6 @@ static void nvme_dev_scan(struct work_struct *work)
  */
 static int nvme_dev_add(struct nvme_dev *dev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int res;
 	struct nvme_id_ctrl *ctrl;
 	int shift = NVME_CAP_MPSMIN(lo_hi_readq(dev->bar + NVME_REG_CAP)) + 12;
@@ -2042,8 +2041,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
 	else
 		dev->max_hw_sectors = UINT_MAX;
-	if ((pdev->vendor == PCI_VENDOR_ID_INTEL) &&
-			(pdev->device == 0x0953) && ctrl->vs[3]) {
+
+	if ((dev->ctrl.quirks & NVME_QUIRK_STRIPE_SIZE) && ctrl->vs[3]) {
 		unsigned int max_hw_sectors;
 
 		dev->stripe_size = 1 << (ctrl->vs[3] + shift);
@@ -2711,6 +2710,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev->ctrl.vendor = pdev->vendor;
 	dev->ctrl.ops = &nvme_pci_ctrl_ops;
 	dev->ctrl.dev = dev->dev;
+	dev->ctrl.quirks = id->driver_data;
 
 	result = nvme_set_instance(dev);
 	if (result)
@@ -2838,6 +2838,8 @@ static const struct pci_error_handlers nvme_err_handler = {
 #define PCI_CLASS_STORAGE_EXPRESS	0x010802
 
 static const struct pci_device_id nvme_id_table[] = {
+	{ PCI_VDEVICE(INTEL, 0x0953),
+		.driver_data = NVME_QUIRK_STRIPE_SIZE, },
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
 	{ 0, }
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 20/47] nvme: add a common helper to read Identify Controller data
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (18 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 19/47] nvme: add explicit quirk handling Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 21/47] nvme: move the call to nvme_init_identify earlier Christoph Hellwig
                   ` (14 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


And add the 64-bit register read operation for it.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  4 ++++
 drivers/nvme/host/pci.c  | 53 ++++++++++++++----------------------------------
 3 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b858df6..6ee05b3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -670,6 +670,58 @@ const struct block_device_operations nvme_fops = {
 	.pr_ops		= &nvme_pr_ops,
 };
 
+/*
+ * Initialize the cached copies of the Identify data and various controller
+ * register in our nvme_ctrl structure.  This should be called as soon as
+ * the admin queue is fully up and running.
+ */
+int nvme_init_identify(struct nvme_ctrl *ctrl)
+{
+	struct nvme_id_ctrl *id;
+	u64 cap;
+	int ret, page_shift;
+
+	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
+	if (ret) {
+		dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
+		return ret;
+	}
+	page_shift = NVME_CAP_MPSMIN(cap) + 12;
+
+	ret = nvme_identify_ctrl(ctrl, &id);
+	if (ret) {
+		dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
+		return -EIO;
+	}
+
+	ctrl->oncs = le16_to_cpup(&id->oncs);
+	ctrl->abort_limit = id->acl + 1;
+	ctrl->vwc = id->vwc;
+	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
+	memcpy(ctrl->model, id->mn, sizeof(id->mn));
+	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
+	if (id->mdts)
+		ctrl->max_hw_sectors = 1 << (id->mdts + page_shift - 9);
+	else
+		ctrl->max_hw_sectors = UINT_MAX;
+
+	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[3]) {
+		unsigned int max_hw_sectors;
+
+		ctrl->stripe_size = 1 << (id->vs[3] + page_shift);
+		max_hw_sectors = ctrl->stripe_size >> (page_shift - 9);
+		if (ctrl->max_hw_sectors) {
+			ctrl->max_hw_sectors = min(max_hw_sectors,
+							ctrl->max_hw_sectors);
+		} else {
+			ctrl->max_hw_sectors = max_hw_sectors;
+		}
+	}
+
+	kfree(id);
+	return 0;
+}
+
 static void nvme_free_ctrl(struct kref *kref)
 {
 	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9838a92..2d68262 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -55,6 +55,8 @@ struct nvme_ctrl {
 	char serial[20];
 	char model[40];
 	char firmware_rev[8];
+	u32 max_hw_sectors;
+	u32 stripe_size;
 	u16 oncs;
 	u16 abort_limit;
 	u8 event_limit;
@@ -86,6 +88,7 @@ struct nvme_ns {
 
 struct nvme_ctrl_ops {
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
+	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
 };
 
@@ -167,6 +170,7 @@ static inline int nvme_error_status(u16 status)
 }
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl);
+int nvme_init_identify(struct nvme_ctrl *ctrl);
 void nvme_put_ns(struct nvme_ns *ns);
 
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 03a0b3e..7e04a32 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -129,8 +129,6 @@ struct nvme_dev {
 	struct work_struct probe_work;
 	struct work_struct scan_work;
 	bool subsystem;
-	u32 max_hw_sectors;
-	u32 stripe_size;
 	u32 page_size;
 	void __iomem *cmb;
 	dma_addr_t cmb_dma_addr;
@@ -1681,13 +1679,13 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
 	list_add_tail(&ns->list, &dev->namespaces);
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
-	if (dev->max_hw_sectors) {
-		blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);
+	if (dev->ctrl.max_hw_sectors) {
+		blk_queue_max_hw_sectors(ns->queue, dev->ctrl.max_hw_sectors);
 		blk_queue_max_segments(ns->queue,
-			(dev->max_hw_sectors / (dev->page_size >> 9)) + 1);
+			(dev->ctrl.max_hw_sectors / (dev->page_size >> 9)) + 1);
 	}
-	if (dev->stripe_size)
-		blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
+	if (dev->ctrl.stripe_size)
+		blk_queue_chunk_sectors(ns->queue, dev->ctrl.stripe_size >> 9);
 	if (dev->ctrl.vwc & NVME_CTRL_VWC_PRESENT)
 		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
 	blk_queue_virt_boundary(ns->queue, dev->page_size - 1);
@@ -2022,38 +2020,10 @@ static void nvme_dev_scan(struct work_struct *work)
 static int nvme_dev_add(struct nvme_dev *dev)
 {
 	int res;
-	struct nvme_id_ctrl *ctrl;
-	int shift = NVME_CAP_MPSMIN(lo_hi_readq(dev->bar + NVME_REG_CAP)) + 12;
-
-	res = nvme_identify_ctrl(&dev->ctrl, &ctrl);
-	if (res) {
-		dev_err(dev->dev, "Identify Controller failed (%d)\n", res);
-		return -EIO;
-	}
-
-	dev->ctrl.oncs = le16_to_cpup(&ctrl->oncs);
-	dev->ctrl.abort_limit = ctrl->acl + 1;
-	dev->ctrl.vwc = ctrl->vwc;
-	memcpy(dev->ctrl.serial, ctrl->sn, sizeof(ctrl->sn));
-	memcpy(dev->ctrl.model, ctrl->mn, sizeof(ctrl->mn));
-	memcpy(dev->ctrl.firmware_rev, ctrl->fr, sizeof(ctrl->fr));
-	if (ctrl->mdts)
-		dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
-	else
-		dev->max_hw_sectors = UINT_MAX;
-
-	if ((dev->ctrl.quirks & NVME_QUIRK_STRIPE_SIZE) && ctrl->vs[3]) {
-		unsigned int max_hw_sectors;
 
-		dev->stripe_size = 1 << (ctrl->vs[3] + shift);
-		max_hw_sectors = dev->stripe_size >> (shift - 9);
-		if (dev->max_hw_sectors) {
-			dev->max_hw_sectors = min(max_hw_sectors,
-							dev->max_hw_sectors);
-		} else
-			dev->max_hw_sectors = max_hw_sectors;
-	}
-	kfree(ctrl);
+	res = nvme_init_identify(&dev->ctrl);
+	if (res)
+		return res;
 
 	if (!dev->tagset.tags) {
 		dev->tagset.ops = &nvme_mq_ops;
@@ -2676,8 +2646,15 @@ static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
 	return 0;
 }
 
+static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
+{
+	*val = readq(to_nvme_dev(ctrl)->bar + off);
+	return 0;
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read32		= nvme_pci_reg_read32,
+	.reg_read64		= nvme_pci_reg_read64,
 	.free_ctrl		= nvme_pci_free_ctrl,
 };
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 21/47] nvme: move the call to nvme_init_identify earlier
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (19 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 20/47] nvme: add a common helper to read Identify Controller data Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 22/47] nvme: move namespace scanning to common code Christoph Hellwig
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


We want to record the identify and CAP values even if no I/O queue
is available.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7e04a32..7eb975b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2019,12 +2019,6 @@ static void nvme_dev_scan(struct work_struct *work)
  */
 static int nvme_dev_add(struct nvme_dev *dev)
 {
-	int res;
-
-	res = nvme_init_identify(&dev->ctrl);
-	if (res)
-		return res;
-
 	if (!dev->tagset.tags) {
 		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
@@ -2516,6 +2510,10 @@ static void nvme_probe_work(struct work_struct *work)
 	if (result)
 		goto disable;
 
+	result = nvme_init_identify(&dev->ctrl);
+	if (result)
+		goto free_tags;
+
 	result = nvme_setup_io_queues(dev);
 	if (result)
 		goto free_tags;
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 22/47] nvme: move namespace scanning to common code
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (20 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 21/47] nvme: move the call to nvme_init_identify earlier Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 23/47] nvme: move chardev and sysfs interface " Christoph Hellwig
                   ` (12 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


The namespace scanning code has been mostly generic already, we just
need to store a pointer to the tagset in the nvme_ctrl structure, and
add a method to check if a controller is I/O incapable.  The latter
will hopefully be replaced by a proper controller state machine soon.

Signed-off-by: Christoph Hellwig <hch at lst.de>
[Fixed pr conflicts]
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 193 ++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  25 +++++-
 drivers/nvme/host/pci.c  | 220 +++++++----------------------------------------
 3 files changed, 242 insertions(+), 196 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6ee05b3..c8e7706 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -17,6 +17,8 @@
 #include <linux/errno.h>
 #include <linux/hdreg.h>
 #include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_sort.h>
 #include <linux/slab.h>
 #include <linux/types.h>
 #include <linux/pr.h>
@@ -28,6 +30,9 @@
 
 #include "nvme.h"
 
+static int nvme_major;
+module_param(nvme_major, int, 0);
+
 DEFINE_SPINLOCK(dev_list_lock);
 
 static void nvme_free_ns(struct kref *kref)
@@ -46,7 +51,7 @@ static void nvme_free_ns(struct kref *kref)
 	kfree(ns);
 }
 
-void nvme_put_ns(struct nvme_ns *ns)
+static void nvme_put_ns(struct nvme_ns *ns)
 {
 	kref_put(&ns->kref, nvme_free_ns);
 }
@@ -495,7 +500,7 @@ static void nvme_config_discard(struct nvme_ns *ns)
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
 }
 
-int nvme_revalidate_disk(struct gendisk *disk)
+static int nvme_revalidate_disk(struct gendisk *disk)
 {
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_id_ns *id;
@@ -659,7 +664,7 @@ static const struct pr_ops nvme_pr_ops = {
 	.pr_clear	= nvme_pr_clear,
 };
 
-const struct block_device_operations nvme_fops = {
+static const struct block_device_operations nvme_fops = {
 	.owner		= THIS_MODULE,
 	.ioctl		= nvme_ioctl,
 	.compat_ioctl	= nvme_compat_ioctl,
@@ -687,6 +692,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return ret;
 	}
 	page_shift = NVME_CAP_MPSMIN(cap) + 12;
+	ctrl->page_size = 1 << page_shift;
 
 	ret = nvme_identify_ctrl(ctrl, &id);
 	if (ret) {
@@ -734,3 +740,184 @@ void nvme_put_ctrl(struct nvme_ctrl *ctrl)
 	kref_put(&ctrl->kref, nvme_free_ctrl);
 }
 
+static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
+	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
+
+	return nsa->ns_id - nsb->ns_id;
+}
+
+static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->ns_id == nsid)
+			return ns;
+		if (ns->ns_id > nsid)
+			break;
+	}
+	return NULL;
+}
+
+static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+	struct gendisk *disk;
+	int node = dev_to_node(ctrl->dev);
+
+	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
+	if (!ns)
+		return;
+
+	ns->queue = blk_mq_init_queue(ctrl->tagset);
+	if (IS_ERR(ns->queue))
+		goto out_free_ns;
+	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	ns->queue->queuedata = ns;
+	ns->ctrl = ctrl;
+
+	disk = alloc_disk_node(0, node);
+	if (!disk)
+		goto out_free_queue;
+
+	kref_init(&ns->kref);
+	ns->ns_id = nsid;
+	ns->disk = disk;
+	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
+	list_add_tail(&ns->list, &ctrl->namespaces);
+
+	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
+	if (ctrl->max_hw_sectors) {
+		blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
+		blk_queue_max_segments(ns->queue,
+			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
+	}
+	if (ctrl->stripe_size)
+		blk_queue_chunk_sectors(ns->queue, ctrl->stripe_size >> 9);
+	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
+		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
+	blk_queue_virt_boundary(ns->queue, ctrl->page_size - 1);
+
+	disk->major = nvme_major;
+	disk->first_minor = 0;
+	disk->fops = &nvme_fops;
+	disk->private_data = ns;
+	disk->queue = ns->queue;
+	disk->driverfs_dev = ctrl->device;
+	disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, nsid);
+
+	/*
+	 * Initialize capacity to 0 until we establish the namespace format and
+	 * setup integrity extentions if necessary. The revalidate_disk after
+	 * add_disk allows the driver to register with integrity if the format
+	 * requires it.
+	 */
+	set_capacity(disk, 0);
+	if (nvme_revalidate_disk(ns->disk))
+		goto out_free_disk;
+
+	kref_get(&ctrl->kref);
+	if (ns->type != NVME_NS_LIGHTNVM) {
+		add_disk(ns->disk);
+		if (ns->ms) {
+			struct block_device *bd = bdget_disk(ns->disk, 0);
+			if (!bd)
+				return;
+			if (blkdev_get(bd, FMODE_READ, NULL)) {
+				bdput(bd);
+				return;
+			}
+			blkdev_reread_part(bd);
+			blkdev_put(bd, FMODE_READ);
+		}
+	}
+
+	return;
+ out_free_disk:
+	kfree(disk);
+	list_del(&ns->list);
+ out_free_queue:
+	blk_cleanup_queue(ns->queue);
+ out_free_ns:
+	kfree(ns);
+}
+
+static void nvme_ns_remove(struct nvme_ns *ns)
+{
+	bool kill = nvme_io_incapable(ns->ctrl) &&
+			!blk_queue_dying(ns->queue);
+
+	if (kill)
+		blk_set_queue_dying(ns->queue);
+	if (ns->disk->flags & GENHD_FL_UP) {
+		if (blk_get_integrity(ns->disk))
+			blk_integrity_unregister(ns->disk);
+		del_gendisk(ns->disk);
+	}
+	if (kill || !blk_queue_dying(ns->queue)) {
+		blk_mq_abort_requeue_list(ns->queue);
+		blk_cleanup_queue(ns->queue);
+	}
+	list_del_init(&ns->list);
+	nvme_put_ns(ns);
+}
+
+static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
+{
+	struct nvme_ns *ns, *next;
+	unsigned i;
+
+	for (i = 1; i <= nn; i++) {
+		ns = nvme_find_ns(ctrl, i);
+		if (ns) {
+			if (revalidate_disk(ns->disk))
+				nvme_ns_remove(ns);
+		} else
+			nvme_alloc_ns(ctrl, i);
+	}
+	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
+		if (ns->ns_id > nn)
+			nvme_ns_remove(ns);
+	}
+	list_sort(NULL, &ctrl->namespaces, ns_cmp);
+}
+
+void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
+{
+	struct nvme_id_ctrl *id;
+
+	if (nvme_identify_ctrl(ctrl, &id))
+		return;
+	__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
+	kfree(id);
+}
+
+void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns, *next;
+
+	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
+		nvme_ns_remove(ns);
+}
+
+int __init nvme_core_init(void)
+{
+	int result;
+
+	result = register_blkdev(nvme_major, "nvme");
+	if (result < 0)
+		return result;
+	else if (result > 0)
+		nvme_major = result;
+
+	return 0;
+}
+
+void nvme_core_exit(void)
+{
+	unregister_blkdev(nvme_major, "nvme");
+}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2d68262..b097cdb 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -50,6 +50,9 @@ struct nvme_ctrl {
 	struct device *dev;
 	struct kref kref;
 	int instance;
+	struct blk_mq_tag_set *tagset;
+	struct list_head namespaces;
+	struct device *device;	/* char device */
 
 	char name[12];
 	char serial[20];
@@ -57,6 +60,7 @@ struct nvme_ctrl {
 	char firmware_rev[8];
 	u32 max_hw_sectors;
 	u32 stripe_size;
+	u32 page_size;
 	u16 oncs;
 	u16 abort_limit;
 	u8 event_limit;
@@ -89,6 +93,7 @@ struct nvme_ns {
 struct nvme_ctrl_ops {
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
+	bool (*io_incapable)(struct nvme_ctrl *ctrl);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
 };
 
@@ -101,6 +106,17 @@ static inline bool nvme_ctrl_ready(struct nvme_ctrl *ctrl)
 	return val & NVME_CSTS_RDY;
 }
 
+static inline bool nvme_io_incapable(struct nvme_ctrl *ctrl)
+{
+	u32 val = 0;
+
+	if (ctrl->ops->io_incapable(ctrl))
+		return false;
+	if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val))
+		return false;
+	return val & NVME_CSTS_CFS;
+}
+
 static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
 {
 	return (sector >> (ns->lba_shift - 9));
@@ -171,7 +187,9 @@ static inline int nvme_error_status(u16 status)
 
 void nvme_put_ctrl(struct nvme_ctrl *ctrl);
 int nvme_init_identify(struct nvme_ctrl *ctrl);
-void nvme_put_ns(struct nvme_ns *ns);
+
+void nvme_scan_namespaces(struct nvme_ctrl *ctrl);
+void nvme_remove_namespaces(struct nvme_ctrl *ctrl);
 
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void *buf, unsigned bufflen);
@@ -193,10 +211,8 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 			dma_addr_t dma_addr, u32 *result);
 
-extern const struct block_device_operations nvme_fops;
 extern spinlock_t dev_list_lock;
 
-int nvme_revalidate_disk(struct gendisk *disk);
 int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 			struct nvme_passthru_cmd __user *ucmd);
 
@@ -210,4 +226,7 @@ int nvme_nvm_ns_supported(struct nvme_ns *ns, struct nvme_id_ns *id);
 int nvme_nvm_register(struct request_queue *q, char *disk_name);
 void nvme_nvm_unregister(struct request_queue *q, char *disk_name);
 
+int __init nvme_core_init(void);
+void nvme_core_exit(void);
+
 #endif /* _NVME_H */
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7eb975b..6127bf6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,7 +28,6 @@
 #include <linux/kdev_t.h>
 #include <linux/kthread.h>
 #include <linux/kernel.h>
-#include <linux/list_sort.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
@@ -66,9 +65,6 @@ static unsigned char shutdown_timeout = 5;
 module_param(shutdown_timeout, byte, 0644);
 MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown");
 
-static int nvme_major;
-module_param(nvme_major, int, 0);
-
 static int nvme_char_major;
 module_param(nvme_char_major, int, 0);
 
@@ -123,8 +119,6 @@ struct nvme_dev {
 	u32 ctrl_config;
 	struct msix_entry *entry;
 	void __iomem *bar;
-	struct list_head namespaces;
-	struct device *device;
 	struct work_struct reset_work;
 	struct work_struct probe_work;
 	struct work_struct scan_work;
@@ -1650,90 +1644,6 @@ static int nvme_kthread(void *data)
 	return 0;
 }
 
-static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
-{
-	struct nvme_ns *ns;
-	struct gendisk *disk;
-	int node = dev_to_node(dev->dev);
-
-	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
-	if (!ns)
-		return;
-
-	ns->queue = blk_mq_init_queue(&dev->tagset);
-	if (IS_ERR(ns->queue))
-		goto out_free_ns;
-	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
-	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
-	ns->ctrl = &dev->ctrl;
-	ns->queue->queuedata = ns;
-
-	disk = alloc_disk_node(0, node);
-	if (!disk)
-		goto out_free_queue;
-
-	kref_init(&ns->kref);
-	ns->ns_id = nsid;
-	ns->disk = disk;
-	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
-	list_add_tail(&ns->list, &dev->namespaces);
-
-	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
-	if (dev->ctrl.max_hw_sectors) {
-		blk_queue_max_hw_sectors(ns->queue, dev->ctrl.max_hw_sectors);
-		blk_queue_max_segments(ns->queue,
-			(dev->ctrl.max_hw_sectors / (dev->page_size >> 9)) + 1);
-	}
-	if (dev->ctrl.stripe_size)
-		blk_queue_chunk_sectors(ns->queue, dev->ctrl.stripe_size >> 9);
-	if (dev->ctrl.vwc & NVME_CTRL_VWC_PRESENT)
-		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
-	blk_queue_virt_boundary(ns->queue, dev->page_size - 1);
-
-	disk->major = nvme_major;
-	disk->first_minor = 0;
-	disk->fops = &nvme_fops;
-	disk->private_data = ns;
-	disk->queue = ns->queue;
-	disk->driverfs_dev = dev->device;
-	disk->flags = GENHD_FL_EXT_DEVT;
-	sprintf(disk->disk_name, "nvme%dn%d", dev->ctrl.instance, nsid);
-
-	/*
-	 * Initialize capacity to 0 until we establish the namespace format and
-	 * setup integrity extentions if necessary. The revalidate_disk after
-	 * add_disk allows the driver to register with integrity if the format
-	 * requires it.
-	 */
-	set_capacity(disk, 0);
-	if (nvme_revalidate_disk(ns->disk))
-		goto out_free_disk;
-
-	kref_get(&dev->ctrl.kref);
-	if (ns->type != NVME_NS_LIGHTNVM) {
-		add_disk(ns->disk);
-		if (ns->ms) {
-			struct block_device *bd = bdget_disk(ns->disk, 0);
-			if (!bd)
-				return;
-			if (blkdev_get(bd, FMODE_READ, NULL)) {
-				bdput(bd);
-				return;
-			}
-			blkdev_reread_part(bd);
-			blkdev_put(bd, FMODE_READ);
-		}
-	}
-	return;
- out_free_disk:
-	kfree(disk);
-	list_del(&ns->list);
- out_free_queue:
-	blk_cleanup_queue(ns->queue);
- out_free_ns:
-	kfree(ns);
-}
-
 /*
  * Create I/O queues.  Failing to create an I/O queue is not an issue,
  * we can continue with less than the desired amount of queues, and
@@ -1916,71 +1826,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	return result;
 }
 
-static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
-{
-	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
-	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
-
-	return nsa->ns_id - nsb->ns_id;
-}
-
-static struct nvme_ns *nvme_find_ns(struct nvme_dev *dev, unsigned nsid)
-{
-	struct nvme_ns *ns;
-
-	list_for_each_entry(ns, &dev->namespaces, list) {
-		if (ns->ns_id == nsid)
-			return ns;
-		if (ns->ns_id > nsid)
-			break;
-	}
-	return NULL;
-}
-
-static inline bool nvme_io_incapable(struct nvme_dev *dev)
-{
-	return (!dev->bar ||
-		readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_CFS ||
-		dev->online_queues < 2);
-}
-
-static void nvme_ns_remove(struct nvme_ns *ns)
-{
-	bool kill = nvme_io_incapable(to_nvme_dev(ns->ctrl)) &&
-			!blk_queue_dying(ns->queue);
-
-	if (kill)
-		blk_set_queue_dying(ns->queue);
-	if (ns->disk->flags & GENHD_FL_UP)
-		del_gendisk(ns->disk);
-	if (kill || !blk_queue_dying(ns->queue)) {
-		blk_mq_abort_requeue_list(ns->queue);
-		blk_cleanup_queue(ns->queue);
-	}
-	list_del_init(&ns->list);
-	nvme_put_ns(ns);
-}
-
-static void nvme_scan_namespaces(struct nvme_dev *dev, unsigned nn)
-{
-	struct nvme_ns *ns, *next;
-	unsigned i;
-
-	for (i = 1; i <= nn; i++) {
-		ns = nvme_find_ns(dev, i);
-		if (ns) {
-			if (revalidate_disk(ns->disk))
-				nvme_ns_remove(ns);
-		} else
-			nvme_alloc_ns(dev, i);
-	}
-	list_for_each_entry_safe(ns, next, &dev->namespaces, list) {
-		if (ns->ns_id > nn)
-			nvme_ns_remove(ns);
-	}
-	list_sort(NULL, &dev->namespaces, ns_cmp);
-}
-
 static void nvme_set_irq_hints(struct nvme_dev *dev)
 {
 	struct nvme_queue *nvmeq;
@@ -2000,14 +1845,10 @@ static void nvme_set_irq_hints(struct nvme_dev *dev)
 static void nvme_dev_scan(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, scan_work);
-	struct nvme_id_ctrl *ctrl;
 
 	if (!dev->tagset.tags)
 		return;
-	if (nvme_identify_ctrl(&dev->ctrl, &ctrl))
-		return;
-	nvme_scan_namespaces(dev, le32_to_cpup(&ctrl->nn));
-	kfree(ctrl);
+	nvme_scan_namespaces(&dev->ctrl);
 	nvme_set_irq_hints(dev);
 }
 
@@ -2019,7 +1860,7 @@ static void nvme_dev_scan(struct work_struct *work)
  */
 static int nvme_dev_add(struct nvme_dev *dev)
 {
-	if (!dev->tagset.tags) {
+	if (!dev->ctrl.tagset) {
 		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
@@ -2032,6 +1873,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
+		dev->ctrl.tagset = &dev->tagset;
 	}
 	schedule_work(&dev->scan_work);
 	return 0;
@@ -2282,7 +2124,7 @@ static void nvme_freeze_queues(struct nvme_dev *dev)
 {
 	struct nvme_ns *ns;
 
-	list_for_each_entry(ns, &dev->namespaces, list) {
+	list_for_each_entry(ns, &dev->ctrl.namespaces, list) {
 		blk_mq_freeze_queue_start(ns->queue);
 
 		spin_lock_irq(ns->queue->queue_lock);
@@ -2298,7 +2140,7 @@ static void nvme_unfreeze_queues(struct nvme_dev *dev)
 {
 	struct nvme_ns *ns;
 
-	list_for_each_entry(ns, &dev->namespaces, list) {
+	list_for_each_entry(ns, &dev->ctrl.namespaces, list) {
 		queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, ns->queue);
 		blk_mq_unfreeze_queue(ns->queue);
 		blk_mq_start_stopped_hw_queues(ns->queue, true);
@@ -2333,14 +2175,6 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 		nvme_clear_queue(dev->queues[i]);
 }
 
-static void nvme_dev_remove(struct nvme_dev *dev)
-{
-	struct nvme_ns *ns, *next;
-
-	list_for_each_entry_safe(ns, next, &dev->namespaces, list)
-		nvme_ns_remove(ns);
-}
-
 static int nvme_setup_prp_pools(struct nvme_dev *dev)
 {
 	dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
@@ -2398,7 +2232,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
 
 	put_device(dev->dev);
-	put_device(dev->device);
+	put_device(ctrl->device);
 	nvme_release_instance(dev);
 	if (dev->tagset.tags)
 		blk_mq_free_tag_set(&dev->tagset);
@@ -2450,9 +2284,9 @@ static long nvme_dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 	case NVME_IOCTL_ADMIN_CMD:
 		return nvme_user_cmd(&dev->ctrl, NULL, (void __user *)arg);
 	case NVME_IOCTL_IO_CMD:
-		if (list_empty(&dev->namespaces))
+		if (list_empty(&dev->ctrl.namespaces))
 			return -ENOTTY;
-		ns = list_first_entry(&dev->namespaces, struct nvme_ns, list);
+		ns = list_first_entry(&dev->ctrl.namespaces, struct nvme_ns, list);
 		return nvme_user_cmd(&dev->ctrl, ns, (void __user *)arg);
 	case NVME_IOCTL_RESET:
 		dev_warn(dev->dev, "resetting controller\n");
@@ -2526,7 +2360,7 @@ static void nvme_probe_work(struct work_struct *work)
 	 */
 	if (dev->online_queues < 2) {
 		dev_warn(dev->dev, "IO queues not created\n");
-		nvme_dev_remove(dev);
+		nvme_remove_namespaces(&dev->ctrl);
 	} else {
 		nvme_unfreeze_queues(dev);
 		nvme_dev_add(dev);
@@ -2650,9 +2484,17 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 	return 0;
 }
 
+static bool nvme_pci_io_incapable(struct nvme_ctrl *ctrl)
+{
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+	return !dev->bar || dev->online_queues < 2;
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_read64		= nvme_pci_reg_read64,
+	.io_incapable		= nvme_pci_io_incapable,
 	.free_ctrl		= nvme_pci_free_ctrl,
 };
 
@@ -2677,7 +2519,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (!dev->queues)
 		goto free;
 
-	INIT_LIST_HEAD(&dev->namespaces);
+	INIT_LIST_HEAD(&dev->ctrl.namespaces);
 	INIT_WORK(&dev->reset_work, nvme_reset_work);
 	dev->dev = get_device(&pdev->dev);
 	pci_set_drvdata(pdev, dev);
@@ -2696,17 +2538,17 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto release;
 
 	kref_init(&dev->ctrl.kref);
-	dev->device = device_create(nvme_class, &pdev->dev,
+	dev->ctrl.device = device_create(nvme_class, &pdev->dev,
 				MKDEV(nvme_char_major, dev->ctrl.instance),
 				dev, "nvme%d", dev->ctrl.instance);
-	if (IS_ERR(dev->device)) {
-		result = PTR_ERR(dev->device);
+	if (IS_ERR(dev->ctrl.device)) {
+		result = PTR_ERR(dev->ctrl.device);
 		goto release_pools;
 	}
-	get_device(dev->device);
-	dev_set_drvdata(dev->device, dev);
+	get_device(dev->ctrl.device);
+	dev_set_drvdata(dev->ctrl.device, dev);
 
-	result = device_create_file(dev->device, &dev_attr_reset_controller);
+	result = device_create_file(dev->ctrl.device, &dev_attr_reset_controller);
 	if (result)
 		goto put_dev;
 
@@ -2718,7 +2560,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
  put_dev:
 	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->ctrl.instance));
-	put_device(dev->device);
+	put_device(dev->ctrl.device);
  release_pools:
 	nvme_release_prp_pools(dev);
  release:
@@ -2760,8 +2602,8 @@ static void nvme_remove(struct pci_dev *pdev)
 	flush_work(&dev->probe_work);
 	flush_work(&dev->reset_work);
 	flush_work(&dev->scan_work);
-	device_remove_file(dev->device, &dev_attr_reset_controller);
-	nvme_dev_remove(dev);
+	device_remove_file(dev->ctrl.device, &dev_attr_reset_controller);
+	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_shutdown(dev);
 	nvme_dev_remove_admin(dev);
 	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->ctrl.instance));
@@ -2843,11 +2685,9 @@ static int __init nvme_init(void)
 	if (!nvme_workq)
 		return -ENOMEM;
 
-	result = register_blkdev(nvme_major, "nvme");
+	result = nvme_core_init();
 	if (result < 0)
 		goto kill_workq;
-	else if (result > 0)
-		nvme_major = result;
 
 	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
 							&nvme_dev_fops);
@@ -2872,7 +2712,7 @@ static int __init nvme_init(void)
  unregister_chrdev:
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
  unregister_blkdev:
-	unregister_blkdev(nvme_major, "nvme");
+	nvme_core_exit();
  kill_workq:
 	destroy_workqueue(nvme_workq);
 	return result;
@@ -2881,7 +2721,7 @@ static int __init nvme_init(void)
 static void __exit nvme_exit(void)
 {
 	pci_unregister_driver(&nvme_driver);
-	unregister_blkdev(nvme_major, "nvme");
+	nvme_core_exit();
 	destroy_workqueue(nvme_workq);
 	class_destroy(nvme_class);
 	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 23/47] nvme: move chardev and sysfs interface to common code
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (21 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 22/47] nvme: move namespace scanning to common code Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 24/47] nvme: only add a controller to dev_list after it's been fully initialized Christoph Hellwig
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


For this we need to add a proper controller init routine and a list of
all controllers that is in addition to the list of PCIe controllers,
which stays in pci.c.  Note that we remove the sysfs device when the
last reference to a controller is dropped now - the old code would have
kept it around longer, which doesn't make much sense.

This requires a new ->reset_ctrl operation to implement controleller
resets, and a new ->write_reg32 operation that is required to implement
subsystem resets.  We also now store caches copied of the NVMe compliance
version and the flag if a controller is attached to a subsystem or not in
the generic controller structure now.

Signed-off-by: Christoph Hellwig <hch at lst.de>
[Fixes for pr merge]
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 217 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h |  20 +++--
 drivers/nvme/host/pci.c  | 203 +++++---------------------------------------
 drivers/nvme/host/scsi.c |   9 +-
 4 files changed, 250 insertions(+), 199 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c8e7706..4d83586 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -30,11 +30,19 @@
 
 #include "nvme.h"
 
+#define NVME_MINORS		(1U << MINORBITS)
+
 static int nvme_major;
 module_param(nvme_major, int, 0);
 
+static int nvme_char_major;
+module_param(nvme_char_major, int, 0);
+
+static LIST_HEAD(nvme_ctrl_list);
 DEFINE_SPINLOCK(dev_list_lock);
 
+static struct class *nvme_class;
+
 static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
@@ -366,7 +374,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 			metadata, meta_len, io.slba, NULL, 0);
 }
 
-int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 			struct nvme_passthru_cmd __user *ucmd)
 {
 	struct nvme_passthru_cmd cmd;
@@ -686,6 +694,12 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	u64 cap;
 	int ret, page_shift;
 
+	ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
+	if (ret) {
+		dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
+		return ret;
+	}
+
 	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
 	if (ret) {
 		dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
@@ -694,6 +708,9 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	page_shift = NVME_CAP_MPSMIN(cap) + 12;
 	ctrl->page_size = 1 << page_shift;
 
+	if (ctrl->vs >= NVME_VS(1, 1))
+		ctrl->subsystem = NVME_CAP_NSSRC(cap);
+
 	ret = nvme_identify_ctrl(ctrl, &id);
 	if (ret) {
 		dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
@@ -728,18 +745,85 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	return 0;
 }
 
-static void nvme_free_ctrl(struct kref *kref)
+static int nvme_dev_open(struct inode *inode, struct file *file)
 {
-	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+	struct nvme_ctrl *ctrl;
+	int instance = iminor(inode);
+	int ret = -ENODEV;
 
-	ctrl->ops->free_ctrl(ctrl);
+	spin_lock(&dev_list_lock);
+	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
+		if (ctrl->instance != instance)
+			continue;
+
+		if (!ctrl->admin_q) {
+			ret = -EWOULDBLOCK;
+			break;
+		}
+		if (!kref_get_unless_zero(&ctrl->kref))
+			break;
+		file->private_data = ctrl;
+		ret = 0;
+		break;
+	}
+	spin_unlock(&dev_list_lock);
+
+	return ret;
 }
 
-void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+static int nvme_dev_release(struct inode *inode, struct file *file)
 {
-	kref_put(&ctrl->kref, nvme_free_ctrl);
+	nvme_put_ctrl(file->private_data);
+	return 0;
+}
+
+static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct nvme_ctrl *ctrl = file->private_data;
+	void __user *argp = (void __user *)arg;
+	struct nvme_ns *ns;
+
+	switch (cmd) {
+	case NVME_IOCTL_ADMIN_CMD:
+		return nvme_user_cmd(ctrl, NULL, argp);
+	case NVME_IOCTL_IO_CMD:
+		if (list_empty(&ctrl->namespaces))
+			return -ENOTTY;
+		ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list);
+		return nvme_user_cmd(ctrl, ns, argp);
+	case NVME_IOCTL_RESET:
+		dev_warn(ctrl->dev, "resetting controller\n");
+		return ctrl->ops->reset_ctrl(ctrl);
+	case NVME_IOCTL_SUBSYS_RESET:
+		return nvme_reset_subsystem(ctrl);
+	default:
+		return -ENOTTY;
+	}
 }
 
+static const struct file_operations nvme_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nvme_dev_open,
+	.release	= nvme_dev_release,
+	.unlocked_ioctl	= nvme_dev_ioctl,
+	.compat_ioctl	= nvme_dev_ioctl,
+};
+
+static ssize_t nvme_sysfs_reset(struct device *dev,
+				struct device_attribute *attr, const char *buf,
+				size_t count)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	int ret;
+
+	ret = ctrl->ops->reset_ctrl(ctrl);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
+
 static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
@@ -904,6 +988,106 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 		nvme_ns_remove(ns);
 }
 
+static DEFINE_IDA(nvme_instance_ida);
+
+static int nvme_set_instance(struct nvme_ctrl *ctrl)
+{
+	int instance, error;
+
+	do {
+		if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
+			return -ENODEV;
+
+		spin_lock(&dev_list_lock);
+		error = ida_get_new(&nvme_instance_ida, &instance);
+		spin_unlock(&dev_list_lock);
+	} while (error == -EAGAIN);
+
+	if (error)
+		return -ENODEV;
+
+	ctrl->instance = instance;
+	return 0;
+}
+
+static void nvme_release_instance(struct nvme_ctrl *ctrl)
+{
+	spin_lock(&dev_list_lock);
+	ida_remove(&nvme_instance_ida, ctrl->instance);
+	spin_unlock(&dev_list_lock);
+}
+
+static void nvme_free_ctrl(struct kref *kref)
+{
+	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+
+	spin_lock(&dev_list_lock);
+	list_del(&ctrl->node);
+	spin_unlock(&dev_list_lock);
+
+	put_device(ctrl->device);
+	nvme_release_instance(ctrl);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+
+	ctrl->ops->free_ctrl(ctrl);
+}
+
+void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+{
+	kref_put(&ctrl->kref, nvme_free_ctrl);
+}
+
+/*
+ * Initialize a NVMe controller structures.  This needs to be called during
+ * earliest initialization so that we have the initialized structured around
+ * during probing.
+ */
+int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
+		const struct nvme_ctrl_ops *ops, u16 vendor,
+		unsigned long quirks)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&ctrl->namespaces);
+	kref_init(&ctrl->kref);
+	ctrl->dev = dev;
+	ctrl->ops = ops;
+	ctrl->vendor = vendor;
+	ctrl->quirks = quirks;
+
+	ret = nvme_set_instance(ctrl);
+	if (ret)
+		goto out;
+
+	ctrl->device = device_create(nvme_class, ctrl->dev,
+				MKDEV(nvme_char_major, ctrl->instance),
+				dev, "nvme%d", ctrl->instance);
+	if (IS_ERR(ctrl->device)) {
+		ret = PTR_ERR(ctrl->device);
+		goto out_release_instance;
+	}
+	get_device(ctrl->device);
+	dev_set_drvdata(ctrl->device, ctrl);
+
+	ret = device_create_file(ctrl->device, &dev_attr_reset_controller);
+	if (ret)
+		goto out_put_device;
+
+	spin_lock(&dev_list_lock);
+	list_add_tail(&ctrl->node, &nvme_ctrl_list);
+	spin_unlock(&dev_list_lock);
+
+	return 0;
+
+out_put_device:
+	put_device(ctrl->device);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+out_release_instance:
+	nvme_release_instance(ctrl);
+out:
+	return ret;
+}
+
 int __init nvme_core_init(void)
 {
 	int result;
@@ -914,10 +1098,31 @@ int __init nvme_core_init(void)
 	else if (result > 0)
 		nvme_major = result;
 
+	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
+							&nvme_dev_fops);
+	if (result < 0)
+		goto unregister_blkdev;
+	else if (result > 0)
+		nvme_char_major = result;
+
+	nvme_class = class_create(THIS_MODULE, "nvme");
+	if (IS_ERR(nvme_class)) {
+		result = PTR_ERR(nvme_class);
+		goto unregister_chrdev;
+	}
+
 	return 0;
+
+ unregister_chrdev:
+	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+ unregister_blkdev:
+	unregister_blkdev(nvme_major, "nvme");
+	return result;
 }
 
 void nvme_core_exit(void)
 {
 	unregister_blkdev(nvme_major, "nvme");
+	class_destroy(nvme_class);
+	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index b097cdb..40c9fed 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -19,8 +19,6 @@
 #include <linux/kref.h>
 #include <linux/blk-mq.h>
 
-struct nvme_passthru_cmd;
-
 extern unsigned char nvme_io_timeout;
 #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
 
@@ -53,6 +51,7 @@ struct nvme_ctrl {
 	struct blk_mq_tag_set *tagset;
 	struct list_head namespaces;
 	struct device *device;	/* char device */
+	struct list_head node;
 
 	char name[12];
 	char serial[20];
@@ -65,6 +64,8 @@ struct nvme_ctrl {
 	u16 abort_limit;
 	u8 event_limit;
 	u8 vwc;
+	u32 vs;
+	bool subsystem;
 	u16 vendor;
 	unsigned long quirks;
 };
@@ -93,7 +94,9 @@ struct nvme_ns {
 struct nvme_ctrl_ops {
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
+	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	bool (*io_incapable)(struct nvme_ctrl *ctrl);
+	int (*reset_ctrl)(struct nvme_ctrl *ctrl);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
 };
 
@@ -117,6 +120,13 @@ static inline bool nvme_io_incapable(struct nvme_ctrl *ctrl)
 	return val & NVME_CSTS_CFS;
 }
 
+static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
+{
+	if (!ctrl->subsystem)
+		return -ENOTTY;
+	return ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
+}
+
 static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
 {
 	return (sector >> (ns->lba_shift - 9));
@@ -185,6 +195,9 @@ static inline int nvme_error_status(u16 status)
 	}
 }
 
+int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
+		const struct nvme_ctrl_ops *ops, u16 vendor,
+		unsigned long quirks);
 void nvme_put_ctrl(struct nvme_ctrl *ctrl);
 int nvme_init_identify(struct nvme_ctrl *ctrl);
 
@@ -213,9 +226,6 @@ int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 
 extern spinlock_t dev_list_lock;
 
-int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-			struct nvme_passthru_cmd __user *ucmd);
-
 struct sg_io_hdr;
 
 int nvme_sg_io(struct nvme_ns *ns, struct sg_io_hdr __user *u_hdr);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6127bf6..8251735 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -38,15 +38,11 @@
 #include <linux/slab.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
-#include <linux/pr.h>
-#include <scsi/sg.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <asm/unaligned.h>
 
-#include <uapi/linux/nvme_ioctl.h>
 #include "nvme.h"
 
-#define NVME_MINORS		(1U << MINORBITS)
 #define NVME_Q_DEPTH		1024
 #define NVME_AQ_DEPTH		256
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
@@ -65,9 +61,6 @@ static unsigned char shutdown_timeout = 5;
 module_param(shutdown_timeout, byte, 0644);
 MODULE_PARM_DESC(shutdown_timeout, "timeout in seconds for controller shutdown");
 
-static int nvme_char_major;
-module_param(nvme_char_major, int, 0);
-
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0);
 
@@ -80,8 +73,6 @@ static struct task_struct *nvme_thread;
 static struct workqueue_struct *nvme_workq;
 static wait_queue_head_t nvme_kthread_wait;
 
-static struct class *nvme_class;
-
 struct nvme_dev;
 struct nvme_queue;
 
@@ -1594,15 +1585,6 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	return result;
 }
 
-static int nvme_subsys_reset(struct nvme_dev *dev)
-{
-	if (!dev->subsystem)
-		return -ENOTTY;
-
-	writel(0x4E564D65, dev->bar + NVME_REG_NSSR); /* "NVMe" */
-	return 0;
-}
-
 static int nvme_kthread(void *data)
 {
 	struct nvme_dev *dev, *next;
@@ -2198,42 +2180,11 @@ static void nvme_release_prp_pools(struct nvme_dev *dev)
 	dma_pool_destroy(dev->prp_small_pool);
 }
 
-static DEFINE_IDA(nvme_instance_ida);
-
-static int nvme_set_instance(struct nvme_dev *dev)
-{
-	int instance, error;
-
-	do {
-		if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
-			return -ENODEV;
-
-		spin_lock(&dev_list_lock);
-		error = ida_get_new(&nvme_instance_ida, &instance);
-		spin_unlock(&dev_list_lock);
-	} while (error == -EAGAIN);
-
-	if (error)
-		return -ENODEV;
-
-	dev->ctrl.instance = instance;
-	return 0;
-}
-
-static void nvme_release_instance(struct nvme_dev *dev)
-{
-	spin_lock(&dev_list_lock);
-	ida_remove(&nvme_instance_ida, dev->ctrl.instance);
-	spin_unlock(&dev_list_lock);
-}
-
 static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 {
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
 
 	put_device(dev->dev);
-	put_device(ctrl->device);
-	nvme_release_instance(dev);
 	if (dev->tagset.tags)
 		blk_mq_free_tag_set(&dev->tagset);
 	if (dev->ctrl.admin_q)
@@ -2243,69 +2194,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
-static int nvme_dev_open(struct inode *inode, struct file *f)
-{
-	struct nvme_dev *dev;
-	int instance = iminor(inode);
-	int ret = -ENODEV;
-
-	spin_lock(&dev_list_lock);
-	list_for_each_entry(dev, &dev_list, node) {
-		if (dev->ctrl.instance == instance) {
-			if (!dev->ctrl.admin_q) {
-				ret = -EWOULDBLOCK;
-				break;
-			}
-			if (!kref_get_unless_zero(&dev->ctrl.kref))
-				break;
-			f->private_data = dev;
-			ret = 0;
-			break;
-		}
-	}
-	spin_unlock(&dev_list_lock);
-
-	return ret;
-}
-
-static int nvme_dev_release(struct inode *inode, struct file *f)
-{
-	struct nvme_dev *dev = f->private_data;
-	nvme_put_ctrl(&dev->ctrl);
-	return 0;
-}
-
-static long nvme_dev_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
-{
-	struct nvme_dev *dev = f->private_data;
-	struct nvme_ns *ns;
-
-	switch (cmd) {
-	case NVME_IOCTL_ADMIN_CMD:
-		return nvme_user_cmd(&dev->ctrl, NULL, (void __user *)arg);
-	case NVME_IOCTL_IO_CMD:
-		if (list_empty(&dev->ctrl.namespaces))
-			return -ENOTTY;
-		ns = list_first_entry(&dev->ctrl.namespaces, struct nvme_ns, list);
-		return nvme_user_cmd(&dev->ctrl, ns, (void __user *)arg);
-	case NVME_IOCTL_RESET:
-		dev_warn(dev->dev, "resetting controller\n");
-		return nvme_reset(dev);
-	case NVME_IOCTL_SUBSYS_RESET:
-		return nvme_subsys_reset(dev);
-	default:
-		return -ENOTTY;
-	}
-}
-
-static const struct file_operations nvme_dev_fops = {
-	.owner		= THIS_MODULE,
-	.open		= nvme_dev_open,
-	.release	= nvme_dev_release,
-	.unlocked_ioctl	= nvme_dev_ioctl,
-	.compat_ioctl	= nvme_dev_ioctl,
-};
-
 static void nvme_probe_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, probe_work);
@@ -2457,21 +2345,6 @@ static int nvme_reset(struct nvme_dev *dev)
 	return ret;
 }
 
-static ssize_t nvme_sysfs_reset(struct device *dev,
-				struct device_attribute *attr, const char *buf,
-				size_t count)
-{
-	struct nvme_dev *ndev = dev_get_drvdata(dev);
-	int ret;
-
-	ret = nvme_reset(ndev);
-	if (ret < 0)
-		return ret;
-
-	return count;
-}
-static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
-
 static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
 {
 	*val = readl(to_nvme_dev(ctrl)->bar + off);
@@ -2484,6 +2357,12 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val)
 	return 0;
 }
 
+static int nvme_pci_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val)
+{
+	writel(val, to_nvme_dev(ctrl)->bar + off);
+	return 0;
+}
+
 static bool nvme_pci_io_incapable(struct nvme_ctrl *ctrl)
 {
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
@@ -2491,10 +2370,17 @@ static bool nvme_pci_io_incapable(struct nvme_ctrl *ctrl)
 	return !dev->bar || dev->online_queues < 2;
 }
 
+static int nvme_pci_reset_ctrl(struct nvme_ctrl *ctrl)
+{
+	return nvme_reset(to_nvme_dev(ctrl));
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_read64		= nvme_pci_reg_read64,
+	.reg_write32		= nvme_pci_reg_write32,
 	.io_incapable		= nvme_pci_io_incapable,
+	.reset_ctrl		= nvme_pci_reset_ctrl,
 	.free_ctrl		= nvme_pci_free_ctrl,
 };
 
@@ -2519,52 +2405,28 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (!dev->queues)
 		goto free;
 
-	INIT_LIST_HEAD(&dev->ctrl.namespaces);
-	INIT_WORK(&dev->reset_work, nvme_reset_work);
 	dev->dev = get_device(&pdev->dev);
 	pci_set_drvdata(pdev, dev);
 
-	dev->ctrl.vendor = pdev->vendor;
-	dev->ctrl.ops = &nvme_pci_ctrl_ops;
-	dev->ctrl.dev = dev->dev;
-	dev->ctrl.quirks = id->driver_data;
+	INIT_LIST_HEAD(&dev->node);
+	INIT_WORK(&dev->scan_work, nvme_dev_scan);
+	INIT_WORK(&dev->probe_work, nvme_probe_work);
+	INIT_WORK(&dev->reset_work, nvme_reset_work);
 
-	result = nvme_set_instance(dev);
+	result = nvme_setup_prp_pools(dev);
 	if (result)
 		goto put_pci;
 
-	result = nvme_setup_prp_pools(dev);
+	result = nvme_init_ctrl(&dev->ctrl, &pdev->dev, &nvme_pci_ctrl_ops,
+			pdev->vendor, id->driver_data);
 	if (result)
-		goto release;
-
-	kref_init(&dev->ctrl.kref);
-	dev->ctrl.device = device_create(nvme_class, &pdev->dev,
-				MKDEV(nvme_char_major, dev->ctrl.instance),
-				dev, "nvme%d", dev->ctrl.instance);
-	if (IS_ERR(dev->ctrl.device)) {
-		result = PTR_ERR(dev->ctrl.device);
 		goto release_pools;
-	}
-	get_device(dev->ctrl.device);
-	dev_set_drvdata(dev->ctrl.device, dev);
 
-	result = device_create_file(dev->ctrl.device, &dev_attr_reset_controller);
-	if (result)
-		goto put_dev;
-
-	INIT_LIST_HEAD(&dev->node);
-	INIT_WORK(&dev->scan_work, nvme_dev_scan);
-	INIT_WORK(&dev->probe_work, nvme_probe_work);
 	schedule_work(&dev->probe_work);
 	return 0;
 
- put_dev:
-	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->ctrl.instance));
-	put_device(dev->ctrl.device);
  release_pools:
 	nvme_release_prp_pools(dev);
- release:
-	nvme_release_instance(dev);
  put_pci:
 	put_device(dev->dev);
  free:
@@ -2602,11 +2464,9 @@ static void nvme_remove(struct pci_dev *pdev)
 	flush_work(&dev->probe_work);
 	flush_work(&dev->reset_work);
 	flush_work(&dev->scan_work);
-	device_remove_file(dev->ctrl.device, &dev_attr_reset_controller);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_dev_shutdown(dev);
 	nvme_dev_remove_admin(dev);
-	device_destroy(nvme_class, MKDEV(nvme_char_major, dev->ctrl.instance));
 	nvme_free_queues(dev, 0);
 	nvme_release_cmb(dev);
 	nvme_release_prp_pools(dev);
@@ -2689,29 +2549,12 @@ static int __init nvme_init(void)
 	if (result < 0)
 		goto kill_workq;
 
-	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
-							&nvme_dev_fops);
-	if (result < 0)
-		goto unregister_blkdev;
-	else if (result > 0)
-		nvme_char_major = result;
-
-	nvme_class = class_create(THIS_MODULE, "nvme");
-	if (IS_ERR(nvme_class)) {
-		result = PTR_ERR(nvme_class);
-		goto unregister_chrdev;
-	}
-
 	result = pci_register_driver(&nvme_driver);
 	if (result)
-		goto destroy_class;
+		goto core_exit;
 	return 0;
 
- destroy_class:
-	class_destroy(nvme_class);
- unregister_chrdev:
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
- unregister_blkdev:
+ core_exit:
 	nvme_core_exit();
  kill_workq:
 	destroy_workqueue(nvme_workq);
@@ -2723,8 +2566,6 @@ static void __exit nvme_exit(void)
 	pci_unregister_driver(&nvme_driver);
 	nvme_core_exit();
 	destroy_workqueue(nvme_workq);
-	class_destroy(nvme_class);
-	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
 	BUG_ON(nvme_thread && !IS_ERR(nvme_thread));
 	_nvme_check_size();
 }
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index b673fe4..34c65b9 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -606,16 +606,11 @@ static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	int res;
 	int nvme_sc;
 	int xfer_len;
-	u32 vs;
 	__be32 tmp_id = cpu_to_be32(ns->ns_id);
 
-	res = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &vs);
-	if (res)
-		return res;
-
 	memset(inq_response, 0, alloc_len);
 	inq_response[1] = INQ_DEVICE_IDENTIFICATION_PAGE;    /* Page Code */
-	if (vs >= NVME_VS(1, 1)) {
+	if (ctrl->vs >= NVME_VS(1, 1)) {
 		struct nvme_id_ns *id_ns;
 		void *eui;
 		int len;
@@ -627,7 +622,7 @@ static int nvme_trans_device_id_page(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 
 		eui = id_ns->eui64;
 		len = sizeof(id_ns->eui64);
-		if (vs >= NVME_VS(1, 2)) {
+		if (ctrl->vs >= NVME_VS(1, 2)) {
 			if (bitmap_empty(eui, len * 8)) {
 				eui = id_ns->nguid;
 				len = sizeof(id_ns->nguid);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 24/47] nvme: only add a controller to dev_list after it's been fully initialized
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (22 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 23/47] nvme: move chardev and sysfs interface " Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout Christoph Hellwig
                   ` (10 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Without this we can easily get bad derferences on nvmeq->d_db when the nvme
kthread tries to poll the CQs for controllers that are in half initialized
state.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 51 +++++++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8251735..c8e381b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2082,6 +2082,30 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 	kthread_stop(kworker_task);
 }
 
+static int nvme_dev_list_add(struct nvme_dev *dev)
+{
+	bool start_thread = false;
+
+	spin_lock(&dev_list_lock);
+	if (list_empty(&dev_list) && IS_ERR_OR_NULL(nvme_thread)) {
+		start_thread = true;
+		nvme_thread = NULL;
+	}
+	list_add(&dev->node, &dev_list);
+	spin_unlock(&dev_list_lock);
+
+	if (start_thread) {
+		nvme_thread = kthread_run(nvme_kthread, NULL, "nvme");
+		wake_up_all(&nvme_kthread_wait);
+	} else
+		wait_event_killable(nvme_kthread_wait, nvme_thread);
+
+	if (IS_ERR_OR_NULL(nvme_thread))
+		return nvme_thread ? PTR_ERR(nvme_thread) : -EINTR;
+
+	return 0;
+}
+
 /*
 * Remove the node from the device list and check
 * for whether or not we need to stop the nvme_thread.
@@ -2197,7 +2221,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 static void nvme_probe_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, probe_work);
-	bool start_thread = false;
 	int result;
 
 	result = nvme_dev_map(dev);
@@ -2208,25 +2231,6 @@ static void nvme_probe_work(struct work_struct *work)
 	if (result)
 		goto unmap;
 
-	spin_lock(&dev_list_lock);
-	if (list_empty(&dev_list) && IS_ERR_OR_NULL(nvme_thread)) {
-		start_thread = true;
-		nvme_thread = NULL;
-	}
-	list_add(&dev->node, &dev_list);
-	spin_unlock(&dev_list_lock);
-
-	if (start_thread) {
-		nvme_thread = kthread_run(nvme_kthread, NULL, "nvme");
-		wake_up_all(&nvme_kthread_wait);
-	} else
-		wait_event_killable(nvme_kthread_wait, nvme_thread);
-
-	if (IS_ERR_OR_NULL(nvme_thread)) {
-		result = nvme_thread ? PTR_ERR(nvme_thread) : -EINTR;
-		goto disable;
-	}
-
 	nvme_init_queue(dev->queues[0], 0);
 	result = nvme_alloc_admin_tags(dev);
 	if (result)
@@ -2242,6 +2246,10 @@ static void nvme_probe_work(struct work_struct *work)
 
 	dev->ctrl.event_limit = 1;
 
+	result = nvme_dev_list_add(dev);
+	if (result)
+		goto remove;
+
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
 	 * any working I/O queue.
@@ -2256,6 +2264,8 @@ static void nvme_probe_work(struct work_struct *work)
 
 	return;
 
+ remove:
+	nvme_dev_list_remove(dev);
  free_tags:
 	nvme_dev_remove_admin(dev);
 	blk_put_queue(dev->ctrl.admin_q);
@@ -2263,7 +2273,6 @@ static void nvme_probe_work(struct work_struct *work)
 	dev->queues[0]->tags = NULL;
  disable:
 	nvme_disable_queue(dev, 0);
-	nvme_dev_list_remove(dev);
  unmap:
 	nvme_dev_unmap(dev);
  out:
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (23 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 24/47] nvme: only add a controller to dev_list after it's been fully initialized Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2017-03-10 12:51   ` David Woodhouse
  2015-11-20 16:35 ` [PATCH 26/47] nvme: merge nvme_abort_req and nvme_timeout Christoph Hellwig
                   ` (9 subsequent siblings)
  34 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


There is nothing it protects, but it makes lockdep unhappy in many different
ways.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c8e381b..df3ddeb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1086,13 +1086,13 @@ static void nvme_abort_req(struct request *req)
 	struct nvme_command cmd;
 
 	if (!nvmeq->qid || cmd_rq->aborted) {
-		spin_lock(&dev_list_lock);
+		spin_lock_irq(&dev_list_lock);
 		if (!__nvme_reset(dev)) {
 			dev_warn(dev->dev,
 				 "I/O %d QID %d timeout, reset controller\n",
 				 req->tag, nvmeq->qid);
 		}
-		spin_unlock(&dev_list_lock);
+		spin_unlock_irq(&dev_list_lock);
 		return;
 	}
 
@@ -1156,9 +1156,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	dev_warn(nvmeq->q_dmadev, "Timeout I/O %d QID %d\n", req->tag,
 							nvmeq->qid);
-	spin_lock_irq(&nvmeq->q_lock);
 	nvme_abort_req(req);
-	spin_unlock_irq(&nvmeq->q_lock);
 
 	/*
 	 * The aborted req will be completed on receiving the abort req.
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 26/47] nvme: merge nvme_abort_req and nvme_timeout
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (24 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 27/47] nvme: do not restart the request timeout if we're resetting the controller Christoph Hellwig
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


We want to be able to return bettern error values frmo nvme_timeout, which
is significantly easier if the two functions are merged.  Also clean up and
reduce the printk spew so that we only get one message per abort.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 47 ++++++++++++++++++-----------------------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index df3ddeb..58cff75 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1070,13 +1070,7 @@ static int adapter_delete_sq(struct nvme_dev *dev, u16 sqid)
 	return adapter_delete_queue(dev, nvme_admin_delete_sq, sqid);
 }
 
-/**
- * nvme_abort_req - Attempt aborting a request
- *
- * Schedule controller reset if the command was already aborted once before and
- * still hasn't been returned to the driver, or if this is the admin queue.
- */
-static void nvme_abort_req(struct request *req)
+static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 {
 	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
 	struct nvme_queue *nvmeq = cmd_rq->nvmeq;
@@ -1085,6 +1079,11 @@ static void nvme_abort_req(struct request *req)
 	struct nvme_cmd_info *abort_cmd;
 	struct nvme_command cmd;
 
+	/*
+	 * Schedule controller reset if the command was already aborted once
+	 * before and still hasn't been returned to the driver, or if this is
+	 * the admin queue.
+	 */
 	if (!nvmeq->qid || cmd_rq->aborted) {
 		spin_lock_irq(&dev_list_lock);
 		if (!__nvme_reset(dev)) {
@@ -1093,16 +1092,16 @@ static void nvme_abort_req(struct request *req)
 				 req->tag, nvmeq->qid);
 		}
 		spin_unlock_irq(&dev_list_lock);
-		return;
+		return BLK_EH_RESET_TIMER;
 	}
 
 	if (!dev->ctrl.abort_limit)
-		return;
+		return BLK_EH_RESET_TIMER;
 
 	abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, WRITE,
 			BLK_MQ_REQ_NOWAIT);
 	if (IS_ERR(abort_req))
-		return;
+		return BLK_EH_RESET_TIMER;
 
 	abort_cmd = blk_mq_rq_to_pdu(abort_req);
 	nvme_set_info(abort_cmd, abort_req, abort_completion);
@@ -1116,9 +1115,16 @@ static void nvme_abort_req(struct request *req)
 	--dev->ctrl.abort_limit;
 	cmd_rq->aborted = 1;
 
-	dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
-							nvmeq->qid);
+	dev_warn(nvmeq->q_dmadev, "I/O %d QID %d timeout, aborting\n",
+				 req->tag, nvmeq->qid);
 	nvme_submit_cmd(dev->queues[0], &cmd);
+
+	/*
+	 * The aborted req will be completed on receiving the abort req.
+	 * We enable the timer again. If hit twice, it'll cause a device reset,
+	 * as the device then is in a faulty state.
+	 */
+	return BLK_EH_RESET_TIMER;
 }
 
 static void nvme_cancel_queue_ios(struct request *req, void *data, bool reserved)
@@ -1149,23 +1155,6 @@ static void nvme_cancel_queue_ios(struct request *req, void *data, bool reserved
 	fn(nvmeq, ctx, &cqe);
 }
 
-static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
-{
-	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
-	struct nvme_queue *nvmeq = cmd->nvmeq;
-
-	dev_warn(nvmeq->q_dmadev, "Timeout I/O %d QID %d\n", req->tag,
-							nvmeq->qid);
-	nvme_abort_req(req);
-
-	/*
-	 * The aborted req will be completed on receiving the abort req.
-	 * We enable the timer again. If hit twice, it'll cause a device reset,
-	 * as the device then is in a faulty state.
-	 */
-	return BLK_EH_RESET_TIMER;
-}
-
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 27/47] nvme: do not restart the request timeout if we're resetting the controller
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (25 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 26/47] nvme: merge nvme_abort_req and nvme_timeout Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 28/47] nvme: simplify resets Christoph Hellwig
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Otherwise we're never going to complete a command when it is restarted just
after we completed all other outstanding commands in nvme_clear_queue.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 58cff75..9a6e47e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1092,7 +1092,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 				 req->tag, nvmeq->qid);
 		}
 		spin_unlock_irq(&dev_list_lock);
-		return BLK_EH_RESET_TIMER;
+
+		/*
+		 * Mark the request as quiesced, that is don't do anything with
+		 * it until the error completion from the controller reset
+		 * comes in.
+		 */
+		return BLK_EH_QUIESCED;
 	}
 
 	if (!dev->ctrl.abort_limit)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 28/47] nvme: simplify resets
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (26 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 27/47] nvme: do not restart the request timeout if we're resetting the controller Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 29/47] nvme: merge probe_work and reset_work Christoph Hellwig
                   ` (6 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Don't delete the controller from dev_list before queuing a reset, instead
just check for it being reset in the polling kthread.  This allows to remove
the dev_list_lock in various places, and in addition we can simply rely on
checking the queue_work return value to see if we could reset a controller.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9a6e47e..3173f0c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -76,7 +76,6 @@ static wait_queue_head_t nvme_kthread_wait;
 struct nvme_dev;
 struct nvme_queue;
 
-static int __nvme_reset(struct nvme_dev *dev);
 static int nvme_reset(struct nvme_dev *dev);
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dead_ctrl(struct nvme_dev *dev);
@@ -1085,13 +1084,11 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 * the admin queue.
 	 */
 	if (!nvmeq->qid || cmd_rq->aborted) {
-		spin_lock_irq(&dev_list_lock);
-		if (!__nvme_reset(dev)) {
+		if (queue_work(nvme_workq, &dev->reset_work)) {
 			dev_warn(dev->dev,
 				 "I/O %d QID %d timeout, reset controller\n",
 				 req->tag, nvmeq->qid);
 		}
-		spin_unlock_irq(&dev_list_lock);
 
 		/*
 		 * Mark the request as quiesced, that is don't do anything with
@@ -1589,9 +1586,15 @@ static int nvme_kthread(void *data)
 			int i;
 			u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
+			/*
+			 * Skip controllers currently under reset.
+			 */
+			if (work_pending(&dev->reset_work) || work_busy(&dev->reset_work))
+				continue;
+
 			if ((dev->subsystem && (csts & NVME_CSTS_NSSRO)) ||
 							csts & NVME_CSTS_CFS) {
-				if (!__nvme_reset(dev)) {
+				if (queue_work(nvme_workq, &dev->reset_work)) {
 					dev_warn(dev->dev,
 						"Failed status: %x, reset controller\n",
 						readl(dev->bar + NVME_REG_CSTS));
@@ -2318,33 +2321,17 @@ static void nvme_reset_work(struct work_struct *ws)
 	schedule_work(&dev->probe_work);
 }
 
-static int __nvme_reset(struct nvme_dev *dev)
-{
-	if (work_pending(&dev->reset_work))
-		return -EBUSY;
-	list_del_init(&dev->node);
-	queue_work(nvme_workq, &dev->reset_work);
-	return 0;
-}
-
 static int nvme_reset(struct nvme_dev *dev)
 {
-	int ret;
-
 	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
 		return -ENODEV;
 
-	spin_lock(&dev_list_lock);
-	ret = __nvme_reset(dev);
-	spin_unlock(&dev_list_lock);
-
-	if (!ret) {
-		flush_work(&dev->reset_work);
-		flush_work(&dev->probe_work);
-		return 0;
-	}
+	if (!queue_work(nvme_workq, &dev->reset_work))
+		return -EBUSY;
 
-	return ret;
+	flush_work(&dev->reset_work);
+	flush_work(&dev->probe_work);
+	return 0;
 }
 
 static int nvme_pci_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 29/47] nvme: merge probe_work and reset_work
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (27 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 28/47] nvme: simplify resets Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 30/47] nvme: remove dead controllers from a work item Christoph Hellwig
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


If we're using two work queues we're always going to run into races where
one item is tearing down what the other one is initializing.  So insted
merge the two work queues, and let the old probe_work also tear the
controller down first if it was alive.  Together with the better detection
of the probe path using a flag this gives us a properly serialized
reset/probe path that also doesn't accidentally trigger when two command
time out and the second one tries to reset the controller while the first
reset is still in progress.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 64 ++++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3173f0c..fdd0df9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -110,7 +110,6 @@ struct nvme_dev {
 	struct msix_entry *entry;
 	void __iomem *bar;
 	struct work_struct reset_work;
-	struct work_struct probe_work;
 	struct work_struct scan_work;
 	bool subsystem;
 	u32 page_size;
@@ -118,6 +117,8 @@ struct nvme_dev {
 	dma_addr_t cmb_dma_addr;
 	u64 cmb_size;
 	u32 cmbsz;
+	unsigned long flags;
+#define NVME_CTRL_RESETTING	0
 
 	struct nvme_ctrl ctrl;
 };
@@ -1079,6 +1080,17 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct nvme_command cmd;
 
 	/*
+	 * Just return an error to reset_work if we're already called from
+	 * probe context.  We don't worry about request / tag reuse as
+	 * reset_work will immeditalely unwind and remove the controller
+	 * once we fail a command.
+	 */
+	if (test_bit(NVME_CTRL_RESETTING, &dev->flags)) {
+		req->errors = -EIO;
+		return BLK_EH_HANDLED;
+	}
+
+	/*
 	 * Schedule controller reset if the command was already aborted once
 	 * before and still hasn't been returned to the driver, or if this is
 	 * the admin queue.
@@ -2214,11 +2226,23 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
-static void nvme_probe_work(struct work_struct *work)
+static void nvme_reset_work(struct work_struct *work)
 {
-	struct nvme_dev *dev = container_of(work, struct nvme_dev, probe_work);
+	struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
 	int result;
 
+	if (WARN_ON(test_bit(NVME_CTRL_RESETTING, &dev->flags)))
+		goto out;
+
+	/*
+	 * If we're called to reset a live controller first shut it down before
+	 * moving on.
+	 */
+	if (dev->bar)
+		nvme_dev_shutdown(dev);
+
+	set_bit(NVME_CTRL_RESETTING, &dev->flags);
+
 	result = nvme_dev_map(dev);
 	if (result)
 		goto out;
@@ -2258,6 +2282,7 @@ static void nvme_probe_work(struct work_struct *work)
 		nvme_dev_add(dev);
 	}
 
+	clear_bit(NVME_CTRL_RESETTING, &dev->flags);
 	return;
 
  remove:
@@ -2272,7 +2297,7 @@ static void nvme_probe_work(struct work_struct *work)
  unmap:
 	nvme_dev_unmap(dev);
  out:
-	if (!work_busy(&dev->reset_work))
+	if (!work_pending(&dev->reset_work))
 		nvme_dead_ctrl(dev);
 }
 
@@ -2299,28 +2324,6 @@ static void nvme_dead_ctrl(struct nvme_dev *dev)
 	}
 }
 
-static void nvme_reset_work(struct work_struct *ws)
-{
-	struct nvme_dev *dev = container_of(ws, struct nvme_dev, reset_work);
-	bool in_probe = work_busy(&dev->probe_work);
-
-	nvme_dev_shutdown(dev);
-
-	/* Synchronize with device probe so that work will see failure status
-	 * and exit gracefully without trying to schedule another reset */
-	flush_work(&dev->probe_work);
-
-	/* Fail this device if reset occured during probe to avoid
-	 * infinite initialization loops. */
-	if (in_probe) {
-		nvme_dead_ctrl(dev);
-		return;
-	}
-	/* Schedule device resume asynchronously so the reset work is available
-	 * to cleanup errors that may occur during reinitialization */
-	schedule_work(&dev->probe_work);
-}
-
 static int nvme_reset(struct nvme_dev *dev)
 {
 	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
@@ -2330,7 +2333,6 @@ static int nvme_reset(struct nvme_dev *dev)
 		return -EBUSY;
 
 	flush_work(&dev->reset_work);
-	flush_work(&dev->probe_work);
 	return 0;
 }
 
@@ -2399,7 +2401,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	INIT_LIST_HEAD(&dev->node);
 	INIT_WORK(&dev->scan_work, nvme_dev_scan);
-	INIT_WORK(&dev->probe_work, nvme_probe_work);
 	INIT_WORK(&dev->reset_work, nvme_reset_work);
 
 	result = nvme_setup_prp_pools(dev);
@@ -2411,7 +2412,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
-	schedule_work(&dev->probe_work);
+	schedule_work(&dev->reset_work);
 	return 0;
 
  release_pools:
@@ -2432,7 +2433,7 @@ static void nvme_reset_notify(struct pci_dev *pdev, bool prepare)
 	if (prepare)
 		nvme_dev_shutdown(dev);
 	else
-		schedule_work(&dev->probe_work);
+		schedule_work(&dev->reset_work);
 }
 
 static void nvme_shutdown(struct pci_dev *pdev)
@@ -2450,7 +2451,6 @@ static void nvme_remove(struct pci_dev *pdev)
 	spin_unlock(&dev_list_lock);
 
 	pci_set_drvdata(pdev, NULL);
-	flush_work(&dev->probe_work);
 	flush_work(&dev->reset_work);
 	flush_work(&dev->scan_work);
 	nvme_remove_namespaces(&dev->ctrl);
@@ -2484,7 +2484,7 @@ static int nvme_resume(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	schedule_work(&ndev->probe_work);
+	schedule_work(&ndev->reset_work);
 	return 0;
 }
 #endif
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 30/47] nvme: remove dead controllers from a work item
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (28 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 29/47] nvme: merge probe_work and reset_work Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 31/47] nvme: switch abort_limit to an atomic_t Christoph Hellwig
                   ` (4 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


Compared to the kthread this gives us multiple call prevention for free.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fdd0df9..f64c2bb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -78,7 +78,7 @@ struct nvme_queue;
 
 static int nvme_reset(struct nvme_dev *dev);
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_dead_ctrl(struct nvme_dev *dev);
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev);
 
 struct async_cmd_info {
 	struct kthread_work work;
@@ -111,6 +111,7 @@ struct nvme_dev {
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct scan_work;
+	struct work_struct remove_work;
 	bool subsystem;
 	u32 page_size;
 	void __iomem *cmb;
@@ -2297,31 +2298,25 @@ static void nvme_reset_work(struct work_struct *work)
  unmap:
 	nvme_dev_unmap(dev);
  out:
-	if (!work_pending(&dev->reset_work))
-		nvme_dead_ctrl(dev);
+	nvme_remove_dead_ctrl(dev);
 }
 
-static int nvme_remove_dead_ctrl(void *arg)
+static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 {
-	struct nvme_dev *dev = (struct nvme_dev *)arg;
+	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
 	if (pci_get_drvdata(pdev))
 		pci_stop_and_remove_bus_device_locked(pdev);
 	nvme_put_ctrl(&dev->ctrl);
-	return 0;
 }
 
-static void nvme_dead_ctrl(struct nvme_dev *dev)
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
 {
-	dev_warn(dev->dev, "Device failed to resume\n");
+	dev_warn(dev->dev, "Removing after probe failure\n");
 	kref_get(&dev->ctrl.kref);
-	if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
-						dev->ctrl.instance))) {
-		dev_err(dev->dev,
-			"Failed to start controller remove task\n");
+	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
-	}
 }
 
 static int nvme_reset(struct nvme_dev *dev)
@@ -2402,6 +2397,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_LIST_HEAD(&dev->node);
 	INIT_WORK(&dev->scan_work, nvme_dev_scan);
 	INIT_WORK(&dev->reset_work, nvme_reset_work);
+	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 
 	result = nvme_setup_prp_pools(dev);
 	if (result)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 31/47] nvme: switch abort_limit to an atomic_t
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (29 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 30/47] nvme: remove dead controllers from a work item Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 32/47] NVMe: Implement namespace list scanning Christoph Hellwig
                   ` (3 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


There is no lock to sychronize access to the abort_limit field of
struct nvme_ctrl, so switch it to an atomic_t.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 2 +-
 drivers/nvme/host/nvme.h | 2 +-
 drivers/nvme/host/pci.c  | 9 +++++----
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4d83586..655295f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -718,7 +718,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	ctrl->oncs = le16_to_cpup(&id->oncs);
-	ctrl->abort_limit = id->acl + 1;
+	atomic_set(&ctrl->abort_limit, id->acl + 1);
 	ctrl->vwc = id->vwc;
 	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
 	memcpy(ctrl->model, id->mn, sizeof(id->mn));
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 40c9fed..993767e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -61,7 +61,7 @@ struct nvme_ctrl {
 	u32 stripe_size;
 	u32 page_size;
 	u16 oncs;
-	u16 abort_limit;
+	atomic_t abort_limit;
 	u8 event_limit;
 	u8 vwc;
 	u32 vs;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f64c2bb..425e48a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -385,7 +385,7 @@ static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
 	blk_mq_free_request(req);
 
 	dev_warn(nvmeq->q_dmadev, "Abort status:%x result:%x", status, result);
-	++nvmeq->dev->ctrl.abort_limit;
+	atomic_inc(&nvmeq->dev->ctrl.abort_limit);
 }
 
 static void async_completion(struct nvme_queue *nvmeq, void *ctx,
@@ -1111,13 +1111,15 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		return BLK_EH_QUIESCED;
 	}
 
-	if (!dev->ctrl.abort_limit)
+	if (atomic_dec_and_test(&dev->ctrl.abort_limit))
 		return BLK_EH_RESET_TIMER;
 
 	abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, WRITE,
 			BLK_MQ_REQ_NOWAIT);
-	if (IS_ERR(abort_req))
+	if (IS_ERR(abort_req)) {
+		atomic_inc(&dev->ctrl.abort_limit);
 		return BLK_EH_RESET_TIMER;
+	}
 
 	abort_cmd = blk_mq_rq_to_pdu(abort_req);
 	nvme_set_info(abort_cmd, abort_req, abort_completion);
@@ -1128,7 +1130,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
 	cmd.abort.command_id = abort_req->tag;
 
-	--dev->ctrl.abort_limit;
 	cmd_rq->aborted = 1;
 
 	dev_warn(nvmeq->q_dmadev, "I/O %d QID %d timeout, aborting\n",
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 32/47] NVMe: Implement namespace list scanning
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (30 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 31/47] nvme: switch abort_limit to an atomic_t Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 33/47] NVMe: Use unbounded work queue for all work Christoph Hellwig
                   ` (2 subsequent siblings)
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


From: Keith Busch <keith.busch@intel.com>

The NVMe 1.1 specification provides an identify mode to return a
list of active namespaces. This is more efficient to discover which
namespace identifiers are active on a controller, providing potentially
significant improvement in scan time for controllers with sparesly
populated namespaces.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 79 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 70 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 655295f..38d2ecc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -255,6 +255,16 @@ int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 	return error;
 }
 
+static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *ns_list)
+{
+	struct nvme_command c = { };
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = cpu_to_le32(2);
+	c.identify.nsid = cpu_to_le32(nsid);
+	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
+}
+
 int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id)
 {
@@ -950,33 +960,84 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	nvme_put_ns(ns);
 }
 
+static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+
+	ns = nvme_find_ns(ctrl, nsid);
+	if (ns) {
+		if (revalidate_disk(ns->disk))
+			nvme_ns_remove(ns);
+	} else
+		nvme_alloc_ns(ctrl, nsid);
+}
+
+static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
+{
+	struct nvme_ns *ns;
+	__le32 *ns_list;
+	unsigned i, j, nsid, prev = 0, num_lists = DIV_ROUND_UP(nn, 1024);
+	int ret = 0;
+
+	ns_list = kzalloc(0x1000, GFP_KERNEL);
+	if (!ns_list)
+		return -ENOMEM;
+
+	for (i = 0; i < num_lists; i++) {
+		ret = nvme_identify_ns_list(ctrl, prev, ns_list);
+		if (ret)
+			goto out;
+
+		for (j = 0; j < min(nn, 1024U); j++) {
+			nsid = le32_to_cpu(ns_list[j]);
+			if (!nsid)
+				goto out;
+
+			nvme_validate_ns(ctrl, nsid);
+
+			while (++prev < nsid) {
+				ns = nvme_find_ns(ctrl, prev);
+				if (ns)
+					nvme_ns_remove(ns);
+			}
+		}
+		nn -= j;
+	}
+ out:
+	kfree(ns_list);
+	return ret;
+}
+
 static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
 {
 	struct nvme_ns *ns, *next;
 	unsigned i;
 
-	for (i = 1; i <= nn; i++) {
-		ns = nvme_find_ns(ctrl, i);
-		if (ns) {
-			if (revalidate_disk(ns->disk))
-				nvme_ns_remove(ns);
-		} else
-			nvme_alloc_ns(ctrl, i);
-	}
+	for (i = 1; i <= nn; i++)
+		nvme_validate_ns(ctrl, i);
+
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
 		if (ns->ns_id > nn)
 			nvme_ns_remove(ns);
 	}
-	list_sort(NULL, &ctrl->namespaces, ns_cmp);
 }
 
 void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
 {
 	struct nvme_id_ctrl *id;
+	unsigned nn;
 
 	if (nvme_identify_ctrl(ctrl, &id))
 		return;
+
+	nn = le32_to_cpu(id->nn);
+	if (ctrl->vs >= NVME_VS(1, 1)) {
+		if (!nvme_scan_ns_list(ctrl, nn))
+			goto done;
+	}
 	__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
+ done:
+	list_sort(NULL, &ctrl->namespaces, ns_cmp);
 	kfree(id);
 }
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 33/47] NVMe: Use unbounded work queue for all work
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (31 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 32/47] NVMe: Implement namespace list scanning Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:35 ` [PATCH 34/47] NVMe: Remove device management handles on remove Christoph Hellwig
  2015-11-20 16:50 ` NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


From: Keith Busch <keith.busch@intel.com>

Removes all usage of the global work queue so work can't be
scheduled on two different work queues, and removes nvme's work queue
singlethreadedness so controllers can be driven in parallel.

Signed-off-by: Keith Busch <keith.busch at intel.com>
[hch: keep the dead controller removal on the system workqueue to avoid
 deadlocks]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 425e48a..3a61331 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -368,7 +368,7 @@ static void async_req_completion(struct nvme_queue *nvmeq, void *ctx,
 	switch (result & 0xff07) {
 	case NVME_AER_NOTICE_NS_CHANGED:
 		dev_info(nvmeq->q_dmadev, "rescanning\n");
-		schedule_work(&nvmeq->dev->scan_work);
+		queue_work(nvme_workq, &nvmeq->dev->scan_work);
 	default:
 		dev_warn(nvmeq->q_dmadev, "async event result %08x\n", result);
 	}
@@ -1867,7 +1867,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 			return 0;
 		dev->ctrl.tagset = &dev->tagset;
 	}
-	schedule_work(&dev->scan_work);
+	queue_work(nvme_workq, &dev->scan_work);
 	return 0;
 }
 
@@ -2409,7 +2409,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto release_pools;
 
-	schedule_work(&dev->reset_work);
+	queue_work(nvme_workq, &dev->reset_work);
 	return 0;
 
  release_pools:
@@ -2430,7 +2430,7 @@ static void nvme_reset_notify(struct pci_dev *pdev, bool prepare)
 	if (prepare)
 		nvme_dev_shutdown(dev);
 	else
-		schedule_work(&dev->reset_work);
+		queue_work(nvme_workq, &dev->reset_work);
 }
 
 static void nvme_shutdown(struct pci_dev *pdev)
@@ -2481,7 +2481,7 @@ static int nvme_resume(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	schedule_work(&ndev->reset_work);
+	queue_work(nvme_workq, &ndev->reset_work);
 	return 0;
 }
 #endif
@@ -2527,7 +2527,7 @@ static int __init nvme_init(void)
 
 	init_waitqueue_head(&nvme_kthread_wait);
 
-	nvme_workq = create_singlethread_workqueue("nvme");
+	nvme_workq = alloc_workqueue("nvme", WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
 	if (!nvme_workq)
 		return -ENOMEM;
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 34/47] NVMe: Remove device management handles on remove
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (32 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 33/47] NVMe: Use unbounded work queue for all work Christoph Hellwig
@ 2015-11-20 16:35 ` Christoph Hellwig
  2015-11-20 16:50 ` NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:35 UTC (permalink / raw)


From: Keith Busch <keith.busch@intel.com>

We don't want to allow new references to open on a device that is
removed. This ties the lifetime of these handles to the physical device's
presence rather than to the open reference count.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 13 +++++++++----
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 38d2ecc..379003b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1078,17 +1078,22 @@ static void nvme_release_instance(struct nvme_ctrl *ctrl)
 	spin_unlock(&dev_list_lock);
 }
 
-static void nvme_free_ctrl(struct kref *kref)
-{
-	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
+ {
+	device_remove_file(ctrl->device, &dev_attr_reset_controller);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
 
 	spin_lock(&dev_list_lock);
 	list_del(&ctrl->node);
 	spin_unlock(&dev_list_lock);
+}
+
+static void nvme_free_ctrl(struct kref *kref)
+{
+	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
 
 	put_device(ctrl->device);
 	nvme_release_instance(ctrl);
-	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
 
 	ctrl->ops->free_ctrl(ctrl);
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 993767e..84bde22 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ static inline int nvme_error_status(u16 status)
 	}
 }
 
+void nvme_uninit_ctrl(struct nvme_ctrl *ctrl);
 int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 		const struct nvme_ctrl_ops *ops, u16 vendor,
 		unsigned long quirks);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3a61331..c5187b5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2451,6 +2451,7 @@ static void nvme_remove(struct pci_dev *pdev)
 	flush_work(&dev->reset_work);
 	flush_work(&dev->scan_work);
 	nvme_remove_namespaces(&dev->ctrl);
+	nvme_uninit_ctrl(&dev->ctrl);
 	nvme_dev_shutdown(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

* NVMe mega patchbomb for Linux 4.5-rc
  2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
                   ` (33 preceding siblings ...)
  2015-11-20 16:35 ` [PATCH 34/47] NVMe: Remove device management handles on remove Christoph Hellwig
@ 2015-11-20 16:50 ` Christoph Hellwig
  34 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-20 16:50 UTC (permalink / raw)


I run into a max smpt connections reached while patch bombing.
I'll ty to resend at a quiter time, but in the mean time the git tree
is available for review.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers
  2015-11-20 16:34 ` [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers Christoph Hellwig
@ 2015-11-20 21:43   ` Jeff Moyer
  2015-11-20 21:47     ` Jens Axboe
  2015-11-20 22:20   ` Jeff Moyer
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-20 21:43 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> We only added the request to the request list for the !blk-mq case,
> so we should only delete it in that case as well.

Sorry, I don't follow.  Do you mean the timeout_list?  It appears to me
that blk_add_timer is called for both the old and mq paths.

-Jeff

> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-timeout.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index 246dfb1..aa40aa9 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -158,11 +158,13 @@ void blk_abort_request(struct request *req)
>  {
>  	if (blk_mark_rq_complete(req))
>  		return;
> -	blk_delete_timer(req);
> -	if (req->q->mq_ops)
> +
> +	if (req->q->mq_ops) {
>  		blk_mq_rq_timed_out(req, false);
> -	else
> +	} else {
> +		blk_delete_timer(req);
>  		blk_rq_timed_out(req);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(blk_abort_request);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers
  2015-11-20 21:43   ` Jeff Moyer
@ 2015-11-20 21:47     ` Jens Axboe
  2015-11-20 21:54       ` Jeff Moyer
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2015-11-20 21:47 UTC (permalink / raw)


On 11/20/2015 02:43 PM, Jeff Moyer wrote:
> Christoph Hellwig <hch at lst.de> writes:
>
>> We only added the request to the request list for the !blk-mq case,
>> so we should only delete it in that case as well.
>
> Sorry, I don't follow.  Do you mean the timeout_list?  It appears to me
> that blk_add_timer is called for both the old and mq paths.

It is called for both, but only !mq adds it to the list. We don't need 
that tracking on the mq side, as we can look it up in our tags.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers
  2015-11-20 21:47     ` Jens Axboe
@ 2015-11-20 21:54       ` Jeff Moyer
  0 siblings, 0 replies; 63+ messages in thread
From: Jeff Moyer @ 2015-11-20 21:54 UTC (permalink / raw)


Jens Axboe <axboe at fb.com> writes:

> On 11/20/2015 02:43 PM, Jeff Moyer wrote:
>> Christoph Hellwig <hch at lst.de> writes:
>>
>>> We only added the request to the request list for the !blk-mq case,
>>> so we should only delete it in that case as well.
>>
>> Sorry, I don't follow.  Do you mean the timeout_list?  It appears to me
>> that blk_add_timer is called for both the old and mq paths.
>
> It is called for both, but only !mq adds it to the list. We don't need
> that tracking on the mq side, as we can look it up in our tags.

Yeah, just saw that.  Thanks.

-Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers
  2015-11-20 16:34 ` [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers Christoph Hellwig
  2015-11-20 21:43   ` Jeff Moyer
@ 2015-11-20 22:20   ` Jeff Moyer
  2015-11-21  7:34     ` Christoph Hellwig
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-20 22:20 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> We only added the request to the request list for the !blk-mq case,
> so we should only delete it in that case as well.

I think I'd rather see the conditional moved into blk_delete_timer.
Breaking the balance of add/delete seems ugly to me.  I had expected to
see a later patch make use of the list, but nothing does.  So, this is
just a cleanup (as opposed to a preparatory patch), yes ?

-Jeff

> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-timeout.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index 246dfb1..aa40aa9 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -158,11 +158,13 @@ void blk_abort_request(struct request *req)
>  {
>  	if (blk_mark_rq_complete(req))
>  		return;
> -	blk_delete_timer(req);
> -	if (req->q->mq_ops)
> +
> +	if (req->q->mq_ops) {
>  		blk_mq_rq_timed_out(req, false);
> -	else
> +	} else {
> +		blk_delete_timer(req);
>  		blk_rq_timed_out(req);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(blk_abort_request);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers
  2015-11-20 22:20   ` Jeff Moyer
@ 2015-11-21  7:34     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-21  7:34 UTC (permalink / raw)


On Fri, Nov 20, 2015@05:20:21PM -0500, Jeff Moyer wrote:
> > We only added the request to the request list for the !blk-mq case,
> > so we should only delete it in that case as well.
> 
> I think I'd rather see the conditional moved into blk_delete_timer.
> Breaking the balance of add/delete seems ugly to me.  I had expected to
> see a later patch make use of the list, but nothing does.  So, this is
> just a cleanup (as opposed to a preparatory patch), yes ?

It was needed in an earlier version of the series.  Note that it's still
needed e.g. for libsas.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 03/47] block: defer timeouts to a workqueue
  2015-11-20 16:34 ` [PATCH 03/47] block: defer timeouts to a workqueue Christoph Hellwig
@ 2015-11-23 20:31   ` Jeff Moyer
  2015-11-23 20:48     ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-23 20:31 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> Timer context is not very useful for drivers to perform any meaningful abort
> action from.  So instead of calling the driver from this useless context
> defer it to a workqueue as soon as possible.
>
> Note that while a delayed_work item would seem the right thing here I didn't
> dare to use it due to the magic in blk_add_timer that pokes deep into timer
> internals.  But maybe this encourages Tejun to add a sensible API for that to
> the workqueue API and we'll all be fine in the end :)

I don't see where the blk-mq timeout work is ever scheduled.  You
removed the call to setup_timer for the mq case, so what causes the mq
timeout work to run?  I must be missing something.

-Jeff

>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-core.c       | 8 ++++++++
>  block/blk-mq.c         | 8 +++++---
>  block/blk-timeout.c    | 5 +++--
>  block/blk.h            | 2 +-
>  include/linux/blkdev.h | 1 +
>  5 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5131993b..1de0974 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -664,6 +664,13 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
>  	wake_up_all(&q->mq_freeze_wq);
>  }
>  
> +static void blk_rq_timed_out_timer(unsigned long data)
> +{
> +	struct request_queue *q = (struct request_queue *)data;
> +
> +	kblockd_schedule_work(&q->timeout_work);
> +}
> +
>  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  {
>  	struct request_queue *q;
> @@ -825,6 +832,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
>  	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
>  		goto fail;
>  
> +	INIT_WORK(&q->timeout_work, blk_timeout_work);
>  	q->request_fn		= rfn;
>  	q->prep_rq_fn		= NULL;
>  	q->unprep_rq_fn		= NULL;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ae09de..8354601 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -85,6 +85,7 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
>  	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>  	if (freeze_depth == 1) {
>  		percpu_ref_kill(&q->q_usage_counter);
> +		cancel_work_sync(&q->timeout_work);
>  		blk_mq_run_hw_queues(q, false);
>  	}
>  }
> @@ -617,9 +618,10 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  	}
>  }
>  
> -static void blk_mq_rq_timer(unsigned long priv)
> +static void blk_mq_timeout_work(struct work_struct *work)
>  {
> -	struct request_queue *q = (struct request_queue *)priv;
> +	struct request_queue *q =
> +		container_of(work, struct request_queue, timeout_work);
>  	struct blk_mq_timeout_data data = {
>  		.next		= 0,
>  		.next_set	= 0,
> @@ -2015,7 +2017,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  		hctxs[i]->queue_num = i;
>  	}
>  
> -	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
> +	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
>  	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
>  
>  	q->nr_queues = nr_cpu_ids;
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index aa40aa9..aedd128 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -127,9 +127,10 @@ static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout
>  	}
>  }
>  
> -void blk_rq_timed_out_timer(unsigned long data)
> +void blk_timeout_work(struct work_struct *work)
>  {
> -	struct request_queue *q = (struct request_queue *) data;
> +	struct request_queue *q =
> +		container_of(work, struct request_queue, timeout_work);
>  	unsigned long flags, next = 0;
>  	struct request *rq, *tmp;
>  	int next_set = 0;
> diff --git a/block/blk.h b/block/blk.h
> index da722eb..37b9165 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -95,7 +95,7 @@ static inline void blk_flush_integrity(void)
>  }
>  #endif
>  
> -void blk_rq_timed_out_timer(unsigned long data);
> +void blk_timeout_work(struct work_struct *work);
>  unsigned long blk_rq_timeout(unsigned long timeout);
>  void blk_add_timer(struct request *req);
>  void blk_delete_timer(struct request *);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 3fe27f8..9a8424a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -407,6 +407,7 @@ struct request_queue {
>  
>  	unsigned int		rq_timeout;
>  	struct timer_list	timeout;
> +	struct work_struct	timeout_work;
>  	struct list_head	timeout_list;
>  
>  	struct list_head	icq_list;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 03/47] block: defer timeouts to a workqueue
  2015-11-23 20:31   ` Jeff Moyer
@ 2015-11-23 20:48     ` Christoph Hellwig
  2015-11-23 20:59       ` Jeff Moyer
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-23 20:48 UTC (permalink / raw)


On Mon, Nov 23, 2015@03:31:32PM -0500, Jeff Moyer wrote:
> I don't see where the blk-mq timeout work is ever scheduled.  You
> removed the call to setup_timer for the mq case, so what causes the mq
> timeout work to run?  I must be missing something.

No we use the timer which we initialize in blk_alloc_queue_node instead
of overwriting it later in the blk-mq case.  The timer gets added in
blk_add_timer either way.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 03/47] block: defer timeouts to a workqueue
  2015-11-23 20:48     ` Christoph Hellwig
@ 2015-11-23 20:59       ` Jeff Moyer
  0 siblings, 0 replies; 63+ messages in thread
From: Jeff Moyer @ 2015-11-23 20:59 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> On Mon, Nov 23, 2015@03:31:32PM -0500, Jeff Moyer wrote:
>> I don't see where the blk-mq timeout work is ever scheduled.  You
>> removed the call to setup_timer for the mq case, so what causes the mq
>> timeout work to run?  I must be missing something.
>
> No we use the timer which we initialize in blk_alloc_queue_node instead
> of overwriting it later in the blk-mq case.  The timer gets added in
> blk_add_timer either way.

Ah, I see.  I missed the call to blk_alloc_queue_node from
blk_mq_init_queue.  Thanks.

The patch looks good to me.  Everything will still be running in the
same context (irqs off).

Reviewed-by: Jeff Moyer <jmoyer at redhat.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers
  2015-11-20 16:35 ` [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers Christoph Hellwig
@ 2015-11-23 21:23   ` Jeff Moyer
  2015-11-24 21:22   ` Jens Axboe
  1 sibling, 0 replies; 63+ messages in thread
From: Jeff Moyer @ 2015-11-23 21:23 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> We only use them inconsistenly, and the other flags don't use wrappers like
> this either.

Fine by me.

Reviewed-by: Jeff Moyer <jmoyer at redhat.com>

>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-core.c    |  2 +-
>  block/blk-mq.c      |  6 +++---
>  block/blk-softirq.c |  2 +-
>  block/blk-timeout.c |  6 +++---
>  block/blk.h         | 18 ++++--------------
>  5 files changed, 12 insertions(+), 22 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 1de0974..af9c315 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1375,7 +1375,7 @@ EXPORT_SYMBOL(blk_rq_set_block_pc);
>  void blk_requeue_request(struct request_queue *q, struct request *rq)
>  {
>  	blk_delete_timer(rq);
> -	blk_clear_rq_complete(rq);
> +	clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
>  	trace_block_rq_requeue(q, rq);
>  
>  	if (rq->cmd_flags & REQ_QUEUED)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 76773dc..c932605 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -383,7 +383,7 @@ void blk_mq_complete_request(struct request *rq, int error)
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -	if (!blk_mark_rq_complete(rq) ||
> +	if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags) ||
>  	    test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags)) {
>  		rq->errors = error;
>  		__blk_mq_complete_request(rq);
> @@ -583,7 +583,7 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		break;
>  	case BLK_EH_RESET_TIMER:
>  		blk_add_timer(req);
> -		blk_clear_rq_complete(req);
> +		clear_bit(REQ_ATOM_COMPLETE, &req->atomic_flags);
>  		break;
>  	case BLK_EH_NOT_HANDLED:
>  		break;
> @@ -614,7 +614,7 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  		return;
>  
>  	if (time_after_eq(jiffies, rq->deadline)) {
> -		if (!blk_mark_rq_complete(rq))
> +		if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
>  			blk_mq_rq_timed_out(rq, reserved);
>  	} else if (!data->next_set || time_after(data->next, rq->deadline)) {
>  		data->next = rq->deadline;
> diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> index 9d47fbc..b89f655 100644
> --- a/block/blk-softirq.c
> +++ b/block/blk-softirq.c
> @@ -167,7 +167,7 @@ void blk_complete_request(struct request *req)
>  {
>  	if (unlikely(blk_should_fake_timeout(req->q)))
>  		return;
> -	if (!blk_mark_rq_complete(req) ||
> +	if (!test_and_set_bit(REQ_ATOM_COMPLETE, &req->atomic_flags) ||
>  	    test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))
>  		__blk_complete_request(req);
>  }
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index b3a7f20..cc10db2 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -94,7 +94,7 @@ static void blk_rq_timed_out(struct request *req)
>  		break;
>  	case BLK_EH_RESET_TIMER:
>  		blk_add_timer(req);
> -		blk_clear_rq_complete(req);
> +		clear_bit(REQ_ATOM_COMPLETE, &req->atomic_flags);
>  		break;
>  	case BLK_EH_QUIESCED:
>  		set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
> @@ -122,7 +122,7 @@ static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout
>  		/*
>  		 * Check if we raced with end io completion
>  		 */
> -		if (!blk_mark_rq_complete(rq))
> +		if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
>  			blk_rq_timed_out(rq);
>  	} else if (!*next_set || time_after(*next_timeout, rq->deadline)) {
>  		*next_timeout = rq->deadline;
> @@ -160,7 +160,7 @@ void blk_timeout_work(struct work_struct *work)
>   */
>  void blk_abort_request(struct request *req)
>  {
> -	if (blk_mark_rq_complete(req))
> +	if (test_and_set_bit(REQ_ATOM_COMPLETE, &req->atomic_flags))
>  		return;
>  
>  	if (req->q->mq_ops) {
> diff --git a/block/blk.h b/block/blk.h
> index f4c98f8..1d95107 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -118,26 +118,16 @@ void blk_account_io_done(struct request *req);
>   * Internal atomic flags for request handling
>   */
>  enum rq_atomic_flags {
> +	/*
> +	 * EH timer and IO completion will both attempt to 'grab' the request,
> +	 * make sure that only one of them succeeds by setting this flag.
> +	 */
>  	REQ_ATOM_COMPLETE = 0,
>  	REQ_ATOM_STARTED,
>  	REQ_ATOM_QUIESCED,
>  };
>  
>  /*
> - * EH timer and IO completion will both attempt to 'grab' the request, make
> - * sure that only one of them succeeds
> - */
> -static inline int blk_mark_rq_complete(struct request *rq)
> -{
> -	return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> -}
> -
> -static inline void blk_clear_rq_complete(struct request *rq)
> -{
> -	clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> -}
> -
> -/*
>   * Internal elevator interface
>   */
>  #define ELV_ON_HASH(rq) ((rq)->cmd_flags & REQ_HASHED)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-20 16:34 ` [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value Christoph Hellwig
@ 2015-11-24 15:16   ` Jeff Moyer
  2015-11-24 15:40     ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-24 15:16 UTC (permalink / raw)


Hi Christoph,

Christoph Hellwig <hch at lst.de> writes:

> This marks the request as one that's not actually completed yet, but
> should be reaped next time blk_mq_complete_request comes in.  This is
> useful it the abort handler kicked of a reset that will complete all
> pending requests.

What's the purpose, though?  Is this an optimization?

We've had "fun" problems with races between completion and timeout
before.  I can't say I'm too keen on adding more complexity to this code
path.  Have you considered what happens in your new code when this race
occurs?  I don't expect it to cause any issues in the mq case, since the
timeout handler should run on the same cpu as the completion code for a
given request (right?).  However, for the old code path, they could run
in parallel.

blk_complete_request:
A  if (!blk_mark_rq_complete(rq) ||
B      test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) {
C        __blk_mq_complete_request(rq);

could run alongside of:

blk_rq_check_expired:
1 if (!blk_mark_rq_complete(rq))
2   blk_rq_timed_out(rq);

So, if 1 comes before A, we have two cases to consider:

i.  the expiration path does not yet set REQ_ATOM_QUIESCED before the
    completion code runs, and so the completion code does nothing.
ii. the expiration path *does* SET REQ_ATOM_QUIESCED.  In this instance,
    will we get yet another completion for the request when the command
    is ultimately retired by the adapter reset?

Cheers,
Jeff

>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-mq.c         | 6 +++++-
>  block/blk-softirq.c    | 3 ++-
>  block/blk-timeout.c    | 3 +++
>  block/blk.h            | 1 +
>  include/linux/blkdev.h | 1 +
>  5 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8354601..76773dc 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -383,7 +383,8 @@ void blk_mq_complete_request(struct request *rq, int error)
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
> -	if (!blk_mark_rq_complete(rq)) {
> +	if (!blk_mark_rq_complete(rq) ||
> +	    test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags)) {
>  		rq->errors = error;
>  		__blk_mq_complete_request(rq);
>  	}
> @@ -586,6 +587,9 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		break;
>  	case BLK_EH_NOT_HANDLED:
>  		break;
> +	case BLK_EH_QUIESCED:
> +		set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
> +		break;
>  	default:
>  		printk(KERN_ERR "block: bad eh return: %d\n", ret);
>  		break;
> diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> index 53b1737..9d47fbc 100644
> --- a/block/blk-softirq.c
> +++ b/block/blk-softirq.c
> @@ -167,7 +167,8 @@ void blk_complete_request(struct request *req)
>  {
>  	if (unlikely(blk_should_fake_timeout(req->q)))
>  		return;
> -	if (!blk_mark_rq_complete(req))
> +	if (!blk_mark_rq_complete(req) ||
> +	    test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))
>  		__blk_complete_request(req);
>  }
>  EXPORT_SYMBOL(blk_complete_request);
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index aedd128..b3a7f20 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -96,6 +96,9 @@ static void blk_rq_timed_out(struct request *req)
>  		blk_add_timer(req);
>  		blk_clear_rq_complete(req);
>  		break;
> +	case BLK_EH_QUIESCED:
> +		set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
> +		break;
>  	case BLK_EH_NOT_HANDLED:
>  		/*
>  		 * LLD handles this for now but in the future
> diff --git a/block/blk.h b/block/blk.h
> index 37b9165..f4c98f8 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -120,6 +120,7 @@ void blk_account_io_done(struct request *req);
>  enum rq_atomic_flags {
>  	REQ_ATOM_COMPLETE = 0,
>  	REQ_ATOM_STARTED,
> +	REQ_ATOM_QUIESCED,
>  };
>  
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 9a8424a..5df5fb13 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -223,6 +223,7 @@ enum blk_eh_timer_return {
>  	BLK_EH_NOT_HANDLED,
>  	BLK_EH_HANDLED,
>  	BLK_EH_RESET_TIMER,
> +	BLK_EH_QUIESCED,
>  };
>  
>  typedef enum blk_eh_timer_return (rq_timed_out_fn)(struct request *);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-20 16:35 ` [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request Christoph Hellwig
@ 2015-11-24 15:19   ` Jeff Moyer
  2015-11-24 15:32     ` Christoph Hellwig
  2015-11-24 21:21   ` Jens Axboe
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-24 15:19 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> We already have the reserved flag, and a nowait flag awkwardly encoded as
> a gfp_t.  Add a real flags argument to make the scheme more extensible and
> allow for a nicer calling convention.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  block/blk-core.c                  |   11 +-
>  block/blk-mq-tag.c                |   11 +-
>  block/blk-mq.c                    |   20 +-
>  block/blk-mq.h                    |   11 +-
>  block/blk.h                       |    2 +-
>  drivers/block/mtip32xx/mtip32xx.c |    2 +-
>  drivers/nvme/host/core.c          | 1172 +++++++++++++++++++++++++++++++++++++

Christoph, I think you included a bit too much in this patch!  ;-)

-Jeff

>  drivers/nvme/host/pci.c           |   11 +-
>  include/linux/blk-mq.h            |    8 +-
>  9 files changed, 1210 insertions(+), 38 deletions(-)
>  create mode 100644 drivers/nvme/host/core.c
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index af9c315..d2100aa 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -630,7 +630,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
>  
> -int blk_queue_enter(struct request_queue *q, gfp_t gfp)
> +int blk_queue_enter(struct request_queue *q, bool nowait)
>  {
>  	while (true) {
>  		int ret;
> @@ -638,7 +638,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
>  		if (percpu_ref_tryget_live(&q->q_usage_counter))
>  			return 0;
>  
> -		if (!gfpflags_allow_blocking(gfp))
> +		if (nowait)
>  			return -EBUSY;
>  
>  		ret = wait_event_interruptible(q->mq_freeze_wq,
> @@ -1284,7 +1284,9 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
>  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>  	if (q->mq_ops)
> -		return blk_mq_alloc_request(q, rw, gfp_mask, false);
> +		return blk_mq_alloc_request(q, rw,
> +			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
> +				0 : BLK_MQ_REQ_NOWAIT);
>  	else
>  		return blk_old_get_request(q, rw, gfp_mask);
>  }
> @@ -2052,8 +2054,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
> -		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> -
> +		if (likely(blk_queue_enter(q, false) == 0)) {
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index a07ca34..abdbb47 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -268,7 +268,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
>  	if (tag != -1)
>  		return tag;
>  
> -	if (!gfpflags_allow_blocking(data->gfp))
> +	if (data->flags & BLK_MQ_REQ_NOWAIT)
>  		return -1;
>  
>  	bs = bt_wait_ptr(bt, hctx);
> @@ -303,7 +303,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
>  		data->ctx = blk_mq_get_ctx(data->q);
>  		data->hctx = data->q->mq_ops->map_queue(data->q,
>  				data->ctx->cpu);
> -		if (data->reserved) {
> +		if (data->flags & BLK_MQ_REQ_RESERVED) {
>  			bt = &data->hctx->tags->breserved_tags;
>  		} else {
>  			last_tag = &data->ctx->last_tag;
> @@ -349,10 +349,9 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_alloc_data *data)
>  
>  unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  {
> -	if (!data->reserved)
> -		return __blk_mq_get_tag(data);
> -
> -	return __blk_mq_get_reserved_tag(data);
> +	if (data->flags & BLK_MQ_REQ_RESERVED)
> +		return __blk_mq_get_reserved_tag(data);
> +	return __blk_mq_get_tag(data);
>  }
>  
>  static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c932605..6da03f1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -230,8 +230,8 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, int rw)
>  	return NULL;
>  }
>  
> -struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
> -		bool reserved)
> +struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
> +		unsigned int flags)
>  {
>  	struct blk_mq_ctx *ctx;
>  	struct blk_mq_hw_ctx *hctx;
> @@ -239,24 +239,22 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>  	struct blk_mq_alloc_data alloc_data;
>  	int ret;
>  
> -	ret = blk_queue_enter(q, gfp);
> +	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
>  	if (ret)
>  		return ERR_PTR(ret);
>  
>  	ctx = blk_mq_get_ctx(q);
>  	hctx = q->mq_ops->map_queue(q, ctx->cpu);
> -	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
> -			reserved, ctx, hctx);
> +	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
>  
>  	rq = __blk_mq_alloc_request(&alloc_data, rw);
> -	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
> +	if (!rq && !(flags & BLK_MQ_REQ_NOWAIT)) {
>  		__blk_mq_run_hw_queue(hctx);
>  		blk_mq_put_ctx(ctx);
>  
>  		ctx = blk_mq_get_ctx(q);
>  		hctx = q->mq_ops->map_queue(q, ctx->cpu);
> -		blk_mq_set_alloc_data(&alloc_data, q, gfp, reserved, ctx,
> -				hctx);
> +		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
>  		rq =  __blk_mq_alloc_request(&alloc_data, rw);
>  		ctx = alloc_data.ctx;
>  	}
> @@ -1181,8 +1179,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
>  		rw |= REQ_SYNC;
>  
>  	trace_block_getrq(q, bio, rw);
> -	blk_mq_set_alloc_data(&alloc_data, q, GFP_ATOMIC, false, ctx,
> -			hctx);
> +	blk_mq_set_alloc_data(&alloc_data, q, BLK_MQ_REQ_NOWAIT, ctx, hctx);
>  	rq = __blk_mq_alloc_request(&alloc_data, rw);
>  	if (unlikely(!rq)) {
>  		__blk_mq_run_hw_queue(hctx);
> @@ -1191,8 +1188,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
>  
>  		ctx = blk_mq_get_ctx(q);
>  		hctx = q->mq_ops->map_queue(q, ctx->cpu);
> -		blk_mq_set_alloc_data(&alloc_data, q,
> -				__GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
> +		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
>  		rq = __blk_mq_alloc_request(&alloc_data, rw);
>  		ctx = alloc_data.ctx;
>  		hctx = alloc_data.hctx;
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 713820b..eaede8e 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -96,8 +96,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
>  struct blk_mq_alloc_data {
>  	/* input parameter */
>  	struct request_queue *q;
> -	gfp_t gfp;
> -	bool reserved;
> +	unsigned int flags;
>  
>  	/* input & output parameter */
>  	struct blk_mq_ctx *ctx;
> @@ -105,13 +104,11 @@ struct blk_mq_alloc_data {
>  };
>  
>  static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
> -		struct request_queue *q, gfp_t gfp, bool reserved,
> -		struct blk_mq_ctx *ctx,
> -		struct blk_mq_hw_ctx *hctx)
> +		struct request_queue *q, unsigned int flags,
> +		struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
>  {
>  	data->q = q;
> -	data->gfp = gfp;
> -	data->reserved = reserved;
> +	data->flags = flags;
>  	data->ctx = ctx;
>  	data->hctx = hctx;
>  }
> diff --git a/block/blk.h b/block/blk.h
> index 1d95107..38bf997 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -72,7 +72,7 @@ void blk_dequeue_request(struct request *rq);
>  void __blk_queue_free_tags(struct request_queue *q);
>  bool __blk_end_bidi_request(struct request *rq, int error,
>  			    unsigned int nr_bytes, unsigned int bidi_bytes);
> -int blk_queue_enter(struct request_queue *q, gfp_t gfp);
> +int blk_queue_enter(struct request_queue *q, bool nowait);
>  void blk_queue_exit(struct request_queue *q);
>  void blk_freeze_queue(struct request_queue *q);
>  
> diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
> index a28a562..cf3b51a 100644
> --- a/drivers/block/mtip32xx/mtip32xx.c
> +++ b/drivers/block/mtip32xx/mtip32xx.c
> @@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
>  {
>  	struct request *rq;
>  
> -	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
> +	rq = blk_mq_alloc_request(dd->queue, 0, BLK_MQ_REQ_RESERVED);
>  	return blk_mq_rq_to_pdu(rq);
>  }
>  
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> new file mode 100644
> index 0000000..53cf507
> --- /dev/null
> +++ b/drivers/nvme/host/core.c
> @@ -0,0 +1,1172 @@
> +/*
> + * NVM Express device driver
> + * Copyright (c) 2011-2014, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> +#include <linux/errno.h>
> +#include <linux/hdreg.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/list_sort.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +#include <linux/pr.h>
> +#include <linux/ptrace.h>
> +#include <linux/nvme_ioctl.h>
> +#include <linux/t10-pi.h>
> +#include <scsi/sg.h>
> +#include <asm/unaligned.h>
> +
> +#include "nvme.h"
> +
> +#define NVME_MINORS		(1U << MINORBITS)
> +
> +static int nvme_major;
> +module_param(nvme_major, int, 0);
> +
> +static int nvme_char_major;
> +module_param(nvme_char_major, int, 0);
> +
> +static LIST_HEAD(nvme_ctrl_list);
> +DEFINE_SPINLOCK(dev_list_lock);
> +
> +static struct class *nvme_class;
> +
> +static void nvme_free_ns(struct kref *kref)
> +{
> +	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
> +
> +	if (ns->type == NVME_NS_LIGHTNVM)
> +		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
> +
> +	spin_lock(&dev_list_lock);
> +	ns->disk->private_data = NULL;
> +	spin_unlock(&dev_list_lock);
> +
> +	nvme_put_ctrl(ns->ctrl);
> +	put_disk(ns->disk);
> +	kfree(ns);
> +}
> +
> +static void nvme_put_ns(struct nvme_ns *ns)
> +{
> +	kref_put(&ns->kref, nvme_free_ns);
> +}
> +
> +static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
> +{
> +	struct nvme_ns *ns;
> +
> +	spin_lock(&dev_list_lock);
> +	ns = disk->private_data;
> +	if (ns && !kref_get_unless_zero(&ns->kref))
> +		ns = NULL;
> +	spin_unlock(&dev_list_lock);
> +
> +	return ns;
> +}
> +
> +static struct request *nvme_alloc_request(struct request_queue *q,
> +		struct nvme_command *cmd)
> +{
> +	bool write = cmd->common.opcode & 1;
> +	struct request *req;
> +
> +	req = blk_mq_alloc_request(q, write, 0);
> +	if (IS_ERR(req))
> +		return req;
> +
> +	req->cmd_type = REQ_TYPE_DRV_PRIV;
> +	req->cmd_flags |= REQ_FAILFAST_DRIVER;
> +	req->__data_len = 0;
> +	req->__sector = (sector_t) -1;
> +	req->bio = req->biotail = NULL;
> +
> +	req->cmd = (unsigned char *)cmd;
> +	req->cmd_len = sizeof(struct nvme_command);
> +	req->special = (void *)0;
> +
> +	return req;
> +}
> +
> +/*
> + * Returns 0 on success.  If the result is negative, it's a Linux error code;
> + * if the result is positive, it's an NVM Express status code
> + */
> +int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
> +		void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
> +{
> +	struct request *req;
> +	int ret;
> +
> +	req = nvme_alloc_request(q, cmd);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
> +
> +	if (buffer && bufflen) {
> +		ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	blk_execute_rq(req->q, NULL, req, 0);
> +	if (result)
> +		*result = (u32)(uintptr_t)req->special;
> +	ret = req->errors;
> + out:
> +	blk_mq_free_request(req);
> +	return ret;
> +}
> +
> +int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
> +		void *buffer, unsigned bufflen)
> +{
> +	return __nvme_submit_sync_cmd(q, cmd, buffer, bufflen, NULL, 0);
> +}
> +
> +int __nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
> +		void __user *ubuffer, unsigned bufflen,
> +		void __user *meta_buffer, unsigned meta_len, u32 meta_seed,
> +		u32 *result, unsigned timeout)
> +{
> +	bool write = cmd->common.opcode & 1;
> +	struct nvme_ns *ns = q->queuedata;
> +	struct gendisk *disk = ns ? ns->disk : NULL;
> +	struct request *req;
> +	struct bio *bio = NULL;
> +	void *meta = NULL;
> +	int ret;
> +
> +	req = nvme_alloc_request(q, cmd);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
> +
> +	if (ubuffer && bufflen) {
> +		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
> +				GFP_KERNEL);
> +		if (ret)
> +			goto out;
> +		bio = req->bio;
> +
> +		if (!disk)
> +			goto submit;
> +		bio->bi_bdev = bdget_disk(disk, 0);
> +		if (!bio->bi_bdev) {
> +			ret = -ENODEV;
> +			goto out_unmap;
> +		}
> +
> +		if (meta_buffer) {
> +			struct bio_integrity_payload *bip;
> +
> +			meta = kmalloc(meta_len, GFP_KERNEL);
> +			if (!meta) {
> +				ret = -ENOMEM;
> +				goto out_unmap;
> +			}
> +
> +			if (write) {
> +				if (copy_from_user(meta, meta_buffer,
> +						meta_len)) {
> +					ret = -EFAULT;
> +					goto out_free_meta;
> +				}
> +			}
> +
> +			bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
> +			if (!bip) {
> +				ret = -ENOMEM;
> +				goto out_free_meta;
> +			}
> +
> +			bip->bip_iter.bi_size = meta_len;
> +			bip->bip_iter.bi_sector = meta_seed;
> +
> +			ret = bio_integrity_add_page(bio, virt_to_page(meta),
> +					meta_len, offset_in_page(meta));
> +			if (ret != meta_len) {
> +				ret = -ENOMEM;
> +				goto out_free_meta;
> +			}
> +		}
> +	}
> + submit:
> +	blk_execute_rq(req->q, disk, req, 0);
> +	ret = req->errors;
> +	if (result)
> +		*result = (u32)(uintptr_t)req->special;
> +	if (meta && !ret && !write) {
> +		if (copy_to_user(meta_buffer, meta, meta_len))
> +			ret = -EFAULT;
> +	}
> + out_free_meta:
> +	kfree(meta);
> + out_unmap:
> +	if (bio) {
> +		if (disk && bio->bi_bdev)
> +			bdput(bio->bi_bdev);
> +		blk_rq_unmap_user(bio);
> +	}
> + out:
> +	blk_mq_free_request(req);
> +	return ret;
> +}
> +
> +int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
> +		void __user *ubuffer, unsigned bufflen, u32 *result,
> +		unsigned timeout)
> +{
> +	return __nvme_submit_user_cmd(q, cmd, ubuffer, bufflen, NULL, 0, 0,
> +			result, timeout);
> +}
> +
> +int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
> +{
> +	struct nvme_command c = { };
> +	int error;
> +
> +	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
> +	c.identify.opcode = nvme_admin_identify;
> +	c.identify.cns = cpu_to_le32(1);
> +
> +	*id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
> +	if (!*id)
> +		return -ENOMEM;
> +
> +	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
> +			sizeof(struct nvme_id_ctrl));
> +	if (error)
> +		kfree(*id);
> +	return error;
> +}
> +
> +static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *ns_list)
> +{
> +	struct nvme_command c = { };
> +
> +	c.identify.opcode = nvme_admin_identify;
> +	c.identify.cns = cpu_to_le32(2);
> +	c.identify.nsid = cpu_to_le32(nsid);
> +	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
> +}
> +
> +int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
> +		struct nvme_id_ns **id)
> +{
> +	struct nvme_command c = { };
> +	int error;
> +
> +	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
> +	c.identify.opcode = nvme_admin_identify,
> +	c.identify.nsid = cpu_to_le32(nsid),
> +
> +	*id = kmalloc(sizeof(struct nvme_id_ns), GFP_KERNEL);
> +	if (!*id)
> +		return -ENOMEM;
> +
> +	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
> +			sizeof(struct nvme_id_ns));
> +	if (error)
> +		kfree(*id);
> +	return error;
> +}
> +
> +int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
> +					dma_addr_t dma_addr, u32 *result)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.features.opcode = nvme_admin_get_features;
> +	c.features.nsid = cpu_to_le32(nsid);
> +	c.features.prp1 = cpu_to_le64(dma_addr);
> +	c.features.fid = cpu_to_le32(fid);
> +
> +	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
> +}
> +
> +int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
> +					dma_addr_t dma_addr, u32 *result)
> +{
> +	struct nvme_command c;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.features.opcode = nvme_admin_set_features;
> +	c.features.prp1 = cpu_to_le64(dma_addr);
> +	c.features.fid = cpu_to_le32(fid);
> +	c.features.dword11 = cpu_to_le32(dword11);
> +
> +	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
> +}
> +
> +int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
> +{
> +	struct nvme_command c = { };
> +	int error;
> +
> +	c.common.opcode = nvme_admin_get_log_page,
> +	c.common.nsid = cpu_to_le32(0xFFFFFFFF),
> +	c.common.cdw10[0] = cpu_to_le32(
> +			(((sizeof(struct nvme_smart_log) / 4) - 1) << 16) |
> +			 NVME_LOG_SMART),
> +
> +	*log = kmalloc(sizeof(struct nvme_smart_log), GFP_KERNEL);
> +	if (!*log)
> +		return -ENOMEM;
> +
> +	error = nvme_submit_sync_cmd(dev->admin_q, &c, *log,
> +			sizeof(struct nvme_smart_log));
> +	if (error)
> +		kfree(*log);
> +	return error;
> +}
> +
> +static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
> +{
> +	struct nvme_user_io io;
> +	struct nvme_command c;
> +	unsigned length, meta_len;
> +	void __user *metadata;
> +
> +	if (copy_from_user(&io, uio, sizeof(io)))
> +		return -EFAULT;
> +
> +	switch (io.opcode) {
> +	case nvme_cmd_write:
> +	case nvme_cmd_read:
> +	case nvme_cmd_compare:
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	length = (io.nblocks + 1) << ns->lba_shift;
> +	meta_len = (io.nblocks + 1) * ns->ms;
> +	metadata = (void __user *)(uintptr_t)io.metadata;
> +
> +	if (ns->ext) {
> +		length += meta_len;
> +		meta_len = 0;
> +	} else if (meta_len) {
> +		if ((io.metadata & 3) || !io.metadata)
> +			return -EINVAL;
> +	}
> +
> +	memset(&c, 0, sizeof(c));
> +	c.rw.opcode = io.opcode;
> +	c.rw.flags = io.flags;
> +	c.rw.nsid = cpu_to_le32(ns->ns_id);
> +	c.rw.slba = cpu_to_le64(io.slba);
> +	c.rw.length = cpu_to_le16(io.nblocks);
> +	c.rw.control = cpu_to_le16(io.control);
> +	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
> +	c.rw.reftag = cpu_to_le32(io.reftag);
> +	c.rw.apptag = cpu_to_le16(io.apptag);
> +	c.rw.appmask = cpu_to_le16(io.appmask);
> +
> +	return __nvme_submit_user_cmd(ns->queue, &c,
> +			(void __user *)(uintptr_t)io.addr, length,
> +			metadata, meta_len, io.slba, NULL, 0);
> +}
> +
> +static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> +			struct nvme_passthru_cmd __user *ucmd)
> +{
> +	struct nvme_passthru_cmd cmd;
> +	struct nvme_command c;
> +	unsigned timeout = 0;
> +	int status;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> +		return -EFAULT;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.common.opcode = cmd.opcode;
> +	c.common.flags = cmd.flags;
> +	c.common.nsid = cpu_to_le32(cmd.nsid);
> +	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> +	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> +	c.common.cdw10[0] = cpu_to_le32(cmd.cdw10);
> +	c.common.cdw10[1] = cpu_to_le32(cmd.cdw11);
> +	c.common.cdw10[2] = cpu_to_le32(cmd.cdw12);
> +	c.common.cdw10[3] = cpu_to_le32(cmd.cdw13);
> +	c.common.cdw10[4] = cpu_to_le32(cmd.cdw14);
> +	c.common.cdw10[5] = cpu_to_le32(cmd.cdw15);
> +
> +	if (cmd.timeout_ms)
> +		timeout = msecs_to_jiffies(cmd.timeout_ms);
> +
> +	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
> +			(void __user *)cmd.addr, cmd.data_len,
> +			&cmd.result, timeout);
> +	if (status >= 0) {
> +		if (put_user(cmd.result, &ucmd->result))
> +			return -EFAULT;
> +	}
> +
> +	return status;
> +}
> +
> +static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	struct nvme_ns *ns = bdev->bd_disk->private_data;
> +
> +	switch (cmd) {
> +	case NVME_IOCTL_ID:
> +		force_successful_syscall_return();
> +		return ns->ns_id;
> +	case NVME_IOCTL_ADMIN_CMD:
> +		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
> +	case NVME_IOCTL_IO_CMD:
> +		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
> +	case NVME_IOCTL_SUBMIT_IO:
> +		return nvme_submit_io(ns, (void __user *)arg);
> +	case SG_GET_VERSION_NUM:
> +		return nvme_sg_get_version_num((void __user *)arg);
> +	case SG_IO:
> +		return nvme_sg_io(ns, (void __user *)arg);
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
> +			unsigned int cmd, unsigned long arg)
> +{
> +	switch (cmd) {
> +	case SG_IO:
> +		return -ENOIOCTLCMD;
> +	}
> +	return nvme_ioctl(bdev, mode, cmd, arg);
> +}
> +#else
> +#define nvme_compat_ioctl	NULL
> +#endif
> +
> +static int nvme_open(struct block_device *bdev, fmode_t mode)
> +{
> +	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
> +}
> +
> +static void nvme_release(struct gendisk *disk, fmode_t mode)
> +{
> +	nvme_put_ns(disk->private_data);
> +}
> +
> +static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
> +{
> +	/* some standard values */
> +	geo->heads = 1 << 6;
> +	geo->sectors = 1 << 5;
> +	geo->cylinders = get_capacity(bdev->bd_disk) >> 11;
> +	return 0;
> +}
> +
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> +static void nvme_init_integrity(struct nvme_ns *ns)
> +{
> +	struct blk_integrity integrity;
> +
> +	switch (ns->pi_type) {
> +	case NVME_NS_DPS_PI_TYPE3:
> +		integrity.profile = &t10_pi_type3_crc;
> +		break;
> +	case NVME_NS_DPS_PI_TYPE1:
> +	case NVME_NS_DPS_PI_TYPE2:
> +		integrity.profile = &t10_pi_type1_crc;
> +		break;
> +	default:
> +		integrity.profile = NULL;
> +		break;
> +	}
> +	integrity.tuple_size = ns->ms;
> +	blk_integrity_register(ns->disk, &integrity);
> +	blk_queue_max_integrity_segments(ns->queue, 1);
> +}
> +#else
> +static void nvme_init_integrity(struct nvme_ns *ns)
> +{
> +}
> +#endif /* CONFIG_BLK_DEV_INTEGRITY */
> +
> +static void nvme_config_discard(struct nvme_ns *ns)
> +{
> +	u32 logical_block_size = queue_logical_block_size(ns->queue);
> +	ns->queue->limits.discard_zeroes_data = 0;
> +	ns->queue->limits.discard_alignment = logical_block_size;
> +	ns->queue->limits.discard_granularity = logical_block_size;
> +	blk_queue_max_discard_sectors(ns->queue, 0xffffffff);
> +	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
> +}
> +
> +static int nvme_revalidate_disk(struct gendisk *disk)
> +{
> +	struct nvme_ns *ns = disk->private_data;
> +	struct nvme_id_ns *id;
> +	u8 lbaf, pi_type;
> +	u16 old_ms;
> +	unsigned short bs;
> +
> +	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
> +		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
> +				__func__, ns->ctrl->instance, ns->ns_id);
> +		return -ENODEV;
> +	}
> +	if (id->ncap == 0) {
> +		kfree(id);
> +		return -ENODEV;
> +	}
> +
> +	if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
> +		if (nvme_nvm_register(ns->queue, disk->disk_name)) {
> +			dev_warn(ns->ctrl->dev,
> +				"%s: LightNVM init failure\n", __func__);
> +			kfree(id);
> +			return -ENODEV;
> +		}
> +		ns->type = NVME_NS_LIGHTNVM;
> +	}
> +
> +	old_ms = ns->ms;
> +	lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
> +	ns->lba_shift = id->lbaf[lbaf].ds;
> +	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
> +	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
> +
> +	/*
> +	 * If identify namespace failed, use default 512 byte block size so
> +	 * block layer can use before failing read/write for 0 capacity.
> +	 */
> +	if (ns->lba_shift == 0)
> +		ns->lba_shift = 9;
> +	bs = 1 << ns->lba_shift;
> +	/* XXX: PI implementation requires metadata equal t10 pi tuple size */
> +	pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
> +					id->dps & NVME_NS_DPS_PI_MASK : 0;
> +
> +	blk_mq_freeze_queue(disk->queue);
> +	if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
> +				ns->ms != old_ms ||
> +				bs != queue_logical_block_size(disk->queue) ||
> +				(ns->ms && ns->ext)))
> +		blk_integrity_unregister(disk);
> +
> +	ns->pi_type = pi_type;
> +	blk_queue_logical_block_size(ns->queue, bs);
> +
> +	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
> +		nvme_init_integrity(ns);
> +	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
> +		set_capacity(disk, 0);
> +	else
> +		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
> +
> +	if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
> +		nvme_config_discard(ns);
> +	blk_mq_unfreeze_queue(disk->queue);
> +
> +	kfree(id);
> +	return 0;
> +}
> +
> +static char nvme_pr_type(enum pr_type type)
> +{
> +	switch (type) {
> +	case PR_WRITE_EXCLUSIVE:
> +		return 1;
> +	case PR_EXCLUSIVE_ACCESS:
> +		return 2;
> +	case PR_WRITE_EXCLUSIVE_REG_ONLY:
> +		return 3;
> +	case PR_EXCLUSIVE_ACCESS_REG_ONLY:
> +		return 4;
> +	case PR_WRITE_EXCLUSIVE_ALL_REGS:
> +		return 5;
> +	case PR_EXCLUSIVE_ACCESS_ALL_REGS:
> +		return 6;
> +	default:
> +		return 0;
> +	}
> +};
> +
> +static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
> +				u64 key, u64 sa_key, u8 op)
> +{
> +	struct nvme_ns *ns = bdev->bd_disk->private_data;
> +	struct nvme_command c;
> +	u8 data[16] = { 0, };
> +
> +	put_unaligned_le64(key, &data[0]);
> +	put_unaligned_le64(sa_key, &data[8]);
> +
> +	memset(&c, 0, sizeof(c));
> +	c.common.opcode = op;
> +	c.common.nsid = cpu_to_le32(ns->ns_id);
> +	c.common.cdw10[0] = cpu_to_le32(cdw10);
> +
> +	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
> +}
> +
> +static int nvme_pr_register(struct block_device *bdev, u64 old,
> +		u64 new, unsigned flags)
> +{
> +	u32 cdw10;
> +
> +	if (flags & ~PR_FL_IGNORE_KEY)
> +		return -EOPNOTSUPP;
> +
> +	cdw10 = old ? 2 : 0;
> +	cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
> +	cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
> +	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
> +}
> +
> +static int nvme_pr_reserve(struct block_device *bdev, u64 key,
> +		enum pr_type type, unsigned flags)
> +{
> +	u32 cdw10;
> +
> +	if (flags & ~PR_FL_IGNORE_KEY)
> +		return -EOPNOTSUPP;
> +
> +	cdw10 = nvme_pr_type(type) << 8;
> +	cdw10 |= ((flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0);
> +	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_acquire);
> +}
> +
> +static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
> +		enum pr_type type, bool abort)
> +{
> +	u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
> +	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
> +}
> +
> +static int nvme_pr_clear(struct block_device *bdev, u64 key)
> +{
> +	u32 cdw10 = 1 | key ? 1 << 3 : 0;
> +	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register);
> +}
> +
> +static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
> +{
> +	u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
> +	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
> +}
> +
> +static const struct pr_ops nvme_pr_ops = {
> +	.pr_register	= nvme_pr_register,
> +	.pr_reserve	= nvme_pr_reserve,
> +	.pr_release	= nvme_pr_release,
> +	.pr_preempt	= nvme_pr_preempt,
> +	.pr_clear	= nvme_pr_clear,
> +};
> +
> +static const struct block_device_operations nvme_fops = {
> +	.owner		= THIS_MODULE,
> +	.ioctl		= nvme_ioctl,
> +	.compat_ioctl	= nvme_compat_ioctl,
> +	.open		= nvme_open,
> +	.release	= nvme_release,
> +	.getgeo		= nvme_getgeo,
> +	.revalidate_disk= nvme_revalidate_disk,
> +	.pr_ops		= &nvme_pr_ops,
> +};
> +
> +/*
> + * Initialize the cached copies of the Identify data and various controller
> + * register in our nvme_ctrl structure.  This should be called as soon as
> + * the admin queue is fully up and running.
> + */
> +int nvme_init_identify(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_id_ctrl *id;
> +	u64 cap;
> +	int ret, page_shift;
> +
> +	ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
> +	if (ret) {
> +		dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
> +		return ret;
> +	}
> +
> +	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
> +	if (ret) {
> +		dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
> +		return ret;
> +	}
> +	page_shift = NVME_CAP_MPSMIN(cap) + 12;
> +	ctrl->page_size = 1 << page_shift;
> +
> +	if (ctrl->vs >= NVME_VS(1, 1))
> +		ctrl->subsystem = NVME_CAP_NSSRC(cap);
> +
> +	ret = nvme_identify_ctrl(ctrl, &id);
> +	if (ret) {
> +		dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
> +		return -EIO;
> +	}
> +
> +	ctrl->oncs = le16_to_cpup(&id->oncs);
> +	atomic_set(&ctrl->abort_limit, id->acl + 1);
> +	ctrl->vwc = id->vwc;
> +	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
> +	memcpy(ctrl->model, id->mn, sizeof(id->mn));
> +	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
> +	if (id->mdts)
> +		ctrl->max_hw_sectors = 1 << (id->mdts + page_shift - 9);
> +	else
> +		ctrl->max_hw_sectors = UINT_MAX;
> +
> +	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[3]) {
> +		unsigned int max_hw_sectors;
> +
> +		ctrl->stripe_size = 1 << (id->vs[3] + page_shift);
> +		max_hw_sectors = ctrl->stripe_size >> (page_shift - 9);
> +		if (ctrl->max_hw_sectors) {
> +			ctrl->max_hw_sectors = min(max_hw_sectors,
> +							ctrl->max_hw_sectors);
> +		} else {
> +			ctrl->max_hw_sectors = max_hw_sectors;
> +		}
> +	}
> +
> +	kfree(id);
> +	return 0;
> +}
> +
> +static int nvme_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct nvme_ctrl *ctrl;
> +	int instance = iminor(inode);
> +	int ret = -ENODEV;
> +
> +	spin_lock(&dev_list_lock);
> +	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
> +		if (ctrl->instance != instance)
> +			continue;
> +
> +		if (!ctrl->admin_q) {
> +			ret = -EWOULDBLOCK;
> +			break;
> +		}
> +		if (!kref_get_unless_zero(&ctrl->kref))
> +			break;
> +		file->private_data = ctrl;
> +		ret = 0;
> +		break;
> +	}
> +	spin_unlock(&dev_list_lock);
> +
> +	return ret;
> +}
> +
> +static int nvme_dev_release(struct inode *inode, struct file *file)
> +{
> +	nvme_put_ctrl(file->private_data);
> +	return 0;
> +}
> +
> +static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
> +		unsigned long arg)
> +{
> +	struct nvme_ctrl *ctrl = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	struct nvme_ns *ns;
> +
> +	switch (cmd) {
> +	case NVME_IOCTL_ADMIN_CMD:
> +		return nvme_user_cmd(ctrl, NULL, argp);
> +	case NVME_IOCTL_IO_CMD:
> +		if (list_empty(&ctrl->namespaces))
> +			return -ENOTTY;
> +		ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list);
> +		return nvme_user_cmd(ctrl, ns, argp);
> +	case NVME_IOCTL_RESET:
> +		dev_warn(ctrl->dev, "resetting controller\n");
> +		return ctrl->ops->reset_ctrl(ctrl);
> +	case NVME_IOCTL_SUBSYS_RESET:
> +		return nvme_reset_subsystem(ctrl);
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +static const struct file_operations nvme_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= nvme_dev_open,
> +	.release	= nvme_dev_release,
> +	.unlocked_ioctl	= nvme_dev_ioctl,
> +	.compat_ioctl	= nvme_dev_ioctl,
> +};
> +
> +static ssize_t nvme_sysfs_reset(struct device *dev,
> +				struct device_attribute *attr, const char *buf,
> +				size_t count)
> +{
> +	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> +	int ret;
> +
> +	ret = ctrl->ops->reset_ctrl(ctrl);
> +	if (ret < 0)
> +		return ret;
> +	return count;
> +}
> +static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
> +
> +static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
> +{
> +	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
> +	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
> +
> +	return nsa->ns_id - nsb->ns_id;
> +}
> +
> +static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> +{
> +	struct nvme_ns *ns;
> +
> +	list_for_each_entry(ns, &ctrl->namespaces, list) {
> +		if (ns->ns_id == nsid)
> +			return ns;
> +		if (ns->ns_id > nsid)
> +			break;
> +	}
> +	return NULL;
> +}
> +
> +static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> +{
> +	struct nvme_ns *ns;
> +	struct gendisk *disk;
> +	int node = dev_to_node(ctrl->dev);
> +
> +	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
> +	if (!ns)
> +		return;
> +
> +	ns->queue = blk_mq_init_queue(ctrl->tagset);
> +	if (IS_ERR(ns->queue))
> +		goto out_free_ns;
> +	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
> +	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
> +	ns->queue->queuedata = ns;
> +	ns->ctrl = ctrl;
> +
> +	disk = alloc_disk_node(0, node);
> +	if (!disk)
> +		goto out_free_queue;
> +
> +	kref_init(&ns->kref);
> +	ns->ns_id = nsid;
> +	ns->disk = disk;
> +	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
> +
> +	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
> +	if (ctrl->max_hw_sectors) {
> +		blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
> +		blk_queue_max_segments(ns->queue,
> +			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
> +	}
> +	if (ctrl->stripe_size)
> +		blk_queue_chunk_sectors(ns->queue, ctrl->stripe_size >> 9);
> +	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
> +		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
> +	blk_queue_virt_boundary(ns->queue, ctrl->page_size - 1);
> +
> +	disk->major = nvme_major;
> +	disk->first_minor = 0;
> +	disk->fops = &nvme_fops;
> +	disk->private_data = ns;
> +	disk->queue = ns->queue;
> +	disk->driverfs_dev = ctrl->device;
> +	disk->flags = GENHD_FL_EXT_DEVT;
> +	sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, nsid);
> +
> +	if (nvme_revalidate_disk(ns->disk))
> +		goto out_free_disk;
> +
> +	list_add_tail(&ns->list, &ctrl->namespaces);
> +	kref_get(&ctrl->kref);
> +	if (ns->type != NVME_NS_LIGHTNVM)
> +		add_disk(ns->disk);
> +
> +	return;
> + out_free_disk:
> +	kfree(disk);
> + out_free_queue:
> +	blk_cleanup_queue(ns->queue);
> + out_free_ns:
> +	kfree(ns);
> +}
> +
> +static void nvme_ns_remove(struct nvme_ns *ns)
> +{
> +	bool kill = nvme_io_incapable(ns->ctrl) &&
> +			!blk_queue_dying(ns->queue);
> +
> +	if (kill)
> +		blk_set_queue_dying(ns->queue);
> +	if (ns->disk->flags & GENHD_FL_UP) {
> +		if (blk_get_integrity(ns->disk))
> +			blk_integrity_unregister(ns->disk);
> +		del_gendisk(ns->disk);
> +	}
> +	if (kill || !blk_queue_dying(ns->queue)) {
> +		blk_mq_abort_requeue_list(ns->queue);
> +		blk_cleanup_queue(ns->queue);
> +	}
> +	list_del_init(&ns->list);
> +	nvme_put_ns(ns);
> +}
> +
> +static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> +{
> +	struct nvme_ns *ns;
> +
> +	ns = nvme_find_ns(ctrl, nsid);
> +	if (ns) {
> +		if (revalidate_disk(ns->disk))
> +			nvme_ns_remove(ns);
> +	} else
> +		nvme_alloc_ns(ctrl, nsid);
> +}
> +
> +static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
> +{
> +	struct nvme_ns *ns;
> +	__le32 *ns_list;
> +	unsigned i, j, nsid, prev = 0, num_lists = DIV_ROUND_UP(nn, 1024);
> +	int ret = 0;
> +
> +	ns_list = kzalloc(0x1000, GFP_KERNEL);
> +	if (!ns_list)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < num_lists; i++) {
> +		ret = nvme_identify_ns_list(ctrl, prev, ns_list);
> +		if (ret)
> +			goto out;
> +
> +		for (j = 0; j < min(nn, 1024U); j++) {
> +			nsid = le32_to_cpu(ns_list[j]);
> +			if (!nsid)
> +				goto out;
> +
> +			nvme_validate_ns(ctrl, nsid);
> +
> +			while (++prev < nsid) {
> +				ns = nvme_find_ns(ctrl, prev);
> +				if (ns)
> +					nvme_ns_remove(ns);
> +			}
> +		}
> +		nn -= j;
> +	}
> + out:
> +	kfree(ns_list);
> +	return ret;
> +}
> +
> +static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
> +{
> +	struct nvme_ns *ns, *next;
> +	unsigned i;
> +
> +	for (i = 1; i <= nn; i++)
> +		nvme_validate_ns(ctrl, i);
> +
> +	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
> +		if (ns->ns_id > nn)
> +			nvme_ns_remove(ns);
> +	}
> +}
> +
> +void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_id_ctrl *id;
> +	unsigned nn;
> +
> +	if (nvme_identify_ctrl(ctrl, &id))
> +		return;
> +
> +	nn = le32_to_cpu(id->nn);
> +	if (ctrl->vs >= NVME_VS(1, 1)) {
> +		if (!nvme_scan_ns_list(ctrl, nn))
> +			goto done;
> +	}
> +	__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
> + done:
> +	list_sort(NULL, &ctrl->namespaces, ns_cmp);
> +	kfree(id);
> +}
> +
> +void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns, *next;
> +
> +	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
> +		nvme_ns_remove(ns);
> +}
> +
> +static DEFINE_IDA(nvme_instance_ida);
> +
> +static int nvme_set_instance(struct nvme_ctrl *ctrl)
> +{
> +	int instance, error;
> +
> +	do {
> +		if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
> +			return -ENODEV;
> +
> +		spin_lock(&dev_list_lock);
> +		error = ida_get_new(&nvme_instance_ida, &instance);
> +		spin_unlock(&dev_list_lock);
> +	} while (error == -EAGAIN);
> +
> +	if (error)
> +		return -ENODEV;
> +
> +	ctrl->instance = instance;
> +	return 0;
> +}
> +
> +static void nvme_release_instance(struct nvme_ctrl *ctrl)
> +{
> +	spin_lock(&dev_list_lock);
> +	ida_remove(&nvme_instance_ida, ctrl->instance);
> +	spin_unlock(&dev_list_lock);
> +}
> +
> +void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
> + {
> +	device_remove_file(ctrl->device, &dev_attr_reset_controller);
> +	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
> +
> +	spin_lock(&dev_list_lock);
> +	list_del(&ctrl->node);
> +	spin_unlock(&dev_list_lock);
> +}
> +
> +static void nvme_free_ctrl(struct kref *kref)
> +{
> +	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
> +
> +	put_device(ctrl->device);
> +	nvme_release_instance(ctrl);
> +
> +	ctrl->ops->free_ctrl(ctrl);
> +}
> +
> +void nvme_put_ctrl(struct nvme_ctrl *ctrl)
> +{
> +	kref_put(&ctrl->kref, nvme_free_ctrl);
> +}
> +
> +/*
> + * Initialize a NVMe controller structures.  This needs to be called during
> + * earliest initialization so that we have the initialized structured around
> + * during probing.
> + */
> +int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> +		const struct nvme_ctrl_ops *ops, u16 vendor,
> +		unsigned long quirks)
> +{
> +	int ret;
> +
> +	INIT_LIST_HEAD(&ctrl->namespaces);
> +	kref_init(&ctrl->kref);
> +	ctrl->dev = dev;
> +	ctrl->ops = ops;
> +	ctrl->vendor = vendor;
> +	ctrl->quirks = quirks;
> +
> +	ret = nvme_set_instance(ctrl);
> +	if (ret)
> +		goto out;
> +
> +	ctrl->device = device_create(nvme_class, ctrl->dev,
> +				MKDEV(nvme_char_major, ctrl->instance),
> +				dev, "nvme%d", ctrl->instance);
> +	if (IS_ERR(ctrl->device)) {
> +		ret = PTR_ERR(ctrl->device);
> +		goto out_release_instance;
> +	}
> +	get_device(ctrl->device);
> +	dev_set_drvdata(ctrl->device, ctrl);
> +
> +	ret = device_create_file(ctrl->device, &dev_attr_reset_controller);
> +	if (ret)
> +		goto out_put_device;
> +
> +	spin_lock(&dev_list_lock);
> +	list_add_tail(&ctrl->node, &nvme_ctrl_list);
> +	spin_unlock(&dev_list_lock);
> +
> +	return 0;
> +
> +out_put_device:
> +	put_device(ctrl->device);
> +	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
> +out_release_instance:
> +	nvme_release_instance(ctrl);
> +out:
> +	return ret;
> +}
> +
> +int __init nvme_core_init(void)
> +{
> +	int result;
> +
> +	result = register_blkdev(nvme_major, "nvme");
> +	if (result < 0)
> +		return result;
> +	else if (result > 0)
> +		nvme_major = result;
> +
> +	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
> +							&nvme_dev_fops);
> +	if (result < 0)
> +		goto unregister_blkdev;
> +	else if (result > 0)
> +		nvme_char_major = result;
> +
> +	nvme_class = class_create(THIS_MODULE, "nvme");
> +	if (IS_ERR(nvme_class)) {
> +		result = PTR_ERR(nvme_class);
> +		goto unregister_chrdev;
> +	}
> +
> +	return 0;
> +
> + unregister_chrdev:
> +	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
> + unregister_blkdev:
> +	unregister_blkdev(nvme_major, "nvme");
> +	return result;
> +}
> +
> +void nvme_core_exit(void)
> +{
> +	unregister_blkdev(nvme_major, "nvme");
> +	class_destroy(nvme_class);
> +	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
> +}
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9444884..5c5f455 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1040,7 +1040,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
>  	struct request *req;
>  	int ret;
>  
> -	req = blk_mq_alloc_request(q, write, GFP_KERNEL, false);
> +	req = blk_mq_alloc_request(q, write, 0);
>  	if (IS_ERR(req))
>  		return PTR_ERR(req);
>  
> @@ -1093,7 +1093,8 @@ static int nvme_submit_async_admin_req(struct nvme_dev *dev)
>  	struct nvme_cmd_info *cmd_info;
>  	struct request *req;
>  
> -	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC, true);
> +	req = blk_mq_alloc_request(dev->admin_q, WRITE,
> +			BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED);
>  	if (IS_ERR(req))
>  		return PTR_ERR(req);
>  
> @@ -1118,7 +1119,7 @@ static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
>  	struct request *req;
>  	struct nvme_cmd_info *cmd_rq;
>  
> -	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
> +	req = blk_mq_alloc_request(dev->admin_q, WRITE, 0);
>  	if (IS_ERR(req))
>  		return PTR_ERR(req);
>  
> @@ -1319,8 +1320,8 @@ static void nvme_abort_req(struct request *req)
>  	if (!dev->abort_limit)
>  		return;
>  
> -	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
> -									false);
> +	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE,
> +			BLK_MQ_REQ_NOWAIT);
>  	if (IS_ERR(abort_req))
>  		return;
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index daf17d7..7fc9296 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -188,8 +188,14 @@ void blk_mq_insert_request(struct request *, bool, bool, bool);
>  void blk_mq_free_request(struct request *rq);
>  void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
>  bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
> +
> +enum {
> +	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
> +	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
> +};
> +
>  struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
> -		gfp_t gfp, bool reserved);
> +		unsigned int flags);
>  struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
>  struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-24 15:19   ` Jeff Moyer
@ 2015-11-24 15:32     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-24 15:32 UTC (permalink / raw)


On Tue, Nov 24, 2015@10:19:54AM -0500, Jeff Moyer wrote:
> >  drivers/block/mtip32xx/mtip32xx.c |    2 +-
> >  drivers/nvme/host/core.c          | 1172 +++++++++++++++++++++++++++++++++++++
> 
> Christoph, I think you included a bit too much in this patch!  ;-)

Oops, looks like once of these weird git-rebase failure cases that
are totally non-obvious.  I'll push out another rebase with these
issues fixed after a few more review comments come in.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 15:16   ` Jeff Moyer
@ 2015-11-24 15:40     ` Christoph Hellwig
  2015-11-24 15:51       ` Jeff Moyer
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-24 15:40 UTC (permalink / raw)


On Tue, Nov 24, 2015@10:16:51AM -0500, Jeff Moyer wrote:
> Hi Christoph,
> 
> Christoph Hellwig <hch at lst.de> writes:
> 
> > This marks the request as one that's not actually completed yet, but
> > should be reaped next time blk_mq_complete_request comes in.  This is
> > useful it the abort handler kicked of a reset that will complete all
> > pending requests.
> 
> What's the purpose, though?  Is this an optimization?

It allows us to to correctly implement controller reset (like SCSI target
resets) from the timeout handler.  The current HANDLED/NOT_HANDLED returns
are not very useful if you want to eventually kick of a reset that will
abort all requests, but needs to ensure the the requests don't get reused
before that.  Only SCSI handles that for now, and needs it's own per-LUN
command list and a lot of complex code for that - something we'd like
to avoid for NVMe or other new drivers.

> We've had "fun" problems with races between completion and timeout
> before.  I can't say I'm too keen on adding more complexity to this code
> path.  Have you considered what happens in your new code when this race
> occurs?  I don't expect it to cause any issues in the mq case, since the
> timeout handler should run on the same cpu as the completion code for a
> given request (right?).  However, for the old code path, they could run
> in parallel.
> 
> blk_complete_request:
> A  if (!blk_mark_rq_complete(rq) ||
> B      test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) {
> C        __blk_mq_complete_request(rq);
> 
> could run alongside of:
> 
> blk_rq_check_expired:
> 1 if (!blk_mark_rq_complete(rq))
> 2   blk_rq_timed_out(rq);
> 
> So, if 1 comes before A, we have two cases to consider:
> 
> i.  the expiration path does not yet set REQ_ATOM_QUIESCED before the
>     completion code runs, and so the completion code does nothing.

The command has timed out and sets REQ_ATOM_COMPLETED first,
the the actual completion comes in and does indeed nothing.  We now
set REQ_ATOM_QUIESCED and kick off a controller reset, which will
ultimatively complete all commands using blk_mq_complete_request.
Now REQ_ATOM_QUIESCED is set on the command that caused the timeout,
so it will be completed as well.

> ii. the expiration path *does* SET REQ_ATOM_QUIESCED.  In this instance,
>     will we get yet another completion for the request when the command
>     is ultimately retired by the adapter reset?

The command has timed out and sets REQ_ATOM_COMPLETED first, then
REQ_ATOM_QUIESCED as well.  Now the actual completion comes in and does
nothing because REQ_ATOM_COMPLETED was set.  We will then kick off the
controller reset, which will ultimatively complete all commands using
blk_mq_complete_request. Now REQ_ATOM_QUIESCED is set on the command that
caused the timeout, so it will be completed as well.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 15:40     ` Christoph Hellwig
@ 2015-11-24 15:51       ` Jeff Moyer
  2015-11-24 15:56         ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-24 15:51 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> On Tue, Nov 24, 2015@10:16:51AM -0500, Jeff Moyer wrote:
>> Hi Christoph,
>> 
>> Christoph Hellwig <hch at lst.de> writes:
>> 
>> > This marks the request as one that's not actually completed yet, but
>> > should be reaped next time blk_mq_complete_request comes in.  This is
>> > useful it the abort handler kicked of a reset that will complete all
>> > pending requests.
>> 
>> What's the purpose, though?  Is this an optimization?
>
> It allows us to to correctly implement controller reset (like SCSI target
> resets) from the timeout handler.  The current HANDLED/NOT_HANDLED returns
> are not very useful if you want to eventually kick of a reset that will
> abort all requests, but needs to ensure the the requests don't get reused
> before that.  Only SCSI handles that for now, and needs it's own per-LUN
> command list and a lot of complex code for that - something we'd like
> to avoid for NVMe or other new drivers.

Thanks for the explanation.  One more question below.

>> We've had "fun" problems with races between completion and timeout
>> before.  I can't say I'm too keen on adding more complexity to this code
>> path.  Have you considered what happens in your new code when this race
>> occurs?  I don't expect it to cause any issues in the mq case, since the
>> timeout handler should run on the same cpu as the completion code for a
>> given request (right?).  However, for the old code path, they could run
>> in parallel.
>> 
>> blk_complete_request:
>> A  if (!blk_mark_rq_complete(rq) ||
>> B      test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) {
>> C        __blk_mq_complete_request(rq);
>> 
>> could run alongside of:
>> 
>> blk_rq_check_expired:
>> 1 if (!blk_mark_rq_complete(rq))
>> 2   blk_rq_timed_out(rq);
>> 
>> So, if 1 comes before A, we have two cases to consider:
>> 
>> i.  the expiration path does not yet set REQ_ATOM_QUIESCED before the
>>     completion code runs, and so the completion code does nothing.
>
> The command has timed out and sets REQ_ATOM_COMPLETED first,
> the the actual completion comes in and does indeed nothing.  We now
> set REQ_ATOM_QUIESCED and kick off a controller reset, which will
> ultimatively complete all commands using blk_mq_complete_request.
> Now REQ_ATOM_QUIESCED is set on the command that caused the timeout,
> so it will be completed as well.
>
>> ii. the expiration path *does* SET REQ_ATOM_QUIESCED.  In this instance,
>>     will we get yet another completion for the request when the command
>>     is ultimately retired by the adapter reset?
>
> The command has timed out and sets REQ_ATOM_COMPLETED first, then
> REQ_ATOM_QUIESCED as well.  Now the actual completion comes in and does
> nothing because REQ_ATOM_COMPLETED was set.  We will then kick off the

See B above.  REQ_ATOM_COMPLETE is set, so the first half of that
statement is false, but then test_and_clear_bit(REQ_ATOM_QUIESCED...)
returns true, so we call __blk_complete_request.  So the question is,
will we get a double completion for that request after the reset is
performed?

-Jeff

p.s. That should be __blk_complete_request up there in 'C'.

> controller reset, which will ultimatively complete all commands using
> blk_mq_complete_request. Now REQ_ATOM_QUIESCED is set on the command that
> caused the timeout, so it will be completed as well.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 15:51       ` Jeff Moyer
@ 2015-11-24 15:56         ` Christoph Hellwig
  2015-11-24 16:34           ` Jeff Moyer
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-24 15:56 UTC (permalink / raw)


On Tue, Nov 24, 2015@10:51:04AM -0500, Jeff Moyer wrote:
> >> 
> >> blk_complete_request:
> >> A  if (!blk_mark_rq_complete(rq) ||
> >> B      test_and_cleart_bit(REQ_ATOM_QUIESCED, &req->atomic_flags)) {
> >> C        __blk_mq_complete_request(rq);
> >> 
> >> could run alongside of:
> >> 
> >> blk_rq_check_expired:
> >> 1 if (!blk_mark_rq_complete(rq))
> >> 2   blk_rq_timed_out(rq);
> >> 
> >> So, if 1 comes before A, we have two cases to consider:
> >> 
> >> i.  the expiration path does not yet set REQ_ATOM_QUIESCED before the
> >>     completion code runs, and so the completion code does nothing.
> >
> > The command has timed out and sets REQ_ATOM_COMPLETED first,
> > the the actual completion comes in and does indeed nothing.  We now
> > set REQ_ATOM_QUIESCED and kick off a controller reset, which will
> > ultimatively complete all commands using blk_mq_complete_request.
> > Now REQ_ATOM_QUIESCED is set on the command that caused the timeout,
> > so it will be completed as well.
> >
> >> ii. the expiration path *does* SET REQ_ATOM_QUIESCED.  In this instance,
> >>     will we get yet another completion for the request when the command
> >>     is ultimately retired by the adapter reset?
> >
> > The command has timed out and sets REQ_ATOM_COMPLETED first, then
> > REQ_ATOM_QUIESCED as well.  Now the actual completion comes in and does
> > nothing because REQ_ATOM_COMPLETED was set.  We will then kick off the
> 
> See B above.  REQ_ATOM_COMPLETE is set, so the first half of that
> statement is false, but then test_and_clear_bit(REQ_ATOM_QUIESCED...)
> returns true, so we call __blk_complete_request.  So the question is,
> will we get a double completion for that request after the reset is
> performed?

We always set REQ_ATOM_COMPLETE before calling into ->timeout and thus
even having a chance to REQ_ATOM_QUIESCED.  Maybe we're talking past
each other, so if it feels like I'm off track try to explain the
race in a bit more detail.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 15:56         ` Christoph Hellwig
@ 2015-11-24 16:34           ` Jeff Moyer
  2015-11-24 16:47             ` Keith Busch
  2015-11-24 17:56             ` Christoph Hellwig
  0 siblings, 2 replies; 63+ messages in thread
From: Jeff Moyer @ 2015-11-24 16:34 UTC (permalink / raw)


Christoph Hellwig <hch at lst.de> writes:

> We always set REQ_ATOM_COMPLETE before calling into ->timeout and thus
> even having a chance to REQ_ATOM_QUIESCED.  Maybe we're talking past
> each other, so if it feels like I'm off track try to explain the
> race in a bit more detail.

Sure.

CPU 0 executes the following:

blk_rq_check_expired():
        if (!blk_mark_rq_complete(rq))  // this succeeds, so we call blk_rq_timed_out
                blk_rq_timed_out(rq);
blk_rq_timed_out():
        if (q->rq_timed_out_fn)
                ret = q->rq_timed_out_fn(req);
        switch (ret) {  // QUIESCED is returned
...
        case BLK_EH_QUIESCED:
                set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
                break;

CPU 1 takes an interrupt for the completion of the same request:

blk_complete_request():
        if (!blk_mark_rq_complete(req) ||  // this fails, as it's already marked complete
            test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))  // this succeeds
                __blk_complete_request(req); // so we complete the request

Later, after the adapter reset is finished, you mentioned that all
requests will be completed.  My question is: will this result in a
double completion for this particular request?

I hope that's more clear.

-Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 16:34           ` Jeff Moyer
@ 2015-11-24 16:47             ` Keith Busch
  2015-11-24 17:56             ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Keith Busch @ 2015-11-24 16:47 UTC (permalink / raw)


On Tue, Nov 24, 2015@11:34:22AM -0500, Jeff Moyer wrote:
> CPU 0 executes the following:
> 
> blk_rq_check_expired():
>         if (!blk_mark_rq_complete(rq))  // this succeeds, so we call blk_rq_timed_out
>                 blk_rq_timed_out(rq);
> blk_rq_timed_out():
>         if (q->rq_timed_out_fn)
>                 ret = q->rq_timed_out_fn(req);
>         switch (ret) {  // QUIESCED is returned
> ...
>         case BLK_EH_QUIESCED:
>                 set_bit(REQ_ATOM_QUIESCED, &req->atomic_flags);
>                 break;
> 
> CPU 1 takes an interrupt for the completion of the same request:
> 
> blk_complete_request():
>         if (!blk_mark_rq_complete(req) ||  // this fails, as it's already marked complete
>             test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))  // this succeeds
>                 __blk_complete_request(req); // so we complete the request
> 
> Later, after the adapter reset is finished, you mentioned that all
> requests will be completed.  My question is: will this result in a
> double completion for this particular request?

The reset action clears the queue's busy tags only after the adapter
is disabled. At that point, no interrupts can happen and h/w will never
complete more active requests. If an interrupt completes the request
while the reset is in progress, the request tag won't be "busy" when
the final clean-up occurs.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 16:34           ` Jeff Moyer
  2015-11-24 16:47             ` Keith Busch
@ 2015-11-24 17:56             ` Christoph Hellwig
  2015-11-24 18:12               ` Jeff Moyer
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-24 17:56 UTC (permalink / raw)


On Tue, Nov 24, 2015@11:34:22AM -0500, Jeff Moyer wrote:
> CPU 1 takes an interrupt for the completion of the same request:
> 
> blk_complete_request():
>         if (!blk_mark_rq_complete(req) ||  // this fails, as it's already marked complete
>             test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))  // this succeeds

and clears the flag, so we'd need a race betweem this call to
blk_mq_complete request and the later completion of all outstanding
commands from reset.  For NVMe we ensure this by not taking completions
onc we start reset, but we probably need to document this better.  I
will ensure all this is properly documented in the next version!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 17:56             ` Christoph Hellwig
@ 2015-11-24 18:12               ` Jeff Moyer
  2015-11-24 19:40                 ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Moyer @ 2015-11-24 18:12 UTC (permalink / raw)


Christoph Hellwig <hch at infradead.org> writes:

> On Tue, Nov 24, 2015@11:34:22AM -0500, Jeff Moyer wrote:
>> CPU 1 takes an interrupt for the completion of the same request:
>> 
>> blk_complete_request():
>>         if (!blk_mark_rq_complete(req) ||  // this fails, as it's already marked complete
>>             test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))  // this succeeds
>
> and clears the flag, so we'd need a race betweem this call to
> blk_mq_complete request and the later completion of all outstanding
> commands from reset.  For NVMe we ensure this by not taking completions

But we're completing the request, so the request will actually be freed.
Any references to this request are bogus at this point, no?

> onc we start reset, but we probably need to document this better.  I
> will ensure all this is properly documented in the next version!

Thanks, I think that will help a lot.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value
  2015-11-24 18:12               ` Jeff Moyer
@ 2015-11-24 19:40                 ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-24 19:40 UTC (permalink / raw)


On Tue, Nov 24, 2015@01:12:52PM -0500, Jeff Moyer wrote:
> >> blk_complete_request():
> >>         if (!blk_mark_rq_complete(req) ||  // this fails, as it's already marked complete
> >>             test_and_clear_bit(REQ_ATOM_QUIESCED, &req->atomic_flags))  // this succeeds
> >
> > and clears the flag, so we'd need a race betweem this call to
> > blk_mq_complete request and the later completion of all outstanding
> > commands from reset.  For NVMe we ensure this by not taking completions
> 
> But we're completing the request, so the request will actually be freed.
> Any references to this request are bogus at this point, no?

For blk-mq it won't be freed.  For the non blk-mq case the tag
to request lookup will have to handle that case, but that path isn't
exercised by nvme.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-20 16:35 ` [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request Christoph Hellwig
  2015-11-24 15:19   ` Jeff Moyer
@ 2015-11-24 21:21   ` Jens Axboe
  2015-11-24 22:22     ` Christoph Hellwig
  1 sibling, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2015-11-24 21:21 UTC (permalink / raw)


On 11/20/2015 09:35 AM, Christoph Hellwig wrote:
> We already have the reserved flag, and a nowait flag awkwardly encoded as
> a gfp_t.  Add a real flags argument to make the scheme more extensible and
> allow for a nicer calling convention.

I've been reviewing these in the nvme-req.9 branch, I'll pick them out 
when for-4.5/xx is forked off. That should happen when Linus pulls the 
recent for-linus pull request.

This one looks fine, but:

>   create mode 100644 drivers/nvme/host/core.c

that should obviously not be there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers
  2015-11-20 16:35 ` [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers Christoph Hellwig
  2015-11-23 21:23   ` Jeff Moyer
@ 2015-11-24 21:22   ` Jens Axboe
  1 sibling, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2015-11-24 21:22 UTC (permalink / raw)


On 11/20/2015 09:35 AM, Christoph Hellwig wrote:
> We only use them inconsistenly, and the other flags don't use wrappers like
> this either.

Also fine with this, but it should go sooner in the series (before the 
previous one, to be specific).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-24 21:21   ` Jens Axboe
@ 2015-11-24 22:22     ` Christoph Hellwig
  2015-11-24 22:25       ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-24 22:22 UTC (permalink / raw)


On Tue, Nov 24, 2015@02:21:10PM -0700, Jens Axboe wrote:
> On 11/20/2015 09:35 AM, Christoph Hellwig wrote:
>> We already have the reserved flag, and a nowait flag awkwardly encoded as
>> a gfp_t.  Add a real flags argument to make the scheme more extensible and
>> allow for a nicer calling convention.
>
> I've been reviewing these in the nvme-req.9 branch, I'll pick them out when 
> for-4.5/xx is forked off. That should happen when Linus pulls the recent 
> for-linus pull request.

Ok.  It would be good to only fork off after the NVMe page size fix,
as that is bound to create merge conflicts otherwise.

While you're at it can you send "nvme: add missing unmaps in nvme_queue_rq"
and "block: fix blk_abort_request for blk-mq drivers" from this series to
Linus for 4.4-rc as well?

>>   create mode 100644 drivers/nvme/host/core.c
>
> that should obviously not be there.

Yeah, Jeff already pointed that out.  I'll fix it for the next resend.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-24 22:22     ` Christoph Hellwig
@ 2015-11-24 22:25       ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2015-11-24 22:25 UTC (permalink / raw)


On 11/24/2015 03:22 PM, Christoph Hellwig wrote:
> On Tue, Nov 24, 2015@02:21:10PM -0700, Jens Axboe wrote:
>> On 11/20/2015 09:35 AM, Christoph Hellwig wrote:
>>> We already have the reserved flag, and a nowait flag awkwardly encoded as
>>> a gfp_t.  Add a real flags argument to make the scheme more extensible and
>>> allow for a nicer calling convention.
>>
>> I've been reviewing these in the nvme-req.9 branch, I'll pick them out when
>> for-4.5/xx is forked off. That should happen when Linus pulls the recent
>> for-linus pull request.
>
> Ok.  It would be good to only fork off after the NVMe page size fix,
> as that is bound to create merge conflicts otherwise.
>
> While you're at it can you send "nvme: add missing unmaps in nvme_queue_rq"
> and "block: fix blk_abort_request for blk-mq drivers" from this series to
> Linus for 4.4-rc as well?

Look fine for this series. Added and for-4.5/core kicked off.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout
  2015-11-20 16:35 ` [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout Christoph Hellwig
@ 2017-03-10 12:51   ` David Woodhouse
  2017-03-10 14:24     ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: David Woodhouse @ 2017-03-10 12:51 UTC (permalink / raw)


> There is nothing it protects, but it makes lockdep unhappy in many different
> ways.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> Signed-off-by: Keith Busch <keith.busch at intel.com>

This is commit 4c9f748f0ee88447b28546991f60f43a7319aafd in 4.5.

Doesn't it want to be applied to 4.4 stable too?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4938 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170310/d0a700bf/attachment.bin>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout
  2017-03-10 12:51   ` David Woodhouse
@ 2017-03-10 14:24     ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2017-03-10 14:24 UTC (permalink / raw)


On Fri, Mar 10, 2017@01:51:11PM +0100, David Woodhouse wrote:
> > There is nothing it protects, but it makes lockdep unhappy in many different
> > ways.
> >
> > Signed-off-by: Christoph Hellwig <hch at lst.de>
> > Signed-off-by: Keith Busch <keith.busch at intel.com>
> 
> This is commit 4c9f748f0ee88447b28546991f60f43a7319aafd in 4.5.
> 
> Doesn't it want to be applied to 4.4 stable too?

No idea.  For what someone would have to look a 4.4 stable, check if
it makes sense and hopefully do some NVMe regression testing including
testing timeouts.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request
  2015-11-21  7:19 NVMe mega patchbomb for Linux 4.5-rc (resend) Christoph Hellwig
@ 2015-11-21  7:19 ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2015-11-21  7:19 UTC (permalink / raw)


We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c                  |   11 +-
 block/blk-mq-tag.c                |   11 +-
 block/blk-mq.c                    |   20 +-
 block/blk-mq.h                    |   11 +-
 block/blk.h                       |    2 +-
 drivers/block/mtip32xx/mtip32xx.c |    2 +-
 drivers/nvme/host/core.c          | 1172 +++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/pci.c           |   11 +-
 include/linux/blk-mq.h            |    8 +-
 9 files changed, 1210 insertions(+), 38 deletions(-)
 create mode 100644 drivers/nvme/host/core.c

diff --git a/block/blk-core.c b/block/blk-core.c
index af9c315..d2100aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -630,7 +630,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, gfp_t gfp)
+int blk_queue_enter(struct request_queue *q, bool nowait)
 {
 	while (true) {
 		int ret;
@@ -638,7 +638,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (percpu_ref_tryget_live(&q->q_usage_counter))
 			return 0;
 
-		if (!gfpflags_allow_blocking(gfp))
+		if (nowait)
 			return -EBUSY;
 
 		ret = wait_event_interruptible(q->mq_freeze_wq,
@@ -1284,7 +1284,9 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	if (q->mq_ops)
-		return blk_mq_alloc_request(q, rw, gfp_mask, false);
+		return blk_mq_alloc_request(q, rw,
+			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+				0 : BLK_MQ_REQ_NOWAIT);
 	else
 		return blk_old_get_request(q, rw, gfp_mask);
 }
@@ -2052,8 +2054,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
-
+		if (likely(blk_queue_enter(q, false) == 0)) {
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a07ca34..abdbb47 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -268,7 +268,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 	if (tag != -1)
 		return tag;
 
-	if (!gfpflags_allow_blocking(data->gfp))
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
 		return -1;
 
 	bs = bt_wait_ptr(bt, hctx);
@@ -303,7 +303,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		data->ctx = blk_mq_get_ctx(data->q);
 		data->hctx = data->q->mq_ops->map_queue(data->q,
 				data->ctx->cpu);
-		if (data->reserved) {
+		if (data->flags & BLK_MQ_REQ_RESERVED) {
 			bt = &data->hctx->tags->breserved_tags;
 		} else {
 			last_tag = &data->ctx->last_tag;
@@ -349,10 +349,9 @@ static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_alloc_data *data)
 
 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 {
-	if (!data->reserved)
-		return __blk_mq_get_tag(data);
-
-	return __blk_mq_get_reserved_tag(data);
+	if (data->flags & BLK_MQ_REQ_RESERVED)
+		return __blk_mq_get_reserved_tag(data);
+	return __blk_mq_get_tag(data);
 }
 
 static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c932605..6da03f1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,8 +230,8 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, int rw)
 	return NULL;
 }
 
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
-		bool reserved)
+struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
+		unsigned int flags)
 {
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_hw_ctx *hctx;
@@ -239,24 +239,22 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	struct blk_mq_alloc_data alloc_data;
 	int ret;
 
-	ret = blk_queue_enter(q, gfp);
+	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
-			reserved, ctx, hctx);
+	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
+	if (!rq && !(flags & BLK_MQ_REQ_NOWAIT)) {
 		__blk_mq_run_hw_queue(hctx);
 		blk_mq_put_ctx(ctx);
 
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
-		blk_mq_set_alloc_data(&alloc_data, q, gfp, reserved, ctx,
-				hctx);
+		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 		rq =  __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 	}
@@ -1181,8 +1179,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		rw |= REQ_SYNC;
 
 	trace_block_getrq(q, bio, rw);
-	blk_mq_set_alloc_data(&alloc_data, q, GFP_ATOMIC, false, ctx,
-			hctx);
+	blk_mq_set_alloc_data(&alloc_data, q, BLK_MQ_REQ_NOWAIT, ctx, hctx);
 	rq = __blk_mq_alloc_request(&alloc_data, rw);
 	if (unlikely(!rq)) {
 		__blk_mq_run_hw_queue(hctx);
@@ -1191,8 +1188,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
-		blk_mq_set_alloc_data(&alloc_data, q,
-				__GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
+		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 		hctx = alloc_data.hctx;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 713820b..eaede8e 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -96,8 +96,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
 struct blk_mq_alloc_data {
 	/* input parameter */
 	struct request_queue *q;
-	gfp_t gfp;
-	bool reserved;
+	unsigned int flags;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
@@ -105,13 +104,11 @@ struct blk_mq_alloc_data {
 };
 
 static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
-		struct request_queue *q, gfp_t gfp, bool reserved,
-		struct blk_mq_ctx *ctx,
-		struct blk_mq_hw_ctx *hctx)
+		struct request_queue *q, unsigned int flags,
+		struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
 {
 	data->q = q;
-	data->gfp = gfp;
-	data->reserved = reserved;
+	data->flags = flags;
 	data->ctx = ctx;
 	data->hctx = hctx;
 }
diff --git a/block/blk.h b/block/blk.h
index 1d95107..38bf997 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,7 +72,7 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+int blk_queue_enter(struct request_queue *q, bool nowait);
 void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index a28a562..cf3b51a 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
 {
 	struct request *rq;
 
-	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
+	rq = blk_mq_alloc_request(dd->queue, 0, BLK_MQ_REQ_RESERVED);
 	return blk_mq_rq_to_pdu(rq);
 }
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
new file mode 100644
index 0000000..53cf507
--- /dev/null
+++ b/drivers/nvme/host/core.c
@@ -0,0 +1,1172 @@
+/*
+ * NVM Express device driver
+ * Copyright (c) 2011-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/errno.h>
+#include <linux/hdreg.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_sort.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/pr.h>
+#include <linux/ptrace.h>
+#include <linux/nvme_ioctl.h>
+#include <linux/t10-pi.h>
+#include <scsi/sg.h>
+#include <asm/unaligned.h>
+
+#include "nvme.h"
+
+#define NVME_MINORS		(1U << MINORBITS)
+
+static int nvme_major;
+module_param(nvme_major, int, 0);
+
+static int nvme_char_major;
+module_param(nvme_char_major, int, 0);
+
+static LIST_HEAD(nvme_ctrl_list);
+DEFINE_SPINLOCK(dev_list_lock);
+
+static struct class *nvme_class;
+
+static void nvme_free_ns(struct kref *kref)
+{
+	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+
+	if (ns->type == NVME_NS_LIGHTNVM)
+		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
+
+	spin_lock(&dev_list_lock);
+	ns->disk->private_data = NULL;
+	spin_unlock(&dev_list_lock);
+
+	nvme_put_ctrl(ns->ctrl);
+	put_disk(ns->disk);
+	kfree(ns);
+}
+
+static void nvme_put_ns(struct nvme_ns *ns)
+{
+	kref_put(&ns->kref, nvme_free_ns);
+}
+
+static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk)
+{
+	struct nvme_ns *ns;
+
+	spin_lock(&dev_list_lock);
+	ns = disk->private_data;
+	if (ns && !kref_get_unless_zero(&ns->kref))
+		ns = NULL;
+	spin_unlock(&dev_list_lock);
+
+	return ns;
+}
+
+static struct request *nvme_alloc_request(struct request_queue *q,
+		struct nvme_command *cmd)
+{
+	bool write = cmd->common.opcode & 1;
+	struct request *req;
+
+	req = blk_mq_alloc_request(q, write, 0);
+	if (IS_ERR(req))
+		return req;
+
+	req->cmd_type = REQ_TYPE_DRV_PRIV;
+	req->cmd_flags |= REQ_FAILFAST_DRIVER;
+	req->__data_len = 0;
+	req->__sector = (sector_t) -1;
+	req->bio = req->biotail = NULL;
+
+	req->cmd = (unsigned char *)cmd;
+	req->cmd_len = sizeof(struct nvme_command);
+	req->special = (void *)0;
+
+	return req;
+}
+
+/*
+ * Returns 0 on success.  If the result is negative, it's a Linux error code;
+ * if the result is positive, it's an NVM Express status code
+ */
+int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen, u32 *result, unsigned timeout)
+{
+	struct request *req;
+	int ret;
+
+	req = nvme_alloc_request(q, cmd);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
+	if (buffer && bufflen) {
+		ret = blk_rq_map_kern(q, req, buffer, bufflen, GFP_KERNEL);
+		if (ret)
+			goto out;
+	}
+
+	blk_execute_rq(req->q, NULL, req, 0);
+	if (result)
+		*result = (u32)(uintptr_t)req->special;
+	ret = req->errors;
+ out:
+	blk_mq_free_request(req);
+	return ret;
+}
+
+int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void *buffer, unsigned bufflen)
+{
+	return __nvme_submit_sync_cmd(q, cmd, buffer, bufflen, NULL, 0);
+}
+
+int __nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen,
+		void __user *meta_buffer, unsigned meta_len, u32 meta_seed,
+		u32 *result, unsigned timeout)
+{
+	bool write = cmd->common.opcode & 1;
+	struct nvme_ns *ns = q->queuedata;
+	struct gendisk *disk = ns ? ns->disk : NULL;
+	struct request *req;
+	struct bio *bio = NULL;
+	void *meta = NULL;
+	int ret;
+
+	req = nvme_alloc_request(q, cmd);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->timeout = timeout ? timeout : ADMIN_TIMEOUT;
+
+	if (ubuffer && bufflen) {
+		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
+				GFP_KERNEL);
+		if (ret)
+			goto out;
+		bio = req->bio;
+
+		if (!disk)
+			goto submit;
+		bio->bi_bdev = bdget_disk(disk, 0);
+		if (!bio->bi_bdev) {
+			ret = -ENODEV;
+			goto out_unmap;
+		}
+
+		if (meta_buffer) {
+			struct bio_integrity_payload *bip;
+
+			meta = kmalloc(meta_len, GFP_KERNEL);
+			if (!meta) {
+				ret = -ENOMEM;
+				goto out_unmap;
+			}
+
+			if (write) {
+				if (copy_from_user(meta, meta_buffer,
+						meta_len)) {
+					ret = -EFAULT;
+					goto out_free_meta;
+				}
+			}
+
+			bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
+			if (!bip) {
+				ret = -ENOMEM;
+				goto out_free_meta;
+			}
+
+			bip->bip_iter.bi_size = meta_len;
+			bip->bip_iter.bi_sector = meta_seed;
+
+			ret = bio_integrity_add_page(bio, virt_to_page(meta),
+					meta_len, offset_in_page(meta));
+			if (ret != meta_len) {
+				ret = -ENOMEM;
+				goto out_free_meta;
+			}
+		}
+	}
+ submit:
+	blk_execute_rq(req->q, disk, req, 0);
+	ret = req->errors;
+	if (result)
+		*result = (u32)(uintptr_t)req->special;
+	if (meta && !ret && !write) {
+		if (copy_to_user(meta_buffer, meta, meta_len))
+			ret = -EFAULT;
+	}
+ out_free_meta:
+	kfree(meta);
+ out_unmap:
+	if (bio) {
+		if (disk && bio->bi_bdev)
+			bdput(bio->bi_bdev);
+		blk_rq_unmap_user(bio);
+	}
+ out:
+	blk_mq_free_request(req);
+	return ret;
+}
+
+int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd,
+		void __user *ubuffer, unsigned bufflen, u32 *result,
+		unsigned timeout)
+{
+	return __nvme_submit_user_cmd(q, cmd, ubuffer, bufflen, NULL, 0, 0,
+			result, timeout);
+}
+
+int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
+{
+	struct nvme_command c = { };
+	int error;
+
+	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = cpu_to_le32(1);
+
+	*id = kmalloc(sizeof(struct nvme_id_ctrl), GFP_KERNEL);
+	if (!*id)
+		return -ENOMEM;
+
+	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
+			sizeof(struct nvme_id_ctrl));
+	if (error)
+		kfree(*id);
+	return error;
+}
+
+static int nvme_identify_ns_list(struct nvme_ctrl *dev, unsigned nsid, __le32 *ns_list)
+{
+	struct nvme_command c = { };
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = cpu_to_le32(2);
+	c.identify.nsid = cpu_to_le32(nsid);
+	return nvme_submit_sync_cmd(dev->admin_q, &c, ns_list, 0x1000);
+}
+
+int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
+		struct nvme_id_ns **id)
+{
+	struct nvme_command c = { };
+	int error;
+
+	/* gcc-4.4.4 (at least) has issues with initializers and anon unions */
+	c.identify.opcode = nvme_admin_identify,
+	c.identify.nsid = cpu_to_le32(nsid),
+
+	*id = kmalloc(sizeof(struct nvme_id_ns), GFP_KERNEL);
+	if (!*id)
+		return -ENOMEM;
+
+	error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
+			sizeof(struct nvme_id_ns));
+	if (error)
+		kfree(*id);
+	return error;
+}
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
+					dma_addr_t dma_addr, u32 *result)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.features.opcode = nvme_admin_get_features;
+	c.features.nsid = cpu_to_le32(nsid);
+	c.features.prp1 = cpu_to_le64(dma_addr);
+	c.features.fid = cpu_to_le32(fid);
+
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
+}
+
+int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+					dma_addr_t dma_addr, u32 *result)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	c.features.opcode = nvme_admin_set_features;
+	c.features.prp1 = cpu_to_le64(dma_addr);
+	c.features.fid = cpu_to_le32(fid);
+	c.features.dword11 = cpu_to_le32(dword11);
+
+	return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, 0, result, 0);
+}
+
+int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log)
+{
+	struct nvme_command c = { };
+	int error;
+
+	c.common.opcode = nvme_admin_get_log_page,
+	c.common.nsid = cpu_to_le32(0xFFFFFFFF),
+	c.common.cdw10[0] = cpu_to_le32(
+			(((sizeof(struct nvme_smart_log) / 4) - 1) << 16) |
+			 NVME_LOG_SMART),
+
+	*log = kmalloc(sizeof(struct nvme_smart_log), GFP_KERNEL);
+	if (!*log)
+		return -ENOMEM;
+
+	error = nvme_submit_sync_cmd(dev->admin_q, &c, *log,
+			sizeof(struct nvme_smart_log));
+	if (error)
+		kfree(*log);
+	return error;
+}
+
+static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
+{
+	struct nvme_user_io io;
+	struct nvme_command c;
+	unsigned length, meta_len;
+	void __user *metadata;
+
+	if (copy_from_user(&io, uio, sizeof(io)))
+		return -EFAULT;
+
+	switch (io.opcode) {
+	case nvme_cmd_write:
+	case nvme_cmd_read:
+	case nvme_cmd_compare:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	length = (io.nblocks + 1) << ns->lba_shift;
+	meta_len = (io.nblocks + 1) * ns->ms;
+	metadata = (void __user *)(uintptr_t)io.metadata;
+
+	if (ns->ext) {
+		length += meta_len;
+		meta_len = 0;
+	} else if (meta_len) {
+		if ((io.metadata & 3) || !io.metadata)
+			return -EINVAL;
+	}
+
+	memset(&c, 0, sizeof(c));
+	c.rw.opcode = io.opcode;
+	c.rw.flags = io.flags;
+	c.rw.nsid = cpu_to_le32(ns->ns_id);
+	c.rw.slba = cpu_to_le64(io.slba);
+	c.rw.length = cpu_to_le16(io.nblocks);
+	c.rw.control = cpu_to_le16(io.control);
+	c.rw.dsmgmt = cpu_to_le32(io.dsmgmt);
+	c.rw.reftag = cpu_to_le32(io.reftag);
+	c.rw.apptag = cpu_to_le16(io.apptag);
+	c.rw.appmask = cpu_to_le16(io.appmask);
+
+	return __nvme_submit_user_cmd(ns->queue, &c,
+			(void __user *)(uintptr_t)io.addr, length,
+			metadata, meta_len, io.slba, NULL, 0);
+}
+
+static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+			struct nvme_passthru_cmd __user *ucmd)
+{
+	struct nvme_passthru_cmd cmd;
+	struct nvme_command c;
+	unsigned timeout = 0;
+	int status;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+	if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+		return -EFAULT;
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = cmd.opcode;
+	c.common.flags = cmd.flags;
+	c.common.nsid = cpu_to_le32(cmd.nsid);
+	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	c.common.cdw10[0] = cpu_to_le32(cmd.cdw10);
+	c.common.cdw10[1] = cpu_to_le32(cmd.cdw11);
+	c.common.cdw10[2] = cpu_to_le32(cmd.cdw12);
+	c.common.cdw10[3] = cpu_to_le32(cmd.cdw13);
+	c.common.cdw10[4] = cpu_to_le32(cmd.cdw14);
+	c.common.cdw10[5] = cpu_to_le32(cmd.cdw15);
+
+	if (cmd.timeout_ms)
+		timeout = msecs_to_jiffies(cmd.timeout_ms);
+
+	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+			(void __user *)cmd.addr, cmd.data_len,
+			&cmd.result, timeout);
+	if (status >= 0) {
+		if (put_user(cmd.result, &ucmd->result))
+			return -EFAULT;
+	}
+
+	return status;
+}
+
+static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
+		unsigned int cmd, unsigned long arg)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+
+	switch (cmd) {
+	case NVME_IOCTL_ID:
+		force_successful_syscall_return();
+		return ns->ns_id;
+	case NVME_IOCTL_ADMIN_CMD:
+		return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
+	case NVME_IOCTL_IO_CMD:
+		return nvme_user_cmd(ns->ctrl, ns, (void __user *)arg);
+	case NVME_IOCTL_SUBMIT_IO:
+		return nvme_submit_io(ns, (void __user *)arg);
+	case SG_GET_VERSION_NUM:
+		return nvme_sg_get_version_num((void __user *)arg);
+	case SG_IO:
+		return nvme_sg_io(ns, (void __user *)arg);
+	default:
+		return -ENOTTY;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case SG_IO:
+		return -ENOIOCTLCMD;
+	}
+	return nvme_ioctl(bdev, mode, cmd, arg);
+}
+#else
+#define nvme_compat_ioctl	NULL
+#endif
+
+static int nvme_open(struct block_device *bdev, fmode_t mode)
+{
+	return nvme_get_ns_from_disk(bdev->bd_disk) ? 0 : -ENXIO;
+}
+
+static void nvme_release(struct gendisk *disk, fmode_t mode)
+{
+	nvme_put_ns(disk->private_data);
+}
+
+static int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bdev->bd_disk) >> 11;
+	return 0;
+}
+
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+static void nvme_init_integrity(struct nvme_ns *ns)
+{
+	struct blk_integrity integrity;
+
+	switch (ns->pi_type) {
+	case NVME_NS_DPS_PI_TYPE3:
+		integrity.profile = &t10_pi_type3_crc;
+		break;
+	case NVME_NS_DPS_PI_TYPE1:
+	case NVME_NS_DPS_PI_TYPE2:
+		integrity.profile = &t10_pi_type1_crc;
+		break;
+	default:
+		integrity.profile = NULL;
+		break;
+	}
+	integrity.tuple_size = ns->ms;
+	blk_integrity_register(ns->disk, &integrity);
+	blk_queue_max_integrity_segments(ns->queue, 1);
+}
+#else
+static void nvme_init_integrity(struct nvme_ns *ns)
+{
+}
+#endif /* CONFIG_BLK_DEV_INTEGRITY */
+
+static void nvme_config_discard(struct nvme_ns *ns)
+{
+	u32 logical_block_size = queue_logical_block_size(ns->queue);
+	ns->queue->limits.discard_zeroes_data = 0;
+	ns->queue->limits.discard_alignment = logical_block_size;
+	ns->queue->limits.discard_granularity = logical_block_size;
+	blk_queue_max_discard_sectors(ns->queue, 0xffffffff);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
+}
+
+static int nvme_revalidate_disk(struct gendisk *disk)
+{
+	struct nvme_ns *ns = disk->private_data;
+	struct nvme_id_ns *id;
+	u8 lbaf, pi_type;
+	u16 old_ms;
+	unsigned short bs;
+
+	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
+		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
+				__func__, ns->ctrl->instance, ns->ns_id);
+		return -ENODEV;
+	}
+	if (id->ncap == 0) {
+		kfree(id);
+		return -ENODEV;
+	}
+
+	if (nvme_nvm_ns_supported(ns, id) && ns->type != NVME_NS_LIGHTNVM) {
+		if (nvme_nvm_register(ns->queue, disk->disk_name)) {
+			dev_warn(ns->ctrl->dev,
+				"%s: LightNVM init failure\n", __func__);
+			kfree(id);
+			return -ENODEV;
+		}
+		ns->type = NVME_NS_LIGHTNVM;
+	}
+
+	old_ms = ns->ms;
+	lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
+	ns->lba_shift = id->lbaf[lbaf].ds;
+	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
+	ns->ext = ns->ms && (id->flbas & NVME_NS_FLBAS_META_EXT);
+
+	/*
+	 * If identify namespace failed, use default 512 byte block size so
+	 * block layer can use before failing read/write for 0 capacity.
+	 */
+	if (ns->lba_shift == 0)
+		ns->lba_shift = 9;
+	bs = 1 << ns->lba_shift;
+	/* XXX: PI implementation requires metadata equal t10 pi tuple size */
+	pi_type = ns->ms == sizeof(struct t10_pi_tuple) ?
+					id->dps & NVME_NS_DPS_PI_MASK : 0;
+
+	blk_mq_freeze_queue(disk->queue);
+	if (blk_get_integrity(disk) && (ns->pi_type != pi_type ||
+				ns->ms != old_ms ||
+				bs != queue_logical_block_size(disk->queue) ||
+				(ns->ms && ns->ext)))
+		blk_integrity_unregister(disk);
+
+	ns->pi_type = pi_type;
+	blk_queue_logical_block_size(ns->queue, bs);
+
+	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
+		nvme_init_integrity(ns);
+	if (ns->ms && !(ns->ms == 8 && ns->pi_type) && !blk_get_integrity(disk))
+		set_capacity(disk, 0);
+	else
+		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
+
+	if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
+		nvme_config_discard(ns);
+	blk_mq_unfreeze_queue(disk->queue);
+
+	kfree(id);
+	return 0;
+}
+
+static char nvme_pr_type(enum pr_type type)
+{
+	switch (type) {
+	case PR_WRITE_EXCLUSIVE:
+		return 1;
+	case PR_EXCLUSIVE_ACCESS:
+		return 2;
+	case PR_WRITE_EXCLUSIVE_REG_ONLY:
+		return 3;
+	case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+		return 4;
+	case PR_WRITE_EXCLUSIVE_ALL_REGS:
+		return 5;
+	case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+		return 6;
+	default:
+		return 0;
+	}
+};
+
+static int nvme_pr_command(struct block_device *bdev, u32 cdw10,
+				u64 key, u64 sa_key, u8 op)
+{
+	struct nvme_ns *ns = bdev->bd_disk->private_data;
+	struct nvme_command c;
+	u8 data[16] = { 0, };
+
+	put_unaligned_le64(key, &data[0]);
+	put_unaligned_le64(sa_key, &data[8]);
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = op;
+	c.common.nsid = cpu_to_le32(ns->ns_id);
+	c.common.cdw10[0] = cpu_to_le32(cdw10);
+
+	return nvme_submit_sync_cmd(ns->queue, &c, data, 16);
+}
+
+static int nvme_pr_register(struct block_device *bdev, u64 old,
+		u64 new, unsigned flags)
+{
+	u32 cdw10;
+
+	if (flags & ~PR_FL_IGNORE_KEY)
+		return -EOPNOTSUPP;
+
+	cdw10 = old ? 2 : 0;
+	cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
+	cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
+	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
+}
+
+static int nvme_pr_reserve(struct block_device *bdev, u64 key,
+		enum pr_type type, unsigned flags)
+{
+	u32 cdw10;
+
+	if (flags & ~PR_FL_IGNORE_KEY)
+		return -EOPNOTSUPP;
+
+	cdw10 = nvme_pr_type(type) << 8;
+	cdw10 |= ((flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0);
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_acquire);
+}
+
+static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
+		enum pr_type type, bool abort)
+{
+	u32 cdw10 = nvme_pr_type(type) << 8 | abort ? 2 : 1;
+	return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_acquire);
+}
+
+static int nvme_pr_clear(struct block_device *bdev, u64 key)
+{
+	u32 cdw10 = 1 | key ? 1 << 3 : 0;
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register);
+}
+
+static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
+{
+	u32 cdw10 = nvme_pr_type(type) << 8 | key ? 1 << 3 : 0;
+	return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release);
+}
+
+static const struct pr_ops nvme_pr_ops = {
+	.pr_register	= nvme_pr_register,
+	.pr_reserve	= nvme_pr_reserve,
+	.pr_release	= nvme_pr_release,
+	.pr_preempt	= nvme_pr_preempt,
+	.pr_clear	= nvme_pr_clear,
+};
+
+static const struct block_device_operations nvme_fops = {
+	.owner		= THIS_MODULE,
+	.ioctl		= nvme_ioctl,
+	.compat_ioctl	= nvme_compat_ioctl,
+	.open		= nvme_open,
+	.release	= nvme_release,
+	.getgeo		= nvme_getgeo,
+	.revalidate_disk= nvme_revalidate_disk,
+	.pr_ops		= &nvme_pr_ops,
+};
+
+/*
+ * Initialize the cached copies of the Identify data and various controller
+ * register in our nvme_ctrl structure.  This should be called as soon as
+ * the admin queue is fully up and running.
+ */
+int nvme_init_identify(struct nvme_ctrl *ctrl)
+{
+	struct nvme_id_ctrl *id;
+	u64 cap;
+	int ret, page_shift;
+
+	ret = ctrl->ops->reg_read32(ctrl, NVME_REG_VS, &ctrl->vs);
+	if (ret) {
+		dev_err(ctrl->dev, "Reading VS failed (%d)\n", ret);
+		return ret;
+	}
+
+	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap);
+	if (ret) {
+		dev_err(ctrl->dev, "Reading CAP failed (%d)\n", ret);
+		return ret;
+	}
+	page_shift = NVME_CAP_MPSMIN(cap) + 12;
+	ctrl->page_size = 1 << page_shift;
+
+	if (ctrl->vs >= NVME_VS(1, 1))
+		ctrl->subsystem = NVME_CAP_NSSRC(cap);
+
+	ret = nvme_identify_ctrl(ctrl, &id);
+	if (ret) {
+		dev_err(ctrl->dev, "Identify Controller failed (%d)\n", ret);
+		return -EIO;
+	}
+
+	ctrl->oncs = le16_to_cpup(&id->oncs);
+	atomic_set(&ctrl->abort_limit, id->acl + 1);
+	ctrl->vwc = id->vwc;
+	memcpy(ctrl->serial, id->sn, sizeof(id->sn));
+	memcpy(ctrl->model, id->mn, sizeof(id->mn));
+	memcpy(ctrl->firmware_rev, id->fr, sizeof(id->fr));
+	if (id->mdts)
+		ctrl->max_hw_sectors = 1 << (id->mdts + page_shift - 9);
+	else
+		ctrl->max_hw_sectors = UINT_MAX;
+
+	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) && id->vs[3]) {
+		unsigned int max_hw_sectors;
+
+		ctrl->stripe_size = 1 << (id->vs[3] + page_shift);
+		max_hw_sectors = ctrl->stripe_size >> (page_shift - 9);
+		if (ctrl->max_hw_sectors) {
+			ctrl->max_hw_sectors = min(max_hw_sectors,
+							ctrl->max_hw_sectors);
+		} else {
+			ctrl->max_hw_sectors = max_hw_sectors;
+		}
+	}
+
+	kfree(id);
+	return 0;
+}
+
+static int nvme_dev_open(struct inode *inode, struct file *file)
+{
+	struct nvme_ctrl *ctrl;
+	int instance = iminor(inode);
+	int ret = -ENODEV;
+
+	spin_lock(&dev_list_lock);
+	list_for_each_entry(ctrl, &nvme_ctrl_list, node) {
+		if (ctrl->instance != instance)
+			continue;
+
+		if (!ctrl->admin_q) {
+			ret = -EWOULDBLOCK;
+			break;
+		}
+		if (!kref_get_unless_zero(&ctrl->kref))
+			break;
+		file->private_data = ctrl;
+		ret = 0;
+		break;
+	}
+	spin_unlock(&dev_list_lock);
+
+	return ret;
+}
+
+static int nvme_dev_release(struct inode *inode, struct file *file)
+{
+	nvme_put_ctrl(file->private_data);
+	return 0;
+}
+
+static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct nvme_ctrl *ctrl = file->private_data;
+	void __user *argp = (void __user *)arg;
+	struct nvme_ns *ns;
+
+	switch (cmd) {
+	case NVME_IOCTL_ADMIN_CMD:
+		return nvme_user_cmd(ctrl, NULL, argp);
+	case NVME_IOCTL_IO_CMD:
+		if (list_empty(&ctrl->namespaces))
+			return -ENOTTY;
+		ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list);
+		return nvme_user_cmd(ctrl, ns, argp);
+	case NVME_IOCTL_RESET:
+		dev_warn(ctrl->dev, "resetting controller\n");
+		return ctrl->ops->reset_ctrl(ctrl);
+	case NVME_IOCTL_SUBSYS_RESET:
+		return nvme_reset_subsystem(ctrl);
+	default:
+		return -ENOTTY;
+	}
+}
+
+static const struct file_operations nvme_dev_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nvme_dev_open,
+	.release	= nvme_dev_release,
+	.unlocked_ioctl	= nvme_dev_ioctl,
+	.compat_ioctl	= nvme_dev_ioctl,
+};
+
+static ssize_t nvme_sysfs_reset(struct device *dev,
+				struct device_attribute *attr, const char *buf,
+				size_t count)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	int ret;
+
+	ret = ctrl->ops->reset_ctrl(ctrl);
+	if (ret < 0)
+		return ret;
+	return count;
+}
+static DEVICE_ATTR(reset_controller, S_IWUSR, NULL, nvme_sysfs_reset);
+
+static int ns_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+	struct nvme_ns *nsa = container_of(a, struct nvme_ns, list);
+	struct nvme_ns *nsb = container_of(b, struct nvme_ns, list);
+
+	return nsa->ns_id - nsb->ns_id;
+}
+
+static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (ns->ns_id == nsid)
+			return ns;
+		if (ns->ns_id > nsid)
+			break;
+	}
+	return NULL;
+}
+
+static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+	struct gendisk *disk;
+	int node = dev_to_node(ctrl->dev);
+
+	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
+	if (!ns)
+		return;
+
+	ns->queue = blk_mq_init_queue(ctrl->tagset);
+	if (IS_ERR(ns->queue))
+		goto out_free_ns;
+	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
+	ns->queue->queuedata = ns;
+	ns->ctrl = ctrl;
+
+	disk = alloc_disk_node(0, node);
+	if (!disk)
+		goto out_free_queue;
+
+	kref_init(&ns->kref);
+	ns->ns_id = nsid;
+	ns->disk = disk;
+	ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
+
+	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
+	if (ctrl->max_hw_sectors) {
+		blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
+		blk_queue_max_segments(ns->queue,
+			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
+	}
+	if (ctrl->stripe_size)
+		blk_queue_chunk_sectors(ns->queue, ctrl->stripe_size >> 9);
+	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
+		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
+	blk_queue_virt_boundary(ns->queue, ctrl->page_size - 1);
+
+	disk->major = nvme_major;
+	disk->first_minor = 0;
+	disk->fops = &nvme_fops;
+	disk->private_data = ns;
+	disk->queue = ns->queue;
+	disk->driverfs_dev = ctrl->device;
+	disk->flags = GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance, nsid);
+
+	if (nvme_revalidate_disk(ns->disk))
+		goto out_free_disk;
+
+	list_add_tail(&ns->list, &ctrl->namespaces);
+	kref_get(&ctrl->kref);
+	if (ns->type != NVME_NS_LIGHTNVM)
+		add_disk(ns->disk);
+
+	return;
+ out_free_disk:
+	kfree(disk);
+ out_free_queue:
+	blk_cleanup_queue(ns->queue);
+ out_free_ns:
+	kfree(ns);
+}
+
+static void nvme_ns_remove(struct nvme_ns *ns)
+{
+	bool kill = nvme_io_incapable(ns->ctrl) &&
+			!blk_queue_dying(ns->queue);
+
+	if (kill)
+		blk_set_queue_dying(ns->queue);
+	if (ns->disk->flags & GENHD_FL_UP) {
+		if (blk_get_integrity(ns->disk))
+			blk_integrity_unregister(ns->disk);
+		del_gendisk(ns->disk);
+	}
+	if (kill || !blk_queue_dying(ns->queue)) {
+		blk_mq_abort_requeue_list(ns->queue);
+		blk_cleanup_queue(ns->queue);
+	}
+	list_del_init(&ns->list);
+	nvme_put_ns(ns);
+}
+
+static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
+{
+	struct nvme_ns *ns;
+
+	ns = nvme_find_ns(ctrl, nsid);
+	if (ns) {
+		if (revalidate_disk(ns->disk))
+			nvme_ns_remove(ns);
+	} else
+		nvme_alloc_ns(ctrl, nsid);
+}
+
+static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
+{
+	struct nvme_ns *ns;
+	__le32 *ns_list;
+	unsigned i, j, nsid, prev = 0, num_lists = DIV_ROUND_UP(nn, 1024);
+	int ret = 0;
+
+	ns_list = kzalloc(0x1000, GFP_KERNEL);
+	if (!ns_list)
+		return -ENOMEM;
+
+	for (i = 0; i < num_lists; i++) {
+		ret = nvme_identify_ns_list(ctrl, prev, ns_list);
+		if (ret)
+			goto out;
+
+		for (j = 0; j < min(nn, 1024U); j++) {
+			nsid = le32_to_cpu(ns_list[j]);
+			if (!nsid)
+				goto out;
+
+			nvme_validate_ns(ctrl, nsid);
+
+			while (++prev < nsid) {
+				ns = nvme_find_ns(ctrl, prev);
+				if (ns)
+					nvme_ns_remove(ns);
+			}
+		}
+		nn -= j;
+	}
+ out:
+	kfree(ns_list);
+	return ret;
+}
+
+static void __nvme_scan_namespaces(struct nvme_ctrl *ctrl, unsigned nn)
+{
+	struct nvme_ns *ns, *next;
+	unsigned i;
+
+	for (i = 1; i <= nn; i++)
+		nvme_validate_ns(ctrl, i);
+
+	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
+		if (ns->ns_id > nn)
+			nvme_ns_remove(ns);
+	}
+}
+
+void nvme_scan_namespaces(struct nvme_ctrl *ctrl)
+{
+	struct nvme_id_ctrl *id;
+	unsigned nn;
+
+	if (nvme_identify_ctrl(ctrl, &id))
+		return;
+
+	nn = le32_to_cpu(id->nn);
+	if (ctrl->vs >= NVME_VS(1, 1)) {
+		if (!nvme_scan_ns_list(ctrl, nn))
+			goto done;
+	}
+	__nvme_scan_namespaces(ctrl, le32_to_cpup(&id->nn));
+ done:
+	list_sort(NULL, &ctrl->namespaces, ns_cmp);
+	kfree(id);
+}
+
+void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns, *next;
+
+	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
+		nvme_ns_remove(ns);
+}
+
+static DEFINE_IDA(nvme_instance_ida);
+
+static int nvme_set_instance(struct nvme_ctrl *ctrl)
+{
+	int instance, error;
+
+	do {
+		if (!ida_pre_get(&nvme_instance_ida, GFP_KERNEL))
+			return -ENODEV;
+
+		spin_lock(&dev_list_lock);
+		error = ida_get_new(&nvme_instance_ida, &instance);
+		spin_unlock(&dev_list_lock);
+	} while (error == -EAGAIN);
+
+	if (error)
+		return -ENODEV;
+
+	ctrl->instance = instance;
+	return 0;
+}
+
+static void nvme_release_instance(struct nvme_ctrl *ctrl)
+{
+	spin_lock(&dev_list_lock);
+	ida_remove(&nvme_instance_ida, ctrl->instance);
+	spin_unlock(&dev_list_lock);
+}
+
+void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
+ {
+	device_remove_file(ctrl->device, &dev_attr_reset_controller);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+
+	spin_lock(&dev_list_lock);
+	list_del(&ctrl->node);
+	spin_unlock(&dev_list_lock);
+}
+
+static void nvme_free_ctrl(struct kref *kref)
+{
+	struct nvme_ctrl *ctrl = container_of(kref, struct nvme_ctrl, kref);
+
+	put_device(ctrl->device);
+	nvme_release_instance(ctrl);
+
+	ctrl->ops->free_ctrl(ctrl);
+}
+
+void nvme_put_ctrl(struct nvme_ctrl *ctrl)
+{
+	kref_put(&ctrl->kref, nvme_free_ctrl);
+}
+
+/*
+ * Initialize a NVMe controller structures.  This needs to be called during
+ * earliest initialization so that we have the initialized structured around
+ * during probing.
+ */
+int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
+		const struct nvme_ctrl_ops *ops, u16 vendor,
+		unsigned long quirks)
+{
+	int ret;
+
+	INIT_LIST_HEAD(&ctrl->namespaces);
+	kref_init(&ctrl->kref);
+	ctrl->dev = dev;
+	ctrl->ops = ops;
+	ctrl->vendor = vendor;
+	ctrl->quirks = quirks;
+
+	ret = nvme_set_instance(ctrl);
+	if (ret)
+		goto out;
+
+	ctrl->device = device_create(nvme_class, ctrl->dev,
+				MKDEV(nvme_char_major, ctrl->instance),
+				dev, "nvme%d", ctrl->instance);
+	if (IS_ERR(ctrl->device)) {
+		ret = PTR_ERR(ctrl->device);
+		goto out_release_instance;
+	}
+	get_device(ctrl->device);
+	dev_set_drvdata(ctrl->device, ctrl);
+
+	ret = device_create_file(ctrl->device, &dev_attr_reset_controller);
+	if (ret)
+		goto out_put_device;
+
+	spin_lock(&dev_list_lock);
+	list_add_tail(&ctrl->node, &nvme_ctrl_list);
+	spin_unlock(&dev_list_lock);
+
+	return 0;
+
+out_put_device:
+	put_device(ctrl->device);
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
+out_release_instance:
+	nvme_release_instance(ctrl);
+out:
+	return ret;
+}
+
+int __init nvme_core_init(void)
+{
+	int result;
+
+	result = register_blkdev(nvme_major, "nvme");
+	if (result < 0)
+		return result;
+	else if (result > 0)
+		nvme_major = result;
+
+	result = __register_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme",
+							&nvme_dev_fops);
+	if (result < 0)
+		goto unregister_blkdev;
+	else if (result > 0)
+		nvme_char_major = result;
+
+	nvme_class = class_create(THIS_MODULE, "nvme");
+	if (IS_ERR(nvme_class)) {
+		result = PTR_ERR(nvme_class);
+		goto unregister_chrdev;
+	}
+
+	return 0;
+
+ unregister_chrdev:
+	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+ unregister_blkdev:
+	unregister_blkdev(nvme_major, "nvme");
+	return result;
+}
+
+void nvme_core_exit(void)
+{
+	unregister_blkdev(nvme_major, "nvme");
+	class_destroy(nvme_class);
+	__unregister_chrdev(nvme_char_major, 0, NVME_MINORS, "nvme");
+}
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9444884..5c5f455 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1040,7 +1040,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	struct request *req;
 	int ret;
 
-	req = blk_mq_alloc_request(q, write, GFP_KERNEL, false);
+	req = blk_mq_alloc_request(q, write, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1093,7 +1093,8 @@ static int nvme_submit_async_admin_req(struct nvme_dev *dev)
 	struct nvme_cmd_info *cmd_info;
 	struct request *req;
 
-	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC, true);
+	req = blk_mq_alloc_request(dev->admin_q, WRITE,
+			BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1118,7 +1119,7 @@ static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
 	struct request *req;
 	struct nvme_cmd_info *cmd_rq;
 
-	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
@@ -1319,8 +1320,8 @@ static void nvme_abort_req(struct request *req)
 	if (!dev->abort_limit)
 		return;
 
-	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
-									false);
+	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE,
+			BLK_MQ_REQ_NOWAIT);
 	if (IS_ERR(abort_req))
 		return;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index daf17d7..7fc9296 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -188,8 +188,14 @@ void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
+
+enum {
+	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
+	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
+};
+
 struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
-		gfp_t gfp, bool reserved);
+		unsigned int flags);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
 struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags);
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, back to index

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 16:34 NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
2015-11-20 16:34 ` [PATCH 01/47] nvme: add missing unmaps in nvme_queue_rq Christoph Hellwig
2015-11-20 16:34 ` [PATCH 02/47] block: fix blk_abort_request for blk-mq drivers Christoph Hellwig
2015-11-20 21:43   ` Jeff Moyer
2015-11-20 21:47     ` Jens Axboe
2015-11-20 21:54       ` Jeff Moyer
2015-11-20 22:20   ` Jeff Moyer
2015-11-21  7:34     ` Christoph Hellwig
2015-11-20 16:34 ` [PATCH 03/47] block: defer timeouts to a workqueue Christoph Hellwig
2015-11-23 20:31   ` Jeff Moyer
2015-11-23 20:48     ` Christoph Hellwig
2015-11-23 20:59       ` Jeff Moyer
2015-11-20 16:34 ` [PATCH 04/47] block: provide a new BLK_EH_QUIESCED timeout return value Christoph Hellwig
2015-11-24 15:16   ` Jeff Moyer
2015-11-24 15:40     ` Christoph Hellwig
2015-11-24 15:51       ` Jeff Moyer
2015-11-24 15:56         ` Christoph Hellwig
2015-11-24 16:34           ` Jeff Moyer
2015-11-24 16:47             ` Keith Busch
2015-11-24 17:56             ` Christoph Hellwig
2015-11-24 18:12               ` Jeff Moyer
2015-11-24 19:40                 ` Christoph Hellwig
2015-11-20 16:35 ` [PATCH 05/47] block: remoe REQ_ATOM_COMPLETE wrappers Christoph Hellwig
2015-11-23 21:23   ` Jeff Moyer
2015-11-24 21:22   ` Jens Axboe
2015-11-20 16:35 ` [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request Christoph Hellwig
2015-11-24 15:19   ` Jeff Moyer
2015-11-24 15:32     ` Christoph Hellwig
2015-11-24 21:21   ` Jens Axboe
2015-11-24 22:22     ` Christoph Hellwig
2015-11-24 22:25       ` Jens Axboe
2015-11-20 16:35 ` [PATCH 07/47] nvme: move struct nvme_iod to pci.c Christoph Hellwig
2015-11-20 16:35 ` [PATCH 08/47] nvme: split command submission helpers out of pci.c Christoph Hellwig
2015-11-20 16:35 ` [PATCH 09/47] nvme: add a vendor field to struct nvme_dev Christoph Hellwig
2015-11-20 16:35 ` [PATCH 10/47] nvme: use offset instead of a struct for registers Christoph Hellwig
2015-11-20 16:35 ` [PATCH 11/47] nvme: split a new struct nvme_ctrl out of struct nvme_dev Christoph Hellwig
2015-11-20 16:35 ` [PATCH 12/47] nvme: simplify nvme_setup_prps calling convention Christoph Hellwig
2015-11-20 16:35 ` [PATCH 13/47] nvme: refactor nvme_queue_rq Christoph Hellwig
2015-11-20 16:35 ` [PATCH 14/47] nvme: move nvme_error_status to common code Christoph Hellwig
2015-11-20 16:35 ` [PATCH 15/47] nvme: move nvme_setup_flush and nvme_setup_rw " Christoph Hellwig
2015-11-20 16:35 ` [PATCH 16/47] nvme: split __nvme_submit_sync_cmd Christoph Hellwig
2015-11-20 16:35 ` [PATCH 17/47] nvme: use the block layer for userspace passthrough metadata Christoph Hellwig
2015-11-20 16:35 ` [PATCH 18/47] nvme: move block_device_operations and ns/ctrl freeing to common code Christoph Hellwig
2015-11-20 16:35 ` [PATCH 19/47] nvme: add explicit quirk handling Christoph Hellwig
2015-11-20 16:35 ` [PATCH 20/47] nvme: add a common helper to read Identify Controller data Christoph Hellwig
2015-11-20 16:35 ` [PATCH 21/47] nvme: move the call to nvme_init_identify earlier Christoph Hellwig
2015-11-20 16:35 ` [PATCH 22/47] nvme: move namespace scanning to common code Christoph Hellwig
2015-11-20 16:35 ` [PATCH 23/47] nvme: move chardev and sysfs interface " Christoph Hellwig
2015-11-20 16:35 ` [PATCH 24/47] nvme: only add a controller to dev_list after it's been fully initialized Christoph Hellwig
2015-11-20 16:35 ` [PATCH 25/47] nvme: don't take the I/O queue q_lock in nvme_timeout Christoph Hellwig
2017-03-10 12:51   ` David Woodhouse
2017-03-10 14:24     ` Christoph Hellwig
2015-11-20 16:35 ` [PATCH 26/47] nvme: merge nvme_abort_req and nvme_timeout Christoph Hellwig
2015-11-20 16:35 ` [PATCH 27/47] nvme: do not restart the request timeout if we're resetting the controller Christoph Hellwig
2015-11-20 16:35 ` [PATCH 28/47] nvme: simplify resets Christoph Hellwig
2015-11-20 16:35 ` [PATCH 29/47] nvme: merge probe_work and reset_work Christoph Hellwig
2015-11-20 16:35 ` [PATCH 30/47] nvme: remove dead controllers from a work item Christoph Hellwig
2015-11-20 16:35 ` [PATCH 31/47] nvme: switch abort_limit to an atomic_t Christoph Hellwig
2015-11-20 16:35 ` [PATCH 32/47] NVMe: Implement namespace list scanning Christoph Hellwig
2015-11-20 16:35 ` [PATCH 33/47] NVMe: Use unbounded work queue for all work Christoph Hellwig
2015-11-20 16:35 ` [PATCH 34/47] NVMe: Remove device management handles on remove Christoph Hellwig
2015-11-20 16:50 ` NVMe mega patchbomb for Linux 4.5-rc Christoph Hellwig
2015-11-21  7:19 NVMe mega patchbomb for Linux 4.5-rc (resend) Christoph Hellwig
2015-11-21  7:19 ` [PATCH 06/47] blk-mq: add a flags parameter to blk_mq_alloc_request Christoph Hellwig

Linux-NVME Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvme/0 linux-nvme/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvme linux-nvme/ https://lore.kernel.org/linux-nvme \
		linux-nvme@lists.infradead.org
	public-inbox-index linux-nvme

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-nvme


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git