All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11] Convert NVMe driver to blk-mq
@ 2014-07-26  9:07 ` Matias Bjørling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-07-26  9:07 UTC (permalink / raw)
  To: willy, keith.busch, sbradshaw, axboe, linux-kernel, linux-nvme,
	hch, rlnelson, tom.leiming
  Cc: Matias Bjørling

Hi all,

Thanks for the feedback. I've rebased the patch on top of 3.16-rc6 and applied
the current patches in Willy's master tree.

A branch with the patch on top is found here:

  https://github.com/MatiasBjorling/linux-collab nvmemq_review

and the separate changes can be found in the nvmemq_v11 branch

Changes since v10:
 * Rebased on top of Linus' v3.16-rc6.
 * Incorporated the feedback from Christoph:
    a. Insert comment regarding the timeout flow.
    b. Moved tags into nvmeq instead of hctx.
    c. Moved initialization of tags and nvmeq outside of init_hctx.
    d. Refactor submission of commands in the request queue path.
    e. Fixes for WARN_ON and BUG_ON.
 * Fixed a missing blk_put_request during abort.
 * Converted the "Async event request" patch into the request model.

Changes since v9:
 * Rebased on top of Linus' v3.16-rc3.
 * Ming noted that we should remember to kick the request queue after requeue.
 * Jens noted a couple of superfluous warnings.
 * Christoph is removed from the contribution section. Instead he is going to
   be added as reviewed-by.

Changes since v8:
 * QUEUE_FLAG_VIRT_HOLE was renamed to QUEUE_FLAG_SG_GAPS
 * Previous revertion of patches lost the IRQ affinity hint
 * Removed test code in nvme_reset_notify

Changes since v7:
 * Jens implemented support for QUEUE_FLAG_VIRT_HOLE to limit
   requests to a continuous range of virtual memory.
 * Keith fixed up the abortion logic.
 * Usual style fixups

Changes since v6:
 * Rebased on top of Matthew's master and Jens' for-linus
 * A couple of style fixups

Changes since v5:
 * Splits are now supported directly within blk-mq
 * Remove nvme_queue->cpu_mask variable
 * Remove unnecessary null check
 * Style fixups

Changes since v4:
 * Fix timeout retries
 * Fix naming in nvme_init_hctx
 * Fix racy behavior of admin queue in nvme_dev_remove
 * Fix wrong return values in nvme_queue_request
 * Put cqe_seen back
 * Introduce abort_completion for killing timed out I/Os
 * Move locks outside of nvme_submit_iod
 * Various renaming and style fixes

Changes since v3:
 * Added abortion logic
 * Fixed possible race on abortion
 * Removed req data with flush. Handled by by blk-mq
 * Added safety check for submitting user rq to admin queue.
 * Use dev->online_queues for nr_hw_queues
 * Fix loop with initialization in nvme_create_io_queues
 * Style fixups

Changes since v2:
  * rebased on top of current 3.16/core.
  * use blk-mq queue management for spreading io queues
  * removed rcu handling and allocated all io queues up front for mgmt by blk-mq
  * removed the need for hotplugging notification
  * fixed flush data handling
  * fixed double free of spinlock
  * various cleanup

Matias Bjørling (1):
  NVMe: Convert to blk-mq

 drivers/block/nvme-core.c | 1303 ++++++++++++++++++---------------------------
 drivers/block/nvme-scsi.c |    8 +-
 include/linux/nvme.h      |   15 +-
 3 files changed, 534 insertions(+), 792 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] Convert NVMe driver to blk-mq
@ 2014-07-26  9:07 ` Matias Bjørling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-07-26  9:07 UTC (permalink / raw)


Hi all,

Thanks for the feedback. I've rebased the patch on top of 3.16-rc6 and applied
the current patches in Willy's master tree.

A branch with the patch on top is found here:

  https://github.com/MatiasBjorling/linux-collab nvmemq_review

and the separate changes can be found in the nvmemq_v11 branch

Changes since v10:
 * Rebased on top of Linus' v3.16-rc6.
 * Incorporated the feedback from Christoph:
    a. Insert comment regarding the timeout flow.
    b. Moved tags into nvmeq instead of hctx.
    c. Moved initialization of tags and nvmeq outside of init_hctx.
    d. Refactor submission of commands in the request queue path.
    e. Fixes for WARN_ON and BUG_ON.
 * Fixed a missing blk_put_request during abort.
 * Converted the "Async event request" patch into the request model.

Changes since v9:
 * Rebased on top of Linus' v3.16-rc3.
 * Ming noted that we should remember to kick the request queue after requeue.
 * Jens noted a couple of superfluous warnings.
 * Christoph is removed from the contribution section. Instead he is going to
   be added as reviewed-by.

Changes since v8:
 * QUEUE_FLAG_VIRT_HOLE was renamed to QUEUE_FLAG_SG_GAPS
 * Previous revertion of patches lost the IRQ affinity hint
 * Removed test code in nvme_reset_notify

Changes since v7:
 * Jens implemented support for QUEUE_FLAG_VIRT_HOLE to limit
   requests to a continuous range of virtual memory.
 * Keith fixed up the abortion logic.
 * Usual style fixups

Changes since v6:
 * Rebased on top of Matthew's master and Jens' for-linus
 * A couple of style fixups

Changes since v5:
 * Splits are now supported directly within blk-mq
 * Remove nvme_queue->cpu_mask variable
 * Remove unnecessary null check
 * Style fixups

Changes since v4:
 * Fix timeout retries
 * Fix naming in nvme_init_hctx
 * Fix racy behavior of admin queue in nvme_dev_remove
 * Fix wrong return values in nvme_queue_request
 * Put cqe_seen back
 * Introduce abort_completion for killing timed out I/Os
 * Move locks outside of nvme_submit_iod
 * Various renaming and style fixes

Changes since v3:
 * Added abortion logic
 * Fixed possible race on abortion
 * Removed req data with flush. Handled by by blk-mq
 * Added safety check for submitting user rq to admin queue.
 * Use dev->online_queues for nr_hw_queues
 * Fix loop with initialization in nvme_create_io_queues
 * Style fixups

Changes since v2:
  * rebased on top of current 3.16/core.
  * use blk-mq queue management for spreading io queues
  * removed rcu handling and allocated all io queues up front for mgmt by blk-mq
  * removed the need for hotplugging notification
  * fixed flush data handling
  * fixed double free of spinlock
  * various cleanup

Matias Bj?rling (1):
  NVMe: Convert to blk-mq

 drivers/block/nvme-core.c | 1303 ++++++++++++++++++---------------------------
 drivers/block/nvme-scsi.c |    8 +-
 include/linux/nvme.h      |   15 +-
 3 files changed, 534 insertions(+), 792 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
  2014-07-26  9:07 ` Matias Bjørling
@ 2014-07-26  9:07   ` Matias Bjørling
  -1 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-07-26  9:07 UTC (permalink / raw)
  To: willy, keith.busch, sbradshaw, axboe, linux-kernel, linux-nvme,
	hch, rlnelson, tom.leiming
  Cc: Matias Bjørling

This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within itself.
By using blk-mq, a lot of these responsibilities can be moved and simplified.

The patch is divided into the following blocks:

 * Per-command data and cmdid have been moved into the struct request field.
   The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id maintenance
   are now handled by blk-mq through the rq->tag field.

 * The logic for splitting bio's has been moved into the blk-mq layer. The
   driver instead notifies the block layer about limited gap support in SG
   lists.

 * blk-mq handles timeouts and is reimplemented within nvme_timeout(). This both
   includes abort handling and command cancelation.

 * Assignment of nvme queues to CPUs are replaced with the blk-mq version. The
   current blk-mq strategy is to assign the number of mapped queues and CPUs to
   provide synergy, while the nvme driver assign as many nvme hw queues as
   possible. This can be implemented in blk-mq if needed.

 * NVMe queues are merged with the tags structure of blk-mq.

 * blk-mq takes care of setup/teardown of nvme queues and guards invalid
   accesses. Therefore, RCU-usage for nvme queues can be removed.

 * IO tracing and accounting are handled by blk-mq and therefore removed.

 * Setup and teardown of nvme queues are now handled by nvme_[init/exit]_hctx.

Contributions in this patch from:

  Sam Bradshaw <sbradshaw@micron.com>
  Jens Axboe <axboe@fb.com>
  Keith Busch <keith.busch@intel.com>
  Robert Nelson <rlnelson@google.com>

Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
---
 drivers/block/nvme-core.c | 1303 ++++++++++++++++++---------------------------
 drivers/block/nvme-scsi.c |    8 +-
 include/linux/nvme.h      |   15 +-
 3 files changed, 534 insertions(+), 792 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 28aec2d..384dc91 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -13,9 +13,9 @@
  */
 
 #include <linux/nvme.h>
-#include <linux/bio.h>
 #include <linux/bitops.h>
 #include <linux/blkdev.h>
+#include <linux/blk-mq.h>
 #include <linux/cpu.h>
 #include <linux/delay.h>
 #include <linux/errno.h>
@@ -33,7 +33,6 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/pci.h>
-#include <linux/percpu.h>
 #include <linux/poison.h>
 #include <linux/ptrace.h>
 #include <linux/sched.h>
@@ -42,9 +41,8 @@
 #include <scsi/sg.h>
 #include <asm-generic/io-64-nonatomic-lo-hi.h>
 
-#include <trace/events/block.h>
-
 #define NVME_Q_DEPTH		1024
+#define NVME_AQ_DEPTH		64
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
 #define ADMIN_TIMEOUT		(admin_timeout * HZ)
@@ -76,10 +74,12 @@ static wait_queue_head_t nvme_kthread_wait;
 static struct notifier_block nvme_nb;
 
 static void nvme_reset_failed_dev(struct work_struct *ws);
+static int nvme_process_cq(struct nvme_queue *nvmeq);
 
 struct async_cmd_info {
 	struct kthread_work work;
 	struct kthread_worker *worker;
+	struct request *req;
 	u32 result;
 	int status;
 	void *ctx;
@@ -90,7 +90,6 @@ struct async_cmd_info {
  * commands and one for I/O commands).
  */
 struct nvme_queue {
-	struct rcu_head r_head;
 	struct device *q_dmadev;
 	struct nvme_dev *dev;
 	char irqname[24];	/* nvme4294967295-65535\0 */
@@ -99,10 +98,6 @@ struct nvme_queue {
 	volatile struct nvme_completion *cqes;
 	dma_addr_t sq_dma_addr;
 	dma_addr_t cq_dma_addr;
-	wait_queue_head_t sq_full;
-	wait_queue_t sq_cong_wait;
-	struct bio_list sq_cong;
-	struct list_head iod_bio;
 	u32 __iomem *q_db;
 	u16 q_depth;
 	u16 cq_vector;
@@ -113,9 +108,8 @@ struct nvme_queue {
 	u8 cq_phase;
 	u8 cqe_seen;
 	u8 q_suspended;
-	cpumask_var_t cpu_mask;
 	struct async_cmd_info cmdinfo;
-	unsigned long cmdid_data[];
+	struct blk_mq_tags *tags;
 };
 
 /*
@@ -143,62 +137,65 @@ typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
 struct nvme_cmd_info {
 	nvme_completion_fn fn;
 	void *ctx;
-	unsigned long timeout;
 	int aborted;
+	struct nvme_queue *nvmeq;
 };
 
-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+				unsigned int hctx_idx)
 {
-	return (void *)&nvmeq->cmdid_data[BITS_TO_LONGS(nvmeq->q_depth)];
+	struct nvme_dev *dev = data;
+	struct nvme_queue *nvmeq = dev->queues[0];
+
+	hctx->driver_data = nvmeq;
+	return 0;
 }
 
-static unsigned nvme_queue_extra(int depth)
+static int nvme_admin_init_request(void *data, struct request *req,
+				unsigned int hctx_idx, unsigned int rq_idx,
+				unsigned int numa_node)
 {
-	return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+	struct nvme_dev *dev = data;
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = dev->queues[0];
+
+	BUG_ON(!nvmeq);
+	cmd->nvmeq = nvmeq;
+	return 0;
 }
 
-/**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
- * @ctx: A pointer that will be passed to the handler
- * @handler: The function to call on completion
- *
- * Allocate a Command ID for a queue.  The data passed in will
- * be passed to the completion handler.  This is implemented by using
- * the bottom two bits of the ctx pointer to store the handler ID.
- * Passing in a pointer that's not 4-byte aligned will cause a BUG.
- * We can change this if it becomes a problem.
- *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and no locks held.
- */
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
-				nvme_completion_fn handler, unsigned timeout)
+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			  unsigned int hctx_idx)
 {
-	int depth = nvmeq->q_depth - 1;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	int cmdid;
-
-	do {
-		cmdid = find_first_zero_bit(nvmeq->cmdid_data, depth);
-		if (cmdid >= depth)
-			return -EBUSY;
-	} while (test_and_set_bit(cmdid, nvmeq->cmdid_data));
-
-	info[cmdid].fn = handler;
-	info[cmdid].ctx = ctx;
-	info[cmdid].timeout = jiffies + timeout;
-	info[cmdid].aborted = 0;
-	return cmdid;
+	struct nvme_dev *dev = data;
+	struct nvme_queue *nvmeq = dev->queues[
+					(hctx_idx % dev->queue_count) + 1];
+
+	irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
+								hctx->cpumask);
+	hctx->driver_data = nvmeq;
+	return 0;
+}
+
+static int nvme_init_request(void *data, struct request *req,
+				unsigned int hctx_idx, unsigned int rq_idx,
+				unsigned int numa_node)
+{
+	struct nvme_dev *dev = data;
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
+
+	BUG_ON(!nvmeq);
+	cmd->nvmeq = nvmeq;
+	return 0;
 }
 
-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
-				nvme_completion_fn handler, unsigned timeout)
+static void nvme_set_info(struct nvme_cmd_info *cmd, void *ctx,
+				nvme_completion_fn handler)
 {
-	int cmdid;
-	wait_event_killable(nvmeq->sq_full,
-		(cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
-	return (cmdid < 0) ? -EINTR : cmdid;
+	cmd->fn = handler;
+	cmd->ctx = ctx;
+	cmd->aborted = 0;
 }
 
 /* Special values must be less than 0x1000 */
@@ -206,18 +203,12 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
 #define CMD_CTX_CANCELLED	(0x30C + CMD_CTX_BASE)
 #define CMD_CTX_COMPLETED	(0x310 + CMD_CTX_BASE)
 #define CMD_CTX_INVALID		(0x314 + CMD_CTX_BASE)
-#define CMD_CTX_ABORT		(0x318 + CMD_CTX_BASE)
-#define CMD_CTX_ASYNC		(0x31C + CMD_CTX_BASE)
 
 static void special_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
 	if (ctx == CMD_CTX_CANCELLED)
 		return;
-	if (ctx == CMD_CTX_ABORT) {
-		++nvmeq->dev->abort_limit;
-		return;
-	}
 	if (ctx == CMD_CTX_COMPLETED) {
 		dev_warn(nvmeq->q_dmadev,
 				"completed id %d twice on queue %d\n",
@@ -230,21 +221,52 @@ static void special_completion(struct nvme_queue *nvmeq, void *ctx,
 				cqe->command_id, le16_to_cpup(&cqe->sq_id));
 		return;
 	}
-	if (ctx == CMD_CTX_ASYNC) {
-		u32 result = le32_to_cpup(&cqe->result);
-		u16 status = le16_to_cpup(&cqe->status) >> 1;
-
-		if (status == NVME_SC_SUCCESS || status == NVME_SC_ABORT_REQ)
-			++nvmeq->dev->event_limit;
-		if (status == NVME_SC_SUCCESS)
-			dev_warn(nvmeq->q_dmadev,
-				"async event result %08x\n", result);
-		return;
-	}
-
 	dev_warn(nvmeq->q_dmadev, "Unknown special completion %p\n", ctx);
 }
 
+static void *cancel_cmd_info(struct nvme_cmd_info *cmd, nvme_completion_fn *fn)
+{
+	void *ctx;
+
+	if (fn)
+		*fn = cmd->fn;
+	ctx = cmd->ctx;
+	cmd->fn = special_completion;
+	cmd->ctx = CMD_CTX_CANCELLED;
+	return ctx;
+}
+
+static void async_req_completion(struct nvme_queue *nvmeq, void *ctx,
+						struct nvme_completion *cqe)
+{
+	struct request *req = ctx;
+
+	u32 result = le32_to_cpup(&cqe->result);
+	u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+	if (status == NVME_SC_SUCCESS || status == NVME_SC_ABORT_REQ)
+		++nvmeq->dev->event_limit;
+	if (status == NVME_SC_SUCCESS)
+		dev_warn(nvmeq->q_dmadev,
+			"async event result %08x\n", result);
+
+	blk_put_request(req);
+}
+
+static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
+						struct nvme_completion *cqe)
+{
+	struct request *req = ctx;
+
+	u16 status = le16_to_cpup(&cqe->status) >> 1;
+	u32 result = le32_to_cpup(&cqe->result);
+
+	blk_put_request(req);
+
+	dev_warn(nvmeq->q_dmadev, "Abort status:%x result:%x", status, result);
+	++nvmeq->dev->abort_limit;
+}
+
 static void async_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
@@ -252,90 +274,37 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
 	cmdinfo->result = le32_to_cpup(&cqe->result);
 	cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
 	queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
+	blk_put_request(cmdinfo->req);
+}
+
+static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
+				  unsigned int tag)
+{
+	struct request *req = blk_mq_tag_to_rq(nvmeq->tags, tag);
+
+	return blk_mq_rq_to_pdu(req);
 }
 
 /*
  * Called with local interrupts disabled and the q_lock held.  May not sleep.
  */
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *nvme_finish_cmd(struct nvme_queue *nvmeq, int tag,
 						nvme_completion_fn *fn)
 {
+	struct nvme_cmd_info *cmd = get_cmd_from_tag(nvmeq, tag);
 	void *ctx;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-
-	if (cmdid >= nvmeq->q_depth || !info[cmdid].fn) {
-		if (fn)
-			*fn = special_completion;
+	if (tag >= nvmeq->q_depth) {
+		*fn = special_completion;
 		return CMD_CTX_INVALID;
 	}
 	if (fn)
-		*fn = info[cmdid].fn;
-	ctx = info[cmdid].ctx;
-	info[cmdid].fn = special_completion;
-	info[cmdid].ctx = CMD_CTX_COMPLETED;
-	clear_bit(cmdid, nvmeq->cmdid_data);
-	wake_up(&nvmeq->sq_full);
+		*fn = cmd->fn;
+	ctx = cmd->ctx;
+	cmd->fn = special_completion;
+	cmd->ctx = CMD_CTX_COMPLETED;
 	return ctx;
 }
 
-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
-						nvme_completion_fn *fn)
-{
-	void *ctx;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	if (fn)
-		*fn = info[cmdid].fn;
-	ctx = info[cmdid].ctx;
-	info[cmdid].fn = special_completion;
-	info[cmdid].ctx = CMD_CTX_CANCELLED;
-	return ctx;
-}
-
-static struct nvme_queue *raw_nvmeq(struct nvme_dev *dev, int qid)
-{
-	return rcu_dereference_raw(dev->queues[qid]);
-}
-
-static struct nvme_queue *get_nvmeq(struct nvme_dev *dev) __acquires(RCU)
-{
-	struct nvme_queue *nvmeq;
-	unsigned queue_id = get_cpu_var(*dev->io_queue);
-
-	rcu_read_lock();
-	nvmeq = rcu_dereference(dev->queues[queue_id]);
-	if (nvmeq)
-		return nvmeq;
-
-	rcu_read_unlock();
-	put_cpu_var(*dev->io_queue);
-	return NULL;
-}
-
-static void put_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
-	rcu_read_unlock();
-	put_cpu_var(nvmeq->dev->io_queue);
-}
-
-static struct nvme_queue *lock_nvmeq(struct nvme_dev *dev, int q_idx)
-							__acquires(RCU)
-{
-	struct nvme_queue *nvmeq;
-
-	rcu_read_lock();
-	nvmeq = rcu_dereference(dev->queues[q_idx]);
-	if (nvmeq)
-		return nvmeq;
-
-	rcu_read_unlock();
-	return NULL;
-}
-
-static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
-	rcu_read_unlock();
-}
-
 /**
  * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
  * @nvmeq: The queue to use
@@ -343,24 +312,31 @@ static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
  *
  * Safe to use from interrupt context
  */
-static int nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
+static int __nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
 {
-	unsigned long flags;
 	u16 tail;
-	spin_lock_irqsave(&nvmeq->q_lock, flags);
-	if (nvmeq->q_suspended) {
-		spin_unlock_irqrestore(&nvmeq->q_lock, flags);
+
+	if (nvmeq->q_suspended)
 		return -EBUSY;
-	}
+
 	tail = nvmeq->sq_tail;
 	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
 	writel(tail, nvmeq->q_db);
 	nvmeq->sq_tail = tail;
+
+	return 0;
+}
+
+static int nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
+{
+	unsigned long flags;
+	int ret;
+	spin_lock_irqsave(&nvmeq->q_lock, flags);
+	ret = __nvme_submit_cmd(nvmeq, cmd);
 	spin_unlock_irqrestore(&nvmeq->q_lock, flags);
-
-	return 0;
+	return ret;
 }
 
 static __le64 **iod_list(struct nvme_iod *iod)
@@ -392,7 +368,6 @@ nvme_alloc_iod(unsigned nseg, unsigned nbytes, struct nvme_dev *dev, gfp_t gfp)
 		iod->length = nbytes;
 		iod->nents = 0;
 		iod->first_dma = 0ULL;
-		iod->start_time = jiffies;
 	}
 
 	return iod;
@@ -416,65 +391,37 @@ void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
 	kfree(iod);
 }
 
-static void nvme_start_io_acct(struct bio *bio)
-{
-	struct gendisk *disk = bio->bi_bdev->bd_disk;
-	if (blk_queue_io_stat(disk->queue)) {
-		const int rw = bio_data_dir(bio);
-		int cpu = part_stat_lock();
-		part_round_stats(cpu, &disk->part0);
-		part_stat_inc(cpu, &disk->part0, ios[rw]);
-		part_stat_add(cpu, &disk->part0, sectors[rw],
-							bio_sectors(bio));
-		part_inc_in_flight(&disk->part0, rw);
-		part_stat_unlock();
-	}
-}
-
-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
-{
-	struct gendisk *disk = bio->bi_bdev->bd_disk;
-	if (blk_queue_io_stat(disk->queue)) {
-		const int rw = bio_data_dir(bio);
-		unsigned long duration = jiffies - start_time;
-		int cpu = part_stat_lock();
-		part_stat_add(cpu, &disk->part0, ticks[rw], duration);
-		part_round_stats(cpu, &disk->part0);
-		part_dec_in_flight(&disk->part0, rw);
-		part_stat_unlock();
-	}
-}
-
-static void bio_completion(struct nvme_queue *nvmeq, void *ctx,
+static void req_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
 	struct nvme_iod *iod = ctx;
-	struct bio *bio = iod->private;
+	struct request *req = iod->private;
+	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+
 	u16 status = le16_to_cpup(&cqe->status) >> 1;
-	int error = 0;
 
 	if (unlikely(status)) {
-		if (!(status & NVME_SC_DNR ||
-				bio->bi_rw & REQ_FAILFAST_MASK) &&
-				(jiffies - iod->start_time) < IOD_TIMEOUT) {
-			if (!waitqueue_active(&nvmeq->sq_full))
-				add_wait_queue(&nvmeq->sq_full,
-							&nvmeq->sq_cong_wait);
-			list_add_tail(&iod->node, &nvmeq->iod_bio);
-			wake_up(&nvmeq->sq_full);
+		if (!(status & NVME_SC_DNR || blk_noretry_request(req))
+		    && (jiffies - req->start_time) < req->timeout) {
+			blk_mq_requeue_request(req);
+			blk_mq_kick_requeue_list(req->q);
 			return;
 		}
-		error = -EIO;
-	}
-	if (iod->nents) {
-		dma_unmap_sg(nvmeq->q_dmadev, iod->sg, iod->nents,
-			bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
-		nvme_end_io_acct(bio, iod->start_time);
-	}
+		req->errors = -EIO;
+	} else
+		req->errors = 0;
+
+	if (cmd_rq->aborted)
+		dev_warn(&nvmeq->dev->pci_dev->dev,
+			"completing aborted command with status:%04x\n",
+			status);
+
+	if (iod->nents)
+		dma_unmap_sg(&nvmeq->dev->pci_dev->dev, iod->sg, iod->nents,
+			rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
 	nvme_free_iod(nvmeq->dev, iod);
 
-	trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, error);
-	bio_endio(bio, error);
+	blk_mq_complete_request(req);
 }
 
 /* length is in bytes.  gfp flags indicates whether we may sleep. */
@@ -557,88 +504,25 @@ int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod, int total_len,
 	return total_len;
 }
 
-static int nvme_split_and_submit(struct bio *bio, struct nvme_queue *nvmeq,
-				 int len)
-{
-	struct bio *split = bio_split(bio, len >> 9, GFP_ATOMIC, NULL);
-	if (!split)
-		return -ENOMEM;
-
-	trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
-					split->bi_iter.bi_sector);
-	bio_chain(split, bio);
-
-	if (!waitqueue_active(&nvmeq->sq_full))
-		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-	bio_list_add(&nvmeq->sq_cong, split);
-	bio_list_add(&nvmeq->sq_cong, bio);
-	wake_up(&nvmeq->sq_full);
-
-	return 0;
-}
-
-/* NVMe scatterlists require no holes in the virtual address */
-#define BIOVEC_NOT_VIRT_MERGEABLE(vec1, vec2)	((vec2)->bv_offset || \
-			(((vec1)->bv_offset + (vec1)->bv_len) % PAGE_SIZE))
-
-static int nvme_map_bio(struct nvme_queue *nvmeq, struct nvme_iod *iod,
-		struct bio *bio, enum dma_data_direction dma_dir, int psegs)
-{
-	struct bio_vec bvec, bvprv;
-	struct bvec_iter iter;
-	struct scatterlist *sg = NULL;
-	int length = 0, nsegs = 0, split_len = bio->bi_iter.bi_size;
-	int first = 1;
-
-	if (nvmeq->dev->stripe_size)
-		split_len = nvmeq->dev->stripe_size -
-			((bio->bi_iter.bi_sector << 9) &
-			 (nvmeq->dev->stripe_size - 1));
-
-	sg_init_table(iod->sg, psegs);
-	bio_for_each_segment(bvec, bio, iter) {
-		if (!first && BIOVEC_PHYS_MERGEABLE(&bvprv, &bvec)) {
-			sg->length += bvec.bv_len;
-		} else {
-			if (!first && BIOVEC_NOT_VIRT_MERGEABLE(&bvprv, &bvec))
-				return nvme_split_and_submit(bio, nvmeq,
-							     length);
-
-			sg = sg ? sg + 1 : iod->sg;
-			sg_set_page(sg, bvec.bv_page,
-				    bvec.bv_len, bvec.bv_offset);
-			nsegs++;
-		}
-
-		if (split_len - length < bvec.bv_len)
-			return nvme_split_and_submit(bio, nvmeq, split_len);
-		length += bvec.bv_len;
-		bvprv = bvec;
-		first = 0;
-	}
-	iod->nents = nsegs;
-	sg_mark_end(sg);
-	if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
-		return -ENOMEM;
-
-	BUG_ON(length != bio->bi_iter.bi_size);
-	return length;
-}
-
-static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
-		struct bio *bio, struct nvme_iod *iod, int cmdid)
+/*
+ * We reuse the small pool to allocate the 16-byte range here as it is not
+ * worth having a special pool for these or additional cases to handle freeing
+ * the iod.
+ */
+static void nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+		struct request *req, struct nvme_iod *iod)
 {
 	struct nvme_dsm_range *range =
 				(struct nvme_dsm_range *)iod_list(iod)[0];
 	struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
 
 	range->cattr = cpu_to_le32(0);
-	range->nlb = cpu_to_le32(bio->bi_iter.bi_size >> ns->lba_shift);
-	range->slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
+	range->nlb = cpu_to_le32(blk_rq_bytes(req) >> ns->lba_shift);
+	range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.command_id = cmdid;
+	cmnd->dsm.command_id = req->tag;
 	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
 	cmnd->dsm.prp1 = cpu_to_le64(iod->first_dma);
 	cmnd->dsm.nr = 0;
@@ -647,11 +531,9 @@ static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	if (++nvmeq->sq_tail == nvmeq->q_depth)
 		nvmeq->sq_tail = 0;
 	writel(nvmeq->sq_tail, nvmeq->q_db);
-
-	return 0;
 }
 
-static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 								int cmdid)
 {
 	struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
@@ -664,49 +546,34 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	if (++nvmeq->sq_tail == nvmeq->q_depth)
 		nvmeq->sq_tail = 0;
 	writel(nvmeq->sq_tail, nvmeq->q_db);
-
-	return 0;
 }
 
-static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
+static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+							struct nvme_ns *ns)
 {
-	struct bio *bio = iod->private;
-	struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
+	struct request *req = iod->private;
 	struct nvme_command *cmnd;
-	int cmdid;
-	u16 control;
-	u32 dsmgmt;
+	u16 control = 0;
+	u32 dsmgmt = 0;
 
-	cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
-	if (unlikely(cmdid < 0))
-		return cmdid;
-
-	if (bio->bi_rw & REQ_DISCARD)
-		return nvme_submit_discard(nvmeq, ns, bio, iod, cmdid);
-	if (bio->bi_rw & REQ_FLUSH)
-		return nvme_submit_flush(nvmeq, ns, cmdid);
-
-	control = 0;
-	if (bio->bi_rw & REQ_FUA)
+	if (req->cmd_flags & REQ_FUA)
 		control |= NVME_RW_FUA;
-	if (bio->bi_rw & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
 		control |= NVME_RW_LR;
 
-	dsmgmt = 0;
-	if (bio->bi_rw & REQ_RAHEAD)
+	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
 	cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
 	memset(cmnd, 0, sizeof(*cmnd));
 
-	cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
-	cmnd->rw.command_id = cmdid;
+	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+	cmnd->rw.command_id = req->tag;
 	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
 	cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
 	cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
-	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
-	cmnd->rw.length =
-		cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
+	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 	cmnd->rw.control = cpu_to_le16(control);
 	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
 
@@ -717,45 +584,32 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
 	return 0;
 }
 
-static int nvme_split_flush_data(struct nvme_queue *nvmeq, struct bio *bio)
-{
-	struct bio *split = bio_clone(bio, GFP_ATOMIC);
-	if (!split)
-		return -ENOMEM;
-
-	split->bi_iter.bi_size = 0;
-	split->bi_phys_segments = 0;
-	bio->bi_rw &= ~REQ_FLUSH;
-	bio_chain(split, bio);
-
-	if (!waitqueue_active(&nvmeq->sq_full))
-		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-	bio_list_add(&nvmeq->sq_cong, split);
-	bio_list_add(&nvmeq->sq_cong, bio);
-	wake_up_process(nvme_thread);
-
-	return 0;
-}
-
-/*
- * Called with local interrupts disabled and the q_lock held.  May not sleep.
- */
-static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
-								struct bio *bio)
+static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
+	struct nvme_ns *ns = hctx->queue->queuedata;
+	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
 	struct nvme_iod *iod;
-	int psegs = bio_phys_segments(ns->queue, bio);
-	int result;
+	enum dma_data_direction dma_dir;
+	int psegs = req->nr_phys_segments;
+	int result = BLK_MQ_RQ_QUEUE_BUSY;
+	/*
+	 * Requeued IO has already been prepped
+	 */
+	iod = req->special;
+	if (iod)
+		goto submit_iod;
 
-	if ((bio->bi_rw & REQ_FLUSH) && psegs)
-		return nvme_split_flush_data(nvmeq, bio);
-
-	iod = nvme_alloc_iod(psegs, bio->bi_iter.bi_size, ns->dev, GFP_ATOMIC);
+	iod = nvme_alloc_iod(psegs, blk_rq_bytes(req), ns->dev, GFP_ATOMIC);
 	if (!iod)
-		return -ENOMEM;
+		return result;
 
-	iod->private = bio;
-	if (bio->bi_rw & REQ_DISCARD) {
+	iod->private = req;
+	req->special = iod;
+
+	nvme_set_info(cmd, iod, req_completion);
+
+	if (req->cmd_flags & REQ_DISCARD) {
 		void *range;
 		/*
 		 * We reuse the small pool to allocate the 16-byte range here
@@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 		range = dma_pool_alloc(nvmeq->dev->prp_small_pool,
 						GFP_ATOMIC,
 						&iod->first_dma);
-		if (!range) {
-			result = -ENOMEM;
-			goto free_iod;
-		}
+		if (!range)
+			goto finish_cmd;
 		iod_list(iod)[0] = (__le64 *)range;
 		iod->npages = 0;
 	} else if (psegs) {
-		result = nvme_map_bio(nvmeq, iod, bio,
-			bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
-			psegs);
-		if (result <= 0)
-			goto free_iod;
-		if (nvme_setup_prps(nvmeq->dev, iod, result, GFP_ATOMIC) !=
-								result) {
-			result = -ENOMEM;
-			goto free_iod;
+		dma_dir = rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+
+		sg_init_table(iod->sg, psegs);
+		iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
+		if (!iod->nents) {
+			result = BLK_MQ_RQ_QUEUE_ERROR;
+			goto finish_cmd;
 		}
-		nvme_start_io_acct(bio);
+
+		if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
+			goto finish_cmd;
+
+		if (blk_rq_bytes(req) != nvme_setup_prps(nvmeq->dev, iod,
+						blk_rq_bytes(req), GFP_ATOMIC))
+			goto finish_cmd;
 	}
-	if (unlikely(nvme_submit_iod(nvmeq, iod))) {
-		if (!waitqueue_active(&nvmeq->sq_full))
-			add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-		list_add_tail(&iod->node, &nvmeq->iod_bio);
+
+ submit_iod:
+	spin_lock_irq(&nvmeq->q_lock);
+	if (nvmeq->q_suspended) {
+		spin_unlock_irq(&nvmeq->q_lock);
+		goto finish_cmd;
 	}
-	return 0;
 
- free_iod:
+	if (req->cmd_flags & REQ_DISCARD)
+		nvme_submit_discard(nvmeq, ns, req, iod);
+	else if (req->cmd_flags & REQ_FLUSH)
+		nvme_submit_flush(nvmeq, ns, req->tag);
+	else
+		nvme_submit_iod(nvmeq, iod, ns);
+
+ queued:
+	nvme_process_cq(nvmeq);
+	spin_unlock_irq(&nvmeq->q_lock);
+	return BLK_MQ_RQ_QUEUE_OK;
+
+ finish_cmd:
+	nvme_finish_cmd(nvmeq, req->tag, NULL);
 	nvme_free_iod(nvmeq->dev, iod);
 	return result;
 }
@@ -814,8 +684,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 			head = 0;
 			phase = !phase;
 		}
-
-		ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
+		ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
 		fn(nvmeq, ctx, &cqe);
 	}
 
@@ -836,29 +705,12 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 	return 1;
 }
 
-static void nvme_make_request(struct request_queue *q, struct bio *bio)
+/* Admin queue isn't initialized as a request queue. If at some point this
+ * happens anyway, make sure to notify the user */
+static int nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
-	struct nvme_ns *ns = q->queuedata;
-	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
-	int result = -EBUSY;
-
-	if (!nvmeq) {
-		bio_endio(bio, -EIO);
-		return;
-	}
-
-	spin_lock_irq(&nvmeq->q_lock);
-	if (!nvmeq->q_suspended && bio_list_empty(&nvmeq->sq_cong))
-		result = nvme_submit_bio_queue(nvmeq, ns, bio);
-	if (unlikely(result)) {
-		if (!waitqueue_active(&nvmeq->sq_full))
-			add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-		bio_list_add(&nvmeq->sq_cong, bio);
-	}
-
-	nvme_process_cq(nvmeq);
-	spin_unlock_irq(&nvmeq->q_lock);
-	put_nvmeq(nvmeq);
+	WARN_ON_ONCE(1);
+	return BLK_MQ_RQ_QUEUE_ERROR;
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
@@ -882,10 +734,11 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_WAKE_THREAD;
 }
 
-static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
+static void nvme_abort_cmd_info(struct nvme_queue *nvmeq, struct nvme_cmd_info *
+								cmd_info)
 {
 	spin_lock_irq(&nvmeq->q_lock);
-	cancel_cmdid(nvmeq, cmdid, NULL);
+	cancel_cmd_info(cmd_info, NULL);
 	spin_unlock_irq(&nvmeq->q_lock);
 }
 
@@ -908,45 +761,31 @@ static void sync_completion(struct nvme_queue *nvmeq, void *ctx,
  * Returns 0 on success.  If the result is negative, it's a Linux error code;
  * if the result is positive, it's an NVM Express status code
  */
-static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
-						struct nvme_command *cmd,
+static int nvme_submit_sync_cmd(struct request *req, struct nvme_command *cmd,
 						u32 *result, unsigned timeout)
 {
-	int cmdid, ret;
+	int ret;
 	struct sync_cmd_info cmdinfo;
-	struct nvme_queue *nvmeq;
-
-	nvmeq = lock_nvmeq(dev, q_idx);
-	if (!nvmeq)
-		return -ENODEV;
+	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = cmd_rq->nvmeq;
 
 	cmdinfo.task = current;
 	cmdinfo.status = -EINTR;
 
-	cmdid = alloc_cmdid(nvmeq, &cmdinfo, sync_completion, timeout);
-	if (cmdid < 0) {
-		unlock_nvmeq(nvmeq);
-		return cmdid;
-	}
-	cmd->common.command_id = cmdid;
+	cmd->common.command_id = req->tag;
+
+	nvme_set_info(cmd_rq, &cmdinfo, sync_completion);
 
 	set_current_state(TASK_KILLABLE);
 	ret = nvme_submit_cmd(nvmeq, cmd);
 	if (ret) {
-		free_cmdid(nvmeq, cmdid, NULL);
-		unlock_nvmeq(nvmeq);
+		nvme_finish_cmd(nvmeq, req->tag, NULL);
 		set_current_state(TASK_RUNNING);
-		return ret;
 	}
-	unlock_nvmeq(nvmeq);
 	schedule_timeout(timeout);
 
 	if (cmdinfo.status == -EINTR) {
-		nvmeq = lock_nvmeq(dev, q_idx);
-		if (nvmeq) {
-			nvme_abort_command(nvmeq, cmdid);
-			unlock_nvmeq(nvmeq);
-		}
+		nvme_abort_cmd_info(nvmeq, blk_mq_rq_to_pdu(req));
 		return -EINTR;
 	}
 
@@ -956,59 +795,99 @@ static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
 	return cmdinfo.status;
 }
 
-static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
+static int nvme_submit_async_admin_req(struct nvme_dev *dev)
+{
+	struct nvme_queue *nvmeq = dev->queues[0];
+	struct nvme_command c;
+	struct nvme_cmd_info *cmd_info;
+	struct request *req;
+
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	if (!req)
+		return -ENOMEM;
+
+	cmd_info = blk_mq_rq_to_pdu(req);
+	nvme_set_info(cmd_info, req, async_req_completion);
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = nvme_admin_async_event;
+	c.common.command_id = req->tag;
+
+	return __nvme_submit_cmd(nvmeq, &c);
+}
+
+static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
 			struct nvme_command *cmd,
 			struct async_cmd_info *cmdinfo, unsigned timeout)
 {
-	int cmdid;
+	struct nvme_queue *nvmeq = dev->queues[0];
+	struct request *req;
+	struct nvme_cmd_info *cmd_rq;
 
-	cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
-	if (cmdid < 0)
-		return cmdid;
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	if (!req)
+		return -ENOMEM;
+
+	req->timeout = timeout;
+	cmd_rq = blk_mq_rq_to_pdu(req);
+	cmdinfo->req = req;
+	nvme_set_info(cmd_rq, cmdinfo, async_completion);
 	cmdinfo->status = -EINTR;
-	cmd->common.command_id = cmdid;
+
+	cmd->common.command_id = req->tag;
+
 	return nvme_submit_cmd(nvmeq, cmd);
 }
 
+int __nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
+						u32 *result, unsigned timeout)
+{
+	int res;
+	struct request *req;
+
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	if (!req)
+		return -ENOMEM;
+	res = nvme_submit_sync_cmd(req, cmd, result, timeout);
+	blk_put_request(req);
+	return res;
+}
+
 int nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
 								u32 *result)
 {
-	return nvme_submit_sync_cmd(dev, 0, cmd, result, ADMIN_TIMEOUT);
+	return __nvme_submit_admin_cmd(dev, cmd, result, ADMIN_TIMEOUT);
 }
 
-int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
-								u32 *result)
+int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+					struct nvme_command *cmd, u32 *result)
 {
-	return nvme_submit_sync_cmd(dev, smp_processor_id() + 1, cmd, result,
-							NVME_IO_TIMEOUT);
-}
+	int res;
+	struct request *req;
 
-static int nvme_submit_admin_cmd_async(struct nvme_dev *dev,
-		struct nvme_command *cmd, struct async_cmd_info *cmdinfo)
-{
-	return nvme_submit_async_cmd(raw_nvmeq(dev, 0), cmd, cmdinfo,
-								ADMIN_TIMEOUT);
+	req = blk_mq_alloc_request(ns->queue, WRITE, (GFP_KERNEL|__GFP_WAIT),
+									false);
+	if (!req)
+		return -ENOMEM;
+	res = nvme_submit_sync_cmd(req, cmd, result, NVME_IO_TIMEOUT);
+	blk_put_request(req);
+	return res;
 }
 
 static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 {
-	int status;
 	struct nvme_command c;
 
 	memset(&c, 0, sizeof(c));
 	c.delete_queue.opcode = opcode;
 	c.delete_queue.qid = cpu_to_le16(id);
 
-	status = nvme_submit_admin_cmd(dev, &c, NULL);
-	if (status)
-		return -EIO;
-	return 0;
+	return nvme_submit_admin_cmd(dev, &c, NULL);
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 						struct nvme_queue *nvmeq)
 {
-	int status;
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
 
@@ -1020,16 +899,12 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cq_flags = cpu_to_le16(flags);
 	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
 
-	status = nvme_submit_admin_cmd(dev, &c, NULL);
-	if (status)
-		return -EIO;
-	return 0;
+	return nvme_submit_admin_cmd(dev, &c, NULL);
 }
 
 static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
 						struct nvme_queue *nvmeq)
 {
-	int status;
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;
 
@@ -1041,10 +916,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
 	c.create_sq.sq_flags = cpu_to_le16(flags);
 	c.create_sq.cqid = cpu_to_le16(qid);
 
-	status = nvme_submit_admin_cmd(dev, &c, NULL);
-	if (status)
-		return -EIO;
-	return 0;
+	return nvme_submit_admin_cmd(dev, &c, NULL);
 }
 
 static int adapter_delete_cq(struct nvme_dev *dev, u16 cqid)
@@ -1100,28 +972,27 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
 }
 
 /**
- * nvme_abort_cmd - Attempt aborting a command
- * @cmdid: Command id of a timed out IO
- * @queue: The queue with timed out IO
+ * nvme_abort_req - Attempt aborting a request
  *
  * Schedule controller reset if the command was already aborted once before and
  * still hasn't been returned to the driver, or if this is the admin queue.
  */
-static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
+static void nvme_abort_req(struct request *req)
 {
-	int a_cmdid;
+	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = cmd_rq->nvmeq;
+	struct nvme_dev *dev = nvmeq->dev;
+	struct request *abort_req;
+	struct nvme_cmd_info *abort_cmd;
 	struct nvme_command cmd;
-	struct nvme_dev *dev = nvmeq->dev;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	struct nvme_queue *adminq;
 
-	if (!nvmeq->qid || info[cmdid].aborted) {
+	if (!nvmeq->qid || cmd_rq->aborted) {
 		if (work_busy(&dev->reset_work))
 			return;
 		list_del_init(&dev->node);
 		dev_warn(&dev->pci_dev->dev,
-			"I/O %d QID %d timeout, reset controller\n", cmdid,
-								nvmeq->qid);
+			"I/O %d QID %d timeout, reset controller\n",
+							req->tag, nvmeq->qid);
 		dev->reset_workfn = nvme_reset_failed_dev;
 		queue_work(nvme_workq, &dev->reset_work);
 		return;
@@ -1130,91 +1001,93 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
 	if (!dev->abort_limit)
 		return;
 
-	adminq = rcu_dereference(dev->queues[0]);
-	a_cmdid = alloc_cmdid(adminq, CMD_CTX_ABORT, special_completion,
-								ADMIN_TIMEOUT);
-	if (a_cmdid < 0)
+	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
+									false);
+	if (!abort_req)
 		return;
 
+	abort_cmd = blk_mq_rq_to_pdu(abort_req);
+	nvme_set_info(abort_cmd, abort_req, abort_completion);
+
 	memset(&cmd, 0, sizeof(cmd));
 	cmd.abort.opcode = nvme_admin_abort_cmd;
-	cmd.abort.cid = cmdid;
+	cmd.abort.cid = req->tag;
 	cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
-	cmd.abort.command_id = a_cmdid;
+	cmd.abort.command_id = abort_req->tag;
 
 	--dev->abort_limit;
-	info[cmdid].aborted = 1;
-	info[cmdid].timeout = jiffies + ADMIN_TIMEOUT;
+	cmd_rq->aborted = 1;
 
-	dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", cmdid,
+	dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
 							nvmeq->qid);
-	nvme_submit_cmd(adminq, &cmd);
+	if (nvme_submit_cmd(dev->queues[0], &cmd) < 0) {
+		dev_warn(nvmeq->q_dmadev,
+				"Could not abort I/O %d QID %d",
+				req->tag, nvmeq->qid);
+		blk_put_request(req);
+	}
 }
 
-/**
- * nvme_cancel_ios - Cancel outstanding I/Os
- * @queue: The queue to cancel I/Os on
- * @timeout: True to only cancel I/Os which have timed out
- */
-static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
+static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
 {
-	int depth = nvmeq->q_depth - 1;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	unsigned long now = jiffies;
-	int cmdid;
+	struct nvme_queue *nvmeq = data;
+	unsigned int tag = 0;
 
-	for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
+	tag = 0;
+	do {
+		struct request *req;
 		void *ctx;
 		nvme_completion_fn fn;
+		struct nvme_cmd_info *cmd;
 		static struct nvme_completion cqe = {
 			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
 		};
+		int qdepth = nvmeq == nvmeq->dev->queues[0] ?
+					nvmeq->dev->admin_tagset.queue_depth :
+					nvmeq->dev->tagset.queue_depth;
 
-		if (timeout && !time_after(now, info[cmdid].timeout))
-			continue;
-		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
-			continue;
-		if (timeout && info[cmdid].ctx == CMD_CTX_ASYNC)
-			continue;
-		if (timeout && nvmeq->dev->initialized) {
-			nvme_abort_cmd(cmdid, nvmeq);
+		/* zero'd bits are free tags */
+		tag = find_next_zero_bit(tag_map, qdepth, tag);
+		if (tag >= qdepth)
+			break;
+
+		req = blk_mq_tag_to_rq(nvmeq->tags, tag++);
+		cmd = blk_mq_rq_to_pdu(req);
+
+		if (cmd->ctx == CMD_CTX_CANCELLED)
 			continue;
-		}
-		dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
-								nvmeq->qid);
-		ctx = cancel_cmdid(nvmeq, cmdid, &fn);
+
+		dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n",
+							req->tag, nvmeq->qid);
+		ctx = cancel_cmd_info(cmd, &fn);
 		fn(nvmeq, ctx, &cqe);
-	}
+	} while (1);
 }
 
-static void nvme_free_queue(struct rcu_head *r)
+static enum blk_eh_timer_return nvme_timeout(struct request *req)
 {
-	struct nvme_queue *nvmeq = container_of(r, struct nvme_queue, r_head);
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = cmd->nvmeq;
 
-	spin_lock_irq(&nvmeq->q_lock);
-	while (bio_list_peek(&nvmeq->sq_cong)) {
-		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
-		bio_endio(bio, -EIO);
-	}
-	while (!list_empty(&nvmeq->iod_bio)) {
-		static struct nvme_completion cqe = {
-			.status = cpu_to_le16(
-				(NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1),
-		};
-		struct nvme_iod *iod = list_first_entry(&nvmeq->iod_bio,
-							struct nvme_iod,
-							node);
-		list_del(&iod->node);
-		bio_completion(nvmeq, iod, &cqe);
-	}
-	spin_unlock_irq(&nvmeq->q_lock);
+	dev_warn(nvmeq->q_dmadev, "Timeout I/O %d QID %d\n", req->tag,
+							nvmeq->qid);
+	if (nvmeq->dev->initialized)
+		nvme_abort_req(req);
 
+	/*
+	 * The aborted req will be completed on receiving the abort req.
+	 * We enable the timer again. If hit twice, it'll cause a device reset,
+	 * as the device then is in a faulty state.
+	 */
+	return BLK_EH_RESET_TIMER;
+}
+
+static void nvme_free_queue(struct nvme_queue *nvmeq)
+{
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
 	dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
 					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
-	if (nvmeq->qid)
-		free_cpumask_var(nvmeq->cpu_mask);
 	kfree(nvmeq);
 }
 
@@ -1223,10 +1096,10 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
 	int i;
 
 	for (i = dev->queue_count - 1; i >= lowest; i--) {
-		struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
-		rcu_assign_pointer(dev->queues[i], NULL);
-		call_rcu(&nvmeq->r_head, nvme_free_queue);
+		struct nvme_queue *nvmeq = dev->queues[i];
 		dev->queue_count--;
+		dev->queues[i] = NULL;
+		nvme_free_queue(nvmeq);
 	}
 }
 
@@ -1259,13 +1132,14 @@ static void nvme_clear_queue(struct nvme_queue *nvmeq)
 {
 	spin_lock_irq(&nvmeq->q_lock);
 	nvme_process_cq(nvmeq);
-	nvme_cancel_ios(nvmeq, false);
+	if (nvmeq->tags)
+		blk_mq_tag_busy_iter(nvmeq->tags, nvme_cancel_queue_ios, nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
 }
 
 static void nvme_disable_queue(struct nvme_dev *dev, int qid)
 {
-	struct nvme_queue *nvmeq = raw_nvmeq(dev, qid);
+	struct nvme_queue *nvmeq = dev->queues[qid];
 
 	if (!nvmeq)
 		return;
@@ -1285,8 +1159,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 							int depth, int vector)
 {
 	struct device *dmadev = &dev->pci_dev->dev;
-	unsigned extra = nvme_queue_extra(depth);
-	struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq) + extra, GFP_KERNEL);
+	struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
 	if (!nvmeq)
 		return NULL;
 
@@ -1300,9 +1173,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	if (!nvmeq->sq_cmds)
 		goto free_cqdma;
 
-	if (qid && !zalloc_cpumask_var(&nvmeq->cpu_mask, GFP_KERNEL))
-		goto free_sqdma;
-
 	nvmeq->q_dmadev = dmadev;
 	nvmeq->dev = dev;
 	snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
@@ -1310,23 +1180,16 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	spin_lock_init(&nvmeq->q_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
-	init_waitqueue_head(&nvmeq->sq_full);
-	init_waitqueue_entry(&nvmeq->sq_cong_wait, nvme_thread);
-	bio_list_init(&nvmeq->sq_cong);
-	INIT_LIST_HEAD(&nvmeq->iod_bio);
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	nvmeq->q_depth = depth;
 	nvmeq->cq_vector = vector;
 	nvmeq->qid = qid;
 	nvmeq->q_suspended = 1;
 	dev->queue_count++;
-	rcu_assign_pointer(dev->queues[qid], nvmeq);
+	dev->queues[qid] = nvmeq;
 
 	return nvmeq;
 
- free_sqdma:
-	dma_free_coherent(dmadev, SQ_SIZE(depth), (void *)nvmeq->sq_cmds,
-							nvmeq->sq_dma_addr);
  free_cqdma:
 	dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
 							nvmeq->cq_dma_addr);
@@ -1349,15 +1212,12 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
-	unsigned extra = nvme_queue_extra(nvmeq->q_depth);
 
 	nvmeq->sq_tail = 0;
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
-	memset(nvmeq->cmdid_data, 0, extra);
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
-	nvme_cancel_ios(nvmeq, false);
 	nvmeq->q_suspended = 0;
 	dev->online_queues++;
 }
@@ -1463,6 +1323,54 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
 	return 0;
 }
 
+static struct blk_mq_ops nvme_mq_admin_ops = {
+	.queue_rq	= nvme_admin_queue_rq,
+	.map_queue	= blk_mq_map_queue,
+	.init_hctx	= nvme_admin_init_hctx,
+	.init_request	= nvme_admin_init_request,
+	.timeout	= nvme_timeout,
+};
+
+static struct blk_mq_ops nvme_mq_ops = {
+	.queue_rq	= nvme_queue_rq,
+	.map_queue	= blk_mq_map_queue,
+	.init_hctx	= nvme_init_hctx,
+	.init_request	= nvme_init_request,
+	.timeout	= nvme_timeout,
+};
+
+static int nvme_alloc_admin_tags(struct nvme_dev *dev)
+{
+	if (!dev->admin_q) {
+		dev->admin_tagset.ops = &nvme_mq_admin_ops;
+		dev->admin_tagset.nr_hw_queues = 1;
+		dev->admin_tagset.queue_depth = NVME_AQ_DEPTH;
+		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
+		dev->admin_tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+		dev->admin_tagset.cmd_size = sizeof(struct nvme_cmd_info);
+		dev->admin_tagset.driver_data = dev;
+
+		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
+			return -ENOMEM;
+
+		dev->queues[0]->tags = dev->admin_tagset.tags[0];
+
+		dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
+		if (!dev->admin_q) {
+			blk_mq_free_tag_set(&dev->admin_tagset);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static void nvme_free_admin_tags(struct nvme_dev *dev)
+{
+	if (dev->admin_q)
+		blk_mq_free_tag_set(&dev->admin_tagset);
+}
+
 static int nvme_configure_admin_queue(struct nvme_dev *dev)
 {
 	int result;
@@ -1492,9 +1400,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	if (result < 0)
 		return result;
 
-	nvmeq = raw_nvmeq(dev, 0);
+	nvmeq = dev->queues[0];
 	if (!nvmeq) {
-		nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
+		nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH, 0);
 		if (!nvmeq)
 			return -ENOMEM;
 	}
@@ -1515,16 +1423,26 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 
 	result = nvme_enable_ctrl(dev, cap);
 	if (result)
-		return result;
+		goto free_nvmeq;
+
+	result = nvme_alloc_admin_tags(dev);
+	if (result)
+		goto free_nvmeq;
 
 	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
 	if (result)
-		return result;
+		goto free_tags;
 
 	spin_lock_irq(&nvmeq->q_lock);
 	nvme_init_queue(nvmeq, 0);
 	spin_unlock_irq(&nvmeq->q_lock);
 	return result;
+
+ free_tags:
+	nvme_free_admin_tags(dev);
+ free_nvmeq:
+	nvme_free_queues(dev, 0);
+	return result;
 }
 
 struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
@@ -1682,7 +1600,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	if (length != (io.nblocks + 1) << ns->lba_shift)
 		status = -ENOMEM;
 	else
-		status = nvme_submit_io_cmd(dev, &c, NULL);
+		status = nvme_submit_io_cmd(dev, ns, &c, NULL);
 
 	if (meta_len) {
 		if (status == NVME_SC_SUCCESS && !(io.opcode & 1)) {
@@ -1754,10 +1672,11 @@ static int nvme_user_admin_cmd(struct nvme_dev *dev,
 
 	timeout = cmd.timeout_ms ? msecs_to_jiffies(cmd.timeout_ms) :
 								ADMIN_TIMEOUT;
+
 	if (length != cmd.data_len)
 		status = -ENOMEM;
 	else
-		status = nvme_submit_sync_cmd(dev, 0, &c, &cmd.result, timeout);
+		status = __nvme_submit_admin_cmd(dev, &c, &cmd.result, timeout);
 
 	if (cmd.data_len) {
 		nvme_unmap_user_pages(dev, cmd.opcode & 1, iod);
@@ -1846,62 +1765,6 @@ static const struct block_device_operations nvme_fops = {
 	.getgeo		= nvme_getgeo,
 };
 
-static void nvme_resubmit_iods(struct nvme_queue *nvmeq)
-{
-	struct nvme_iod *iod, *next;
-
-	list_for_each_entry_safe(iod, next, &nvmeq->iod_bio, node) {
-		if (unlikely(nvme_submit_iod(nvmeq, iod)))
-			break;
-		list_del(&iod->node);
-		if (bio_list_empty(&nvmeq->sq_cong) &&
-						list_empty(&nvmeq->iod_bio))
-			remove_wait_queue(&nvmeq->sq_full,
-						&nvmeq->sq_cong_wait);
-	}
-}
-
-static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
-{
-	while (bio_list_peek(&nvmeq->sq_cong)) {
-		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
-		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
-
-		if (bio_list_empty(&nvmeq->sq_cong) &&
-						list_empty(&nvmeq->iod_bio))
-			remove_wait_queue(&nvmeq->sq_full,
-							&nvmeq->sq_cong_wait);
-		if (nvme_submit_bio_queue(nvmeq, ns, bio)) {
-			if (!waitqueue_active(&nvmeq->sq_full))
-				add_wait_queue(&nvmeq->sq_full,
-							&nvmeq->sq_cong_wait);
-			bio_list_add_head(&nvmeq->sq_cong, bio);
-			break;
-		}
-	}
-}
-
-static int nvme_submit_async_req(struct nvme_queue *nvmeq)
-{
-	struct nvme_command *c;
-	int cmdid;
-
-	cmdid = alloc_cmdid(nvmeq, CMD_CTX_ASYNC, special_completion, 0);
-	if (cmdid < 0)
-		return cmdid;
-
-	c = &nvmeq->sq_cmds[nvmeq->sq_tail];
-	memset(c, 0, sizeof(*c));
-	c->common.opcode = nvme_admin_async_event;
-	c->common.command_id = cmdid;
-
-	if (++nvmeq->sq_tail == nvmeq->q_depth)
-		nvmeq->sq_tail = 0;
-	writel(nvmeq->sq_tail, nvmeq->q_db);
-
-	return 0;
-}
-
 static int nvme_kthread(void *data)
 {
 	struct nvme_dev *dev, *next;
@@ -1917,34 +1780,29 @@ static int nvme_kthread(void *data)
 					continue;
 				list_del_init(&dev->node);
 				dev_warn(&dev->pci_dev->dev,
-					"Failed status, reset controller\n");
+					"Failed status: %x, reset controller\n",
+					readl(&dev->bar->csts));
 				dev->reset_workfn = nvme_reset_failed_dev;
 				queue_work(nvme_workq, &dev->reset_work);
 				continue;
 			}
-			rcu_read_lock();
 			for (i = 0; i < dev->queue_count; i++) {
-				struct nvme_queue *nvmeq =
-						rcu_dereference(dev->queues[i]);
+				struct nvme_queue *nvmeq = dev->queues[i];
 				if (!nvmeq)
 					continue;
 				spin_lock_irq(&nvmeq->q_lock);
 				if (nvmeq->q_suspended)
 					goto unlock;
 				nvme_process_cq(nvmeq);
-				nvme_cancel_ios(nvmeq, true);
-				nvme_resubmit_bios(nvmeq);
-				nvme_resubmit_iods(nvmeq);
 
 				while ((i == 0) && (dev->event_limit > 0)) {
-					if (nvme_submit_async_req(nvmeq))
+					if (nvme_submit_async_admin_req(dev))
 						break;
 					dev->event_limit--;
 				}
  unlock:
 				spin_unlock_irq(&nvmeq->q_lock);
 			}
-			rcu_read_unlock();
 		}
 		spin_unlock(&dev_list_lock);
 		schedule_timeout(round_jiffies_relative(HZ));
@@ -1967,27 +1825,30 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 {
 	struct nvme_ns *ns;
 	struct gendisk *disk;
+	int node = dev_to_node(&dev->pci_dev->dev);
 	int lbaf;
 
 	if (rt->attributes & NVME_LBART_ATTRIB_HIDE)
 		return NULL;
 
-	ns = kzalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
 		return NULL;
-	ns->queue = blk_alloc_queue(GFP_KERNEL);
+	ns->queue = blk_mq_init_queue(&dev->tagset);
 	if (!ns->queue)
 		goto out_free_ns;
-	ns->queue->queue_flags = QUEUE_FLAG_DEFAULT;
+	queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
-	blk_queue_make_request(ns->queue, nvme_make_request);
+	queue_flag_set_unlocked(QUEUE_FLAG_SG_GAPS, ns->queue);
+	queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
 	ns->dev = dev;
 	ns->queue->queuedata = ns;
 
-	disk = alloc_disk(0);
+	disk = alloc_disk_node(0, node);
 	if (!disk)
 		goto out_free_queue;
+
 	ns->ns_id = nsid;
 	ns->disk = disk;
 	lbaf = id->flbas & 0xf;
@@ -1996,6 +1857,8 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
 	if (dev->max_hw_sectors)
 		blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);
+	if (dev->stripe_size)
+		blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
 	if (dev->vwc & NVME_CTRL_VWC_PRESENT)
 		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
 
@@ -2021,143 +1884,19 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	return NULL;
 }
 
-static int nvme_find_closest_node(int node)
-{
-	int n, val, min_val = INT_MAX, best_node = node;
-
-	for_each_online_node(n) {
-		if (n == node)
-			continue;
-		val = node_distance(node, n);
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-	return best_node;
-}
-
-static void nvme_set_queue_cpus(cpumask_t *qmask, struct nvme_queue *nvmeq,
-								int count)
-{
-	int cpu;
-	for_each_cpu(cpu, qmask) {
-		if (cpumask_weight(nvmeq->cpu_mask) >= count)
-			break;
-		if (!cpumask_test_and_set_cpu(cpu, nvmeq->cpu_mask))
-			*per_cpu_ptr(nvmeq->dev->io_queue, cpu) = nvmeq->qid;
-	}
-}
-
-static void nvme_add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
-	const cpumask_t *new_mask, struct nvme_queue *nvmeq, int cpus_per_queue)
-{
-	int next_cpu;
-	for_each_cpu(next_cpu, new_mask) {
-		cpumask_or(mask, mask, get_cpu_mask(next_cpu));
-		cpumask_or(mask, mask, topology_thread_cpumask(next_cpu));
-		cpumask_and(mask, mask, unassigned_cpus);
-		nvme_set_queue_cpus(mask, nvmeq, cpus_per_queue);
-	}
-}
-
 static void nvme_create_io_queues(struct nvme_dev *dev)
 {
-	unsigned i, max;
+	unsigned i;
 
-	max = min(dev->max_qid, num_online_cpus());
-	for (i = dev->queue_count; i <= max; i++)
+	for (i = dev->queue_count; i <= dev->max_qid; i++)
 		if (!nvme_alloc_queue(dev, i, dev->q_depth, i - 1))
 			break;
 
-	max = min(dev->queue_count - 1, num_online_cpus());
-	for (i = dev->online_queues; i <= max; i++)
-		if (nvme_create_queue(raw_nvmeq(dev, i), i))
+	for (i = dev->online_queues; i <= dev->queue_count - 1; i++)
+		if (nvme_create_queue(dev->queues[i], i))
 			break;
 }
 
-/*
- * If there are fewer queues than online cpus, this will try to optimally
- * assign a queue to multiple cpus by grouping cpus that are "close" together:
- * thread siblings, core, socket, closest node, then whatever else is
- * available.
- */
-static void nvme_assign_io_queues(struct nvme_dev *dev)
-{
-	unsigned cpu, cpus_per_queue, queues, remainder, i;
-	cpumask_var_t unassigned_cpus;
-
-	nvme_create_io_queues(dev);
-
-	queues = min(dev->online_queues - 1, num_online_cpus());
-	if (!queues)
-		return;
-
-	cpus_per_queue = num_online_cpus() / queues;
-	remainder = queues - (num_online_cpus() - queues * cpus_per_queue);
-
-	if (!alloc_cpumask_var(&unassigned_cpus, GFP_KERNEL))
-		return;
-
-	cpumask_copy(unassigned_cpus, cpu_online_mask);
-	cpu = cpumask_first(unassigned_cpus);
-	for (i = 1; i <= queues; i++) {
-		struct nvme_queue *nvmeq = lock_nvmeq(dev, i);
-		cpumask_t mask;
-
-		cpumask_clear(nvmeq->cpu_mask);
-		if (!cpumask_weight(unassigned_cpus)) {
-			unlock_nvmeq(nvmeq);
-			break;
-		}
-
-		mask = *get_cpu_mask(cpu);
-		nvme_set_queue_cpus(&mask, nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				topology_thread_cpumask(cpu),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				topology_core_cpumask(cpu),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				cpumask_of_node(cpu_to_node(cpu)),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				cpumask_of_node(
-					nvme_find_closest_node(
-						cpu_to_node(cpu))),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				unassigned_cpus,
-				nvmeq, cpus_per_queue);
-
-		WARN(cpumask_weight(nvmeq->cpu_mask) != cpus_per_queue,
-			"nvme%d qid:%d mis-matched queue-to-cpu assignment\n",
-			dev->instance, i);
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-							nvmeq->cpu_mask);
-		cpumask_andnot(unassigned_cpus, unassigned_cpus,
-						nvmeq->cpu_mask);
-		cpu = cpumask_next(cpu, unassigned_cpus);
-		if (remainder && !--remainder)
-			cpus_per_queue++;
-		unlock_nvmeq(nvmeq);
-	}
-	WARN(cpumask_weight(unassigned_cpus), "nvme%d unassigned online cpus\n",
-								dev->instance);
-	i = 0;
-	cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
-	for_each_cpu(cpu, unassigned_cpus)
-		*per_cpu_ptr(dev->io_queue, cpu) = (i++ % queues) + 1;
-	free_cpumask_var(unassigned_cpus);
-}
-
 static int set_queue_count(struct nvme_dev *dev, int count)
 {
 	int status;
@@ -2181,33 +1920,9 @@ static size_t db_bar_size(struct nvme_dev *dev, unsigned nr_io_queues)
 	return 4096 + ((nr_io_queues + 1) * 8 * dev->db_stride);
 }
 
-static void nvme_cpu_workfn(struct work_struct *work)
-{
-	struct nvme_dev *dev = container_of(work, struct nvme_dev, cpu_work);
-	if (dev->initialized)
-		nvme_assign_io_queues(dev);
-}
-
-static int nvme_cpu_notify(struct notifier_block *self,
-				unsigned long action, void *hcpu)
-{
-	struct nvme_dev *dev;
-
-	switch (action) {
-	case CPU_ONLINE:
-	case CPU_DEAD:
-		spin_lock(&dev_list_lock);
-		list_for_each_entry(dev, &dev_list, node)
-			schedule_work(&dev->cpu_work);
-		spin_unlock(&dev_list_lock);
-		break;
-	}
-	return NOTIFY_OK;
-}
-
 static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
-	struct nvme_queue *adminq = raw_nvmeq(dev, 0);
+	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = dev->pci_dev;
 	int result, i, vecs, nr_io_queues, size;
 
@@ -2266,7 +1981,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	/* Free previously allocated queues that are no longer usable */
 	nvme_free_queues(dev, nr_io_queues + 1);
-	nvme_assign_io_queues(dev);
+	nvme_create_io_queues(dev);
 
 	return 0;
 
@@ -2316,8 +2031,32 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	if (ctrl->mdts)
 		dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
 	if ((pdev->vendor == PCI_VENDOR_ID_INTEL) &&
-			(pdev->device == 0x0953) && ctrl->vs[3])
+			(pdev->device == 0x0953) && ctrl->vs[3]) {
+		unsigned int max_hw_sectors;
+
 		dev->stripe_size = 1 << (ctrl->vs[3] + shift);
+		max_hw_sectors = dev->stripe_size >> (shift - 9);
+		if (dev->max_hw_sectors) {
+			dev->max_hw_sectors = min(max_hw_sectors,
+							dev->max_hw_sectors);
+		} else
+			dev->max_hw_sectors = max_hw_sectors;
+	}
+
+	dev->tagset.ops = &nvme_mq_ops;
+	dev->tagset.nr_hw_queues = dev->online_queues - 1;
+	dev->tagset.timeout = NVME_IO_TIMEOUT;
+	dev->tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+	dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH);
+	dev->tagset.cmd_size = sizeof(struct nvme_cmd_info);
+	dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+	dev->tagset.driver_data = dev;
+
+	if (blk_mq_alloc_tag_set(&dev->tagset))
+		goto out;
+
+	for (i = 1; i < dev->online_queues; i++)
+		dev->queues[i]->tags = dev->tagset.tags[i - 1];
 
 	id_ns = mem;
 	for (i = 1; i <= nn; i++) {
@@ -2467,7 +2206,8 @@ static int adapter_async_del_queue(struct nvme_queue *nvmeq, u8 opcode,
 	c.delete_queue.qid = cpu_to_le16(nvmeq->qid);
 
 	init_kthread_work(&nvmeq->cmdinfo.work, fn);
-	return nvme_submit_admin_cmd_async(nvmeq->dev, &c, &nvmeq->cmdinfo);
+	return nvme_submit_admin_async_cmd(nvmeq->dev, &c, &nvmeq->cmdinfo,
+								ADMIN_TIMEOUT);
 }
 
 static void nvme_del_cq_work_handler(struct kthread_work *work)
@@ -2530,7 +2270,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 	atomic_set(&dq.refcount, 0);
 	dq.worker = &worker;
 	for (i = dev->queue_count - 1; i > 0; i--) {
-		struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+		struct nvme_queue *nvmeq = dev->queues[i];
 
 		if (nvme_suspend_queue(nvmeq))
 			continue;
@@ -2575,7 +2315,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 		csts = readl(&dev->bar->csts);
 	if (csts & NVME_CSTS_CFS || !(csts & NVME_CSTS_RDY)) {
 		for (i = dev->queue_count - 1; i >= 0; i--) {
-			struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+			struct nvme_queue *nvmeq = dev->queues[i];
 			nvme_suspend_queue(nvmeq);
 			nvme_clear_queue(nvmeq);
 		}
@@ -2587,6 +2327,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 	nvme_dev_unmap(dev);
 }
 
+static void nvme_dev_remove_admin(struct nvme_dev *dev)
+{
+	if (dev->admin_q && !blk_queue_dying(dev->admin_q))
+		blk_cleanup_queue(dev->admin_q);
+}
+
 static void nvme_dev_remove(struct nvme_dev *dev)
 {
 	struct nvme_ns *ns;
@@ -2668,7 +2414,7 @@ static void nvme_free_dev(struct kref *kref)
 	struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);
 
 	nvme_free_namespaces(dev);
-	free_percpu(dev->io_queue);
+	blk_mq_free_tag_set(&dev->tagset);
 	kfree(dev->queues);
 	kfree(dev->entry);
 	kfree(dev);
@@ -2795,7 +2541,7 @@ static void nvme_dev_reset(struct nvme_dev *dev)
 {
 	nvme_dev_shutdown(dev);
 	if (nvme_dev_resume(dev)) {
-		dev_err(&dev->pci_dev->dev, "Device failed to resume\n");
+		dev_warn(&dev->pci_dev->dev, "Device failed to resume\n");
 		kref_get(&dev->kref);
 		if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
 							dev->instance))) {
@@ -2820,28 +2566,28 @@ static void nvme_reset_workfn(struct work_struct *work)
 
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
-	int result = -ENOMEM;
+	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 
-	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	node = dev_to_node(&pdev->dev);
+	if (node == NUMA_NO_NODE)
+		set_dev_node(&pdev->dev, 0);
+
+	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kcalloc(num_possible_cpus(), sizeof(*dev->entry),
-								GFP_KERNEL);
+	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
+							GFP_KERNEL, node);
 	if (!dev->entry)
 		goto free;
-	dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
-								GFP_KERNEL);
+	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
+							GFP_KERNEL, node);
 	if (!dev->queues)
 		goto free;
-	dev->io_queue = alloc_percpu(unsigned short);
-	if (!dev->io_queue)
-		goto free;
 
 	INIT_LIST_HEAD(&dev->namespaces);
 	dev->reset_workfn = nvme_reset_failed_dev;
 	INIT_WORK(&dev->reset_work, nvme_reset_workfn);
-	INIT_WORK(&dev->cpu_work, nvme_cpu_workfn);
 	dev->pci_dev = pdev;
 	pci_set_drvdata(pdev, dev);
 	result = nvme_set_instance(dev);
@@ -2876,6 +2622,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
  remove:
 	nvme_dev_remove(dev);
+	nvme_dev_remove_admin(dev);
 	nvme_free_namespaces(dev);
  shutdown:
 	nvme_dev_shutdown(dev);
@@ -2885,7 +2632,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
  release:
 	nvme_release_instance(dev);
  free:
-	free_percpu(dev->io_queue);
 	kfree(dev->queues);
 	kfree(dev->entry);
 	kfree(dev);
@@ -2918,12 +2664,12 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	pci_set_drvdata(pdev, NULL);
 	flush_work(&dev->reset_work);
-	flush_work(&dev->cpu_work);
 	misc_deregister(&dev->miscdev);
 	nvme_dev_remove(dev);
 	nvme_dev_shutdown(dev);
+	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-	rcu_barrier();
+	nvme_free_admin_tags(dev);
 	nvme_release_instance(dev);
 	nvme_release_prp_pools(dev);
 	kref_put(&dev->kref, nvme_free_dev);
@@ -3007,18 +2753,11 @@ static int __init nvme_init(void)
 	else if (result > 0)
 		nvme_major = result;
 
-	nvme_nb.notifier_call = &nvme_cpu_notify;
-	result = register_hotcpu_notifier(&nvme_nb);
-	if (result)
-		goto unregister_blkdev;
-
 	result = pci_register_driver(&nvme_driver);
 	if (result)
-		goto unregister_hotcpu;
+		goto unregister_blkdev;
 	return 0;
 
- unregister_hotcpu:
-	unregister_hotcpu_notifier(&nvme_nb);
  unregister_blkdev:
 	unregister_blkdev(nvme_major, "nvme");
  kill_workq:
diff --git a/drivers/block/nvme-scsi.c b/drivers/block/nvme-scsi.c
index a4cd6d6..52c0356 100644
--- a/drivers/block/nvme-scsi.c
+++ b/drivers/block/nvme-scsi.c
@@ -2105,7 +2105,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 
 		nvme_offset += unit_num_blocks;
 
-		nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+		nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
 		if (nvme_sc != NVME_SC_SUCCESS) {
 			nvme_unmap_user_pages(dev,
 				(is_write) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
@@ -2658,7 +2658,7 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 			c.common.opcode = nvme_cmd_flush;
 			c.common.nsid = cpu_to_le32(ns->ns_id);
 
-			nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+			nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
 			res = nvme_trans_status_code(hdr, nvme_sc);
 			if (res)
 				goto out;
@@ -2686,7 +2686,7 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
 	c.common.opcode = nvme_cmd_flush;
 	c.common.nsid = cpu_to_le32(ns->ns_id);
 
-	nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+	nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
 
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
@@ -2894,7 +2894,7 @@ static int nvme_trans_unmap(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	c.dsm.nr = cpu_to_le32(ndesc - 1);
 	c.dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
-	nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+	nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 
 	dma_free_coherent(&dev->pci_dev->dev, ndesc * sizeof(*range),
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index ed09074..258945f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -19,6 +19,7 @@
 #include <linux/pci.h>
 #include <linux/miscdevice.h>
 #include <linux/kref.h>
+#include <linux/blk-mq.h>
 
 struct nvme_bar {
 	__u64			cap;	/* Controller Capabilities */
@@ -71,8 +72,10 @@ extern unsigned char nvme_io_timeout;
  */
 struct nvme_dev {
 	struct list_head node;
-	struct nvme_queue __rcu **queues;
-	unsigned short __percpu *io_queue;
+	struct nvme_queue **queues;
+	struct request_queue *admin_q;
+	struct blk_mq_tag_set tagset;
+	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct pci_dev *pci_dev;
 	struct dma_pool *prp_page_pool;
@@ -91,7 +94,6 @@ struct nvme_dev {
 	struct miscdevice miscdev;
 	work_func_t reset_workfn;
 	struct work_struct reset_work;
-	struct work_struct cpu_work;
 	char name[12];
 	char serial[20];
 	char model[40];
@@ -135,7 +137,6 @@ struct nvme_iod {
 	int offset;		/* Of PRP list */
 	int nents;		/* Used in scatterlist */
 	int length;		/* Of data, in bytes */
-	unsigned long start_time;
 	dma_addr_t first_dma;
 	struct list_head node;
 	struct scatterlist sg[0];
@@ -153,12 +154,14 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
  */
 void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod);
 
-int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int , gfp_t);
+int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int, gfp_t);
 struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
 				unsigned long addr, unsigned length);
 void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
 			struct nvme_iod *iod);
-int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_command *, u32 *);
+int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_ns *,
+						struct nvme_command *, u32 *);
+int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns);
 int nvme_submit_admin_cmd(struct nvme_dev *, struct nvme_command *,
 							u32 *result);
 int nvme_identify(struct nvme_dev *, unsigned nsid, unsigned cns,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-07-26  9:07   ` Matias Bjørling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-07-26  9:07 UTC (permalink / raw)


This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within itself.
By using blk-mq, a lot of these responsibilities can be moved and simplified.

The patch is divided into the following blocks:

 * Per-command data and cmdid have been moved into the struct request field.
   The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id maintenance
   are now handled by blk-mq through the rq->tag field.

 * The logic for splitting bio's has been moved into the blk-mq layer. The
   driver instead notifies the block layer about limited gap support in SG
   lists.

 * blk-mq handles timeouts and is reimplemented within nvme_timeout(). This both
   includes abort handling and command cancelation.

 * Assignment of nvme queues to CPUs are replaced with the blk-mq version. The
   current blk-mq strategy is to assign the number of mapped queues and CPUs to
   provide synergy, while the nvme driver assign as many nvme hw queues as
   possible. This can be implemented in blk-mq if needed.

 * NVMe queues are merged with the tags structure of blk-mq.

 * blk-mq takes care of setup/teardown of nvme queues and guards invalid
   accesses. Therefore, RCU-usage for nvme queues can be removed.

 * IO tracing and accounting are handled by blk-mq and therefore removed.

 * Setup and teardown of nvme queues are now handled by nvme_[init/exit]_hctx.

Contributions in this patch from:

  Sam Bradshaw <sbradshaw at micron.com>
  Jens Axboe <axboe at fb.com>
  Keith Busch <keith.busch at intel.com>
  Robert Nelson <rlnelson at google.com>

Acked-by: Keith Busch <keith.busch at intel.com>
Acked-by: Jens Axboe <axboe at fb.com>
Signed-off-by: Matias Bj?rling <m at bjorling.me>
---
 drivers/block/nvme-core.c | 1303 ++++++++++++++++++---------------------------
 drivers/block/nvme-scsi.c |    8 +-
 include/linux/nvme.h      |   15 +-
 3 files changed, 534 insertions(+), 792 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 28aec2d..384dc91 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -13,9 +13,9 @@
  */
 
 #include <linux/nvme.h>
-#include <linux/bio.h>
 #include <linux/bitops.h>
 #include <linux/blkdev.h>
+#include <linux/blk-mq.h>
 #include <linux/cpu.h>
 #include <linux/delay.h>
 #include <linux/errno.h>
@@ -33,7 +33,6 @@
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/pci.h>
-#include <linux/percpu.h>
 #include <linux/poison.h>
 #include <linux/ptrace.h>
 #include <linux/sched.h>
@@ -42,9 +41,8 @@
 #include <scsi/sg.h>
 #include <asm-generic/io-64-nonatomic-lo-hi.h>
 
-#include <trace/events/block.h>
-
 #define NVME_Q_DEPTH		1024
+#define NVME_AQ_DEPTH		64
 #define SQ_SIZE(depth)		(depth * sizeof(struct nvme_command))
 #define CQ_SIZE(depth)		(depth * sizeof(struct nvme_completion))
 #define ADMIN_TIMEOUT		(admin_timeout * HZ)
@@ -76,10 +74,12 @@ static wait_queue_head_t nvme_kthread_wait;
 static struct notifier_block nvme_nb;
 
 static void nvme_reset_failed_dev(struct work_struct *ws);
+static int nvme_process_cq(struct nvme_queue *nvmeq);
 
 struct async_cmd_info {
 	struct kthread_work work;
 	struct kthread_worker *worker;
+	struct request *req;
 	u32 result;
 	int status;
 	void *ctx;
@@ -90,7 +90,6 @@ struct async_cmd_info {
  * commands and one for I/O commands).
  */
 struct nvme_queue {
-	struct rcu_head r_head;
 	struct device *q_dmadev;
 	struct nvme_dev *dev;
 	char irqname[24];	/* nvme4294967295-65535\0 */
@@ -99,10 +98,6 @@ struct nvme_queue {
 	volatile struct nvme_completion *cqes;
 	dma_addr_t sq_dma_addr;
 	dma_addr_t cq_dma_addr;
-	wait_queue_head_t sq_full;
-	wait_queue_t sq_cong_wait;
-	struct bio_list sq_cong;
-	struct list_head iod_bio;
 	u32 __iomem *q_db;
 	u16 q_depth;
 	u16 cq_vector;
@@ -113,9 +108,8 @@ struct nvme_queue {
 	u8 cq_phase;
 	u8 cqe_seen;
 	u8 q_suspended;
-	cpumask_var_t cpu_mask;
 	struct async_cmd_info cmdinfo;
-	unsigned long cmdid_data[];
+	struct blk_mq_tags *tags;
 };
 
 /*
@@ -143,62 +137,65 @@ typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
 struct nvme_cmd_info {
 	nvme_completion_fn fn;
 	void *ctx;
-	unsigned long timeout;
 	int aborted;
+	struct nvme_queue *nvmeq;
 };
 
-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+				unsigned int hctx_idx)
 {
-	return (void *)&nvmeq->cmdid_data[BITS_TO_LONGS(nvmeq->q_depth)];
+	struct nvme_dev *dev = data;
+	struct nvme_queue *nvmeq = dev->queues[0];
+
+	hctx->driver_data = nvmeq;
+	return 0;
 }
 
-static unsigned nvme_queue_extra(int depth)
+static int nvme_admin_init_request(void *data, struct request *req,
+				unsigned int hctx_idx, unsigned int rq_idx,
+				unsigned int numa_node)
 {
-	return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+	struct nvme_dev *dev = data;
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = dev->queues[0];
+
+	BUG_ON(!nvmeq);
+	cmd->nvmeq = nvmeq;
+	return 0;
 }
 
-/**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
- * @ctx: A pointer that will be passed to the handler
- * @handler: The function to call on completion
- *
- * Allocate a Command ID for a queue.  The data passed in will
- * be passed to the completion handler.  This is implemented by using
- * the bottom two bits of the ctx pointer to store the handler ID.
- * Passing in a pointer that's not 4-byte aligned will cause a BUG.
- * We can change this if it becomes a problem.
- *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and no locks held.
- */
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
-				nvme_completion_fn handler, unsigned timeout)
+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+			  unsigned int hctx_idx)
 {
-	int depth = nvmeq->q_depth - 1;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	int cmdid;
-
-	do {
-		cmdid = find_first_zero_bit(nvmeq->cmdid_data, depth);
-		if (cmdid >= depth)
-			return -EBUSY;
-	} while (test_and_set_bit(cmdid, nvmeq->cmdid_data));
-
-	info[cmdid].fn = handler;
-	info[cmdid].ctx = ctx;
-	info[cmdid].timeout = jiffies + timeout;
-	info[cmdid].aborted = 0;
-	return cmdid;
+	struct nvme_dev *dev = data;
+	struct nvme_queue *nvmeq = dev->queues[
+					(hctx_idx % dev->queue_count) + 1];
+
+	irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
+								hctx->cpumask);
+	hctx->driver_data = nvmeq;
+	return 0;
+}
+
+static int nvme_init_request(void *data, struct request *req,
+				unsigned int hctx_idx, unsigned int rq_idx,
+				unsigned int numa_node)
+{
+	struct nvme_dev *dev = data;
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
+
+	BUG_ON(!nvmeq);
+	cmd->nvmeq = nvmeq;
+	return 0;
 }
 
-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
-				nvme_completion_fn handler, unsigned timeout)
+static void nvme_set_info(struct nvme_cmd_info *cmd, void *ctx,
+				nvme_completion_fn handler)
 {
-	int cmdid;
-	wait_event_killable(nvmeq->sq_full,
-		(cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
-	return (cmdid < 0) ? -EINTR : cmdid;
+	cmd->fn = handler;
+	cmd->ctx = ctx;
+	cmd->aborted = 0;
 }
 
 /* Special values must be less than 0x1000 */
@@ -206,18 +203,12 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
 #define CMD_CTX_CANCELLED	(0x30C + CMD_CTX_BASE)
 #define CMD_CTX_COMPLETED	(0x310 + CMD_CTX_BASE)
 #define CMD_CTX_INVALID		(0x314 + CMD_CTX_BASE)
-#define CMD_CTX_ABORT		(0x318 + CMD_CTX_BASE)
-#define CMD_CTX_ASYNC		(0x31C + CMD_CTX_BASE)
 
 static void special_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
 	if (ctx == CMD_CTX_CANCELLED)
 		return;
-	if (ctx == CMD_CTX_ABORT) {
-		++nvmeq->dev->abort_limit;
-		return;
-	}
 	if (ctx == CMD_CTX_COMPLETED) {
 		dev_warn(nvmeq->q_dmadev,
 				"completed id %d twice on queue %d\n",
@@ -230,21 +221,52 @@ static void special_completion(struct nvme_queue *nvmeq, void *ctx,
 				cqe->command_id, le16_to_cpup(&cqe->sq_id));
 		return;
 	}
-	if (ctx == CMD_CTX_ASYNC) {
-		u32 result = le32_to_cpup(&cqe->result);
-		u16 status = le16_to_cpup(&cqe->status) >> 1;
-
-		if (status == NVME_SC_SUCCESS || status == NVME_SC_ABORT_REQ)
-			++nvmeq->dev->event_limit;
-		if (status == NVME_SC_SUCCESS)
-			dev_warn(nvmeq->q_dmadev,
-				"async event result %08x\n", result);
-		return;
-	}
-
 	dev_warn(nvmeq->q_dmadev, "Unknown special completion %p\n", ctx);
 }
 
+static void *cancel_cmd_info(struct nvme_cmd_info *cmd, nvme_completion_fn *fn)
+{
+	void *ctx;
+
+	if (fn)
+		*fn = cmd->fn;
+	ctx = cmd->ctx;
+	cmd->fn = special_completion;
+	cmd->ctx = CMD_CTX_CANCELLED;
+	return ctx;
+}
+
+static void async_req_completion(struct nvme_queue *nvmeq, void *ctx,
+						struct nvme_completion *cqe)
+{
+	struct request *req = ctx;
+
+	u32 result = le32_to_cpup(&cqe->result);
+	u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+	if (status == NVME_SC_SUCCESS || status == NVME_SC_ABORT_REQ)
+		++nvmeq->dev->event_limit;
+	if (status == NVME_SC_SUCCESS)
+		dev_warn(nvmeq->q_dmadev,
+			"async event result %08x\n", result);
+
+	blk_put_request(req);
+}
+
+static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
+						struct nvme_completion *cqe)
+{
+	struct request *req = ctx;
+
+	u16 status = le16_to_cpup(&cqe->status) >> 1;
+	u32 result = le32_to_cpup(&cqe->result);
+
+	blk_put_request(req);
+
+	dev_warn(nvmeq->q_dmadev, "Abort status:%x result:%x", status, result);
+	++nvmeq->dev->abort_limit;
+}
+
 static void async_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
@@ -252,90 +274,37 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
 	cmdinfo->result = le32_to_cpup(&cqe->result);
 	cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
 	queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
+	blk_put_request(cmdinfo->req);
+}
+
+static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
+				  unsigned int tag)
+{
+	struct request *req = blk_mq_tag_to_rq(nvmeq->tags, tag);
+
+	return blk_mq_rq_to_pdu(req);
 }
 
 /*
  * Called with local interrupts disabled and the q_lock held.  May not sleep.
  */
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *nvme_finish_cmd(struct nvme_queue *nvmeq, int tag,
 						nvme_completion_fn *fn)
 {
+	struct nvme_cmd_info *cmd = get_cmd_from_tag(nvmeq, tag);
 	void *ctx;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-
-	if (cmdid >= nvmeq->q_depth || !info[cmdid].fn) {
-		if (fn)
-			*fn = special_completion;
+	if (tag >= nvmeq->q_depth) {
+		*fn = special_completion;
 		return CMD_CTX_INVALID;
 	}
 	if (fn)
-		*fn = info[cmdid].fn;
-	ctx = info[cmdid].ctx;
-	info[cmdid].fn = special_completion;
-	info[cmdid].ctx = CMD_CTX_COMPLETED;
-	clear_bit(cmdid, nvmeq->cmdid_data);
-	wake_up(&nvmeq->sq_full);
+		*fn = cmd->fn;
+	ctx = cmd->ctx;
+	cmd->fn = special_completion;
+	cmd->ctx = CMD_CTX_COMPLETED;
 	return ctx;
 }
 
-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
-						nvme_completion_fn *fn)
-{
-	void *ctx;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	if (fn)
-		*fn = info[cmdid].fn;
-	ctx = info[cmdid].ctx;
-	info[cmdid].fn = special_completion;
-	info[cmdid].ctx = CMD_CTX_CANCELLED;
-	return ctx;
-}
-
-static struct nvme_queue *raw_nvmeq(struct nvme_dev *dev, int qid)
-{
-	return rcu_dereference_raw(dev->queues[qid]);
-}
-
-static struct nvme_queue *get_nvmeq(struct nvme_dev *dev) __acquires(RCU)
-{
-	struct nvme_queue *nvmeq;
-	unsigned queue_id = get_cpu_var(*dev->io_queue);
-
-	rcu_read_lock();
-	nvmeq = rcu_dereference(dev->queues[queue_id]);
-	if (nvmeq)
-		return nvmeq;
-
-	rcu_read_unlock();
-	put_cpu_var(*dev->io_queue);
-	return NULL;
-}
-
-static void put_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
-	rcu_read_unlock();
-	put_cpu_var(nvmeq->dev->io_queue);
-}
-
-static struct nvme_queue *lock_nvmeq(struct nvme_dev *dev, int q_idx)
-							__acquires(RCU)
-{
-	struct nvme_queue *nvmeq;
-
-	rcu_read_lock();
-	nvmeq = rcu_dereference(dev->queues[q_idx]);
-	if (nvmeq)
-		return nvmeq;
-
-	rcu_read_unlock();
-	return NULL;
-}
-
-static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
-	rcu_read_unlock();
-}
-
 /**
  * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
  * @nvmeq: The queue to use
@@ -343,24 +312,31 @@ static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
  *
  * Safe to use from interrupt context
  */
-static int nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
+static int __nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
 {
-	unsigned long flags;
 	u16 tail;
-	spin_lock_irqsave(&nvmeq->q_lock, flags);
-	if (nvmeq->q_suspended) {
-		spin_unlock_irqrestore(&nvmeq->q_lock, flags);
+
+	if (nvmeq->q_suspended)
 		return -EBUSY;
-	}
+
 	tail = nvmeq->sq_tail;
 	memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
 	if (++tail == nvmeq->q_depth)
 		tail = 0;
 	writel(tail, nvmeq->q_db);
 	nvmeq->sq_tail = tail;
+
+	return 0;
+}
+
+static int nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
+{
+	unsigned long flags;
+	int ret;
+	spin_lock_irqsave(&nvmeq->q_lock, flags);
+	ret = __nvme_submit_cmd(nvmeq, cmd);
 	spin_unlock_irqrestore(&nvmeq->q_lock, flags);
-
-	return 0;
+	return ret;
 }
 
 static __le64 **iod_list(struct nvme_iod *iod)
@@ -392,7 +368,6 @@ nvme_alloc_iod(unsigned nseg, unsigned nbytes, struct nvme_dev *dev, gfp_t gfp)
 		iod->length = nbytes;
 		iod->nents = 0;
 		iod->first_dma = 0ULL;
-		iod->start_time = jiffies;
 	}
 
 	return iod;
@@ -416,65 +391,37 @@ void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
 	kfree(iod);
 }
 
-static void nvme_start_io_acct(struct bio *bio)
-{
-	struct gendisk *disk = bio->bi_bdev->bd_disk;
-	if (blk_queue_io_stat(disk->queue)) {
-		const int rw = bio_data_dir(bio);
-		int cpu = part_stat_lock();
-		part_round_stats(cpu, &disk->part0);
-		part_stat_inc(cpu, &disk->part0, ios[rw]);
-		part_stat_add(cpu, &disk->part0, sectors[rw],
-							bio_sectors(bio));
-		part_inc_in_flight(&disk->part0, rw);
-		part_stat_unlock();
-	}
-}
-
-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
-{
-	struct gendisk *disk = bio->bi_bdev->bd_disk;
-	if (blk_queue_io_stat(disk->queue)) {
-		const int rw = bio_data_dir(bio);
-		unsigned long duration = jiffies - start_time;
-		int cpu = part_stat_lock();
-		part_stat_add(cpu, &disk->part0, ticks[rw], duration);
-		part_round_stats(cpu, &disk->part0);
-		part_dec_in_flight(&disk->part0, rw);
-		part_stat_unlock();
-	}
-}
-
-static void bio_completion(struct nvme_queue *nvmeq, void *ctx,
+static void req_completion(struct nvme_queue *nvmeq, void *ctx,
 						struct nvme_completion *cqe)
 {
 	struct nvme_iod *iod = ctx;
-	struct bio *bio = iod->private;
+	struct request *req = iod->private;
+	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+
 	u16 status = le16_to_cpup(&cqe->status) >> 1;
-	int error = 0;
 
 	if (unlikely(status)) {
-		if (!(status & NVME_SC_DNR ||
-				bio->bi_rw & REQ_FAILFAST_MASK) &&
-				(jiffies - iod->start_time) < IOD_TIMEOUT) {
-			if (!waitqueue_active(&nvmeq->sq_full))
-				add_wait_queue(&nvmeq->sq_full,
-							&nvmeq->sq_cong_wait);
-			list_add_tail(&iod->node, &nvmeq->iod_bio);
-			wake_up(&nvmeq->sq_full);
+		if (!(status & NVME_SC_DNR || blk_noretry_request(req))
+		    && (jiffies - req->start_time) < req->timeout) {
+			blk_mq_requeue_request(req);
+			blk_mq_kick_requeue_list(req->q);
 			return;
 		}
-		error = -EIO;
-	}
-	if (iod->nents) {
-		dma_unmap_sg(nvmeq->q_dmadev, iod->sg, iod->nents,
-			bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
-		nvme_end_io_acct(bio, iod->start_time);
-	}
+		req->errors = -EIO;
+	} else
+		req->errors = 0;
+
+	if (cmd_rq->aborted)
+		dev_warn(&nvmeq->dev->pci_dev->dev,
+			"completing aborted command with status:%04x\n",
+			status);
+
+	if (iod->nents)
+		dma_unmap_sg(&nvmeq->dev->pci_dev->dev, iod->sg, iod->nents,
+			rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
 	nvme_free_iod(nvmeq->dev, iod);
 
-	trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, error);
-	bio_endio(bio, error);
+	blk_mq_complete_request(req);
 }
 
 /* length is in bytes.  gfp flags indicates whether we may sleep. */
@@ -557,88 +504,25 @@ int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod, int total_len,
 	return total_len;
 }
 
-static int nvme_split_and_submit(struct bio *bio, struct nvme_queue *nvmeq,
-				 int len)
-{
-	struct bio *split = bio_split(bio, len >> 9, GFP_ATOMIC, NULL);
-	if (!split)
-		return -ENOMEM;
-
-	trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
-					split->bi_iter.bi_sector);
-	bio_chain(split, bio);
-
-	if (!waitqueue_active(&nvmeq->sq_full))
-		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-	bio_list_add(&nvmeq->sq_cong, split);
-	bio_list_add(&nvmeq->sq_cong, bio);
-	wake_up(&nvmeq->sq_full);
-
-	return 0;
-}
-
-/* NVMe scatterlists require no holes in the virtual address */
-#define BIOVEC_NOT_VIRT_MERGEABLE(vec1, vec2)	((vec2)->bv_offset || \
-			(((vec1)->bv_offset + (vec1)->bv_len) % PAGE_SIZE))
-
-static int nvme_map_bio(struct nvme_queue *nvmeq, struct nvme_iod *iod,
-		struct bio *bio, enum dma_data_direction dma_dir, int psegs)
-{
-	struct bio_vec bvec, bvprv;
-	struct bvec_iter iter;
-	struct scatterlist *sg = NULL;
-	int length = 0, nsegs = 0, split_len = bio->bi_iter.bi_size;
-	int first = 1;
-
-	if (nvmeq->dev->stripe_size)
-		split_len = nvmeq->dev->stripe_size -
-			((bio->bi_iter.bi_sector << 9) &
-			 (nvmeq->dev->stripe_size - 1));
-
-	sg_init_table(iod->sg, psegs);
-	bio_for_each_segment(bvec, bio, iter) {
-		if (!first && BIOVEC_PHYS_MERGEABLE(&bvprv, &bvec)) {
-			sg->length += bvec.bv_len;
-		} else {
-			if (!first && BIOVEC_NOT_VIRT_MERGEABLE(&bvprv, &bvec))
-				return nvme_split_and_submit(bio, nvmeq,
-							     length);
-
-			sg = sg ? sg + 1 : iod->sg;
-			sg_set_page(sg, bvec.bv_page,
-				    bvec.bv_len, bvec.bv_offset);
-			nsegs++;
-		}
-
-		if (split_len - length < bvec.bv_len)
-			return nvme_split_and_submit(bio, nvmeq, split_len);
-		length += bvec.bv_len;
-		bvprv = bvec;
-		first = 0;
-	}
-	iod->nents = nsegs;
-	sg_mark_end(sg);
-	if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
-		return -ENOMEM;
-
-	BUG_ON(length != bio->bi_iter.bi_size);
-	return length;
-}
-
-static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
-		struct bio *bio, struct nvme_iod *iod, int cmdid)
+/*
+ * We reuse the small pool to allocate the 16-byte range here as it is not
+ * worth having a special pool for these or additional cases to handle freeing
+ * the iod.
+ */
+static void nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+		struct request *req, struct nvme_iod *iod)
 {
 	struct nvme_dsm_range *range =
 				(struct nvme_dsm_range *)iod_list(iod)[0];
 	struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
 
 	range->cattr = cpu_to_le32(0);
-	range->nlb = cpu_to_le32(bio->bi_iter.bi_size >> ns->lba_shift);
-	range->slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
+	range->nlb = cpu_to_le32(blk_rq_bytes(req) >> ns->lba_shift);
+	range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 
 	memset(cmnd, 0, sizeof(*cmnd));
 	cmnd->dsm.opcode = nvme_cmd_dsm;
-	cmnd->dsm.command_id = cmdid;
+	cmnd->dsm.command_id = req->tag;
 	cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
 	cmnd->dsm.prp1 = cpu_to_le64(iod->first_dma);
 	cmnd->dsm.nr = 0;
@@ -647,11 +531,9 @@ static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	if (++nvmeq->sq_tail == nvmeq->q_depth)
 		nvmeq->sq_tail = 0;
 	writel(nvmeq->sq_tail, nvmeq->q_db);
-
-	return 0;
 }
 
-static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 								int cmdid)
 {
 	struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
@@ -664,49 +546,34 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 	if (++nvmeq->sq_tail == nvmeq->q_depth)
 		nvmeq->sq_tail = 0;
 	writel(nvmeq->sq_tail, nvmeq->q_db);
-
-	return 0;
 }
 
-static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
+static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+							struct nvme_ns *ns)
 {
-	struct bio *bio = iod->private;
-	struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
+	struct request *req = iod->private;
 	struct nvme_command *cmnd;
-	int cmdid;
-	u16 control;
-	u32 dsmgmt;
+	u16 control = 0;
+	u32 dsmgmt = 0;
 
-	cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
-	if (unlikely(cmdid < 0))
-		return cmdid;
-
-	if (bio->bi_rw & REQ_DISCARD)
-		return nvme_submit_discard(nvmeq, ns, bio, iod, cmdid);
-	if (bio->bi_rw & REQ_FLUSH)
-		return nvme_submit_flush(nvmeq, ns, cmdid);
-
-	control = 0;
-	if (bio->bi_rw & REQ_FUA)
+	if (req->cmd_flags & REQ_FUA)
 		control |= NVME_RW_FUA;
-	if (bio->bi_rw & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+	if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
 		control |= NVME_RW_LR;
 
-	dsmgmt = 0;
-	if (bio->bi_rw & REQ_RAHEAD)
+	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
 	cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
 	memset(cmnd, 0, sizeof(*cmnd));
 
-	cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
-	cmnd->rw.command_id = cmdid;
+	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+	cmnd->rw.command_id = req->tag;
 	cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
 	cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
 	cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
-	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
-	cmnd->rw.length =
-		cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
+	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 	cmnd->rw.control = cpu_to_le16(control);
 	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
 
@@ -717,45 +584,32 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
 	return 0;
 }
 
-static int nvme_split_flush_data(struct nvme_queue *nvmeq, struct bio *bio)
-{
-	struct bio *split = bio_clone(bio, GFP_ATOMIC);
-	if (!split)
-		return -ENOMEM;
-
-	split->bi_iter.bi_size = 0;
-	split->bi_phys_segments = 0;
-	bio->bi_rw &= ~REQ_FLUSH;
-	bio_chain(split, bio);
-
-	if (!waitqueue_active(&nvmeq->sq_full))
-		add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-	bio_list_add(&nvmeq->sq_cong, split);
-	bio_list_add(&nvmeq->sq_cong, bio);
-	wake_up_process(nvme_thread);
-
-	return 0;
-}
-
-/*
- * Called with local interrupts disabled and the q_lock held.  May not sleep.
- */
-static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
-								struct bio *bio)
+static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
+	struct nvme_ns *ns = hctx->queue->queuedata;
+	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
 	struct nvme_iod *iod;
-	int psegs = bio_phys_segments(ns->queue, bio);
-	int result;
+	enum dma_data_direction dma_dir;
+	int psegs = req->nr_phys_segments;
+	int result = BLK_MQ_RQ_QUEUE_BUSY;
+	/*
+	 * Requeued IO has already been prepped
+	 */
+	iod = req->special;
+	if (iod)
+		goto submit_iod;
 
-	if ((bio->bi_rw & REQ_FLUSH) && psegs)
-		return nvme_split_flush_data(nvmeq, bio);
-
-	iod = nvme_alloc_iod(psegs, bio->bi_iter.bi_size, ns->dev, GFP_ATOMIC);
+	iod = nvme_alloc_iod(psegs, blk_rq_bytes(req), ns->dev, GFP_ATOMIC);
 	if (!iod)
-		return -ENOMEM;
+		return result;
 
-	iod->private = bio;
-	if (bio->bi_rw & REQ_DISCARD) {
+	iod->private = req;
+	req->special = iod;
+
+	nvme_set_info(cmd, iod, req_completion);
+
+	if (req->cmd_flags & REQ_DISCARD) {
 		void *range;
 		/*
 		 * We reuse the small pool to allocate the 16-byte range here
@@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
 		range = dma_pool_alloc(nvmeq->dev->prp_small_pool,
 						GFP_ATOMIC,
 						&iod->first_dma);
-		if (!range) {
-			result = -ENOMEM;
-			goto free_iod;
-		}
+		if (!range)
+			goto finish_cmd;
 		iod_list(iod)[0] = (__le64 *)range;
 		iod->npages = 0;
 	} else if (psegs) {
-		result = nvme_map_bio(nvmeq, iod, bio,
-			bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
-			psegs);
-		if (result <= 0)
-			goto free_iod;
-		if (nvme_setup_prps(nvmeq->dev, iod, result, GFP_ATOMIC) !=
-								result) {
-			result = -ENOMEM;
-			goto free_iod;
+		dma_dir = rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+
+		sg_init_table(iod->sg, psegs);
+		iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
+		if (!iod->nents) {
+			result = BLK_MQ_RQ_QUEUE_ERROR;
+			goto finish_cmd;
 		}
-		nvme_start_io_acct(bio);
+
+		if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
+			goto finish_cmd;
+
+		if (blk_rq_bytes(req) != nvme_setup_prps(nvmeq->dev, iod,
+						blk_rq_bytes(req), GFP_ATOMIC))
+			goto finish_cmd;
 	}
-	if (unlikely(nvme_submit_iod(nvmeq, iod))) {
-		if (!waitqueue_active(&nvmeq->sq_full))
-			add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-		list_add_tail(&iod->node, &nvmeq->iod_bio);
+
+ submit_iod:
+	spin_lock_irq(&nvmeq->q_lock);
+	if (nvmeq->q_suspended) {
+		spin_unlock_irq(&nvmeq->q_lock);
+		goto finish_cmd;
 	}
-	return 0;
 
- free_iod:
+	if (req->cmd_flags & REQ_DISCARD)
+		nvme_submit_discard(nvmeq, ns, req, iod);
+	else if (req->cmd_flags & REQ_FLUSH)
+		nvme_submit_flush(nvmeq, ns, req->tag);
+	else
+		nvme_submit_iod(nvmeq, iod, ns);
+
+ queued:
+	nvme_process_cq(nvmeq);
+	spin_unlock_irq(&nvmeq->q_lock);
+	return BLK_MQ_RQ_QUEUE_OK;
+
+ finish_cmd:
+	nvme_finish_cmd(nvmeq, req->tag, NULL);
 	nvme_free_iod(nvmeq->dev, iod);
 	return result;
 }
@@ -814,8 +684,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 			head = 0;
 			phase = !phase;
 		}
-
-		ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
+		ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
 		fn(nvmeq, ctx, &cqe);
 	}
 
@@ -836,29 +705,12 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
 	return 1;
 }
 
-static void nvme_make_request(struct request_queue *q, struct bio *bio)
+/* Admin queue isn't initialized as a request queue. If at some point this
+ * happens anyway, make sure to notify the user */
+static int nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
-	struct nvme_ns *ns = q->queuedata;
-	struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
-	int result = -EBUSY;
-
-	if (!nvmeq) {
-		bio_endio(bio, -EIO);
-		return;
-	}
-
-	spin_lock_irq(&nvmeq->q_lock);
-	if (!nvmeq->q_suspended && bio_list_empty(&nvmeq->sq_cong))
-		result = nvme_submit_bio_queue(nvmeq, ns, bio);
-	if (unlikely(result)) {
-		if (!waitqueue_active(&nvmeq->sq_full))
-			add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
-		bio_list_add(&nvmeq->sq_cong, bio);
-	}
-
-	nvme_process_cq(nvmeq);
-	spin_unlock_irq(&nvmeq->q_lock);
-	put_nvmeq(nvmeq);
+	WARN_ON_ONCE(1);
+	return BLK_MQ_RQ_QUEUE_ERROR;
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
@@ -882,10 +734,11 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_WAKE_THREAD;
 }
 
-static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
+static void nvme_abort_cmd_info(struct nvme_queue *nvmeq, struct nvme_cmd_info *
+								cmd_info)
 {
 	spin_lock_irq(&nvmeq->q_lock);
-	cancel_cmdid(nvmeq, cmdid, NULL);
+	cancel_cmd_info(cmd_info, NULL);
 	spin_unlock_irq(&nvmeq->q_lock);
 }
 
@@ -908,45 +761,31 @@ static void sync_completion(struct nvme_queue *nvmeq, void *ctx,
  * Returns 0 on success.  If the result is negative, it's a Linux error code;
  * if the result is positive, it's an NVM Express status code
  */
-static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
-						struct nvme_command *cmd,
+static int nvme_submit_sync_cmd(struct request *req, struct nvme_command *cmd,
 						u32 *result, unsigned timeout)
 {
-	int cmdid, ret;
+	int ret;
 	struct sync_cmd_info cmdinfo;
-	struct nvme_queue *nvmeq;
-
-	nvmeq = lock_nvmeq(dev, q_idx);
-	if (!nvmeq)
-		return -ENODEV;
+	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = cmd_rq->nvmeq;
 
 	cmdinfo.task = current;
 	cmdinfo.status = -EINTR;
 
-	cmdid = alloc_cmdid(nvmeq, &cmdinfo, sync_completion, timeout);
-	if (cmdid < 0) {
-		unlock_nvmeq(nvmeq);
-		return cmdid;
-	}
-	cmd->common.command_id = cmdid;
+	cmd->common.command_id = req->tag;
+
+	nvme_set_info(cmd_rq, &cmdinfo, sync_completion);
 
 	set_current_state(TASK_KILLABLE);
 	ret = nvme_submit_cmd(nvmeq, cmd);
 	if (ret) {
-		free_cmdid(nvmeq, cmdid, NULL);
-		unlock_nvmeq(nvmeq);
+		nvme_finish_cmd(nvmeq, req->tag, NULL);
 		set_current_state(TASK_RUNNING);
-		return ret;
 	}
-	unlock_nvmeq(nvmeq);
 	schedule_timeout(timeout);
 
 	if (cmdinfo.status == -EINTR) {
-		nvmeq = lock_nvmeq(dev, q_idx);
-		if (nvmeq) {
-			nvme_abort_command(nvmeq, cmdid);
-			unlock_nvmeq(nvmeq);
-		}
+		nvme_abort_cmd_info(nvmeq, blk_mq_rq_to_pdu(req));
 		return -EINTR;
 	}
 
@@ -956,59 +795,99 @@ static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
 	return cmdinfo.status;
 }
 
-static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
+static int nvme_submit_async_admin_req(struct nvme_dev *dev)
+{
+	struct nvme_queue *nvmeq = dev->queues[0];
+	struct nvme_command c;
+	struct nvme_cmd_info *cmd_info;
+	struct request *req;
+
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	if (!req)
+		return -ENOMEM;
+
+	cmd_info = blk_mq_rq_to_pdu(req);
+	nvme_set_info(cmd_info, req, async_req_completion);
+
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = nvme_admin_async_event;
+	c.common.command_id = req->tag;
+
+	return __nvme_submit_cmd(nvmeq, &c);
+}
+
+static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
 			struct nvme_command *cmd,
 			struct async_cmd_info *cmdinfo, unsigned timeout)
 {
-	int cmdid;
+	struct nvme_queue *nvmeq = dev->queues[0];
+	struct request *req;
+	struct nvme_cmd_info *cmd_rq;
 
-	cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
-	if (cmdid < 0)
-		return cmdid;
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	if (!req)
+		return -ENOMEM;
+
+	req->timeout = timeout;
+	cmd_rq = blk_mq_rq_to_pdu(req);
+	cmdinfo->req = req;
+	nvme_set_info(cmd_rq, cmdinfo, async_completion);
 	cmdinfo->status = -EINTR;
-	cmd->common.command_id = cmdid;
+
+	cmd->common.command_id = req->tag;
+
 	return nvme_submit_cmd(nvmeq, cmd);
 }
 
+int __nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
+						u32 *result, unsigned timeout)
+{
+	int res;
+	struct request *req;
+
+	req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+	if (!req)
+		return -ENOMEM;
+	res = nvme_submit_sync_cmd(req, cmd, result, timeout);
+	blk_put_request(req);
+	return res;
+}
+
 int nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
 								u32 *result)
 {
-	return nvme_submit_sync_cmd(dev, 0, cmd, result, ADMIN_TIMEOUT);
+	return __nvme_submit_admin_cmd(dev, cmd, result, ADMIN_TIMEOUT);
 }
 
-int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
-								u32 *result)
+int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+					struct nvme_command *cmd, u32 *result)
 {
-	return nvme_submit_sync_cmd(dev, smp_processor_id() + 1, cmd, result,
-							NVME_IO_TIMEOUT);
-}
+	int res;
+	struct request *req;
 
-static int nvme_submit_admin_cmd_async(struct nvme_dev *dev,
-		struct nvme_command *cmd, struct async_cmd_info *cmdinfo)
-{
-	return nvme_submit_async_cmd(raw_nvmeq(dev, 0), cmd, cmdinfo,
-								ADMIN_TIMEOUT);
+	req = blk_mq_alloc_request(ns->queue, WRITE, (GFP_KERNEL|__GFP_WAIT),
+									false);
+	if (!req)
+		return -ENOMEM;
+	res = nvme_submit_sync_cmd(req, cmd, result, NVME_IO_TIMEOUT);
+	blk_put_request(req);
+	return res;
 }
 
 static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 {
-	int status;
 	struct nvme_command c;
 
 	memset(&c, 0, sizeof(c));
 	c.delete_queue.opcode = opcode;
 	c.delete_queue.qid = cpu_to_le16(id);
 
-	status = nvme_submit_admin_cmd(dev, &c, NULL);
-	if (status)
-		return -EIO;
-	return 0;
+	return nvme_submit_admin_cmd(dev, &c, NULL);
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 						struct nvme_queue *nvmeq)
 {
-	int status;
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
 
@@ -1020,16 +899,12 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cq_flags = cpu_to_le16(flags);
 	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
 
-	status = nvme_submit_admin_cmd(dev, &c, NULL);
-	if (status)
-		return -EIO;
-	return 0;
+	return nvme_submit_admin_cmd(dev, &c, NULL);
 }
 
 static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
 						struct nvme_queue *nvmeq)
 {
-	int status;
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;
 
@@ -1041,10 +916,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
 	c.create_sq.sq_flags = cpu_to_le16(flags);
 	c.create_sq.cqid = cpu_to_le16(qid);
 
-	status = nvme_submit_admin_cmd(dev, &c, NULL);
-	if (status)
-		return -EIO;
-	return 0;
+	return nvme_submit_admin_cmd(dev, &c, NULL);
 }
 
 static int adapter_delete_cq(struct nvme_dev *dev, u16 cqid)
@@ -1100,28 +972,27 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
 }
 
 /**
- * nvme_abort_cmd - Attempt aborting a command
- * @cmdid: Command id of a timed out IO
- * @queue: The queue with timed out IO
+ * nvme_abort_req - Attempt aborting a request
  *
  * Schedule controller reset if the command was already aborted once before and
  * still hasn't been returned to the driver, or if this is the admin queue.
  */
-static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
+static void nvme_abort_req(struct request *req)
 {
-	int a_cmdid;
+	struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = cmd_rq->nvmeq;
+	struct nvme_dev *dev = nvmeq->dev;
+	struct request *abort_req;
+	struct nvme_cmd_info *abort_cmd;
 	struct nvme_command cmd;
-	struct nvme_dev *dev = nvmeq->dev;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	struct nvme_queue *adminq;
 
-	if (!nvmeq->qid || info[cmdid].aborted) {
+	if (!nvmeq->qid || cmd_rq->aborted) {
 		if (work_busy(&dev->reset_work))
 			return;
 		list_del_init(&dev->node);
 		dev_warn(&dev->pci_dev->dev,
-			"I/O %d QID %d timeout, reset controller\n", cmdid,
-								nvmeq->qid);
+			"I/O %d QID %d timeout, reset controller\n",
+							req->tag, nvmeq->qid);
 		dev->reset_workfn = nvme_reset_failed_dev;
 		queue_work(nvme_workq, &dev->reset_work);
 		return;
@@ -1130,91 +1001,93 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
 	if (!dev->abort_limit)
 		return;
 
-	adminq = rcu_dereference(dev->queues[0]);
-	a_cmdid = alloc_cmdid(adminq, CMD_CTX_ABORT, special_completion,
-								ADMIN_TIMEOUT);
-	if (a_cmdid < 0)
+	abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
+									false);
+	if (!abort_req)
 		return;
 
+	abort_cmd = blk_mq_rq_to_pdu(abort_req);
+	nvme_set_info(abort_cmd, abort_req, abort_completion);
+
 	memset(&cmd, 0, sizeof(cmd));
 	cmd.abort.opcode = nvme_admin_abort_cmd;
-	cmd.abort.cid = cmdid;
+	cmd.abort.cid = req->tag;
 	cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
-	cmd.abort.command_id = a_cmdid;
+	cmd.abort.command_id = abort_req->tag;
 
 	--dev->abort_limit;
-	info[cmdid].aborted = 1;
-	info[cmdid].timeout = jiffies + ADMIN_TIMEOUT;
+	cmd_rq->aborted = 1;
 
-	dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", cmdid,
+	dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
 							nvmeq->qid);
-	nvme_submit_cmd(adminq, &cmd);
+	if (nvme_submit_cmd(dev->queues[0], &cmd) < 0) {
+		dev_warn(nvmeq->q_dmadev,
+				"Could not abort I/O %d QID %d",
+				req->tag, nvmeq->qid);
+		blk_put_request(req);
+	}
 }
 
-/**
- * nvme_cancel_ios - Cancel outstanding I/Os
- * @queue: The queue to cancel I/Os on
- * @timeout: True to only cancel I/Os which have timed out
- */
-static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
+static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
 {
-	int depth = nvmeq->q_depth - 1;
-	struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-	unsigned long now = jiffies;
-	int cmdid;
+	struct nvme_queue *nvmeq = data;
+	unsigned int tag = 0;
 
-	for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
+	tag = 0;
+	do {
+		struct request *req;
 		void *ctx;
 		nvme_completion_fn fn;
+		struct nvme_cmd_info *cmd;
 		static struct nvme_completion cqe = {
 			.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
 		};
+		int qdepth = nvmeq == nvmeq->dev->queues[0] ?
+					nvmeq->dev->admin_tagset.queue_depth :
+					nvmeq->dev->tagset.queue_depth;
 
-		if (timeout && !time_after(now, info[cmdid].timeout))
-			continue;
-		if (info[cmdid].ctx == CMD_CTX_CANCELLED)
-			continue;
-		if (timeout && info[cmdid].ctx == CMD_CTX_ASYNC)
-			continue;
-		if (timeout && nvmeq->dev->initialized) {
-			nvme_abort_cmd(cmdid, nvmeq);
+		/* zero'd bits are free tags */
+		tag = find_next_zero_bit(tag_map, qdepth, tag);
+		if (tag >= qdepth)
+			break;
+
+		req = blk_mq_tag_to_rq(nvmeq->tags, tag++);
+		cmd = blk_mq_rq_to_pdu(req);
+
+		if (cmd->ctx == CMD_CTX_CANCELLED)
 			continue;
-		}
-		dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
-								nvmeq->qid);
-		ctx = cancel_cmdid(nvmeq, cmdid, &fn);
+
+		dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n",
+							req->tag, nvmeq->qid);
+		ctx = cancel_cmd_info(cmd, &fn);
 		fn(nvmeq, ctx, &cqe);
-	}
+	} while (1);
 }
 
-static void nvme_free_queue(struct rcu_head *r)
+static enum blk_eh_timer_return nvme_timeout(struct request *req)
 {
-	struct nvme_queue *nvmeq = container_of(r, struct nvme_queue, r_head);
+	struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = cmd->nvmeq;
 
-	spin_lock_irq(&nvmeq->q_lock);
-	while (bio_list_peek(&nvmeq->sq_cong)) {
-		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
-		bio_endio(bio, -EIO);
-	}
-	while (!list_empty(&nvmeq->iod_bio)) {
-		static struct nvme_completion cqe = {
-			.status = cpu_to_le16(
-				(NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1),
-		};
-		struct nvme_iod *iod = list_first_entry(&nvmeq->iod_bio,
-							struct nvme_iod,
-							node);
-		list_del(&iod->node);
-		bio_completion(nvmeq, iod, &cqe);
-	}
-	spin_unlock_irq(&nvmeq->q_lock);
+	dev_warn(nvmeq->q_dmadev, "Timeout I/O %d QID %d\n", req->tag,
+							nvmeq->qid);
+	if (nvmeq->dev->initialized)
+		nvme_abort_req(req);
 
+	/*
+	 * The aborted req will be completed on receiving the abort req.
+	 * We enable the timer again. If hit twice, it'll cause a device reset,
+	 * as the device then is in a faulty state.
+	 */
+	return BLK_EH_RESET_TIMER;
+}
+
+static void nvme_free_queue(struct nvme_queue *nvmeq)
+{
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
 	dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
 					nvmeq->sq_cmds, nvmeq->sq_dma_addr);
-	if (nvmeq->qid)
-		free_cpumask_var(nvmeq->cpu_mask);
 	kfree(nvmeq);
 }
 
@@ -1223,10 +1096,10 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
 	int i;
 
 	for (i = dev->queue_count - 1; i >= lowest; i--) {
-		struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
-		rcu_assign_pointer(dev->queues[i], NULL);
-		call_rcu(&nvmeq->r_head, nvme_free_queue);
+		struct nvme_queue *nvmeq = dev->queues[i];
 		dev->queue_count--;
+		dev->queues[i] = NULL;
+		nvme_free_queue(nvmeq);
 	}
 }
 
@@ -1259,13 +1132,14 @@ static void nvme_clear_queue(struct nvme_queue *nvmeq)
 {
 	spin_lock_irq(&nvmeq->q_lock);
 	nvme_process_cq(nvmeq);
-	nvme_cancel_ios(nvmeq, false);
+	if (nvmeq->tags)
+		blk_mq_tag_busy_iter(nvmeq->tags, nvme_cancel_queue_ios, nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
 }
 
 static void nvme_disable_queue(struct nvme_dev *dev, int qid)
 {
-	struct nvme_queue *nvmeq = raw_nvmeq(dev, qid);
+	struct nvme_queue *nvmeq = dev->queues[qid];
 
 	if (!nvmeq)
 		return;
@@ -1285,8 +1159,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 							int depth, int vector)
 {
 	struct device *dmadev = &dev->pci_dev->dev;
-	unsigned extra = nvme_queue_extra(depth);
-	struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq) + extra, GFP_KERNEL);
+	struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
 	if (!nvmeq)
 		return NULL;
 
@@ -1300,9 +1173,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	if (!nvmeq->sq_cmds)
 		goto free_cqdma;
 
-	if (qid && !zalloc_cpumask_var(&nvmeq->cpu_mask, GFP_KERNEL))
-		goto free_sqdma;
-
 	nvmeq->q_dmadev = dmadev;
 	nvmeq->dev = dev;
 	snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
@@ -1310,23 +1180,16 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	spin_lock_init(&nvmeq->q_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
-	init_waitqueue_head(&nvmeq->sq_full);
-	init_waitqueue_entry(&nvmeq->sq_cong_wait, nvme_thread);
-	bio_list_init(&nvmeq->sq_cong);
-	INIT_LIST_HEAD(&nvmeq->iod_bio);
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
 	nvmeq->q_depth = depth;
 	nvmeq->cq_vector = vector;
 	nvmeq->qid = qid;
 	nvmeq->q_suspended = 1;
 	dev->queue_count++;
-	rcu_assign_pointer(dev->queues[qid], nvmeq);
+	dev->queues[qid] = nvmeq;
 
 	return nvmeq;
 
- free_sqdma:
-	dma_free_coherent(dmadev, SQ_SIZE(depth), (void *)nvmeq->sq_cmds,
-							nvmeq->sq_dma_addr);
  free_cqdma:
 	dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
 							nvmeq->cq_dma_addr);
@@ -1349,15 +1212,12 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
-	unsigned extra = nvme_queue_extra(nvmeq->q_depth);
 
 	nvmeq->sq_tail = 0;
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
-	memset(nvmeq->cmdid_data, 0, extra);
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
-	nvme_cancel_ios(nvmeq, false);
 	nvmeq->q_suspended = 0;
 	dev->online_queues++;
 }
@@ -1463,6 +1323,54 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
 	return 0;
 }
 
+static struct blk_mq_ops nvme_mq_admin_ops = {
+	.queue_rq	= nvme_admin_queue_rq,
+	.map_queue	= blk_mq_map_queue,
+	.init_hctx	= nvme_admin_init_hctx,
+	.init_request	= nvme_admin_init_request,
+	.timeout	= nvme_timeout,
+};
+
+static struct blk_mq_ops nvme_mq_ops = {
+	.queue_rq	= nvme_queue_rq,
+	.map_queue	= blk_mq_map_queue,
+	.init_hctx	= nvme_init_hctx,
+	.init_request	= nvme_init_request,
+	.timeout	= nvme_timeout,
+};
+
+static int nvme_alloc_admin_tags(struct nvme_dev *dev)
+{
+	if (!dev->admin_q) {
+		dev->admin_tagset.ops = &nvme_mq_admin_ops;
+		dev->admin_tagset.nr_hw_queues = 1;
+		dev->admin_tagset.queue_depth = NVME_AQ_DEPTH;
+		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
+		dev->admin_tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+		dev->admin_tagset.cmd_size = sizeof(struct nvme_cmd_info);
+		dev->admin_tagset.driver_data = dev;
+
+		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
+			return -ENOMEM;
+
+		dev->queues[0]->tags = dev->admin_tagset.tags[0];
+
+		dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
+		if (!dev->admin_q) {
+			blk_mq_free_tag_set(&dev->admin_tagset);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static void nvme_free_admin_tags(struct nvme_dev *dev)
+{
+	if (dev->admin_q)
+		blk_mq_free_tag_set(&dev->admin_tagset);
+}
+
 static int nvme_configure_admin_queue(struct nvme_dev *dev)
 {
 	int result;
@@ -1492,9 +1400,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 	if (result < 0)
 		return result;
 
-	nvmeq = raw_nvmeq(dev, 0);
+	nvmeq = dev->queues[0];
 	if (!nvmeq) {
-		nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
+		nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH, 0);
 		if (!nvmeq)
 			return -ENOMEM;
 	}
@@ -1515,16 +1423,26 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 
 	result = nvme_enable_ctrl(dev, cap);
 	if (result)
-		return result;
+		goto free_nvmeq;
+
+	result = nvme_alloc_admin_tags(dev);
+	if (result)
+		goto free_nvmeq;
 
 	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
 	if (result)
-		return result;
+		goto free_tags;
 
 	spin_lock_irq(&nvmeq->q_lock);
 	nvme_init_queue(nvmeq, 0);
 	spin_unlock_irq(&nvmeq->q_lock);
 	return result;
+
+ free_tags:
+	nvme_free_admin_tags(dev);
+ free_nvmeq:
+	nvme_free_queues(dev, 0);
+	return result;
 }
 
 struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
@@ -1682,7 +1600,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 	if (length != (io.nblocks + 1) << ns->lba_shift)
 		status = -ENOMEM;
 	else
-		status = nvme_submit_io_cmd(dev, &c, NULL);
+		status = nvme_submit_io_cmd(dev, ns, &c, NULL);
 
 	if (meta_len) {
 		if (status == NVME_SC_SUCCESS && !(io.opcode & 1)) {
@@ -1754,10 +1672,11 @@ static int nvme_user_admin_cmd(struct nvme_dev *dev,
 
 	timeout = cmd.timeout_ms ? msecs_to_jiffies(cmd.timeout_ms) :
 								ADMIN_TIMEOUT;
+
 	if (length != cmd.data_len)
 		status = -ENOMEM;
 	else
-		status = nvme_submit_sync_cmd(dev, 0, &c, &cmd.result, timeout);
+		status = __nvme_submit_admin_cmd(dev, &c, &cmd.result, timeout);
 
 	if (cmd.data_len) {
 		nvme_unmap_user_pages(dev, cmd.opcode & 1, iod);
@@ -1846,62 +1765,6 @@ static const struct block_device_operations nvme_fops = {
 	.getgeo		= nvme_getgeo,
 };
 
-static void nvme_resubmit_iods(struct nvme_queue *nvmeq)
-{
-	struct nvme_iod *iod, *next;
-
-	list_for_each_entry_safe(iod, next, &nvmeq->iod_bio, node) {
-		if (unlikely(nvme_submit_iod(nvmeq, iod)))
-			break;
-		list_del(&iod->node);
-		if (bio_list_empty(&nvmeq->sq_cong) &&
-						list_empty(&nvmeq->iod_bio))
-			remove_wait_queue(&nvmeq->sq_full,
-						&nvmeq->sq_cong_wait);
-	}
-}
-
-static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
-{
-	while (bio_list_peek(&nvmeq->sq_cong)) {
-		struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
-		struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
-
-		if (bio_list_empty(&nvmeq->sq_cong) &&
-						list_empty(&nvmeq->iod_bio))
-			remove_wait_queue(&nvmeq->sq_full,
-							&nvmeq->sq_cong_wait);
-		if (nvme_submit_bio_queue(nvmeq, ns, bio)) {
-			if (!waitqueue_active(&nvmeq->sq_full))
-				add_wait_queue(&nvmeq->sq_full,
-							&nvmeq->sq_cong_wait);
-			bio_list_add_head(&nvmeq->sq_cong, bio);
-			break;
-		}
-	}
-}
-
-static int nvme_submit_async_req(struct nvme_queue *nvmeq)
-{
-	struct nvme_command *c;
-	int cmdid;
-
-	cmdid = alloc_cmdid(nvmeq, CMD_CTX_ASYNC, special_completion, 0);
-	if (cmdid < 0)
-		return cmdid;
-
-	c = &nvmeq->sq_cmds[nvmeq->sq_tail];
-	memset(c, 0, sizeof(*c));
-	c->common.opcode = nvme_admin_async_event;
-	c->common.command_id = cmdid;
-
-	if (++nvmeq->sq_tail == nvmeq->q_depth)
-		nvmeq->sq_tail = 0;
-	writel(nvmeq->sq_tail, nvmeq->q_db);
-
-	return 0;
-}
-
 static int nvme_kthread(void *data)
 {
 	struct nvme_dev *dev, *next;
@@ -1917,34 +1780,29 @@ static int nvme_kthread(void *data)
 					continue;
 				list_del_init(&dev->node);
 				dev_warn(&dev->pci_dev->dev,
-					"Failed status, reset controller\n");
+					"Failed status: %x, reset controller\n",
+					readl(&dev->bar->csts));
 				dev->reset_workfn = nvme_reset_failed_dev;
 				queue_work(nvme_workq, &dev->reset_work);
 				continue;
 			}
-			rcu_read_lock();
 			for (i = 0; i < dev->queue_count; i++) {
-				struct nvme_queue *nvmeq =
-						rcu_dereference(dev->queues[i]);
+				struct nvme_queue *nvmeq = dev->queues[i];
 				if (!nvmeq)
 					continue;
 				spin_lock_irq(&nvmeq->q_lock);
 				if (nvmeq->q_suspended)
 					goto unlock;
 				nvme_process_cq(nvmeq);
-				nvme_cancel_ios(nvmeq, true);
-				nvme_resubmit_bios(nvmeq);
-				nvme_resubmit_iods(nvmeq);
 
 				while ((i == 0) && (dev->event_limit > 0)) {
-					if (nvme_submit_async_req(nvmeq))
+					if (nvme_submit_async_admin_req(dev))
 						break;
 					dev->event_limit--;
 				}
  unlock:
 				spin_unlock_irq(&nvmeq->q_lock);
 			}
-			rcu_read_unlock();
 		}
 		spin_unlock(&dev_list_lock);
 		schedule_timeout(round_jiffies_relative(HZ));
@@ -1967,27 +1825,30 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 {
 	struct nvme_ns *ns;
 	struct gendisk *disk;
+	int node = dev_to_node(&dev->pci_dev->dev);
 	int lbaf;
 
 	if (rt->attributes & NVME_LBART_ATTRIB_HIDE)
 		return NULL;
 
-	ns = kzalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
 		return NULL;
-	ns->queue = blk_alloc_queue(GFP_KERNEL);
+	ns->queue = blk_mq_init_queue(&dev->tagset);
 	if (!ns->queue)
 		goto out_free_ns;
-	ns->queue->queue_flags = QUEUE_FLAG_DEFAULT;
+	queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
-	blk_queue_make_request(ns->queue, nvme_make_request);
+	queue_flag_set_unlocked(QUEUE_FLAG_SG_GAPS, ns->queue);
+	queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
 	ns->dev = dev;
 	ns->queue->queuedata = ns;
 
-	disk = alloc_disk(0);
+	disk = alloc_disk_node(0, node);
 	if (!disk)
 		goto out_free_queue;
+
 	ns->ns_id = nsid;
 	ns->disk = disk;
 	lbaf = id->flbas & 0xf;
@@ -1996,6 +1857,8 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
 	if (dev->max_hw_sectors)
 		blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);
+	if (dev->stripe_size)
+		blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
 	if (dev->vwc & NVME_CTRL_VWC_PRESENT)
 		blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
 
@@ -2021,143 +1884,19 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
 	return NULL;
 }
 
-static int nvme_find_closest_node(int node)
-{
-	int n, val, min_val = INT_MAX, best_node = node;
-
-	for_each_online_node(n) {
-		if (n == node)
-			continue;
-		val = node_distance(node, n);
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-	return best_node;
-}
-
-static void nvme_set_queue_cpus(cpumask_t *qmask, struct nvme_queue *nvmeq,
-								int count)
-{
-	int cpu;
-	for_each_cpu(cpu, qmask) {
-		if (cpumask_weight(nvmeq->cpu_mask) >= count)
-			break;
-		if (!cpumask_test_and_set_cpu(cpu, nvmeq->cpu_mask))
-			*per_cpu_ptr(nvmeq->dev->io_queue, cpu) = nvmeq->qid;
-	}
-}
-
-static void nvme_add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
-	const cpumask_t *new_mask, struct nvme_queue *nvmeq, int cpus_per_queue)
-{
-	int next_cpu;
-	for_each_cpu(next_cpu, new_mask) {
-		cpumask_or(mask, mask, get_cpu_mask(next_cpu));
-		cpumask_or(mask, mask, topology_thread_cpumask(next_cpu));
-		cpumask_and(mask, mask, unassigned_cpus);
-		nvme_set_queue_cpus(mask, nvmeq, cpus_per_queue);
-	}
-}
-
 static void nvme_create_io_queues(struct nvme_dev *dev)
 {
-	unsigned i, max;
+	unsigned i;
 
-	max = min(dev->max_qid, num_online_cpus());
-	for (i = dev->queue_count; i <= max; i++)
+	for (i = dev->queue_count; i <= dev->max_qid; i++)
 		if (!nvme_alloc_queue(dev, i, dev->q_depth, i - 1))
 			break;
 
-	max = min(dev->queue_count - 1, num_online_cpus());
-	for (i = dev->online_queues; i <= max; i++)
-		if (nvme_create_queue(raw_nvmeq(dev, i), i))
+	for (i = dev->online_queues; i <= dev->queue_count - 1; i++)
+		if (nvme_create_queue(dev->queues[i], i))
 			break;
 }
 
-/*
- * If there are fewer queues than online cpus, this will try to optimally
- * assign a queue to multiple cpus by grouping cpus that are "close" together:
- * thread siblings, core, socket, closest node, then whatever else is
- * available.
- */
-static void nvme_assign_io_queues(struct nvme_dev *dev)
-{
-	unsigned cpu, cpus_per_queue, queues, remainder, i;
-	cpumask_var_t unassigned_cpus;
-
-	nvme_create_io_queues(dev);
-
-	queues = min(dev->online_queues - 1, num_online_cpus());
-	if (!queues)
-		return;
-
-	cpus_per_queue = num_online_cpus() / queues;
-	remainder = queues - (num_online_cpus() - queues * cpus_per_queue);
-
-	if (!alloc_cpumask_var(&unassigned_cpus, GFP_KERNEL))
-		return;
-
-	cpumask_copy(unassigned_cpus, cpu_online_mask);
-	cpu = cpumask_first(unassigned_cpus);
-	for (i = 1; i <= queues; i++) {
-		struct nvme_queue *nvmeq = lock_nvmeq(dev, i);
-		cpumask_t mask;
-
-		cpumask_clear(nvmeq->cpu_mask);
-		if (!cpumask_weight(unassigned_cpus)) {
-			unlock_nvmeq(nvmeq);
-			break;
-		}
-
-		mask = *get_cpu_mask(cpu);
-		nvme_set_queue_cpus(&mask, nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				topology_thread_cpumask(cpu),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				topology_core_cpumask(cpu),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				cpumask_of_node(cpu_to_node(cpu)),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				cpumask_of_node(
-					nvme_find_closest_node(
-						cpu_to_node(cpu))),
-				nvmeq, cpus_per_queue);
-		if (cpus_weight(mask) < cpus_per_queue)
-			nvme_add_cpus(&mask, unassigned_cpus,
-				unassigned_cpus,
-				nvmeq, cpus_per_queue);
-
-		WARN(cpumask_weight(nvmeq->cpu_mask) != cpus_per_queue,
-			"nvme%d qid:%d mis-matched queue-to-cpu assignment\n",
-			dev->instance, i);
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-							nvmeq->cpu_mask);
-		cpumask_andnot(unassigned_cpus, unassigned_cpus,
-						nvmeq->cpu_mask);
-		cpu = cpumask_next(cpu, unassigned_cpus);
-		if (remainder && !--remainder)
-			cpus_per_queue++;
-		unlock_nvmeq(nvmeq);
-	}
-	WARN(cpumask_weight(unassigned_cpus), "nvme%d unassigned online cpus\n",
-								dev->instance);
-	i = 0;
-	cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
-	for_each_cpu(cpu, unassigned_cpus)
-		*per_cpu_ptr(dev->io_queue, cpu) = (i++ % queues) + 1;
-	free_cpumask_var(unassigned_cpus);
-}
-
 static int set_queue_count(struct nvme_dev *dev, int count)
 {
 	int status;
@@ -2181,33 +1920,9 @@ static size_t db_bar_size(struct nvme_dev *dev, unsigned nr_io_queues)
 	return 4096 + ((nr_io_queues + 1) * 8 * dev->db_stride);
 }
 
-static void nvme_cpu_workfn(struct work_struct *work)
-{
-	struct nvme_dev *dev = container_of(work, struct nvme_dev, cpu_work);
-	if (dev->initialized)
-		nvme_assign_io_queues(dev);
-}
-
-static int nvme_cpu_notify(struct notifier_block *self,
-				unsigned long action, void *hcpu)
-{
-	struct nvme_dev *dev;
-
-	switch (action) {
-	case CPU_ONLINE:
-	case CPU_DEAD:
-		spin_lock(&dev_list_lock);
-		list_for_each_entry(dev, &dev_list, node)
-			schedule_work(&dev->cpu_work);
-		spin_unlock(&dev_list_lock);
-		break;
-	}
-	return NOTIFY_OK;
-}
-
 static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
-	struct nvme_queue *adminq = raw_nvmeq(dev, 0);
+	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = dev->pci_dev;
 	int result, i, vecs, nr_io_queues, size;
 
@@ -2266,7 +1981,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	/* Free previously allocated queues that are no longer usable */
 	nvme_free_queues(dev, nr_io_queues + 1);
-	nvme_assign_io_queues(dev);
+	nvme_create_io_queues(dev);
 
 	return 0;
 
@@ -2316,8 +2031,32 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	if (ctrl->mdts)
 		dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
 	if ((pdev->vendor == PCI_VENDOR_ID_INTEL) &&
-			(pdev->device == 0x0953) && ctrl->vs[3])
+			(pdev->device == 0x0953) && ctrl->vs[3]) {
+		unsigned int max_hw_sectors;
+
 		dev->stripe_size = 1 << (ctrl->vs[3] + shift);
+		max_hw_sectors = dev->stripe_size >> (shift - 9);
+		if (dev->max_hw_sectors) {
+			dev->max_hw_sectors = min(max_hw_sectors,
+							dev->max_hw_sectors);
+		} else
+			dev->max_hw_sectors = max_hw_sectors;
+	}
+
+	dev->tagset.ops = &nvme_mq_ops;
+	dev->tagset.nr_hw_queues = dev->online_queues - 1;
+	dev->tagset.timeout = NVME_IO_TIMEOUT;
+	dev->tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+	dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH);
+	dev->tagset.cmd_size = sizeof(struct nvme_cmd_info);
+	dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+	dev->tagset.driver_data = dev;
+
+	if (blk_mq_alloc_tag_set(&dev->tagset))
+		goto out;
+
+	for (i = 1; i < dev->online_queues; i++)
+		dev->queues[i]->tags = dev->tagset.tags[i - 1];
 
 	id_ns = mem;
 	for (i = 1; i <= nn; i++) {
@@ -2467,7 +2206,8 @@ static int adapter_async_del_queue(struct nvme_queue *nvmeq, u8 opcode,
 	c.delete_queue.qid = cpu_to_le16(nvmeq->qid);
 
 	init_kthread_work(&nvmeq->cmdinfo.work, fn);
-	return nvme_submit_admin_cmd_async(nvmeq->dev, &c, &nvmeq->cmdinfo);
+	return nvme_submit_admin_async_cmd(nvmeq->dev, &c, &nvmeq->cmdinfo,
+								ADMIN_TIMEOUT);
 }
 
 static void nvme_del_cq_work_handler(struct kthread_work *work)
@@ -2530,7 +2270,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 	atomic_set(&dq.refcount, 0);
 	dq.worker = &worker;
 	for (i = dev->queue_count - 1; i > 0; i--) {
-		struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+		struct nvme_queue *nvmeq = dev->queues[i];
 
 		if (nvme_suspend_queue(nvmeq))
 			continue;
@@ -2575,7 +2315,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 		csts = readl(&dev->bar->csts);
 	if (csts & NVME_CSTS_CFS || !(csts & NVME_CSTS_RDY)) {
 		for (i = dev->queue_count - 1; i >= 0; i--) {
-			struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+			struct nvme_queue *nvmeq = dev->queues[i];
 			nvme_suspend_queue(nvmeq);
 			nvme_clear_queue(nvmeq);
 		}
@@ -2587,6 +2327,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
 	nvme_dev_unmap(dev);
 }
 
+static void nvme_dev_remove_admin(struct nvme_dev *dev)
+{
+	if (dev->admin_q && !blk_queue_dying(dev->admin_q))
+		blk_cleanup_queue(dev->admin_q);
+}
+
 static void nvme_dev_remove(struct nvme_dev *dev)
 {
 	struct nvme_ns *ns;
@@ -2668,7 +2414,7 @@ static void nvme_free_dev(struct kref *kref)
 	struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);
 
 	nvme_free_namespaces(dev);
-	free_percpu(dev->io_queue);
+	blk_mq_free_tag_set(&dev->tagset);
 	kfree(dev->queues);
 	kfree(dev->entry);
 	kfree(dev);
@@ -2795,7 +2541,7 @@ static void nvme_dev_reset(struct nvme_dev *dev)
 {
 	nvme_dev_shutdown(dev);
 	if (nvme_dev_resume(dev)) {
-		dev_err(&dev->pci_dev->dev, "Device failed to resume\n");
+		dev_warn(&dev->pci_dev->dev, "Device failed to resume\n");
 		kref_get(&dev->kref);
 		if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
 							dev->instance))) {
@@ -2820,28 +2566,28 @@ static void nvme_reset_workfn(struct work_struct *work)
 
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
-	int result = -ENOMEM;
+	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
 
-	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	node = dev_to_node(&pdev->dev);
+	if (node == NUMA_NO_NODE)
+		set_dev_node(&pdev->dev, 0);
+
+	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kcalloc(num_possible_cpus(), sizeof(*dev->entry),
-								GFP_KERNEL);
+	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
+							GFP_KERNEL, node);
 	if (!dev->entry)
 		goto free;
-	dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
-								GFP_KERNEL);
+	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
+							GFP_KERNEL, node);
 	if (!dev->queues)
 		goto free;
-	dev->io_queue = alloc_percpu(unsigned short);
-	if (!dev->io_queue)
-		goto free;
 
 	INIT_LIST_HEAD(&dev->namespaces);
 	dev->reset_workfn = nvme_reset_failed_dev;
 	INIT_WORK(&dev->reset_work, nvme_reset_workfn);
-	INIT_WORK(&dev->cpu_work, nvme_cpu_workfn);
 	dev->pci_dev = pdev;
 	pci_set_drvdata(pdev, dev);
 	result = nvme_set_instance(dev);
@@ -2876,6 +2622,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
  remove:
 	nvme_dev_remove(dev);
+	nvme_dev_remove_admin(dev);
 	nvme_free_namespaces(dev);
  shutdown:
 	nvme_dev_shutdown(dev);
@@ -2885,7 +2632,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
  release:
 	nvme_release_instance(dev);
  free:
-	free_percpu(dev->io_queue);
 	kfree(dev->queues);
 	kfree(dev->entry);
 	kfree(dev);
@@ -2918,12 +2664,12 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	pci_set_drvdata(pdev, NULL);
 	flush_work(&dev->reset_work);
-	flush_work(&dev->cpu_work);
 	misc_deregister(&dev->miscdev);
 	nvme_dev_remove(dev);
 	nvme_dev_shutdown(dev);
+	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
-	rcu_barrier();
+	nvme_free_admin_tags(dev);
 	nvme_release_instance(dev);
 	nvme_release_prp_pools(dev);
 	kref_put(&dev->kref, nvme_free_dev);
@@ -3007,18 +2753,11 @@ static int __init nvme_init(void)
 	else if (result > 0)
 		nvme_major = result;
 
-	nvme_nb.notifier_call = &nvme_cpu_notify;
-	result = register_hotcpu_notifier(&nvme_nb);
-	if (result)
-		goto unregister_blkdev;
-
 	result = pci_register_driver(&nvme_driver);
 	if (result)
-		goto unregister_hotcpu;
+		goto unregister_blkdev;
 	return 0;
 
- unregister_hotcpu:
-	unregister_hotcpu_notifier(&nvme_nb);
  unregister_blkdev:
 	unregister_blkdev(nvme_major, "nvme");
  kill_workq:
diff --git a/drivers/block/nvme-scsi.c b/drivers/block/nvme-scsi.c
index a4cd6d6..52c0356 100644
--- a/drivers/block/nvme-scsi.c
+++ b/drivers/block/nvme-scsi.c
@@ -2105,7 +2105,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 
 		nvme_offset += unit_num_blocks;
 
-		nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+		nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
 		if (nvme_sc != NVME_SC_SUCCESS) {
 			nvme_unmap_user_pages(dev,
 				(is_write) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
@@ -2658,7 +2658,7 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 			c.common.opcode = nvme_cmd_flush;
 			c.common.nsid = cpu_to_le32(ns->ns_id);
 
-			nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+			nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
 			res = nvme_trans_status_code(hdr, nvme_sc);
 			if (res)
 				goto out;
@@ -2686,7 +2686,7 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
 	c.common.opcode = nvme_cmd_flush;
 	c.common.nsid = cpu_to_le32(ns->ns_id);
 
-	nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+	nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
 
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
@@ -2894,7 +2894,7 @@ static int nvme_trans_unmap(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	c.dsm.nr = cpu_to_le32(ndesc - 1);
 	c.dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
-	nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+	nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 
 	dma_free_coherent(&dev->pci_dev->dev, ndesc * sizeof(*range),
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index ed09074..258945f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -19,6 +19,7 @@
 #include <linux/pci.h>
 #include <linux/miscdevice.h>
 #include <linux/kref.h>
+#include <linux/blk-mq.h>
 
 struct nvme_bar {
 	__u64			cap;	/* Controller Capabilities */
@@ -71,8 +72,10 @@ extern unsigned char nvme_io_timeout;
  */
 struct nvme_dev {
 	struct list_head node;
-	struct nvme_queue __rcu **queues;
-	unsigned short __percpu *io_queue;
+	struct nvme_queue **queues;
+	struct request_queue *admin_q;
+	struct blk_mq_tag_set tagset;
+	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct pci_dev *pci_dev;
 	struct dma_pool *prp_page_pool;
@@ -91,7 +94,6 @@ struct nvme_dev {
 	struct miscdevice miscdev;
 	work_func_t reset_workfn;
 	struct work_struct reset_work;
-	struct work_struct cpu_work;
 	char name[12];
 	char serial[20];
 	char model[40];
@@ -135,7 +137,6 @@ struct nvme_iod {
 	int offset;		/* Of PRP list */
 	int nents;		/* Used in scatterlist */
 	int length;		/* Of data, in bytes */
-	unsigned long start_time;
 	dma_addr_t first_dma;
 	struct list_head node;
 	struct scatterlist sg[0];
@@ -153,12 +154,14 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
  */
 void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod);
 
-int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int , gfp_t);
+int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int, gfp_t);
 struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
 				unsigned long addr, unsigned length);
 void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
 			struct nvme_iod *iod);
-int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_command *, u32 *);
+int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_ns *,
+						struct nvme_command *, u32 *);
+int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns);
 int nvme_submit_admin_cmd(struct nvme_dev *, struct nvme_command *,
 							u32 *result);
 int nvme_identify(struct nvme_dev *, unsigned nsid, unsigned cns,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-07-26  9:07   ` Matias Bjørling
@ 2014-08-10 17:27     ` Matias Bjørling
  -1 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-08-10 17:27 UTC (permalink / raw)
  To: Matthew Wilcox, Keith Busch, Sam Bradshaw (sbradshaw),
	Jens Axboe, LKML, linux-nvme, Christoph Hellwig, Rob Nelson,
	Ming Lei
  Cc: Matias Bjørling

On Sat, Jul 26, 2014 at 11:07 AM, Matias Bjørling <m@bjorling.me> wrote:
> This converts the NVMe driver to a blk-mq request-based driver.
>

Willy, do you need me to make any changes to the conversion? Can you
pick it up for 3.17?

Thanks,
Matias

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-10 17:27     ` Matias Bjørling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-08-10 17:27 UTC (permalink / raw)


On Sat, Jul 26, 2014@11:07 AM, Matias Bj?rling <m@bjorling.me> wrote:
> This converts the NVMe driver to a blk-mq request-based driver.
>

Willy, do you need me to make any changes to the conversion? Can you
pick it up for 3.17?

Thanks,
Matias

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-10 17:27     ` Matias Bjørling
@ 2014-08-13 22:27       ` Keith Busch
  -1 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-13 22:27 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Matthew Wilcox, Keith Busch, Sam Bradshaw (sbradshaw),
	Jens Axboe, LKML, linux-nvme, Christoph Hellwig, Rob Nelson,
	Ming Lei

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9341 bytes --]

On Sun, 10 Aug 2014, Matias Bjørling wrote:
> On Sat, Jul 26, 2014 at 11:07 AM, Matias Bjørling <m@bjorling.me> wrote:
>> This converts the NVMe driver to a blk-mq request-based driver.
>>
>
> Willy, do you need me to make any changes to the conversion? Can you
> pick it up for 3.17?

Hi Matias,

I'm starting to get a little more spare time to look at this again. I
think there are still some bugs here, or perhaps something better we
can do. I'll just start with one snippet of the code:

@@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
  submit_iod:
 	spin_lock_irq(&nvmeq->q_lock);
 	if (nvmeq->q_suspended) {
 		spin_unlock_irq(&nvmeq->q_lock);
 		goto finish_cmd;
 	}

  <snip>

  finish_cmd:
 	nvme_finish_cmd(nvmeq, req->tag, NULL);
 	nvme_free_iod(nvmeq->dev, iod);
 	return result;
}


If the nvme queue is marked "suspended", this code just goto's the finish
without setting "result", so I don't think that's right.

But do we even need the "q_suspended" flag anymore? It was there because
we couldn't prevent incoming requests as a bio based driver and we needed
some way to mark that the h/w's IO queue was temporarily inactive, but
blk-mq has ways to start/stop a queue at a higher level, right? If so,
I think that's probably a better way than using this driver specific way.

I haven't event tried debugging this next one: doing an insmod+rmmod
caused this warning followed by a panic:

Aug 13 15:41:41 kbgrz1 kernel: [   89.207525] ------------[ cut here ]------------
Aug 13 15:41:41 kbgrz1 kernel: [   89.207538] WARNING: CPU: 8 PID: 5768 at mm/slab_common.c:491 kmalloc_slab+0x33/0x8b()
Aug 13 15:41:41 kbgrz1 kernel: [   89.207541] Modules linked in: nvme(-) parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr ehci_pci ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma mfd_core i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd isci libsas igb ahci libahci scsi_transport_sas ptp pps_core libata i2c_algo_bit i2c_core scsi_mod dca
Aug 13 15:41:41 kbgrz1 kernel: [   89.207653] CPU: 8 PID: 5768 Comm: nvme1 Not tainted 3.16.0-rc6+ #24
Aug 13 15:41:41 kbgrz1 kernel: [   89.207656] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
Aug 13 15:41:41 kbgrz1 kernel: [   89.207659]  0000000000000000 0000000000000009 ffffffff8139f9ba 0000000000000000
Aug 13 15:41:41 kbgrz1 kernel: [   89.207664]  ffffffff8103db86 ffffe8ffff601d80 ffffffff810f0d59 0000000000000246
Aug 13 15:41:41 kbgrz1 kernel: [   89.207669]  0000000000000000 ffff880827bf28c0 0000000000008020 ffff88082b8d9d00
Aug 13 15:41:41 kbgrz1 kernel: [   89.207674] Call Trace:
Aug 13 15:41:41 kbgrz1 kernel: [   89.207685]  [<ffffffff8139f9ba>] ? dump_stack+0x41/0x51
Aug 13 15:41:41 kbgrz1 kernel: [   89.207694]  [<ffffffff8103db86>] ? warn_slowpath_common+0x7d/0x95
Aug 13 15:41:41 kbgrz1 kernel: [   89.207699]  [<ffffffff810f0d59>] ? kmalloc_slab+0x33/0x8b
Aug 13 15:41:41 kbgrz1 kernel: [   89.207704]  [<ffffffff810f0d59>] ? kmalloc_slab+0x33/0x8b
Aug 13 15:41:41 kbgrz1 kernel: [   89.207710]  [<ffffffff81115329>] ? __kmalloc+0x28/0xf1
Aug 13 15:41:41 kbgrz1 kernel: [   89.207719]  [<ffffffff811d0daf>] ? blk_mq_tag_busy_iter+0x30/0x7c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207728]  [<ffffffffa052c426>] ? nvme_init_hctx+0x49/0x49 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.207733]  [<ffffffff811d0daf>] ? blk_mq_tag_busy_iter+0x30/0x7c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207738]  [<ffffffffa052c98b>] ? nvme_clear_queue+0x72/0x7d [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.207744]  [<ffffffffa052c9a8>] ? nvme_del_queue_end+0x12/0x26 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.207750]  [<ffffffff810576e3>] ? kthread_worker_fn+0xb1/0x111
Aug 13 15:41:41 kbgrz1 kernel: [   89.207754]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.207758]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.207762]  [<ffffffff810574b9>] ? kthread+0x9e/0xa6
Aug 13 15:41:41 kbgrz1 kernel: [   89.207766]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207773]  [<ffffffff813a3a2c>] ? ret_from_fork+0x7c/0xb0
Aug 13 15:41:41 kbgrz1 kernel: [   89.207777]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207780] ---[ end trace 8dc4a4c97c467d4c ]---
Aug 13 15:41:41 kbgrz1 kernel: [   89.223627] PGD 0 
Aug 13 15:41:41 kbgrz1 kernel: [   89.226038] Oops: 0000 [#1] SMP 
Aug 13 15:41:41 kbgrz1 kernel: [   89.229917] Modules linked in: nvme(-) parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr ehci_pci ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma mfd_core i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd isci libsas igb ahci libahci scsi_transport_sas ptp pps_core libata i2c_algo_bit i2c_core scsi_mod dca
Aug 13 15:41:41 kbgrz1 kernel: [   89.315211] CPU: 8 PID: 5768 Comm: nvme1 Tainted: G        W     3.16.0-rc6+ #24
Aug 13 15:41:41 kbgrz1 kernel: [   89.323563] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
Aug 13 15:41:41 kbgrz1 kernel: [   89.335121] task: ffff88042ad92d70 ti: ffff880425ff0000 task.ti: ffff880425ff0000
Aug 13 15:41:41 kbgrz1 kernel: [   89.343574] RIP: 0010:[<ffffffff811d0d38>]  [<ffffffff811d0d38>] bt_for_each_free+0x31/0x78
Aug 13 15:41:41 kbgrz1 kernel: [   89.353144] RSP: 0018:ffff880425ff3de8  EFLAGS: 00010086
Aug 13 15:41:41 kbgrz1 kernel: [   89.359189] RAX: 0000000000000010 RBX: ffffffffffffffff RCX: 0000000000000007
Aug 13 15:41:41 kbgrz1 kernel: [   89.367276] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff880827bf2850
Aug 13 15:41:41 kbgrz1 kernel: [   89.375362] RBP: 0000000000000000 R08: 000000000000000f R09: 00000000fffffffe
Aug 13 15:41:41 kbgrz1 kernel: [   89.383448] R10: 0000000000000000 R11: 0000000000000046 R12: ffff880827bf2850
Aug 13 15:41:41 kbgrz1 kernel: [   89.391534] R13: 00000000ffffffff R14: 0000000000000010 R15: 0000000000000001
Aug 13 15:41:41 kbgrz1 kernel: [   89.399622] FS:  0000000000000000(0000) GS:ffff88083f200000(0000) knlGS:0000000000000000
Aug 13 15:41:41 kbgrz1 kernel: [   89.408805] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 13 15:41:41 kbgrz1 kernel: [   89.415340] CR2: 0000000000000007 CR3: 0000000001610000 CR4: 00000000000407e0
Aug 13 15:41:41 kbgrz1 kernel: [   89.423426] Stack:
Aug 13 15:41:41 kbgrz1 kernel: [   89.425775]  0000000000000007 0000000000000010 ffff880827bf2840 ffffffffa052c426
Aug 13 15:41:41 kbgrz1 kernel: [   89.434515]  ffff88082b8d9f00 ffff88042ad92d70 0000000000000000 ffffffff811d0dc6
Aug 13 15:41:41 kbgrz1 kernel: [   89.443254]  00000000fffffffe ffff88082b8d9f00 ffff88082b8d9f28 ffff88042ad92d70
Aug 13 15:41:41 kbgrz1 kernel: [   89.452012] Call Trace:
Aug 13 15:41:41 kbgrz1 kernel: [   89.454852]  [<ffffffffa052c426>] ? nvme_init_hctx+0x49/0x49 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.461968]  [<ffffffff811d0dc6>] ? blk_mq_tag_busy_iter+0x47/0x7c
Aug 13 15:41:41 kbgrz1 kernel: [   89.468987]  [<ffffffffa052c98b>] ? nvme_clear_queue+0x72/0x7d [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.476298]  [<ffffffffa052c9a8>] ? nvme_del_queue_end+0x12/0x26 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.483801]  [<ffffffff810576e3>] ? kthread_worker_fn+0xb1/0x111
Aug 13 15:41:41 kbgrz1 kernel: [   89.490625]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.498038]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.505446]  [<ffffffff810574b9>] ? kthread+0x9e/0xa6
Aug 13 15:41:41 kbgrz1 kernel: [   89.511200]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.517833]  [<ffffffff813a3a2c>] ? ret_from_fork+0x7c/0xb0
Aug 13 15:41:41 kbgrz1 kernel: [   89.524170]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.530799] Code: 57 41 bf 01 00 00 00 41 56 49 89 f6 41 55 41 89 d5 41 54 49 89 fc 55 31 ed 53 51 eb 42 48 63 dd 31 d2 48 c1 e3 06 49 03 5c 24 10 <48> 8b 73 08 48 63 d2 48 89 df e8 e2 28 02 00 48 63 c8 48 3b 4b 
Aug 13 15:41:41 kbgrz1 kernel: [   89.564696]  RSP <ffff880425ff3de8>
Aug 13 15:41:41 kbgrz1 kernel: [   89.568699] CR2: 0000000000000007
Aug 13 15:41:41 kbgrz1 kernel: [   89.572518] ---[ end trace 8dc4a4c97c467d4d ]---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-13 22:27       ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-13 22:27 UTC (permalink / raw)


On Sun, 10 Aug 2014, Matias Bj?rling wrote:
> On Sat, Jul 26, 2014@11:07 AM, Matias Bj?rling <m@bjorling.me> wrote:
>> This converts the NVMe driver to a blk-mq request-based driver.
>>
>
> Willy, do you need me to make any changes to the conversion? Can you
> pick it up for 3.17?

Hi Matias,

I'm starting to get a little more spare time to look at this again. I
think there are still some bugs here, or perhaps something better we
can do. I'll just start with one snippet of the code:

@@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
  submit_iod:
 	spin_lock_irq(&nvmeq->q_lock);
 	if (nvmeq->q_suspended) {
 		spin_unlock_irq(&nvmeq->q_lock);
 		goto finish_cmd;
 	}

  <snip>

  finish_cmd:
 	nvme_finish_cmd(nvmeq, req->tag, NULL);
 	nvme_free_iod(nvmeq->dev, iod);
 	return result;
}


If the nvme queue is marked "suspended", this code just goto's the finish
without setting "result", so I don't think that's right.

But do we even need the "q_suspended" flag anymore? It was there because
we couldn't prevent incoming requests as a bio based driver and we needed
some way to mark that the h/w's IO queue was temporarily inactive, but
blk-mq has ways to start/stop a queue at a higher level, right? If so,
I think that's probably a better way than using this driver specific way.

I haven't event tried debugging this next one: doing an insmod+rmmod
caused this warning followed by a panic:

Aug 13 15:41:41 kbgrz1 kernel: [   89.207525] ------------[ cut here ]------------
Aug 13 15:41:41 kbgrz1 kernel: [   89.207538] WARNING: CPU: 8 PID: 5768 at mm/slab_common.c:491 kmalloc_slab+0x33/0x8b()
Aug 13 15:41:41 kbgrz1 kernel: [   89.207541] Modules linked in: nvme(-) parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr ehci_pci ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma mfd_core i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd isci libsas igb ahci libahci scsi_transport_sas ptp pps_core libata i2c_algo_bit i2c_core scsi_mod dca
Aug 13 15:41:41 kbgrz1 kernel: [   89.207653] CPU: 8 PID: 5768 Comm: nvme1 Not tainted 3.16.0-rc6+ #24
Aug 13 15:41:41 kbgrz1 kernel: [   89.207656] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
Aug 13 15:41:41 kbgrz1 kernel: [   89.207659]  0000000000000000 0000000000000009 ffffffff8139f9ba 0000000000000000
Aug 13 15:41:41 kbgrz1 kernel: [   89.207664]  ffffffff8103db86 ffffe8ffff601d80 ffffffff810f0d59 0000000000000246
Aug 13 15:41:41 kbgrz1 kernel: [   89.207669]  0000000000000000 ffff880827bf28c0 0000000000008020 ffff88082b8d9d00
Aug 13 15:41:41 kbgrz1 kernel: [   89.207674] Call Trace:
Aug 13 15:41:41 kbgrz1 kernel: [   89.207685]  [<ffffffff8139f9ba>] ? dump_stack+0x41/0x51
Aug 13 15:41:41 kbgrz1 kernel: [   89.207694]  [<ffffffff8103db86>] ? warn_slowpath_common+0x7d/0x95
Aug 13 15:41:41 kbgrz1 kernel: [   89.207699]  [<ffffffff810f0d59>] ? kmalloc_slab+0x33/0x8b
Aug 13 15:41:41 kbgrz1 kernel: [   89.207704]  [<ffffffff810f0d59>] ? kmalloc_slab+0x33/0x8b
Aug 13 15:41:41 kbgrz1 kernel: [   89.207710]  [<ffffffff81115329>] ? __kmalloc+0x28/0xf1
Aug 13 15:41:41 kbgrz1 kernel: [   89.207719]  [<ffffffff811d0daf>] ? blk_mq_tag_busy_iter+0x30/0x7c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207728]  [<ffffffffa052c426>] ? nvme_init_hctx+0x49/0x49 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.207733]  [<ffffffff811d0daf>] ? blk_mq_tag_busy_iter+0x30/0x7c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207738]  [<ffffffffa052c98b>] ? nvme_clear_queue+0x72/0x7d [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.207744]  [<ffffffffa052c9a8>] ? nvme_del_queue_end+0x12/0x26 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.207750]  [<ffffffff810576e3>] ? kthread_worker_fn+0xb1/0x111
Aug 13 15:41:41 kbgrz1 kernel: [   89.207754]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.207758]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.207762]  [<ffffffff810574b9>] ? kthread+0x9e/0xa6
Aug 13 15:41:41 kbgrz1 kernel: [   89.207766]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207773]  [<ffffffff813a3a2c>] ? ret_from_fork+0x7c/0xb0
Aug 13 15:41:41 kbgrz1 kernel: [   89.207777]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.207780] ---[ end trace 8dc4a4c97c467d4c ]---
Aug 13 15:41:41 kbgrz1 kernel: [   89.223627] PGD 0 
Aug 13 15:41:41 kbgrz1 kernel: [   89.226038] Oops: 0000 [#1] SMP 
Aug 13 15:41:41 kbgrz1 kernel: [   89.229917] Modules linked in: nvme(-) parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr ehci_pci ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma mfd_core i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd isci libsas igb ahci libahci scsi_transport_sas ptp pps_core libata i2c_algo_bit i2c_core scsi_mod dca
Aug 13 15:41:41 kbgrz1 kernel: [   89.315211] CPU: 8 PID: 5768 Comm: nvme1 Tainted: G        W     3.16.0-rc6+ #24
Aug 13 15:41:41 kbgrz1 kernel: [   89.323563] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
Aug 13 15:41:41 kbgrz1 kernel: [   89.335121] task: ffff88042ad92d70 ti: ffff880425ff0000 task.ti: ffff880425ff0000
Aug 13 15:41:41 kbgrz1 kernel: [   89.343574] RIP: 0010:[<ffffffff811d0d38>]  [<ffffffff811d0d38>] bt_for_each_free+0x31/0x78
Aug 13 15:41:41 kbgrz1 kernel: [   89.353144] RSP: 0018:ffff880425ff3de8  EFLAGS: 00010086
Aug 13 15:41:41 kbgrz1 kernel: [   89.359189] RAX: 0000000000000010 RBX: ffffffffffffffff RCX: 0000000000000007
Aug 13 15:41:41 kbgrz1 kernel: [   89.367276] RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff880827bf2850
Aug 13 15:41:41 kbgrz1 kernel: [   89.375362] RBP: 0000000000000000 R08: 000000000000000f R09: 00000000fffffffe
Aug 13 15:41:41 kbgrz1 kernel: [   89.383448] R10: 0000000000000000 R11: 0000000000000046 R12: ffff880827bf2850
Aug 13 15:41:41 kbgrz1 kernel: [   89.391534] R13: 00000000ffffffff R14: 0000000000000010 R15: 0000000000000001
Aug 13 15:41:41 kbgrz1 kernel: [   89.399622] FS:  0000000000000000(0000) GS:ffff88083f200000(0000) knlGS:0000000000000000
Aug 13 15:41:41 kbgrz1 kernel: [   89.408805] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 13 15:41:41 kbgrz1 kernel: [   89.415340] CR2: 0000000000000007 CR3: 0000000001610000 CR4: 00000000000407e0
Aug 13 15:41:41 kbgrz1 kernel: [   89.423426] Stack:
Aug 13 15:41:41 kbgrz1 kernel: [   89.425775]  0000000000000007 0000000000000010 ffff880827bf2840 ffffffffa052c426
Aug 13 15:41:41 kbgrz1 kernel: [   89.434515]  ffff88082b8d9f00 ffff88042ad92d70 0000000000000000 ffffffff811d0dc6
Aug 13 15:41:41 kbgrz1 kernel: [   89.443254]  00000000fffffffe ffff88082b8d9f00 ffff88082b8d9f28 ffff88042ad92d70
Aug 13 15:41:41 kbgrz1 kernel: [   89.452012] Call Trace:
Aug 13 15:41:41 kbgrz1 kernel: [   89.454852]  [<ffffffffa052c426>] ? nvme_init_hctx+0x49/0x49 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.461968]  [<ffffffff811d0dc6>] ? blk_mq_tag_busy_iter+0x47/0x7c
Aug 13 15:41:41 kbgrz1 kernel: [   89.468987]  [<ffffffffa052c98b>] ? nvme_clear_queue+0x72/0x7d [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.476298]  [<ffffffffa052c9a8>] ? nvme_del_queue_end+0x12/0x26 [nvme]
Aug 13 15:41:41 kbgrz1 kernel: [   89.483801]  [<ffffffff810576e3>] ? kthread_worker_fn+0xb1/0x111
Aug 13 15:41:41 kbgrz1 kernel: [   89.490625]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.498038]  [<ffffffff81057632>] ? kthread_create_on_node+0x171/0x171
Aug 13 15:41:41 kbgrz1 kernel: [   89.505446]  [<ffffffff810574b9>] ? kthread+0x9e/0xa6
Aug 13 15:41:41 kbgrz1 kernel: [   89.511200]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.517833]  [<ffffffff813a3a2c>] ? ret_from_fork+0x7c/0xb0
Aug 13 15:41:41 kbgrz1 kernel: [   89.524170]  [<ffffffff8105741b>] ? __kthread_parkme+0x5c/0x5c
Aug 13 15:41:41 kbgrz1 kernel: [   89.530799] Code: 57 41 bf 01 00 00 00 41 56 49 89 f6 41 55 41 89 d5 41 54 49 89 fc 55 31 ed 53 51 eb 42 48 63 dd 31 d2 48 c1 e3 06 49 03 5c 24 10 <48> 8b 73 08 48 63 d2 48 89 df e8 e2 28 02 00 48 63 c8 48 3b 4b 
Aug 13 15:41:41 kbgrz1 kernel: [   89.564696]  RSP <ffff880425ff3de8>
Aug 13 15:41:41 kbgrz1 kernel: [   89.568699] CR2: 0000000000000007
Aug 13 15:41:41 kbgrz1 kernel: [   89.572518] ---[ end trace 8dc4a4c97c467d4d ]---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-13 22:27       ` Keith Busch
@ 2014-08-14  8:25         ` Matias Bjørling
  -1 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-08-14  8:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matthew Wilcox, Sam Bradshaw (sbradshaw),
	Jens Axboe, LKML, linux-nvme, Christoph Hellwig, Rob Nelson,
	Ming Lei

On 08/14/2014 12:27 AM, Keith Busch wrote:
> On Sun, 10 Aug 2014, Matias Bjørling wrote:
>> On Sat, Jul 26, 2014 at 11:07 AM, Matias Bjørling <m@bjorling.me> wrote:
>>> This converts the NVMe driver to a blk-mq request-based driver.
>>>
>>
>> Willy, do you need me to make any changes to the conversion? Can you
>> pick it up for 3.17?
>
> Hi Matias,
>

Hi Keith, Thanks for taking the time to take another look.

> I'm starting to get a little more spare time to look at this again. I
> think there are still some bugs here, or perhaps something better we
> can do. I'll just start with one snippet of the code:
>
> @@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue
> *nvmeq, struct nvme_ns *ns,
>   submit_iod:
>      spin_lock_irq(&nvmeq->q_lock);
>      if (nvmeq->q_suspended) {
>          spin_unlock_irq(&nvmeq->q_lock);
>          goto finish_cmd;
>      }
>
>   <snip>
>
>   finish_cmd:
>      nvme_finish_cmd(nvmeq, req->tag, NULL);
>      nvme_free_iod(nvmeq->dev, iod);
>      return result;
> }
>
>
> If the nvme queue is marked "suspended", this code just goto's the finish
> without setting "result", so I don't think that's right.

The result is set to BLK_MQ_RQ_QUEUE_ERROR, or am I mistaken?

>
> But do we even need the "q_suspended" flag anymore? It was there because
> we couldn't prevent incoming requests as a bio based driver and we needed
> some way to mark that the h/w's IO queue was temporarily inactive, but
> blk-mq has ways to start/stop a queue at a higher level, right? If so,
> I think that's probably a better way than using this driver specific way.

Not really, its managed by the block layer. Its on purpose I haven't 
removed it. The patch is already too big, and I want to keep the patch 
free from extra noise that can be removed by later patches.

Should I remove it anyway?

>
> I haven't event tried debugging this next one: doing an insmod+rmmod
> caused this warning followed by a panic:
>

I'll look into it. Thanks

> Aug 13 15:41:41 kbgrz1 kernel: [   89.207525] ------------[ cut here
> ]------------
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207538] WARNING: CPU: 8 PID: 5768
> at mm/slab_common.c:491 kmalloc_slab+0x33/0x8b()
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207541] Modules linked in: nvme(-)
> parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss
> oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp
> llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal
> coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr
> ehci_pci ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma
> mfd_core i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler
> processor thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom
> sd_mod crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod
> crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw
> gf128mul ablk_helper cryptd isci libsas igb ahci libahci
> scsi_transport_sas ptp pps_core libata i2c_algo_bit i2c_core scsi_mod dca
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207653] CPU: 8 PID: 5768 Comm:
> nvme1 Not tainted 3.16.0-rc6+ #24
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207656] Hardware name: Intel
> Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210
> 12/23/2013
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207659]  0000000000000000
> 0000000000000009 ffffffff8139f9ba 0000000000000000
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207664]  ffffffff8103db86
> ffffe8ffff601d80 ffffffff810f0d59 0000000000000246
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207669]  0000000000000000
> ffff880827bf28c0 0000000000008020 ffff88082b8d9d00
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207674] Call Trace:
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207685]  [<ffffffff8139f9ba>] ?
> dump_stack+0x41/0x51
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207694]  [<ffffffff8103db86>] ?
> warn_slowpath_common+0x7d/0x95
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207699]  [<ffffffff810f0d59>] ?
> kmalloc_slab+0x33/0x8b
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207704]  [<ffffffff810f0d59>] ?
> kmalloc_slab+0x33/0x8b
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207710]  [<ffffffff81115329>] ?
> __kmalloc+0x28/0xf1
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207719]  [<ffffffff811d0daf>] ?
> blk_mq_tag_busy_iter+0x30/0x7c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207728]  [<ffffffffa052c426>] ?
> nvme_init_hctx+0x49/0x49 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207733]  [<ffffffff811d0daf>] ?
> blk_mq_tag_busy_iter+0x30/0x7c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207738]  [<ffffffffa052c98b>] ?
> nvme_clear_queue+0x72/0x7d [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207744]  [<ffffffffa052c9a8>] ?
> nvme_del_queue_end+0x12/0x26 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207750]  [<ffffffff810576e3>] ?
> kthread_worker_fn+0xb1/0x111
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207754]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207758]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207762]  [<ffffffff810574b9>] ?
> kthread+0x9e/0xa6
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207766]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207773]  [<ffffffff813a3a2c>] ?
> ret_from_fork+0x7c/0xb0
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207777]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207780] ---[ end trace
> 8dc4a4c97c467d4c ]---
> Aug 13 15:41:41 kbgrz1 kernel: [   89.223627] PGD 0 Aug 13 15:41:41
> kbgrz1 kernel: [   89.226038] Oops: 0000 [#1] SMP Aug 13 15:41:41 kbgrz1
> kernel: [   89.229917] Modules linked in: nvme(-) parport_pc ppdev lp
> parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry
> nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev
> hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp
> kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr ehci_pci
> ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma mfd_core
> i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler processor
> thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod
> crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod crc32c_intel
> ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
> ablk_helper cryptd isci libsas igb ahci libahci scsi_transport_sas ptp
> pps_core libata i2c_algo_bit i2c_core scsi_mod dca
> Aug 13 15:41:41 kbgrz1 kernel: [   89.315211] CPU: 8 PID: 5768 Comm:
> nvme1 Tainted: G        W     3.16.0-rc6+ #24
> Aug 13 15:41:41 kbgrz1 kernel: [   89.323563] Hardware name: Intel
> Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210
> 12/23/2013
> Aug 13 15:41:41 kbgrz1 kernel: [   89.335121] task: ffff88042ad92d70 ti:
> ffff880425ff0000 task.ti: ffff880425ff0000
> Aug 13 15:41:41 kbgrz1 kernel: [   89.343574] RIP:
> 0010:[<ffffffff811d0d38>]  [<ffffffff811d0d38>] bt_for_each_free+0x31/0x78
> Aug 13 15:41:41 kbgrz1 kernel: [   89.353144] RSP:
> 0018:ffff880425ff3de8  EFLAGS: 00010086
> Aug 13 15:41:41 kbgrz1 kernel: [   89.359189] RAX: 0000000000000010 RBX:
> ffffffffffffffff RCX: 0000000000000007
> Aug 13 15:41:41 kbgrz1 kernel: [   89.367276] RDX: 0000000000000000 RSI:
> 0000000000000010 RDI: ffff880827bf2850
> Aug 13 15:41:41 kbgrz1 kernel: [   89.375362] RBP: 0000000000000000 R08:
> 000000000000000f R09: 00000000fffffffe
> Aug 13 15:41:41 kbgrz1 kernel: [   89.383448] R10: 0000000000000000 R11:
> 0000000000000046 R12: ffff880827bf2850
> Aug 13 15:41:41 kbgrz1 kernel: [   89.391534] R13: 00000000ffffffff R14:
> 0000000000000010 R15: 0000000000000001
> Aug 13 15:41:41 kbgrz1 kernel: [   89.399622] FS:
> 0000000000000000(0000) GS:ffff88083f200000(0000) knlGS:0000000000000000
> Aug 13 15:41:41 kbgrz1 kernel: [   89.408805] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Aug 13 15:41:41 kbgrz1 kernel: [   89.415340] CR2: 0000000000000007 CR3:
> 0000000001610000 CR4: 00000000000407e0
> Aug 13 15:41:41 kbgrz1 kernel: [   89.423426] Stack:
> Aug 13 15:41:41 kbgrz1 kernel: [   89.425775]  0000000000000007
> 0000000000000010 ffff880827bf2840 ffffffffa052c426
> Aug 13 15:41:41 kbgrz1 kernel: [   89.434515]  ffff88082b8d9f00
> ffff88042ad92d70 0000000000000000 ffffffff811d0dc6
> Aug 13 15:41:41 kbgrz1 kernel: [   89.443254]  00000000fffffffe
> ffff88082b8d9f00 ffff88082b8d9f28 ffff88042ad92d70
> Aug 13 15:41:41 kbgrz1 kernel: [   89.452012] Call Trace:
> Aug 13 15:41:41 kbgrz1 kernel: [   89.454852]  [<ffffffffa052c426>] ?
> nvme_init_hctx+0x49/0x49 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.461968]  [<ffffffff811d0dc6>] ?
> blk_mq_tag_busy_iter+0x47/0x7c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.468987]  [<ffffffffa052c98b>] ?
> nvme_clear_queue+0x72/0x7d [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.476298]  [<ffffffffa052c9a8>] ?
> nvme_del_queue_end+0x12/0x26 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.483801]  [<ffffffff810576e3>] ?
> kthread_worker_fn+0xb1/0x111
> Aug 13 15:41:41 kbgrz1 kernel: [   89.490625]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.498038]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.505446]  [<ffffffff810574b9>] ?
> kthread+0x9e/0xa6
> Aug 13 15:41:41 kbgrz1 kernel: [   89.511200]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.517833]  [<ffffffff813a3a2c>] ?
> ret_from_fork+0x7c/0xb0
> Aug 13 15:41:41 kbgrz1 kernel: [   89.524170]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.530799] Code: 57 41 bf 01 00 00 00
> 41 56 49 89 f6 41 55 41 89 d5 41 54 49 89 fc 55 31 ed 53 51 eb 42 48 63
> dd 31 d2 48 c1 e3 06 49 03 5c 24 10 <48> 8b 73 08 48 63 d2 48 89 df e8
> e2 28 02 00 48 63 c8 48 3b 4b Aug 13 15:41:41 kbgrz1 kernel: [
> 89.564696]  RSP <ffff880425ff3de8>
> Aug 13 15:41:41 kbgrz1 kernel: [   89.568699] CR2: 0000000000000007
> Aug 13 15:41:41 kbgrz1 kernel: [   89.572518] ---[ end trace
> 8dc4a4c97c467d4d ]---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-14  8:25         ` Matias Bjørling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjørling @ 2014-08-14  8:25 UTC (permalink / raw)


On 08/14/2014 12:27 AM, Keith Busch wrote:
> On Sun, 10 Aug 2014, Matias Bj?rling wrote:
>> On Sat, Jul 26, 2014@11:07 AM, Matias Bj?rling <m@bjorling.me> wrote:
>>> This converts the NVMe driver to a blk-mq request-based driver.
>>>
>>
>> Willy, do you need me to make any changes to the conversion? Can you
>> pick it up for 3.17?
>
> Hi Matias,
>

Hi Keith, Thanks for taking the time to take another look.

> I'm starting to get a little more spare time to look at this again. I
> think there are still some bugs here, or perhaps something better we
> can do. I'll just start with one snippet of the code:
>
> @@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue
> *nvmeq, struct nvme_ns *ns,
>   submit_iod:
>      spin_lock_irq(&nvmeq->q_lock);
>      if (nvmeq->q_suspended) {
>          spin_unlock_irq(&nvmeq->q_lock);
>          goto finish_cmd;
>      }
>
>   <snip>
>
>   finish_cmd:
>      nvme_finish_cmd(nvmeq, req->tag, NULL);
>      nvme_free_iod(nvmeq->dev, iod);
>      return result;
> }
>
>
> If the nvme queue is marked "suspended", this code just goto's the finish
> without setting "result", so I don't think that's right.

The result is set to BLK_MQ_RQ_QUEUE_ERROR, or am I mistaken?

>
> But do we even need the "q_suspended" flag anymore? It was there because
> we couldn't prevent incoming requests as a bio based driver and we needed
> some way to mark that the h/w's IO queue was temporarily inactive, but
> blk-mq has ways to start/stop a queue at a higher level, right? If so,
> I think that's probably a better way than using this driver specific way.

Not really, its managed by the block layer. Its on purpose I haven't 
removed it. The patch is already too big, and I want to keep the patch 
free from extra noise that can be removed by later patches.

Should I remove it anyway?

>
> I haven't event tried debugging this next one: doing an insmod+rmmod
> caused this warning followed by a panic:
>

I'll look into it. Thanks

> Aug 13 15:41:41 kbgrz1 kernel: [   89.207525] ------------[ cut here
> ]------------
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207538] WARNING: CPU: 8 PID: 5768
> at mm/slab_common.c:491 kmalloc_slab+0x33/0x8b()
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207541] Modules linked in: nvme(-)
> parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss
> oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp
> llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal
> coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr
> ehci_pci ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma
> mfd_core i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler
> processor thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom
> sd_mod crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod
> crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw
> gf128mul ablk_helper cryptd isci libsas igb ahci libahci
> scsi_transport_sas ptp pps_core libata i2c_algo_bit i2c_core scsi_mod dca
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207653] CPU: 8 PID: 5768 Comm:
> nvme1 Not tainted 3.16.0-rc6+ #24
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207656] Hardware name: Intel
> Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210
> 12/23/2013
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207659]  0000000000000000
> 0000000000000009 ffffffff8139f9ba 0000000000000000
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207664]  ffffffff8103db86
> ffffe8ffff601d80 ffffffff810f0d59 0000000000000246
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207669]  0000000000000000
> ffff880827bf28c0 0000000000008020 ffff88082b8d9d00
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207674] Call Trace:
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207685]  [<ffffffff8139f9ba>] ?
> dump_stack+0x41/0x51
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207694]  [<ffffffff8103db86>] ?
> warn_slowpath_common+0x7d/0x95
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207699]  [<ffffffff810f0d59>] ?
> kmalloc_slab+0x33/0x8b
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207704]  [<ffffffff810f0d59>] ?
> kmalloc_slab+0x33/0x8b
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207710]  [<ffffffff81115329>] ?
> __kmalloc+0x28/0xf1
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207719]  [<ffffffff811d0daf>] ?
> blk_mq_tag_busy_iter+0x30/0x7c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207728]  [<ffffffffa052c426>] ?
> nvme_init_hctx+0x49/0x49 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207733]  [<ffffffff811d0daf>] ?
> blk_mq_tag_busy_iter+0x30/0x7c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207738]  [<ffffffffa052c98b>] ?
> nvme_clear_queue+0x72/0x7d [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207744]  [<ffffffffa052c9a8>] ?
> nvme_del_queue_end+0x12/0x26 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207750]  [<ffffffff810576e3>] ?
> kthread_worker_fn+0xb1/0x111
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207754]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207758]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207762]  [<ffffffff810574b9>] ?
> kthread+0x9e/0xa6
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207766]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207773]  [<ffffffff813a3a2c>] ?
> ret_from_fork+0x7c/0xb0
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207777]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.207780] ---[ end trace
> 8dc4a4c97c467d4c ]---
> Aug 13 15:41:41 kbgrz1 kernel: [   89.223627] PGD 0 Aug 13 15:41:41
> kbgrz1 kernel: [   89.226038] Oops: 0000 [#1] SMP Aug 13 15:41:41 kbgrz1
> kernel: [   89.229917] Modules linked in: nvme(-) parport_pc ppdev lp
> parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry
> nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev
> hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp
> kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode pcspkr ehci_pci
> ehci_hcd usbcore acpi_cpufreq lpc_ich usb_common ioatdma mfd_core
> i2c_i801 evdev wmi tpm_tis ipmi_si tpm ipmi_msghandler processor
> thermal_sys button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod
> crct10dif_generic crc_t10dif crct10dif_common nbd dm_mod crc32c_intel
> ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
> ablk_helper cryptd isci libsas igb ahci libahci scsi_transport_sas ptp
> pps_core libata i2c_algo_bit i2c_core scsi_mod dca
> Aug 13 15:41:41 kbgrz1 kernel: [   89.315211] CPU: 8 PID: 5768 Comm:
> nvme1 Tainted: G        W     3.16.0-rc6+ #24
> Aug 13 15:41:41 kbgrz1 kernel: [   89.323563] Hardware name: Intel
> Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210
> 12/23/2013
> Aug 13 15:41:41 kbgrz1 kernel: [   89.335121] task: ffff88042ad92d70 ti:
> ffff880425ff0000 task.ti: ffff880425ff0000
> Aug 13 15:41:41 kbgrz1 kernel: [   89.343574] RIP:
> 0010:[<ffffffff811d0d38>]  [<ffffffff811d0d38>] bt_for_each_free+0x31/0x78
> Aug 13 15:41:41 kbgrz1 kernel: [   89.353144] RSP:
> 0018:ffff880425ff3de8  EFLAGS: 00010086
> Aug 13 15:41:41 kbgrz1 kernel: [   89.359189] RAX: 0000000000000010 RBX:
> ffffffffffffffff RCX: 0000000000000007
> Aug 13 15:41:41 kbgrz1 kernel: [   89.367276] RDX: 0000000000000000 RSI:
> 0000000000000010 RDI: ffff880827bf2850
> Aug 13 15:41:41 kbgrz1 kernel: [   89.375362] RBP: 0000000000000000 R08:
> 000000000000000f R09: 00000000fffffffe
> Aug 13 15:41:41 kbgrz1 kernel: [   89.383448] R10: 0000000000000000 R11:
> 0000000000000046 R12: ffff880827bf2850
> Aug 13 15:41:41 kbgrz1 kernel: [   89.391534] R13: 00000000ffffffff R14:
> 0000000000000010 R15: 0000000000000001
> Aug 13 15:41:41 kbgrz1 kernel: [   89.399622] FS:
> 0000000000000000(0000) GS:ffff88083f200000(0000) knlGS:0000000000000000
> Aug 13 15:41:41 kbgrz1 kernel: [   89.408805] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Aug 13 15:41:41 kbgrz1 kernel: [   89.415340] CR2: 0000000000000007 CR3:
> 0000000001610000 CR4: 00000000000407e0
> Aug 13 15:41:41 kbgrz1 kernel: [   89.423426] Stack:
> Aug 13 15:41:41 kbgrz1 kernel: [   89.425775]  0000000000000007
> 0000000000000010 ffff880827bf2840 ffffffffa052c426
> Aug 13 15:41:41 kbgrz1 kernel: [   89.434515]  ffff88082b8d9f00
> ffff88042ad92d70 0000000000000000 ffffffff811d0dc6
> Aug 13 15:41:41 kbgrz1 kernel: [   89.443254]  00000000fffffffe
> ffff88082b8d9f00 ffff88082b8d9f28 ffff88042ad92d70
> Aug 13 15:41:41 kbgrz1 kernel: [   89.452012] Call Trace:
> Aug 13 15:41:41 kbgrz1 kernel: [   89.454852]  [<ffffffffa052c426>] ?
> nvme_init_hctx+0x49/0x49 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.461968]  [<ffffffff811d0dc6>] ?
> blk_mq_tag_busy_iter+0x47/0x7c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.468987]  [<ffffffffa052c98b>] ?
> nvme_clear_queue+0x72/0x7d [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.476298]  [<ffffffffa052c9a8>] ?
> nvme_del_queue_end+0x12/0x26 [nvme]
> Aug 13 15:41:41 kbgrz1 kernel: [   89.483801]  [<ffffffff810576e3>] ?
> kthread_worker_fn+0xb1/0x111
> Aug 13 15:41:41 kbgrz1 kernel: [   89.490625]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.498038]  [<ffffffff81057632>] ?
> kthread_create_on_node+0x171/0x171
> Aug 13 15:41:41 kbgrz1 kernel: [   89.505446]  [<ffffffff810574b9>] ?
> kthread+0x9e/0xa6
> Aug 13 15:41:41 kbgrz1 kernel: [   89.511200]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.517833]  [<ffffffff813a3a2c>] ?
> ret_from_fork+0x7c/0xb0
> Aug 13 15:41:41 kbgrz1 kernel: [   89.524170]  [<ffffffff8105741b>] ?
> __kthread_parkme+0x5c/0x5c
> Aug 13 15:41:41 kbgrz1 kernel: [   89.530799] Code: 57 41 bf 01 00 00 00
> 41 56 49 89 f6 41 55 41 89 d5 41 54 49 89 fc 55 31 ed 53 51 eb 42 48 63
> dd 31 d2 48 c1 e3 06 49 03 5c 24 10 <48> 8b 73 08 48 63 d2 48 89 df e8
> e2 28 02 00 48 63 c8 48 3b 4b Aug 13 15:41:41 kbgrz1 kernel: [
> 89.564696]  RSP <ffff880425ff3de8>
> Aug 13 15:41:41 kbgrz1 kernel: [   89.568699] CR2: 0000000000000007
> Aug 13 15:41:41 kbgrz1 kernel: [   89.572518] ---[ end trace
> 8dc4a4c97c467d4d ]---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-14  8:25         ` Matias Bjørling
@ 2014-08-14 15:09           ` Jens Axboe
  -1 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2014-08-14 15:09 UTC (permalink / raw)
  To: Matias Bjørling, Keith Busch
  Cc: Matthew Wilcox, Sam Bradshaw (sbradshaw),
	LKML, linux-nvme, Christoph Hellwig, Rob Nelson, Ming Lei

On 08/14/2014 02:25 AM, Matias Bjørling wrote:
> On 08/14/2014 12:27 AM, Keith Busch wrote:
>> On Sun, 10 Aug 2014, Matias Bjørling wrote:
>>> On Sat, Jul 26, 2014 at 11:07 AM, Matias Bjørling <m@bjorling.me> wrote:
>>>> This converts the NVMe driver to a blk-mq request-based driver.
>>>>
>>>
>>> Willy, do you need me to make any changes to the conversion? Can you
>>> pick it up for 3.17?
>>
>> Hi Matias,
>>
> 
> Hi Keith, Thanks for taking the time to take another look.
> 
>> I'm starting to get a little more spare time to look at this again. I
>> think there are still some bugs here, or perhaps something better we
>> can do. I'll just start with one snippet of the code:
>>
>> @@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue
>> *nvmeq, struct nvme_ns *ns,
>>   submit_iod:
>>      spin_lock_irq(&nvmeq->q_lock);
>>      if (nvmeq->q_suspended) {
>>          spin_unlock_irq(&nvmeq->q_lock);
>>          goto finish_cmd;
>>      }
>>
>>   <snip>
>>
>>   finish_cmd:
>>      nvme_finish_cmd(nvmeq, req->tag, NULL);
>>      nvme_free_iod(nvmeq->dev, iod);
>>      return result;
>> }
>>
>>
>> If the nvme queue is marked "suspended", this code just goto's the finish
>> without setting "result", so I don't think that's right.
> 
> The result is set to BLK_MQ_RQ_QUEUE_ERROR, or am I mistaken?

Looks OK to me, looking at the code, 'result' is initialized to
BLK_MQ_RQ_QUEUE_BUSY though. Which looks correct, we don't want to error
on a suspended queue.

>> But do we even need the "q_suspended" flag anymore? It was there because
>> we couldn't prevent incoming requests as a bio based driver and we needed
>> some way to mark that the h/w's IO queue was temporarily inactive, but
>> blk-mq has ways to start/stop a queue at a higher level, right? If so,
>> I think that's probably a better way than using this driver specific way.
> 
> Not really, its managed by the block layer. Its on purpose I haven't
> removed it. The patch is already too big, and I want to keep the patch
> free from extra noise that can be removed by later patches.
> 
> Should I remove it anyway?

No point in keeping it, if it's not needed...

>> I haven't event tried debugging this next one: doing an insmod+rmmod
>> caused this warning followed by a panic:
>>
> 
> I'll look into it. Thanks

nr_tags must be uninitialized or screwed up somehow, otherwise I don't
see how that kmalloc() could warn on being too large. Keith, are you
running with slab debugging? Matias, might be worth trying.

FWIW, in general, we've run a bunch of testing internally at FB, all on
backported blk-mq stack and nvme-mq. No issues observed, and performance
is good and overhead low. For other reasons that I can't go into here,
this is the stack on which we'll run nvme hardware. Other features are
much easily implemented on top of a blk-mq based driver as opposed to a
bio based one, similarly to the suspended part above.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-14 15:09           ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2014-08-14 15:09 UTC (permalink / raw)


On 08/14/2014 02:25 AM, Matias Bj?rling wrote:
> On 08/14/2014 12:27 AM, Keith Busch wrote:
>> On Sun, 10 Aug 2014, Matias Bj?rling wrote:
>>> On Sat, Jul 26, 2014@11:07 AM, Matias Bj?rling <m@bjorling.me> wrote:
>>>> This converts the NVMe driver to a blk-mq request-based driver.
>>>>
>>>
>>> Willy, do you need me to make any changes to the conversion? Can you
>>> pick it up for 3.17?
>>
>> Hi Matias,
>>
> 
> Hi Keith, Thanks for taking the time to take another look.
> 
>> I'm starting to get a little more spare time to look at this again. I
>> think there are still some bugs here, or perhaps something better we
>> can do. I'll just start with one snippet of the code:
>>
>> @@ -765,33 +619,49 @@ static int nvme_submit_bio_queue(struct nvme_queue
>> *nvmeq, struct nvme_ns *ns,
>>   submit_iod:
>>      spin_lock_irq(&nvmeq->q_lock);
>>      if (nvmeq->q_suspended) {
>>          spin_unlock_irq(&nvmeq->q_lock);
>>          goto finish_cmd;
>>      }
>>
>>   <snip>
>>
>>   finish_cmd:
>>      nvme_finish_cmd(nvmeq, req->tag, NULL);
>>      nvme_free_iod(nvmeq->dev, iod);
>>      return result;
>> }
>>
>>
>> If the nvme queue is marked "suspended", this code just goto's the finish
>> without setting "result", so I don't think that's right.
> 
> The result is set to BLK_MQ_RQ_QUEUE_ERROR, or am I mistaken?

Looks OK to me, looking at the code, 'result' is initialized to
BLK_MQ_RQ_QUEUE_BUSY though. Which looks correct, we don't want to error
on a suspended queue.

>> But do we even need the "q_suspended" flag anymore? It was there because
>> we couldn't prevent incoming requests as a bio based driver and we needed
>> some way to mark that the h/w's IO queue was temporarily inactive, but
>> blk-mq has ways to start/stop a queue at a higher level, right? If so,
>> I think that's probably a better way than using this driver specific way.
> 
> Not really, its managed by the block layer. Its on purpose I haven't
> removed it. The patch is already too big, and I want to keep the patch
> free from extra noise that can be removed by later patches.
> 
> Should I remove it anyway?

No point in keeping it, if it's not needed...

>> I haven't event tried debugging this next one: doing an insmod+rmmod
>> caused this warning followed by a panic:
>>
> 
> I'll look into it. Thanks

nr_tags must be uninitialized or screwed up somehow, otherwise I don't
see how that kmalloc() could warn on being too large. Keith, are you
running with slab debugging? Matias, might be worth trying.

FWIW, in general, we've run a bunch of testing internally at FB, all on
backported blk-mq stack and nvme-mq. No issues observed, and performance
is good and overhead low. For other reasons that I can't go into here,
this is the stack on which we'll run nvme hardware. Other features are
much easily implemented on top of a blk-mq based driver as opposed to a
bio based one, similarly to the suspended part above.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-14 15:09           ` Jens Axboe
@ 2014-08-14 15:33             ` Keith Busch
  -1 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-14 15:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matias Bjørling, Keith Busch, Matthew Wilcox,
	Sam Bradshaw (sbradshaw),
	LKML, linux-nvme, Christoph Hellwig, Rob Nelson, Ming Lei

[-- Attachment #1: Type: TEXT/PLAIN, Size: 919 bytes --]

On Thu, 14 Aug 2014, Jens Axboe wrote:
> On 08/14/2014 02:25 AM, Matias Bjørling wrote:
>> The result is set to BLK_MQ_RQ_QUEUE_ERROR, or am I mistaken?
>
> Looks OK to me, looking at the code, 'result' is initialized to
> BLK_MQ_RQ_QUEUE_BUSY though. Which looks correct, we don't want to error
> on a suspended queue.

My mistake missing how the result was initialized.

> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
> see how that kmalloc() could warn on being too large. Keith, are you
> running with slab debugging? Matias, might be worth trying.

I'm not running with slab debugging. If it's any clue at all, blk-mq is
using 16 of the 31 allocated h/w queues (which is okay as we discussed
earlier), and the oops happens when clearing the first unused queue.

I'll have time to mess with this more today, so I can either help find
the problem or apply a patch if one becomes available.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-14 15:33             ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-14 15:33 UTC (permalink / raw)


On Thu, 14 Aug 2014, Jens Axboe wrote:
> On 08/14/2014 02:25 AM, Matias Bj?rling wrote:
>> The result is set to BLK_MQ_RQ_QUEUE_ERROR, or am I mistaken?
>
> Looks OK to me, looking at the code, 'result' is initialized to
> BLK_MQ_RQ_QUEUE_BUSY though. Which looks correct, we don't want to error
> on a suspended queue.

My mistake missing how the result was initialized.

> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
> see how that kmalloc() could warn on being too large. Keith, are you
> running with slab debugging? Matias, might be worth trying.

I'm not running with slab debugging. If it's any clue at all, blk-mq is
using 16 of the 31 allocated h/w queues (which is okay as we discussed
earlier), and the oops happens when clearing the first unused queue.

I'll have time to mess with this more today, so I can either help find
the problem or apply a patch if one becomes available.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-14 15:09           ` Jens Axboe
@ 2014-08-14 15:39             ` Matias Bjorling
  -1 siblings, 0 replies; 22+ messages in thread
From: Matias Bjorling @ 2014-08-14 15:39 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch
  Cc: Matthew Wilcox, Sam Bradshaw (sbradshaw),
	LKML, linux-nvme, Christoph Hellwig, Rob Nelson, Ming Lei

>
>>> I haven't event tried debugging this next one: doing an insmod+rmmod
>>> caused this warning followed by a panic:
>>>
>>
>> I'll look into it. Thanks
>
> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
> see how that kmalloc() could warn on being too large. Keith, are you
> running with slab debugging? Matias, might be worth trying.

Thanks for the hint.

Keith, I'll first have a chance to fix it tomorrow. Let me know if you 
find it in the mean time.

I've put up a nvmemq_v12 on the github repository with the q_suspended 
removed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-14 15:39             ` Matias Bjorling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjorling @ 2014-08-14 15:39 UTC (permalink / raw)


>
>>> I haven't event tried debugging this next one: doing an insmod+rmmod
>>> caused this warning followed by a panic:
>>>
>>
>> I'll look into it. Thanks
>
> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
> see how that kmalloc() could warn on being too large. Keith, are you
> running with slab debugging? Matias, might be worth trying.

Thanks for the hint.

Keith, I'll first have a chance to fix it tomorrow. Let me know if you 
find it in the mean time.

I've put up a nvmemq_v12 on the github repository with the q_suspended 
removed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-14 15:39             ` Matias Bjorling
@ 2014-08-14 18:20               ` Keith Busch
  -1 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-14 18:20 UTC (permalink / raw)
  To: Matias Bjorling
  Cc: Jens Axboe, Keith Busch, Matthew Wilcox, Sam Bradshaw (sbradshaw),
	LKML, linux-nvme, Christoph Hellwig, Rob Nelson, Ming Lei

On Thu, 14 Aug 2014, Matias Bjorling wrote:
>> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
>> see how that kmalloc() could warn on being too large. Keith, are you
>> running with slab debugging? Matias, might be worth trying.

The queue's tags were freed in 'blk_mq_map_swqueue' because some queues
weren't mapped to a s/w queue, but the driver has a pointer to that
freed memory, so it's a use-after-free error.

This part in the driver looks different than it used to be in v8 when I
last tested. The nvme_queue used to have a pointer to the 'hctx', but now
it points directly to the 'tags', but it doesn't appear to be very safe.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-14 18:20               ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-14 18:20 UTC (permalink / raw)


On Thu, 14 Aug 2014, Matias Bjorling wrote:
>> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
>> see how that kmalloc() could warn on being too large. Keith, are you
>> running with slab debugging? Matias, might be worth trying.

The queue's tags were freed in 'blk_mq_map_swqueue' because some queues
weren't mapped to a s/w queue, but the driver has a pointer to that
freed memory, so it's a use-after-free error.

This part in the driver looks different than it used to be in v8 when I
last tested. The nvme_queue used to have a pointer to the 'hctx', but now
it points directly to the 'tags', but it doesn't appear to be very safe.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-14 15:09           ` Jens Axboe
@ 2014-08-14 23:09             ` Keith Busch
  -1 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-14 23:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matias Bjørling, Keith Busch, Matthew Wilcox,
	Sam Bradshaw (sbradshaw),
	LKML, linux-nvme, Christoph Hellwig, Rob Nelson, Ming Lei

On Thu, 14 Aug 2014, Jens Axboe wrote:
> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
> see how that kmalloc() could warn on being too large. Keith, are you
> running with slab debugging? Matias, might be worth trying.

The allocation and freeing of blk-mq parts seems a bit asymmetrical
to me. The 'tags' belong to the tagset, but any request_queue using
that tagset may free the tags. I looked to separate the tag allocation
concerns, but that's more time than I have, so this is my quick-fix
driver patch, forcing tag access through the hw_ctx.

---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 384dc91..91432d2 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -109,7 +109,7 @@ struct nvme_queue {
  	u8 cqe_seen;
  	u8 q_suspended;
  	struct async_cmd_info cmdinfo;
-	struct blk_mq_tags *tags;
+	struct blk_mq_hw_ctx *hctx;
  };

  /*
@@ -148,6 +148,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
  	struct nvme_queue *nvmeq = dev->queues[0];

  	hctx->driver_data = nvmeq;
+	nvmeq->hctx = hctx;
  	return 0;
  }

@@ -174,6 +175,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
  	irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
  								hctx->cpumask);
  	hctx->driver_data = nvmeq;
+	nvmeq->hctx = hctx;
  	return 0;
  }

@@ -280,8 +282,7 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
  static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
  				  unsigned int tag)
  {
-	struct request *req = blk_mq_tag_to_rq(nvmeq->tags, tag);
-
+	struct request *req = blk_mq_tag_to_rq(nvmeq->hctx->tags, tag);
  	return blk_mq_rq_to_pdu(req);
  }

@@ -654,8 +655,6 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
  		nvme_submit_flush(nvmeq, ns, req->tag);
  	else
  		nvme_submit_iod(nvmeq, iod, ns);
-
- queued:
  	nvme_process_cq(nvmeq);
  	spin_unlock_irq(&nvmeq->q_lock);
  	return BLK_MQ_RQ_QUEUE_OK;
@@ -1051,9 +1050,8 @@ static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
  		if (tag >= qdepth)
  			break;

-		req = blk_mq_tag_to_rq(nvmeq->tags, tag++);
+		req = blk_mq_tag_to_rq(nvmeq->hctx->tags, tag++);
  		cmd = blk_mq_rq_to_pdu(req);
  		if (cmd->ctx == CMD_CTX_CANCELLED)
  			continue;

@@ -1132,8 +1130,8 @@ static void nvme_clear_queue(struct nvme_queue *nvmeq)
  {
  	spin_lock_irq(&nvmeq->q_lock);
  	nvme_process_cq(nvmeq);
-	if (nvmeq->tags)
-		blk_mq_tag_busy_iter(nvmeq->tags, nvme_cancel_queue_ios, nvmeq);
+	if (nvmeq->hctx->tags)
+		blk_mq_tag_busy_iter(nvmeq->hctx->tags, nvme_cancel_queue_ios, nvmeq);
  	spin_unlock_irq(&nvmeq->q_lock);
  }

@@ -1353,8 +1351,6 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
  		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
  			return -ENOMEM;

-		dev->queues[0]->tags = dev->admin_tagset.tags[0];
-
  		dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
  		if (!dev->admin_q) {
  			blk_mq_free_tag_set(&dev->admin_tagset);
@@ -2055,9 +2051,6 @@ static int nvme_dev_add(struct nvme_dev *dev)
  	if (blk_mq_alloc_tag_set(&dev->tagset))
  		goto out;

-	for (i = 1; i < dev->online_queues; i++)
-		dev->queues[i]->tags = dev->tagset.tags[i - 1];
-
  	id_ns = mem;
  	for (i = 1; i <= nn; i++) {
  		res = nvme_identify(dev, i, 0, dma_addr);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-14 23:09             ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2014-08-14 23:09 UTC (permalink / raw)


On Thu, 14 Aug 2014, Jens Axboe wrote:
> nr_tags must be uninitialized or screwed up somehow, otherwise I don't
> see how that kmalloc() could warn on being too large. Keith, are you
> running with slab debugging? Matias, might be worth trying.

The allocation and freeing of blk-mq parts seems a bit asymmetrical
to me. The 'tags' belong to the tagset, but any request_queue using
that tagset may free the tags. I looked to separate the tag allocation
concerns, but that's more time than I have, so this is my quick-fix
driver patch, forcing tag access through the hw_ctx.

---
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 384dc91..91432d2 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -109,7 +109,7 @@ struct nvme_queue {
  	u8 cqe_seen;
  	u8 q_suspended;
  	struct async_cmd_info cmdinfo;
-	struct blk_mq_tags *tags;
+	struct blk_mq_hw_ctx *hctx;
  };

  /*
@@ -148,6 +148,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
  	struct nvme_queue *nvmeq = dev->queues[0];

  	hctx->driver_data = nvmeq;
+	nvmeq->hctx = hctx;
  	return 0;
  }

@@ -174,6 +175,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
  	irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
  								hctx->cpumask);
  	hctx->driver_data = nvmeq;
+	nvmeq->hctx = hctx;
  	return 0;
  }

@@ -280,8 +282,7 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
  static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
  				  unsigned int tag)
  {
-	struct request *req = blk_mq_tag_to_rq(nvmeq->tags, tag);
-
+	struct request *req = blk_mq_tag_to_rq(nvmeq->hctx->tags, tag);
  	return blk_mq_rq_to_pdu(req);
  }

@@ -654,8 +655,6 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
  		nvme_submit_flush(nvmeq, ns, req->tag);
  	else
  		nvme_submit_iod(nvmeq, iod, ns);
-
- queued:
  	nvme_process_cq(nvmeq);
  	spin_unlock_irq(&nvmeq->q_lock);
  	return BLK_MQ_RQ_QUEUE_OK;
@@ -1051,9 +1050,8 @@ static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
  		if (tag >= qdepth)
  			break;

-		req = blk_mq_tag_to_rq(nvmeq->tags, tag++);
+		req = blk_mq_tag_to_rq(nvmeq->hctx->tags, tag++);
  		cmd = blk_mq_rq_to_pdu(req);
  		if (cmd->ctx == CMD_CTX_CANCELLED)
  			continue;

@@ -1132,8 +1130,8 @@ static void nvme_clear_queue(struct nvme_queue *nvmeq)
  {
  	spin_lock_irq(&nvmeq->q_lock);
  	nvme_process_cq(nvmeq);
-	if (nvmeq->tags)
-		blk_mq_tag_busy_iter(nvmeq->tags, nvme_cancel_queue_ios, nvmeq);
+	if (nvmeq->hctx->tags)
+		blk_mq_tag_busy_iter(nvmeq->hctx->tags, nvme_cancel_queue_ios, nvmeq);
  	spin_unlock_irq(&nvmeq->q_lock);
  }

@@ -1353,8 +1351,6 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
  		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
  			return -ENOMEM;

-		dev->queues[0]->tags = dev->admin_tagset.tags[0];
-
  		dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
  		if (!dev->admin_q) {
  			blk_mq_free_tag_set(&dev->admin_tagset);
@@ -2055,9 +2051,6 @@ static int nvme_dev_add(struct nvme_dev *dev)
  	if (blk_mq_alloc_tag_set(&dev->tagset))
  		goto out;

-	for (i = 1; i < dev->online_queues; i++)
-		dev->queues[i]->tags = dev->tagset.tags[i - 1];
-
  	id_ns = mem;
  	for (i = 1; i <= nn; i++) {
  		res = nvme_identify(dev, i, 0, dma_addr);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v11] NVMe: Convert to blk-mq
  2014-08-14 23:09             ` Keith Busch
@ 2014-08-15 10:09               ` Matias Bjorling
  -1 siblings, 0 replies; 22+ messages in thread
From: Matias Bjorling @ 2014-08-15 10:09 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: Matthew Wilcox, Sam Bradshaw (sbradshaw),
	LKML, linux-nvme, Christoph Hellwig, Rob Nelson, Ming Lei

On 08/15/2014 01:09 AM, Keith Busch wrote:
>
> The allocation and freeing of blk-mq parts seems a bit asymmetrical
> to me. The 'tags' belong to the tagset, but any request_queue using
> that tagset may free the tags. I looked to separate the tag allocation
> concerns, but that's more time than I have, so this is my quick-fix
> driver patch, forcing tag access through the hw_ctx.
>

I moved nvmeq->hctx->tags into nvmeq->tags in the last version. I missed 
the free's in blk_mq_map_swqueue. Good catch.

The previous method might have another problem. If there's two 
namespaces, sharing tag set. The hctx_init fn could be called with 
different hctx for an nvmeq, leading to false tag sharing between nvme 
queues.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11] NVMe: Convert to blk-mq
@ 2014-08-15 10:09               ` Matias Bjorling
  0 siblings, 0 replies; 22+ messages in thread
From: Matias Bjorling @ 2014-08-15 10:09 UTC (permalink / raw)


On 08/15/2014 01:09 AM, Keith Busch wrote:
>
> The allocation and freeing of blk-mq parts seems a bit asymmetrical
> to me. The 'tags' belong to the tagset, but any request_queue using
> that tagset may free the tags. I looked to separate the tag allocation
> concerns, but that's more time than I have, so this is my quick-fix
> driver patch, forcing tag access through the hw_ctx.
>

I moved nvmeq->hctx->tags into nvmeq->tags in the last version. I missed 
the free's in blk_mq_map_swqueue. Good catch.

The previous method might have another problem. If there's two 
namespaces, sharing tag set. The hctx_init fn could be called with 
different hctx for an nvmeq, leading to false tag sharing between nvme 
queues.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-08-15 10:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-26  9:07 [PATCH v11] Convert NVMe driver to blk-mq Matias Bjørling
2014-07-26  9:07 ` Matias Bjørling
2014-07-26  9:07 ` [PATCH v11] NVMe: Convert " Matias Bjørling
2014-07-26  9:07   ` Matias Bjørling
2014-08-10 17:27   ` Matias Bjørling
2014-08-10 17:27     ` Matias Bjørling
2014-08-13 22:27     ` Keith Busch
2014-08-13 22:27       ` Keith Busch
2014-08-14  8:25       ` Matias Bjørling
2014-08-14  8:25         ` Matias Bjørling
2014-08-14 15:09         ` Jens Axboe
2014-08-14 15:09           ` Jens Axboe
2014-08-14 15:33           ` Keith Busch
2014-08-14 15:33             ` Keith Busch
2014-08-14 15:39           ` Matias Bjorling
2014-08-14 15:39             ` Matias Bjorling
2014-08-14 18:20             ` Keith Busch
2014-08-14 18:20               ` Keith Busch
2014-08-14 23:09           ` Keith Busch
2014-08-14 23:09             ` Keith Busch
2014-08-15 10:09             ` Matias Bjorling
2014-08-15 10:09               ` Matias Bjorling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.