All of lore.kernel.org
 help / color / mirror / Atom feed
* block and nvme polling improvements V3
@ 2018-12-02 16:46 ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

Hi all,

this series optimizes a few bits in the block layer and nvme code
related to polling.

It starts by moving the queue types recently introduce entirely into
the block layer instead of requiring an indirect call for them.

It then switches nvme and the block layer to only allow polling
with separate poll queues, which allows us to realize the following
benefits:

 - poll queues can safely avoid disabling irqs on any locks
   (we already do that in NVMe, but it isn't 100% kosher as-is)
 - regular interrupt driven queues can drop the CQ lock entirely,
   as we won't race for completing CQs

Then we drop the NVMe RDMA code, as it doesn't follow the new mode,
and remove the nvme multipath polling code including the block hooks
for it, which didn't make much sense to start with given that we
started bypassing the multipath code for single controller subsystems
early on.  Last but not least we enable polling in the block layer
by default if the underlying driver has poll queues, as that already
requires explicit user action.

Note that it would be really nice to have polling back for RDMA with
dedicated poll queues, but that might take a while.  Also based on
Jens' polling aio patches we could now implement a model in nvmet
where we have a thread polling both the backend nvme device and
the RDMA CQs, which might give us some pretty nice performace
(I know Sagi looked into something similar a while ago).

A git tree is also available at:

    git://git.infradead.org/users/hch/block.git nvme-polling

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-polling

Changes since v2:
 - fix a changelog typo
 - report a string instead of an index from the type sysfs attribute
 - move to a per-queue completions for queue deletion
 - clear NVMEQ_DELETE_ERROR when initializing a queue

Changes since v1:
 - rebased to the latest block for-4.21 tree

^ permalink raw reply	[flat|nested] 92+ messages in thread

* block and nvme polling improvements V3
@ 2018-12-02 16:46 ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


Hi all,

this series optimizes a few bits in the block layer and nvme code
related to polling.

It starts by moving the queue types recently introduce entirely into
the block layer instead of requiring an indirect call for them.

It then switches nvme and the block layer to only allow polling
with separate poll queues, which allows us to realize the following
benefits:

 - poll queues can safely avoid disabling irqs on any locks
   (we already do that in NVMe, but it isn't 100% kosher as-is)
 - regular interrupt driven queues can drop the CQ lock entirely,
   as we won't race for completing CQs

Then we drop the NVMe RDMA code, as it doesn't follow the new mode,
and remove the nvme multipath polling code including the block hooks
for it, which didn't make much sense to start with given that we
started bypassing the multipath code for single controller subsystems
early on.  Last but not least we enable polling in the block layer
by default if the underlying driver has poll queues, as that already
requires explicit user action.

Note that it would be really nice to have polling back for RDMA with
dedicated poll queues, but that might take a while.  Also based on
Jens' polling aio patches we could now implement a model in nvmet
where we have a thread polling both the backend nvme device and
the RDMA CQs, which might give us some pretty nice performace
(I know Sagi looked into something similar a while ago).

A git tree is also available at:

    git://git.infradead.org/users/hch/block.git nvme-polling

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-polling

Changes since v2:
 - fix a changelog typo
 - report a string instead of an index from the type sysfs attribute
 - move to a per-queue completions for queue deletion
 - clear NVMEQ_DELETE_ERROR when initializing a queue

Changes since v1:
 - rebased to the latest block for-4.21 tree

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 01/13] block: move queues types to the block layer
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-sysfs.c    |  9 +++++-
 block/blk-mq.h          | 21 +++++++------
 drivers/nvme/host/pci.c | 68 +++++++++++++++--------------------------
 include/linux/blk-mq.h  | 15 ++++-----
 4 files changed, 51 insertions(+), 62 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 6efef1f679f0..9c2df137256a 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static const char *const hctx_types[] = {
+	[HCTX_TYPE_DEFAULT]	= "default",
+	[HCTX_TYPE_READ]	= "read",
+	[HCTX_TYPE_POLL]	= "poll",
+};
+
 static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char *page)
 {
-	return sprintf(page, "%u\n", hctx->type);
+	BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES);
+	return sprintf(page, "%s\n", hctx_types[hctx->type]);
 }
 
 static struct attribute *default_ctx_attrs[] = {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7291e5379358..a664ea44ffd4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
 /*
  * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue
  * @q: request queue
- * @hctx_type: the hctx type index
+ * @type: the hctx type index
  * @cpu: CPU
  */
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
-							  unsigned int hctx_type,
+							  enum hctx_type type,
 							  unsigned int cpu)
 {
-	struct blk_mq_tag_set *set = q->tag_set;
-
-	return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]];
+	return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]];
 }
 
 /*
@@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 						     unsigned int flags,
 						     unsigned int cpu)
 {
-	int hctx_type = 0;
+	enum hctx_type type = HCTX_TYPE_DEFAULT;
+
+	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
+	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
+		type = HCTX_TYPE_POLL;
 
-	if (q->mq_ops->rq_flags_to_type)
-		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
+	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
+		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
+		type = HCTX_TYPE_READ;
 
-	return blk_mq_map_queue_type(q, hctx_type, cpu);
+	return blk_mq_map_queue_type(q, type, cpu);
 }
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 527907aa6903..a1bb4bb92e7f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -95,13 +95,6 @@ struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
-enum {
-	NVMEQ_TYPE_READ,
-	NVMEQ_TYPE_WRITE,
-	NVMEQ_TYPE_POLL,
-	NVMEQ_TYPE_NR,
-};
-
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -115,7 +108,7 @@ struct nvme_dev {
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
-	unsigned io_queues[NVMEQ_TYPE_NR];
+	unsigned io_queues[HCTX_MAX_TYPES];
 	unsigned int num_vecs;
 	int q_depth;
 	u32 db_stride;
@@ -499,10 +492,10 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 
 		map->nr_queues = dev->io_queues[i];
 		if (!map->nr_queues) {
-			BUG_ON(i == NVMEQ_TYPE_READ);
+			BUG_ON(i == HCTX_TYPE_DEFAULT);
 
 			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[NVMEQ_TYPE_READ];
+			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
 			qoff = 0;
 			offset = queue_irq_offset(dev);
 		}
@@ -512,7 +505,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		 * affinity), so use the regular blk-mq cpu mapping
 		 */
 		map->queue_offset = qoff;
-		if (i != NVMEQ_TYPE_POLL)
+		if (i != HCTX_TYPE_POLL)
 			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
 		else
 			blk_mq_map_queues(map);
@@ -961,16 +954,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
-static int nvme_rq_flags_to_type(struct request_queue *q, unsigned int flags)
-{
-	if ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
-		return NVMEQ_TYPE_POLL;
-	if ((flags & REQ_OP_MASK) == REQ_OP_READ)
-		return NVMEQ_TYPE_READ;
-
-	return NVMEQ_TYPE_WRITE;
-}
-
 static void nvme_pci_complete_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1634,7 +1617,6 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 #define NVME_SHARED_MQ_OPS					\
 	.queue_rq		= nvme_queue_rq,		\
 	.commit_rqs		= nvme_commit_rqs,		\
-	.rq_flags_to_type	= nvme_rq_flags_to_type,	\
 	.complete		= nvme_pci_complete_rq,		\
 	.init_hctx		= nvme_init_hctx,		\
 	.init_request		= nvme_init_request,		\
@@ -1785,9 +1767,9 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	}
 
 	max = min(dev->max_qid, dev->ctrl.queue_count - 1);
-	if (max != 1 && dev->io_queues[NVMEQ_TYPE_POLL]) {
-		rw_queues = dev->io_queues[NVMEQ_TYPE_READ] +
-				dev->io_queues[NVMEQ_TYPE_WRITE];
+	if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
+		rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
+				dev->io_queues[HCTX_TYPE_READ];
 	} else {
 		rw_queues = max;
 	}
@@ -2076,9 +2058,9 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 	 * Setup read/write queue split
 	 */
 	if (nr_io_queues == 1) {
-		dev->io_queues[NVMEQ_TYPE_READ] = 1;
-		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
-		dev->io_queues[NVMEQ_TYPE_POLL] = 0;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
+		dev->io_queues[HCTX_TYPE_READ] = 0;
+		dev->io_queues[HCTX_TYPE_POLL] = 0;
 		return;
 	}
 
@@ -2095,10 +2077,10 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 			this_p_queues = nr_io_queues - 1;
 		}
 
-		dev->io_queues[NVMEQ_TYPE_POLL] = this_p_queues;
+		dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
 		nr_io_queues -= this_p_queues;
 	} else
-		dev->io_queues[NVMEQ_TYPE_POLL] = 0;
+		dev->io_queues[HCTX_TYPE_POLL] = 0;
 
 	/*
 	 * If 'write_queues' is set, ensure it leaves room for at least
@@ -2112,11 +2094,11 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 	 * a queue set.
 	 */
 	if (!this_w_queues) {
-		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
-		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = nr_io_queues;
+		dev->io_queues[HCTX_TYPE_READ] = 0;
 	} else {
-		dev->io_queues[NVMEQ_TYPE_WRITE] = this_w_queues;
-		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues - this_w_queues;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
+		dev->io_queues[HCTX_TYPE_READ] = nr_io_queues - this_w_queues;
 	}
 }
 
@@ -2138,8 +2120,8 @@ static int nvme_setup_irqs(struct nvme_dev *dev, int nr_io_queues)
 	 */
 	do {
 		nvme_calc_io_queues(dev, nr_io_queues);
-		irq_sets[0] = dev->io_queues[NVMEQ_TYPE_READ];
-		irq_sets[1] = dev->io_queues[NVMEQ_TYPE_WRITE];
+		irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
+		irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
 		if (!irq_sets[1])
 			affd.nr_sets = 1;
 
@@ -2226,12 +2208,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
-	dev->max_qid = result + dev->io_queues[NVMEQ_TYPE_POLL];
+	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
 
-	dev_info(dev->ctrl.device, "%d/%d/%d read/write/poll queues\n",
-					dev->io_queues[NVMEQ_TYPE_READ],
-					dev->io_queues[NVMEQ_TYPE_WRITE],
-					dev->io_queues[NVMEQ_TYPE_POLL]);
+	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
+					dev->io_queues[HCTX_TYPE_DEFAULT],
+					dev->io_queues[HCTX_TYPE_READ],
+					dev->io_queues[HCTX_TYPE_POLL]);
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -2332,13 +2314,13 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		if (!dev->io_queues[NVMEQ_TYPE_POLL])
+		if (!dev->io_queues[HCTX_TYPE_POLL])
 			dev->tagset.ops = &nvme_mq_ops;
 		else
 			dev->tagset.ops = &nvme_mq_poll_noirq_ops;
 
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
-		dev->tagset.nr_maps = NVMEQ_TYPE_NR;
+		dev->tagset.nr_maps = HCTX_MAX_TYPES;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 467f1dd21ccf..57eda7b20243 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -81,8 +81,12 @@ struct blk_mq_queue_map {
 	unsigned int queue_offset;
 };
 
-enum {
-	HCTX_MAX_TYPES = 3,
+enum hctx_type {
+	HCTX_TYPE_DEFAULT,	/* all I/O not otherwise accounted for */
+	HCTX_TYPE_READ,		/* just for READ I/O */
+	HCTX_TYPE_POLL,		/* polled I/O of any kind */
+
+	HCTX_MAX_TYPES,
 };
 
 struct blk_mq_tag_set {
@@ -118,8 +122,6 @@ struct blk_mq_queue_data {
 typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
 		const struct blk_mq_queue_data *);
 typedef void (commit_rqs_fn)(struct blk_mq_hw_ctx *);
-/* takes rq->cmd_flags as input, returns a hardware type index */
-typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int);
 typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
 typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
@@ -154,11 +156,6 @@ struct blk_mq_ops {
 	 */
 	commit_rqs_fn		*commit_rqs;
 
-	/*
-	 * Return a queue map type for the given request/bio flags
-	 */
-	rq_flags_to_type_fn	*rq_flags_to_type;
-
 	/*
 	 * Reserve budget before queue request, once .queue_rq is
 	 * run, it is driver's responsibility to release the
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 01/13] block: move queues types to the block layer
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq-sysfs.c    |  9 +++++-
 block/blk-mq.h          | 21 +++++++------
 drivers/nvme/host/pci.c | 68 +++++++++++++++--------------------------
 include/linux/blk-mq.h  | 15 ++++-----
 4 files changed, 51 insertions(+), 62 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 6efef1f679f0..9c2df137256a 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static const char *const hctx_types[] = {
+	[HCTX_TYPE_DEFAULT]	= "default",
+	[HCTX_TYPE_READ]	= "read",
+	[HCTX_TYPE_POLL]	= "poll",
+};
+
 static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char *page)
 {
-	return sprintf(page, "%u\n", hctx->type);
+	BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES);
+	return sprintf(page, "%s\n", hctx_types[hctx->type]);
 }
 
 static struct attribute *default_ctx_attrs[] = {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7291e5379358..a664ea44ffd4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
 /*
  * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue
  * @q: request queue
- * @hctx_type: the hctx type index
+ * @type: the hctx type index
  * @cpu: CPU
  */
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
-							  unsigned int hctx_type,
+							  enum hctx_type type,
 							  unsigned int cpu)
 {
-	struct blk_mq_tag_set *set = q->tag_set;
-
-	return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]];
+	return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]];
 }
 
 /*
@@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 						     unsigned int flags,
 						     unsigned int cpu)
 {
-	int hctx_type = 0;
+	enum hctx_type type = HCTX_TYPE_DEFAULT;
+
+	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
+	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
+		type = HCTX_TYPE_POLL;
 
-	if (q->mq_ops->rq_flags_to_type)
-		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
+	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
+		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
+		type = HCTX_TYPE_READ;
 
-	return blk_mq_map_queue_type(q, hctx_type, cpu);
+	return blk_mq_map_queue_type(q, type, cpu);
 }
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 527907aa6903..a1bb4bb92e7f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -95,13 +95,6 @@ struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
-enum {
-	NVMEQ_TYPE_READ,
-	NVMEQ_TYPE_WRITE,
-	NVMEQ_TYPE_POLL,
-	NVMEQ_TYPE_NR,
-};
-
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -115,7 +108,7 @@ struct nvme_dev {
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
-	unsigned io_queues[NVMEQ_TYPE_NR];
+	unsigned io_queues[HCTX_MAX_TYPES];
 	unsigned int num_vecs;
 	int q_depth;
 	u32 db_stride;
@@ -499,10 +492,10 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 
 		map->nr_queues = dev->io_queues[i];
 		if (!map->nr_queues) {
-			BUG_ON(i == NVMEQ_TYPE_READ);
+			BUG_ON(i == HCTX_TYPE_DEFAULT);
 
 			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[NVMEQ_TYPE_READ];
+			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
 			qoff = 0;
 			offset = queue_irq_offset(dev);
 		}
@@ -512,7 +505,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		 * affinity), so use the regular blk-mq cpu mapping
 		 */
 		map->queue_offset = qoff;
-		if (i != NVMEQ_TYPE_POLL)
+		if (i != HCTX_TYPE_POLL)
 			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
 		else
 			blk_mq_map_queues(map);
@@ -961,16 +954,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
-static int nvme_rq_flags_to_type(struct request_queue *q, unsigned int flags)
-{
-	if ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
-		return NVMEQ_TYPE_POLL;
-	if ((flags & REQ_OP_MASK) == REQ_OP_READ)
-		return NVMEQ_TYPE_READ;
-
-	return NVMEQ_TYPE_WRITE;
-}
-
 static void nvme_pci_complete_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1634,7 +1617,6 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 #define NVME_SHARED_MQ_OPS					\
 	.queue_rq		= nvme_queue_rq,		\
 	.commit_rqs		= nvme_commit_rqs,		\
-	.rq_flags_to_type	= nvme_rq_flags_to_type,	\
 	.complete		= nvme_pci_complete_rq,		\
 	.init_hctx		= nvme_init_hctx,		\
 	.init_request		= nvme_init_request,		\
@@ -1785,9 +1767,9 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	}
 
 	max = min(dev->max_qid, dev->ctrl.queue_count - 1);
-	if (max != 1 && dev->io_queues[NVMEQ_TYPE_POLL]) {
-		rw_queues = dev->io_queues[NVMEQ_TYPE_READ] +
-				dev->io_queues[NVMEQ_TYPE_WRITE];
+	if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
+		rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
+				dev->io_queues[HCTX_TYPE_READ];
 	} else {
 		rw_queues = max;
 	}
@@ -2076,9 +2058,9 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 	 * Setup read/write queue split
 	 */
 	if (nr_io_queues == 1) {
-		dev->io_queues[NVMEQ_TYPE_READ] = 1;
-		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
-		dev->io_queues[NVMEQ_TYPE_POLL] = 0;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
+		dev->io_queues[HCTX_TYPE_READ] = 0;
+		dev->io_queues[HCTX_TYPE_POLL] = 0;
 		return;
 	}
 
@@ -2095,10 +2077,10 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 			this_p_queues = nr_io_queues - 1;
 		}
 
-		dev->io_queues[NVMEQ_TYPE_POLL] = this_p_queues;
+		dev->io_queues[HCTX_TYPE_POLL] = this_p_queues;
 		nr_io_queues -= this_p_queues;
 	} else
-		dev->io_queues[NVMEQ_TYPE_POLL] = 0;
+		dev->io_queues[HCTX_TYPE_POLL] = 0;
 
 	/*
 	 * If 'write_queues' is set, ensure it leaves room for at least
@@ -2112,11 +2094,11 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 	 * a queue set.
 	 */
 	if (!this_w_queues) {
-		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
-		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = nr_io_queues;
+		dev->io_queues[HCTX_TYPE_READ] = 0;
 	} else {
-		dev->io_queues[NVMEQ_TYPE_WRITE] = this_w_queues;
-		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues - this_w_queues;
+		dev->io_queues[HCTX_TYPE_DEFAULT] = this_w_queues;
+		dev->io_queues[HCTX_TYPE_READ] = nr_io_queues - this_w_queues;
 	}
 }
 
@@ -2138,8 +2120,8 @@ static int nvme_setup_irqs(struct nvme_dev *dev, int nr_io_queues)
 	 */
 	do {
 		nvme_calc_io_queues(dev, nr_io_queues);
-		irq_sets[0] = dev->io_queues[NVMEQ_TYPE_READ];
-		irq_sets[1] = dev->io_queues[NVMEQ_TYPE_WRITE];
+		irq_sets[0] = dev->io_queues[HCTX_TYPE_DEFAULT];
+		irq_sets[1] = dev->io_queues[HCTX_TYPE_READ];
 		if (!irq_sets[1])
 			affd.nr_sets = 1;
 
@@ -2226,12 +2208,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
-	dev->max_qid = result + dev->io_queues[NVMEQ_TYPE_POLL];
+	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
 
-	dev_info(dev->ctrl.device, "%d/%d/%d read/write/poll queues\n",
-					dev->io_queues[NVMEQ_TYPE_READ],
-					dev->io_queues[NVMEQ_TYPE_WRITE],
-					dev->io_queues[NVMEQ_TYPE_POLL]);
+	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
+					dev->io_queues[HCTX_TYPE_DEFAULT],
+					dev->io_queues[HCTX_TYPE_READ],
+					dev->io_queues[HCTX_TYPE_POLL]);
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -2332,13 +2314,13 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		if (!dev->io_queues[NVMEQ_TYPE_POLL])
+		if (!dev->io_queues[HCTX_TYPE_POLL])
 			dev->tagset.ops = &nvme_mq_ops;
 		else
 			dev->tagset.ops = &nvme_mq_poll_noirq_ops;
 
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
-		dev->tagset.nr_maps = NVMEQ_TYPE_NR;
+		dev->tagset.nr_maps = HCTX_MAX_TYPES;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 467f1dd21ccf..57eda7b20243 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -81,8 +81,12 @@ struct blk_mq_queue_map {
 	unsigned int queue_offset;
 };
 
-enum {
-	HCTX_MAX_TYPES = 3,
+enum hctx_type {
+	HCTX_TYPE_DEFAULT,	/* all I/O not otherwise accounted for */
+	HCTX_TYPE_READ,		/* just for READ I/O */
+	HCTX_TYPE_POLL,		/* polled I/O of any kind */
+
+	HCTX_MAX_TYPES,
 };
 
 struct blk_mq_tag_set {
@@ -118,8 +122,6 @@ struct blk_mq_queue_data {
 typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
 		const struct blk_mq_queue_data *);
 typedef void (commit_rqs_fn)(struct blk_mq_hw_ctx *);
-/* takes rq->cmd_flags as input, returns a hardware type index */
-typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int);
 typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
 typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
@@ -154,11 +156,6 @@ struct blk_mq_ops {
 	 */
 	commit_rqs_fn		*commit_rqs;
 
-	/*
-	 * Return a queue map type for the given request/bio flags
-	 */
-	rq_flags_to_type_fn	*rq_flags_to_type;
-
 	/*
 	 * Reserve budget before queue request, once .queue_rq is
 	 * run, it is driver's responsibility to release the
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This gets rid of all the messing with cq_vector and the ->polled field
by using an atomic bitop to mark the queue enabled or not.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 43 ++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 28 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a1bb4bb92e7f..022395a319f4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -201,7 +201,8 @@ struct nvme_queue {
 	u16 last_cq_head;
 	u16 qid;
 	u8 cq_phase;
-	u8 polled;
+	unsigned long flags;
+#define NVMEQ_ENABLED		0
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -927,7 +928,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 * We should not need to do this, but we're still using this to
 	 * ensure we can drain requests on a dying queue.
 	 */
-	if (unlikely(nvmeq->cq_vector < 0 && !nvmeq->polled))
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
 		return BLK_STS_IOERR;
 
 	ret = nvme_setup_cmd(ns, req, &cmnd);
@@ -1395,31 +1396,19 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
  */
 static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 {
-	int vector;
-
-	spin_lock_irq(&nvmeq->cq_lock);
-	if (nvmeq->cq_vector == -1 && !nvmeq->polled) {
-		spin_unlock_irq(&nvmeq->cq_lock);
+	if (!test_and_clear_bit(NVMEQ_ENABLED, &nvmeq->flags))
 		return 1;
-	}
-	vector = nvmeq->cq_vector;
-	nvmeq->dev->online_queues--;
-	nvmeq->cq_vector = -1;
-	nvmeq->polled = false;
-	spin_unlock_irq(&nvmeq->cq_lock);
 
-	/*
-	 * Ensure that nvme_queue_rq() sees it ->cq_vector == -1 without
-	 * having to grab the lock.
-	 */
+	/* ensure that nvme_queue_rq() sees NVMEQ_ENABLED cleared */
 	mb();
 
+	nvmeq->dev->online_queues--;
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
-	if (vector != -1)
-		pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
-
+	if (nvmeq->cq_vector == -1)
+		return 0;
+	pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq);
+	nvmeq->cq_vector = -1;
 	return 0;
 }
 
@@ -1578,13 +1567,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	else if (result)
 		goto release_cq;
 
-	/*
-	 * Set cq_vector after alloc cq/sq, otherwise nvme_suspend_queue will
-	 * invoke free_irq for it and cause a 'Trying to free already-free IRQ
-	 * xxx' warning if the create CQ/SQ command times out.
-	 */
 	nvmeq->cq_vector = vector;
-	nvmeq->polled = polled;
 	nvme_init_queue(nvmeq, qid);
 
 	if (vector != -1) {
@@ -1593,11 +1576,11 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 			goto release_sq;
 	}
 
+	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
 	return result;
 
 release_sq:
 	nvmeq->cq_vector = -1;
-	nvmeq->polled = false;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
 release_cq:
@@ -1751,6 +1734,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 		return result;
 	}
 
+	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
 	return result;
 }
 
@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	if (nr_io_queues == 0)
 		return 0;
+	
+	clear_bit(NVMEQ_ENABLED, &adminq->flags);
 
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2227,6 +2213,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		adminq->cq_vector = -1;
 		return result;
 	}
+	set_bit(NVMEQ_ENABLED, &adminq->flags);
 	return nvme_create_io_queues(dev);
 }
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


This gets rid of all the messing with cq_vector and the ->polled field
by using an atomic bitop to mark the queue enabled or not.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 43 ++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 28 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a1bb4bb92e7f..022395a319f4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -201,7 +201,8 @@ struct nvme_queue {
 	u16 last_cq_head;
 	u16 qid;
 	u8 cq_phase;
-	u8 polled;
+	unsigned long flags;
+#define NVMEQ_ENABLED		0
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -927,7 +928,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 * We should not need to do this, but we're still using this to
 	 * ensure we can drain requests on a dying queue.
 	 */
-	if (unlikely(nvmeq->cq_vector < 0 && !nvmeq->polled))
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
 		return BLK_STS_IOERR;
 
 	ret = nvme_setup_cmd(ns, req, &cmnd);
@@ -1395,31 +1396,19 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
  */
 static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 {
-	int vector;
-
-	spin_lock_irq(&nvmeq->cq_lock);
-	if (nvmeq->cq_vector == -1 && !nvmeq->polled) {
-		spin_unlock_irq(&nvmeq->cq_lock);
+	if (!test_and_clear_bit(NVMEQ_ENABLED, &nvmeq->flags))
 		return 1;
-	}
-	vector = nvmeq->cq_vector;
-	nvmeq->dev->online_queues--;
-	nvmeq->cq_vector = -1;
-	nvmeq->polled = false;
-	spin_unlock_irq(&nvmeq->cq_lock);
 
-	/*
-	 * Ensure that nvme_queue_rq() sees it ->cq_vector == -1 without
-	 * having to grab the lock.
-	 */
+	/* ensure that nvme_queue_rq() sees NVMEQ_ENABLED cleared */
 	mb();
 
+	nvmeq->dev->online_queues--;
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
-	if (vector != -1)
-		pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
-
+	if (nvmeq->cq_vector == -1)
+		return 0;
+	pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq);
+	nvmeq->cq_vector = -1;
 	return 0;
 }
 
@@ -1578,13 +1567,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	else if (result)
 		goto release_cq;
 
-	/*
-	 * Set cq_vector after alloc cq/sq, otherwise nvme_suspend_queue will
-	 * invoke free_irq for it and cause a 'Trying to free already-free IRQ
-	 * xxx' warning if the create CQ/SQ command times out.
-	 */
 	nvmeq->cq_vector = vector;
-	nvmeq->polled = polled;
 	nvme_init_queue(nvmeq, qid);
 
 	if (vector != -1) {
@@ -1593,11 +1576,11 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 			goto release_sq;
 	}
 
+	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
 	return result;
 
 release_sq:
 	nvmeq->cq_vector = -1;
-	nvmeq->polled = false;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
 release_cq:
@@ -1751,6 +1734,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 		return result;
 	}
 
+	set_bit(NVMEQ_ENABLED, &nvmeq->flags);
 	return result;
 }
 
@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	if (nr_io_queues == 0)
 		return 0;
+	
+	clear_bit(NVMEQ_ENABLED, &adminq->flags);
 
 	if (dev->cmb_use_sqes) {
 		result = nvme_cmb_qdepth(dev, nr_io_queues,
@@ -2227,6 +2213,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		adminq->cq_vector = -1;
 		return result;
 	}
+	set_bit(NVMEQ_ENABLED, &adminq->flags);
 	return nvme_create_io_queues(dev);
 }
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

Use a bit flag to mark if the SQ was allocated from the CMB, and clean
up the surrounding code a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 022395a319f4..b820dd0351cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -186,7 +186,6 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t sq_lock;
 	struct nvme_command *sq_cmds;
-	bool sq_cmds_is_io;
 	spinlock_t cq_lock ____cacheline_aligned_in_smp;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
@@ -203,6 +202,7 @@ struct nvme_queue {
 	u8 cq_phase;
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
+#define NVMEQ_SQ_CMB		1
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -1366,17 +1366,15 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
+	if (!nvmeq->sq_cmds)
+		return;
 
-	if (nvmeq->sq_cmds) {
-		if (nvmeq->sq_cmds_is_io)
-			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
-					nvmeq->sq_cmds,
-					SQ_SIZE(nvmeq->q_depth));
-		else
-			dma_free_coherent(nvmeq->q_dmadev,
-					  SQ_SIZE(nvmeq->q_depth),
-					  nvmeq->sq_cmds,
-					  nvmeq->sq_dma_addr);
+	if (test_and_clear_bit(NVMEQ_SQ_CMB, &nvmeq->flags)) {
+		pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+				nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
+	} else {
+		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
+				nvmeq->sq_cmds, nvmeq->sq_dma_addr);
 	}
 }
 
@@ -1462,15 +1460,14 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
 		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
 						nvmeq->sq_cmds);
-		nvmeq->sq_cmds_is_io = true;
-	}
-
-	if (!nvmeq->sq_cmds) {
-		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					&nvmeq->sq_dma_addr, GFP_KERNEL);
-		nvmeq->sq_cmds_is_io = false;
+		if (nvmeq->sq_dma_addr) {
+			set_bit(NVMEQ_SQ_CMB, &nvmeq->flags);
+			return 0; 
+		}
 	}
 
+	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+				&nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


Use a bit flag to mark if the SQ was allocated from the CMB, and clean
up the surrounding code a bit.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 022395a319f4..b820dd0351cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -186,7 +186,6 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t sq_lock;
 	struct nvme_command *sq_cmds;
-	bool sq_cmds_is_io;
 	spinlock_t cq_lock ____cacheline_aligned_in_smp;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
@@ -203,6 +202,7 @@ struct nvme_queue {
 	u8 cq_phase;
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
+#define NVMEQ_SQ_CMB		1
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -1366,17 +1366,15 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
 				(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
+	if (!nvmeq->sq_cmds)
+		return;
 
-	if (nvmeq->sq_cmds) {
-		if (nvmeq->sq_cmds_is_io)
-			pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
-					nvmeq->sq_cmds,
-					SQ_SIZE(nvmeq->q_depth));
-		else
-			dma_free_coherent(nvmeq->q_dmadev,
-					  SQ_SIZE(nvmeq->q_depth),
-					  nvmeq->sq_cmds,
-					  nvmeq->sq_dma_addr);
+	if (test_and_clear_bit(NVMEQ_SQ_CMB, &nvmeq->flags)) {
+		pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+				nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
+	} else {
+		dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
+				nvmeq->sq_cmds, nvmeq->sq_dma_addr);
 	}
 }
 
@@ -1462,15 +1460,14 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 		nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
 		nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
 						nvmeq->sq_cmds);
-		nvmeq->sq_cmds_is_io = true;
-	}
-
-	if (!nvmeq->sq_cmds) {
-		nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-					&nvmeq->sq_dma_addr, GFP_KERNEL);
-		nvmeq->sq_cmds_is_io = false;
+		if (nvmeq->sq_dma_addr) {
+			set_bit(NVMEQ_SQ_CMB, &nvmeq->flags);
+			return 0; 
+		}
 	}
 
+	nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+				&nvmeq->sq_dma_addr, GFP_KERNEL);
 	if (!nvmeq->sq_cmds)
 		return -ENOMEM;
 	return 0;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This will allow us to simplify both the regular NVMe interrupt handler
and the upcoming aio poll code.  In addition to that the separate
queues are generally a good idea for performance reasons.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b820dd0351cb..d42bb76e5e78 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1089,13 +1089,6 @@ static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
 }
 
 static int nvme_poll(struct blk_mq_hw_ctx *hctx)
-{
-	struct nvme_queue *nvmeq = hctx->driver_data;
-
-	return __nvme_poll(nvmeq, -1);
-}
-
-static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
 	u16 start, end;
@@ -1605,12 +1598,11 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	NVME_SHARED_MQ_OPS,
-	.poll			= nvme_poll,
 };
 
-static const struct blk_mq_ops nvme_mq_poll_noirq_ops = {
+static const struct blk_mq_ops nvme_mq_poll_ops = {
 	NVME_SHARED_MQ_OPS,
-	.poll			= nvme_poll_noirq,
+	.poll			= nvme_poll,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -2298,10 +2290,10 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		if (!dev->io_queues[HCTX_TYPE_POLL])
-			dev->tagset.ops = &nvme_mq_ops;
+		if (dev->io_queues[HCTX_TYPE_POLL])
+			dev->tagset.ops = &nvme_mq_poll_ops;
 		else
-			dev->tagset.ops = &nvme_mq_poll_noirq_ops;
+			dev->tagset.ops = &nvme_mq_ops;
 
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.nr_maps = HCTX_MAX_TYPES;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


This will allow us to simplify both the regular NVMe interrupt handler
and the upcoming aio poll code.  In addition to that the separate
queues are generally a good idea for performance reasons.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b820dd0351cb..d42bb76e5e78 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1089,13 +1089,6 @@ static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
 }
 
 static int nvme_poll(struct blk_mq_hw_ctx *hctx)
-{
-	struct nvme_queue *nvmeq = hctx->driver_data;
-
-	return __nvme_poll(nvmeq, -1);
-}
-
-static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
 	u16 start, end;
@@ -1605,12 +1598,11 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	NVME_SHARED_MQ_OPS,
-	.poll			= nvme_poll,
 };
 
-static const struct blk_mq_ops nvme_mq_poll_noirq_ops = {
+static const struct blk_mq_ops nvme_mq_poll_ops = {
 	NVME_SHARED_MQ_OPS,
-	.poll			= nvme_poll_noirq,
+	.poll			= nvme_poll,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -2298,10 +2290,10 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		if (!dev->io_queues[HCTX_TYPE_POLL])
-			dev->tagset.ops = &nvme_mq_ops;
+		if (dev->io_queues[HCTX_TYPE_POLL])
+			dev->tagset.ops = &nvme_mq_poll_ops;
 		else
-			dev->tagset.ops = &nvme_mq_poll_noirq_ops;
+			dev->tagset.ops = &nvme_mq_ops;
 
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.nr_maps = HCTX_MAX_TYPES;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

We have three places that can poll for I/O completions on a normal
interrupt-enabled queue.  All of them are in slow path code, so
consolidate them to a single helper that uses spin_lock_irqsave and
removes the fast path cqe_pending check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d42bb76e5e78..10c26a2e355a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1072,17 +1072,19 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
-static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
+/*
+ * Poll for completions any queue, including those not dedicated to polling.
+ * Can be called from any context.
+ */
+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
 {
+	unsigned long flags;
 	u16 start, end;
 	int found;
 
-	if (!nvme_cqe_pending(nvmeq))
-		return 0;
-
-	spin_lock_irq(&nvmeq->cq_lock);
+	spin_lock_irqsave(&nvmeq->cq_lock, flags);
 	found = nvme_process_cq(nvmeq, &start, &end, tag);
-	spin_unlock_irq(&nvmeq->cq_lock);
+	spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
@@ -1279,7 +1281,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	/*
 	 * Did we miss an interrupt?
 	 */
-	if (__nvme_poll(nvmeq, req->tag)) {
+	if (nvme_poll_irqdisable(nvmeq, req->tag)) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, completion polled\n",
 			 req->tag, nvmeq->qid);
@@ -1406,18 +1408,13 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
 {
 	struct nvme_queue *nvmeq = &dev->queues[0];
-	u16 start, end;
 
 	if (shutdown)
 		nvme_shutdown_ctrl(&dev->ctrl);
 	else
 		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
 
-	spin_lock_irq(&nvmeq->cq_lock);
-	nvme_process_cq(nvmeq, &start, &end, -1);
-	spin_unlock_irq(&nvmeq->cq_lock);
-
-	nvme_complete_cqes(nvmeq, start, end);
+	nvme_poll_irqdisable(nvmeq, -1);
 }
 
 static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
@@ -2217,17 +2214,9 @@ static void nvme_del_queue_end(struct request *req, blk_status_t error)
 static void nvme_del_cq_end(struct request *req, blk_status_t error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
-	u16 start, end;
 
-	if (!error) {
-		unsigned long flags;
-
-		spin_lock_irqsave(&nvmeq->cq_lock, flags);
-		nvme_process_cq(nvmeq, &start, &end, -1);
-		spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
-
-		nvme_complete_cqes(nvmeq, start, end);
-	}
+	if (!error)
+		nvme_poll_irqdisable(nvmeq, -1);
 
 	nvme_del_queue_end(req, error);
 }
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


We have three places that can poll for I/O completions on a normal
interrupt-enabled queue.  All of them are in slow path code, so
consolidate them to a single helper that uses spin_lock_irqsave and
removes the fast path cqe_pending check.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d42bb76e5e78..10c26a2e355a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1072,17 +1072,19 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
-static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
+/*
+ * Poll for completions any queue, including those not dedicated to polling.
+ * Can be called from any context.
+ */
+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
 {
+	unsigned long flags;
 	u16 start, end;
 	int found;
 
-	if (!nvme_cqe_pending(nvmeq))
-		return 0;
-
-	spin_lock_irq(&nvmeq->cq_lock);
+	spin_lock_irqsave(&nvmeq->cq_lock, flags);
 	found = nvme_process_cq(nvmeq, &start, &end, tag);
-	spin_unlock_irq(&nvmeq->cq_lock);
+	spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
@@ -1279,7 +1281,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	/*
 	 * Did we miss an interrupt?
 	 */
-	if (__nvme_poll(nvmeq, req->tag)) {
+	if (nvme_poll_irqdisable(nvmeq, req->tag)) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, completion polled\n",
 			 req->tag, nvmeq->qid);
@@ -1406,18 +1408,13 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 static void nvme_disable_admin_queue(struct nvme_dev *dev, bool shutdown)
 {
 	struct nvme_queue *nvmeq = &dev->queues[0];
-	u16 start, end;
 
 	if (shutdown)
 		nvme_shutdown_ctrl(&dev->ctrl);
 	else
 		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
 
-	spin_lock_irq(&nvmeq->cq_lock);
-	nvme_process_cq(nvmeq, &start, &end, -1);
-	spin_unlock_irq(&nvmeq->cq_lock);
-
-	nvme_complete_cqes(nvmeq, start, end);
+	nvme_poll_irqdisable(nvmeq, -1);
 }
 
 static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues,
@@ -2217,17 +2214,9 @@ static void nvme_del_queue_end(struct request *req, blk_status_t error)
 static void nvme_del_cq_end(struct request *req, blk_status_t error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
-	u16 start, end;
 
-	if (!error) {
-		unsigned long flags;
-
-		spin_lock_irqsave(&nvmeq->cq_lock, flags);
-		nvme_process_cq(nvmeq, &start, &end, -1);
-		spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
-
-		nvme_complete_cqes(nvmeq, start, end);
-	}
+	if (!error)
+		nvme_poll_irqdisable(nvmeq, -1);
 
 	nvme_del_queue_end(req, error);
 }
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

Pass the opcode for the delete SQ/CQ command as an argument instead of
the somewhat confusing pass loop.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 10c26a2e355a..9ceba9900ca3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2244,31 +2244,29 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 	return 0;
 }
 
-static void nvme_disable_io_queues(struct nvme_dev *dev)
+static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 {
-	int pass, queues = dev->online_queues - 1;
+	int nr_queues = dev->online_queues - 1, sent = 0;
 	unsigned long timeout;
-	u8 opcode = nvme_admin_delete_sq;
 
-	for (pass = 0; pass < 2; pass++) {
-		int sent = 0, i = queues;
-
-		reinit_completion(&dev->ioq_wait);
+	reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
-		for (; i > 0; i--, sent++)
-			if (nvme_delete_queue(&dev->queues[i], opcode))
-				break;
-
-		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
-			if (i)
-				goto retry;
-		}
-		opcode = nvme_admin_delete_cq;
+	timeout = ADMIN_TIMEOUT;
+	while (nr_queues > 0) {
+		if (nvme_delete_queue(&dev->queues[nr_queues], opcode))
+			break;
+		nr_queues--;
+		sent++;
 	}
+	while (sent--) {
+		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
+				timeout);
+		if (timeout == 0)
+			return false;
+		if (nr_queues)
+			goto retry;
+	}
+	return true;
 }
 
 /*
@@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead && dev->ctrl.queue_count > 0) {
-		nvme_disable_io_queues(dev);
+		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
+			nvme_disable_io_queues(dev, nvme_admin_delete_cq);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


Pass the opcode for the delete SQ/CQ command as an argument instead of
the somewhat confusing pass loop.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 10c26a2e355a..9ceba9900ca3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2244,31 +2244,29 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 	return 0;
 }
 
-static void nvme_disable_io_queues(struct nvme_dev *dev)
+static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 {
-	int pass, queues = dev->online_queues - 1;
+	int nr_queues = dev->online_queues - 1, sent = 0;
 	unsigned long timeout;
-	u8 opcode = nvme_admin_delete_sq;
 
-	for (pass = 0; pass < 2; pass++) {
-		int sent = 0, i = queues;
-
-		reinit_completion(&dev->ioq_wait);
+	reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
-		for (; i > 0; i--, sent++)
-			if (nvme_delete_queue(&dev->queues[i], opcode))
-				break;
-
-		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
-			if (i)
-				goto retry;
-		}
-		opcode = nvme_admin_delete_cq;
+	timeout = ADMIN_TIMEOUT;
+	while (nr_queues > 0) {
+		if (nvme_delete_queue(&dev->queues[nr_queues], opcode))
+			break;
+		nr_queues--;
+		sent++;
 	}
+	while (sent--) {
+		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
+				timeout);
+		if (timeout == 0)
+			return false;
+		if (nr_queues)
+			goto retry;
+	}
+	return true;
 }
 
 /*
@@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead && dev->ctrl.queue_count > 0) {
-		nvme_disable_io_queues(dev);
+		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
+			nvme_disable_io_queues(dev, nvme_admin_delete_cq);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This is the last place outside of nvme_irq that handles CQEs from
interrupt context, and thus is in the way of removing the cq_lock for
normal queues, and avoiding lockdep warnings on the poll queues, for
which we already take it without IRQ disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9ceba9900ca3..2d5a468c35b1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -122,7 +122,6 @@ struct nvme_dev {
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
-	struct completion ioq_wait;
 
 	mempool_t *iod_mempool;
 
@@ -203,10 +202,12 @@ struct nvme_queue {
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
 #define NVMEQ_SQ_CMB		1
+#define NVMEQ_DELETE_ERROR	2
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
 	u32 *dbbuf_cq_ei;
+	struct completion delete_done;
 };
 
 /*
@@ -1535,6 +1536,8 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	int result;
 	s16 vector;
 
+	clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
+
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
@@ -2208,15 +2211,15 @@ static void nvme_del_queue_end(struct request *req, blk_status_t error)
 	struct nvme_queue *nvmeq = req->end_io_data;
 
 	blk_mq_free_request(req);
-	complete(&nvmeq->dev->ioq_wait);
+	complete(&nvmeq->delete_done);
 }
 
 static void nvme_del_cq_end(struct request *req, blk_status_t error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
 
-	if (!error)
-		nvme_poll_irqdisable(nvmeq, -1);
+	if (error)
+		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
 	nvme_del_queue_end(req, error);
 }
@@ -2238,6 +2241,7 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 	req->timeout = ADMIN_TIMEOUT;
 	req->end_io_data = nvmeq;
 
+	init_completion(&nvmeq->delete_done);
 	blk_execute_rq_nowait(q, NULL, req, false,
 			opcode == nvme_admin_delete_cq ?
 				nvme_del_cq_end : nvme_del_queue_end);
@@ -2249,7 +2253,6 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 	int nr_queues = dev->online_queues - 1, sent = 0;
 	unsigned long timeout;
 
-	reinit_completion(&dev->ioq_wait);
  retry:
 	timeout = ADMIN_TIMEOUT;
 	while (nr_queues > 0) {
@@ -2258,11 +2261,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 		nr_queues--;
 		sent++;
 	}
-	while (sent--) {
-		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
+	while (sent) {
+		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
+
+		timeout = wait_for_completion_io_timeout(&nvmeq->delete_done,
 				timeout);
 		if (timeout == 0)
 			return false;
+
+		/* handle any remaining CQEs */
+		if (opcode == nvme_admin_delete_cq &&
+		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
+			nvme_poll_irqdisable(nvmeq, -1);
+
+		sent--;
 		if (nr_queues)
 			goto retry;
 	}
@@ -2746,7 +2758,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 	mutex_init(&dev->shutdown_lock);
-	init_completion(&dev->ioq_wait);
 
 	result = nvme_setup_prp_pools(dev);
 	if (result)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


This is the last place outside of nvme_irq that handles CQEs from
interrupt context, and thus is in the way of removing the cq_lock for
normal queues, and avoiding lockdep warnings on the poll queues, for
which we already take it without IRQ disabling.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9ceba9900ca3..2d5a468c35b1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -122,7 +122,6 @@ struct nvme_dev {
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
-	struct completion ioq_wait;
 
 	mempool_t *iod_mempool;
 
@@ -203,10 +202,12 @@ struct nvme_queue {
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
 #define NVMEQ_SQ_CMB		1
+#define NVMEQ_DELETE_ERROR	2
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
 	u32 *dbbuf_cq_ei;
+	struct completion delete_done;
 };
 
 /*
@@ -1535,6 +1536,8 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	int result;
 	s16 vector;
 
+	clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
+
 	/*
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
@@ -2208,15 +2211,15 @@ static void nvme_del_queue_end(struct request *req, blk_status_t error)
 	struct nvme_queue *nvmeq = req->end_io_data;
 
 	blk_mq_free_request(req);
-	complete(&nvmeq->dev->ioq_wait);
+	complete(&nvmeq->delete_done);
 }
 
 static void nvme_del_cq_end(struct request *req, blk_status_t error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
 
-	if (!error)
-		nvme_poll_irqdisable(nvmeq, -1);
+	if (error)
+		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
 	nvme_del_queue_end(req, error);
 }
@@ -2238,6 +2241,7 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 	req->timeout = ADMIN_TIMEOUT;
 	req->end_io_data = nvmeq;
 
+	init_completion(&nvmeq->delete_done);
 	blk_execute_rq_nowait(q, NULL, req, false,
 			opcode == nvme_admin_delete_cq ?
 				nvme_del_cq_end : nvme_del_queue_end);
@@ -2249,7 +2253,6 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 	int nr_queues = dev->online_queues - 1, sent = 0;
 	unsigned long timeout;
 
-	reinit_completion(&dev->ioq_wait);
  retry:
 	timeout = ADMIN_TIMEOUT;
 	while (nr_queues > 0) {
@@ -2258,11 +2261,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 		nr_queues--;
 		sent++;
 	}
-	while (sent--) {
-		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
+	while (sent) {
+		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
+
+		timeout = wait_for_completion_io_timeout(&nvmeq->delete_done,
 				timeout);
 		if (timeout == 0)
 			return false;
+
+		/* handle any remaining CQEs */
+		if (opcode == nvme_admin_delete_cq &&
+		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
+			nvme_poll_irqdisable(nvmeq, -1);
+
+		sent--;
 		if (nr_queues)
 			goto retry;
 	}
@@ -2746,7 +2758,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
 	mutex_init(&dev->shutdown_lock);
-	init_completion(&dev->ioq_wait);
 
 	result = nvme_setup_prp_pools(dev);
 	if (result)
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

Now that we can't poll regular, interrupt driven I/O queues there
is almost nothing that can race with an interrupt.  The only
possible other contexts polling a CQ are the error handler and
queue shutdown, and both are so far off in the slow path that
we can simply use the big hammer of disabling interrupts.

With that we can stop taking the cq_lock for normal queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2d5a468c35b1..4ccb4ea22ac6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -185,7 +185,8 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t sq_lock;
 	struct nvme_command *sq_cmds;
-	spinlock_t cq_lock ____cacheline_aligned_in_smp;
+	 /* only used for poll queues: */
+	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -1050,12 +1051,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	irqreturn_t ret = IRQ_NONE;
 	u16 start, end;
 
-	spin_lock(&nvmeq->cq_lock);
+	/*
+	 * The rmb/wmb pair ensures we see all updates from a previous run of
+	 * the irq handler, even if that was on another CPU.
+	 */
+	rmb();
 	if (nvmeq->cq_head != nvmeq->last_cq_head)
 		ret = IRQ_HANDLED;
 	nvme_process_cq(nvmeq, &start, &end, -1);
 	nvmeq->last_cq_head = nvmeq->cq_head;
-	spin_unlock(&nvmeq->cq_lock);
+	wmb();
 
 	if (start != end) {
 		nvme_complete_cqes(nvmeq, start, end);
@@ -1079,13 +1084,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
  */
 static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
 {
-	unsigned long flags;
+	struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
 	u16 start, end;
 	int found;
 
-	spin_lock_irqsave(&nvmeq->cq_lock, flags);
+	/*
+	 * For a poll queue we need to protect against the polling thread
+	 * using the CQ lock.  For normal interrupt driven threads we have
+	 * to disable the interrupt to avoid racing with it.
+	 */
+	if (nvmeq->cq_vector == -1)
+		spin_lock(&nvmeq->cq_poll_lock);
+	else
+		disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
 	found = nvme_process_cq(nvmeq, &start, &end, tag);
-	spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
+	if (nvmeq->cq_vector == -1)
+		spin_unlock(&nvmeq->cq_poll_lock);
+	else
+		enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
@@ -1100,9 +1116,9 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
 	if (!nvme_cqe_pending(nvmeq))
 		return 0;
 
-	spin_lock(&nvmeq->cq_lock);
+	spin_lock(&nvmeq->cq_poll_lock);
 	found = nvme_process_cq(nvmeq, &start, &end, -1);
-	spin_unlock(&nvmeq->cq_lock);
+	spin_unlock(&nvmeq->cq_poll_lock);
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
@@ -1482,7 +1498,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth)
 	nvmeq->q_dmadev = dev->dev;
 	nvmeq->dev = dev;
 	spin_lock_init(&nvmeq->sq_lock);
-	spin_lock_init(&nvmeq->cq_lock);
+	spin_lock_init(&nvmeq->cq_poll_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
@@ -1518,7 +1534,6 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 
-	spin_lock_irq(&nvmeq->cq_lock);
 	nvmeq->sq_tail = 0;
 	nvmeq->last_sq_tail = 0;
 	nvmeq->cq_head = 0;
@@ -1527,7 +1542,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
 	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->online_queues++;
-	spin_unlock_irq(&nvmeq->cq_lock);
+	wmb(); /* ensure the first interrupt sees the initialization */
 }
 
 static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


Now that we can't poll regular, interrupt driven I/O queues there
is almost nothing that can race with an interrupt.  The only
possible other contexts polling a CQ are the error handler and
queue shutdown, and both are so far off in the slow path that
we can simply use the big hammer of disabling interrupts.

With that we can stop taking the cq_lock for normal queues.

Signed-off-by: Christoph Hellwig <hch at lst.de>
Reviewed-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2d5a468c35b1..4ccb4ea22ac6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -185,7 +185,8 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t sq_lock;
 	struct nvme_command *sq_cmds;
-	spinlock_t cq_lock ____cacheline_aligned_in_smp;
+	 /* only used for poll queues: */
+	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
 	volatile struct nvme_completion *cqes;
 	struct blk_mq_tags **tags;
 	dma_addr_t sq_dma_addr;
@@ -1050,12 +1051,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
 	irqreturn_t ret = IRQ_NONE;
 	u16 start, end;
 
-	spin_lock(&nvmeq->cq_lock);
+	/*
+	 * The rmb/wmb pair ensures we see all updates from a previous run of
+	 * the irq handler, even if that was on another CPU.
+	 */
+	rmb();
 	if (nvmeq->cq_head != nvmeq->last_cq_head)
 		ret = IRQ_HANDLED;
 	nvme_process_cq(nvmeq, &start, &end, -1);
 	nvmeq->last_cq_head = nvmeq->cq_head;
-	spin_unlock(&nvmeq->cq_lock);
+	wmb();
 
 	if (start != end) {
 		nvme_complete_cqes(nvmeq, start, end);
@@ -1079,13 +1084,24 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
  */
 static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
 {
-	unsigned long flags;
+	struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
 	u16 start, end;
 	int found;
 
-	spin_lock_irqsave(&nvmeq->cq_lock, flags);
+	/*
+	 * For a poll queue we need to protect against the polling thread
+	 * using the CQ lock.  For normal interrupt driven threads we have
+	 * to disable the interrupt to avoid racing with it.
+	 */
+	if (nvmeq->cq_vector == -1)
+		spin_lock(&nvmeq->cq_poll_lock);
+	else
+		disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
 	found = nvme_process_cq(nvmeq, &start, &end, tag);
-	spin_unlock_irqrestore(&nvmeq->cq_lock, flags);
+	if (nvmeq->cq_vector == -1)
+		spin_unlock(&nvmeq->cq_poll_lock);
+	else
+		enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
@@ -1100,9 +1116,9 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
 	if (!nvme_cqe_pending(nvmeq))
 		return 0;
 
-	spin_lock(&nvmeq->cq_lock);
+	spin_lock(&nvmeq->cq_poll_lock);
 	found = nvme_process_cq(nvmeq, &start, &end, -1);
-	spin_unlock(&nvmeq->cq_lock);
+	spin_unlock(&nvmeq->cq_poll_lock);
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
@@ -1482,7 +1498,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth)
 	nvmeq->q_dmadev = dev->dev;
 	nvmeq->dev = dev;
 	spin_lock_init(&nvmeq->sq_lock);
-	spin_lock_init(&nvmeq->cq_lock);
+	spin_lock_init(&nvmeq->cq_poll_lock);
 	nvmeq->cq_head = 0;
 	nvmeq->cq_phase = 1;
 	nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
@@ -1518,7 +1534,6 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 
-	spin_lock_irq(&nvmeq->cq_lock);
 	nvmeq->sq_tail = 0;
 	nvmeq->last_sq_tail = 0;
 	nvmeq->cq_head = 0;
@@ -1527,7 +1542,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
 	nvme_dbbuf_init(dev, nvmeq, qid);
 	dev->online_queues++;
-	spin_unlock_irq(&nvmeq->cq_lock);
+	wmb(); /* ensure the first interrupt sees the initialization */
 }
 
 static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 09/13] nvme-rdma: remove I/O polling support
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

The code was always a bit of a hack that digs far too much into
RDMA core internals.  Lets kick it out and reimplement proper
dedicated poll queues as needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/rdma.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ccfde6c7c0a5..a62e9f177c06 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1736,29 +1736,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_IOERR;
 }
 
-static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx)
-{
-	struct nvme_rdma_queue *queue = hctx->driver_data;
-	struct ib_cq *cq = queue->ib_cq;
-	struct ib_wc wc;
-	int found = 0;
-
-	while (ib_poll_cq(cq, 1, &wc) > 0) {
-		struct ib_cqe *cqe = wc.wr_cqe;
-
-		if (cqe) {
-			if (cqe->done == nvme_rdma_recv_done) {
-				nvme_rdma_recv_done(cq, &wc);
-				found++;
-			} else {
-				cqe->done(cq, &wc);
-			}
-		}
-	}
-
-	return found;
-}
-
 static void nvme_rdma_complete_rq(struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
@@ -1780,7 +1757,6 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
 	.init_hctx	= nvme_rdma_init_hctx,
-	.poll		= nvme_rdma_poll,
 	.timeout	= nvme_rdma_timeout,
 	.map_queues	= nvme_rdma_map_queues,
 };
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 09/13] nvme-rdma: remove I/O polling support
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


The code was always a bit of a hack that digs far too much into
RDMA core internals.  Lets kick it out and reimplement proper
dedicated poll queues as needed.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/rdma.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ccfde6c7c0a5..a62e9f177c06 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1736,29 +1736,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_IOERR;
 }
 
-static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx)
-{
-	struct nvme_rdma_queue *queue = hctx->driver_data;
-	struct ib_cq *cq = queue->ib_cq;
-	struct ib_wc wc;
-	int found = 0;
-
-	while (ib_poll_cq(cq, 1, &wc) > 0) {
-		struct ib_cqe *cqe = wc.wr_cqe;
-
-		if (cqe) {
-			if (cqe->done == nvme_rdma_recv_done) {
-				nvme_rdma_recv_done(cq, &wc);
-				found++;
-			} else {
-				cqe->done(cq, &wc);
-			}
-		}
-	}
-
-	return found;
-}
-
 static void nvme_rdma_complete_rq(struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
@@ -1780,7 +1757,6 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
 	.init_hctx	= nvme_rdma_init_hctx,
-	.poll		= nvme_rdma_poll,
 	.timeout	= nvme_rdma_timeout,
 	.map_queues	= nvme_rdma_map_queues,
 };
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/13] nvme-mpath: remove I/O polling support
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

The ->poll_fn has been stale for a while, as a lot of places check for mq
ops.  But there is no real point in it anyway, as we don't even use
the multipath code for subsystems without multiple ports, which is usually
what we do high performance I/O to.  If it really becomes an issue we
should rework the nvme code to also skip the multipath code for any
private namespace, even if that could mean some trouble when rescanning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/multipath.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ffebdd0ae34b..ec310b1b9267 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -220,21 +220,6 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
 	return ret;
 }
 
-static int nvme_ns_head_poll(struct request_queue *q, blk_qc_t qc, bool spin)
-{
-	struct nvme_ns_head *head = q->queuedata;
-	struct nvme_ns *ns;
-	int found = 0;
-	int srcu_idx;
-
-	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = srcu_dereference(head->current_path[numa_node_id()], &head->srcu);
-	if (likely(ns && nvme_path_is_optimized(ns)))
-		found = ns->queue->poll_fn(q, qc, spin);
-	srcu_read_unlock(&head->srcu, srcu_idx);
-	return found;
-}
-
 static void nvme_requeue_work(struct work_struct *work)
 {
 	struct nvme_ns_head *head =
@@ -281,7 +266,6 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 		goto out;
 	q->queuedata = head;
 	blk_queue_make_request(q, nvme_ns_head_make_request);
-	q->poll_fn = nvme_ns_head_poll;
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
 	/* set to a default value for 512 until disk is validated */
 	blk_queue_logical_block_size(q, 512);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 10/13] nvme-mpath: remove I/O polling support
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


The ->poll_fn has been stale for a while, as a lot of places check for mq
ops.  But there is no real point in it anyway, as we don't even use
the multipath code for subsystems without multiple ports, which is usually
what we do high performance I/O to.  If it really becomes an issue we
should rework the nvme code to also skip the multipath code for any
private namespace, even if that could mean some trouble when rescanning.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/multipath.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ffebdd0ae34b..ec310b1b9267 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -220,21 +220,6 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
 	return ret;
 }
 
-static int nvme_ns_head_poll(struct request_queue *q, blk_qc_t qc, bool spin)
-{
-	struct nvme_ns_head *head = q->queuedata;
-	struct nvme_ns *ns;
-	int found = 0;
-	int srcu_idx;
-
-	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = srcu_dereference(head->current_path[numa_node_id()], &head->srcu);
-	if (likely(ns && nvme_path_is_optimized(ns)))
-		found = ns->queue->poll_fn(q, qc, spin);
-	srcu_read_unlock(&head->srcu, srcu_idx);
-	return found;
-}
-
 static void nvme_requeue_work(struct work_struct *work)
 {
 	struct nvme_ns_head *head =
@@ -281,7 +266,6 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 		goto out;
 	q->queuedata = head;
 	blk_queue_make_request(q, nvme_ns_head_make_request);
-	q->poll_fn = nvme_ns_head_poll;
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
 	/* set to a default value for 512 until disk is validated */
 	blk_queue_logical_block_size(q, 512);
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 11/13] block: remove ->poll_fn
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This was intended to support users like nvme multipath, but is just
getting in the way and adding another indirect call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c       | 23 -----------------------
 block/blk-mq.c         | 24 +++++++++++++++++++-----
 include/linux/blkdev.h |  2 --
 3 files changed, 19 insertions(+), 30 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3f6f5e6c2fe4..942276399085 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1249,29 +1249,6 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
-/**
- * blk_poll - poll for IO completions
- * @q:  the queue
- * @cookie: cookie passed back at IO submission time
- * @spin: whether to spin for completions
- *
- * Description:
- *    Poll for completions on the passed in queue. Returns number of
- *    completed entries found. If @spin is true, then blk_poll will continue
- *    looping until at least one completion is found, unless the task is
- *    otherwise marked running (or we need to reschedule).
- */
-int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
-{
-	if (!q->poll_fn || !blk_qc_t_valid(cookie))
-		return 0;
-
-	if (current->plug)
-		blk_flush_plug_list(current->plug, false);
-	return q->poll_fn(q, cookie, spin);
-}
-EXPORT_SYMBOL_GPL(blk_poll);
-
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for new the queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7dcef565dc0f..9c90c5038d07 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -38,7 +38,6 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2823,8 +2822,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	spin_lock_init(&q->requeue_lock);
 
 	blk_queue_make_request(q, blk_mq_make_request);
-	if (q->mq_ops->poll)
-		q->poll_fn = blk_mq_poll;
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
@@ -3385,14 +3382,30 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, hctx, rq);
 }
 
-static int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+/**
+ * blk_poll - poll for IO completions
+ * @q:  the queue
+ * @cookie: cookie passed back at IO submission time
+ * @spin: whether to spin for completions
+ *
+ * Description:
+ *    Poll for completions on the passed in queue. Returns number of
+ *    completed entries found. If @spin is true, then blk_poll will continue
+ *    looping until at least one completion is found, unless the task is
+ *    otherwise marked running (or we need to reschedule).
+ */
+int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) ||
+	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return 0;
 
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
@@ -3433,6 +3446,7 @@ static int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	__set_current_state(TASK_RUNNING);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(blk_poll);
 
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08d940f85fa0..0b3874bdbc6a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -283,7 +283,6 @@ static inline unsigned short req_get_ioprio(struct request *req)
 struct blk_queue_ctx;
 
 typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
-typedef int (poll_q_fn) (struct request_queue *q, blk_qc_t, bool spin);
 
 struct bio_vec;
 typedef int (dma_drain_needed_fn)(struct request *);
@@ -401,7 +400,6 @@ struct request_queue {
 	struct rq_qos		*rq_qos;
 
 	make_request_fn		*make_request_fn;
-	poll_q_fn		*poll_fn;
 	dma_drain_needed_fn	*dma_drain_needed;
 
 	const struct blk_mq_ops	*mq_ops;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 11/13] block: remove ->poll_fn
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


This was intended to support users like nvme multipath, but is just
getting in the way and adding another indirect call.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-core.c       | 23 -----------------------
 block/blk-mq.c         | 24 +++++++++++++++++++-----
 include/linux/blkdev.h |  2 --
 3 files changed, 19 insertions(+), 30 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3f6f5e6c2fe4..942276399085 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1249,29 +1249,6 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
-/**
- * blk_poll - poll for IO completions
- * @q:  the queue
- * @cookie: cookie passed back at IO submission time
- * @spin: whether to spin for completions
- *
- * Description:
- *    Poll for completions on the passed in queue. Returns number of
- *    completed entries found. If @spin is true, then blk_poll will continue
- *    looping until at least one completion is found, unless the task is
- *    otherwise marked running (or we need to reschedule).
- */
-int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
-{
-	if (!q->poll_fn || !blk_qc_t_valid(cookie))
-		return 0;
-
-	if (current->plug)
-		blk_flush_plug_list(current->plug, false);
-	return q->poll_fn(q, cookie, spin);
-}
-EXPORT_SYMBOL_GPL(blk_poll);
-
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for new the queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7dcef565dc0f..9c90c5038d07 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -38,7 +38,6 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2823,8 +2822,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	spin_lock_init(&q->requeue_lock);
 
 	blk_queue_make_request(q, blk_mq_make_request);
-	if (q->mq_ops->poll)
-		q->poll_fn = blk_mq_poll;
 
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
@@ -3385,14 +3382,30 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, hctx, rq);
 }
 
-static int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+/**
+ * blk_poll - poll for IO completions
+ * @q:  the queue
+ * @cookie: cookie passed back at IO submission time
+ * @spin: whether to spin for completions
+ *
+ * Description:
+ *    Poll for completions on the passed in queue. Returns number of
+ *    completed entries found. If @spin is true, then blk_poll will continue
+ *    looping until at least one completion is found, unless the task is
+ *    otherwise marked running (or we need to reschedule).
+ */
+int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) ||
+	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return 0;
 
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
@@ -3433,6 +3446,7 @@ static int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	__set_current_state(TASK_RUNNING);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(blk_poll);
 
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08d940f85fa0..0b3874bdbc6a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -283,7 +283,6 @@ static inline unsigned short req_get_ioprio(struct request *req)
 struct blk_queue_ctx;
 
 typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
-typedef int (poll_q_fn) (struct request_queue *q, blk_qc_t, bool spin);
 
 struct bio_vec;
 typedef int (dma_drain_needed_fn)(struct request *);
@@ -401,7 +400,6 @@ struct request_queue {
 	struct rq_qos		*rq_qos;
 
 	make_request_fn		*make_request_fn;
-	poll_q_fn		*poll_fn;
 	dma_drain_needed_fn	*dma_drain_needed;
 
 	const struct blk_mq_ops	*mq_ops;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/13] block: only allow polling if a poll queue_map exists
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This avoids having to have differnet mq_ops for different setups
with or without poll queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-sysfs.c       |  2 +-
 drivers/nvme/host/pci.c | 29 +++++++++--------------------
 2 files changed, 10 insertions(+), 21 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9f0cb370b39b..bb7c642ffefa 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -402,7 +402,7 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->mq_ops || !q->mq_ops->poll)
+	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL)
 		return -EINVAL;
 
 	ret = queue_var_store(&poll_on, page, count);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4ccb4ea22ac6..7732c4979a4e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1602,22 +1602,15 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.timeout	= nvme_timeout,
 };
 
-#define NVME_SHARED_MQ_OPS					\
-	.queue_rq		= nvme_queue_rq,		\
-	.commit_rqs		= nvme_commit_rqs,		\
-	.complete		= nvme_pci_complete_rq,		\
-	.init_hctx		= nvme_init_hctx,		\
-	.init_request		= nvme_init_request,		\
-	.map_queues		= nvme_pci_map_queues,		\
-	.timeout		= nvme_timeout			\
-
 static const struct blk_mq_ops nvme_mq_ops = {
-	NVME_SHARED_MQ_OPS,
-};
-
-static const struct blk_mq_ops nvme_mq_poll_ops = {
-	NVME_SHARED_MQ_OPS,
-	.poll			= nvme_poll,
+	.queue_rq	= nvme_queue_rq,
+	.complete	= nvme_pci_complete_rq,
+	.commit_rqs	= nvme_commit_rqs,
+	.init_hctx	= nvme_init_hctx,
+	.init_request	= nvme_init_request,
+	.map_queues	= nvme_pci_map_queues,
+	.timeout	= nvme_timeout,
+	.poll		= nvme_poll,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -2304,11 +2297,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		if (dev->io_queues[HCTX_TYPE_POLL])
-			dev->tagset.ops = &nvme_mq_poll_ops;
-		else
-			dev->tagset.ops = &nvme_mq_ops;
-
+		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.nr_maps = HCTX_MAX_TYPES;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 12/13] block: only allow polling if a poll queue_map exists
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


This avoids having to have differnet mq_ops for different setups
with or without poll queues.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-sysfs.c       |  2 +-
 drivers/nvme/host/pci.c | 29 +++++++++--------------------
 2 files changed, 10 insertions(+), 21 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9f0cb370b39b..bb7c642ffefa 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -402,7 +402,7 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->mq_ops || !q->mq_ops->poll)
+	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL)
 		return -EINVAL;
 
 	ret = queue_var_store(&poll_on, page, count);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4ccb4ea22ac6..7732c4979a4e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1602,22 +1602,15 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.timeout	= nvme_timeout,
 };
 
-#define NVME_SHARED_MQ_OPS					\
-	.queue_rq		= nvme_queue_rq,		\
-	.commit_rqs		= nvme_commit_rqs,		\
-	.complete		= nvme_pci_complete_rq,		\
-	.init_hctx		= nvme_init_hctx,		\
-	.init_request		= nvme_init_request,		\
-	.map_queues		= nvme_pci_map_queues,		\
-	.timeout		= nvme_timeout			\
-
 static const struct blk_mq_ops nvme_mq_ops = {
-	NVME_SHARED_MQ_OPS,
-};
-
-static const struct blk_mq_ops nvme_mq_poll_ops = {
-	NVME_SHARED_MQ_OPS,
-	.poll			= nvme_poll,
+	.queue_rq	= nvme_queue_rq,
+	.complete	= nvme_pci_complete_rq,
+	.commit_rqs	= nvme_commit_rqs,
+	.init_hctx	= nvme_init_hctx,
+	.init_request	= nvme_init_request,
+	.map_queues	= nvme_pci_map_queues,
+	.timeout	= nvme_timeout,
+	.poll		= nvme_poll,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -2304,11 +2297,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		if (dev->io_queues[HCTX_TYPE_POLL])
-			dev->tagset.ops = &nvme_mq_poll_ops;
-		else
-			dev->tagset.ops = &nvme_mq_ops;
-
+		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.nr_maps = HCTX_MAX_TYPES;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/13] block: enable polling by default if a poll map is initalized
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-02 16:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

If the user did setup polling in the driver we should not require
another know in the block layer to enable it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9c90c5038d07..a550a00ac00c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2811,6 +2811,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->tag_set = set;
 
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
+	if (set->nr_maps > HCTX_TYPE_POLL)
+		blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
 		blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 13/13] block: enable polling by default if a poll map is initalized
@ 2018-12-02 16:46   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-02 16:46 UTC (permalink / raw)


If the user did setup polling in the driver we should not require
another know in the block layer to enable it.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9c90c5038d07..a550a00ac00c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2811,6 +2811,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->tag_set = set;
 
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
+	if (set->nr_maps > HCTX_TYPE_POLL)
+		blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
 		blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-03 18:15     ` Keith Busch
  -1 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-12-03 18:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Max Gurtovoy, linux-nvme, linux-block

On Sun, Dec 02, 2018 at 08:46:22AM -0800, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-12-03 18:15     ` Keith Busch
  0 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-12-03 18:15 UTC (permalink / raw)


On Sun, Dec 02, 2018@08:46:22AM -0800, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-03 18:22     ` Keith Busch
  -1 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-12-03 18:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Max Gurtovoy, linux-nvme, linux-block

On Sun, Dec 02, 2018 at 08:46:25AM -0800, Christoph Hellwig wrote:
> The ->poll_fn has been stale for a while, as a lot of places check for mq
> ops.  But there is no real point in it anyway, as we don't even use
> the multipath code for subsystems without multiple ports, which is usually
> what we do high performance I/O to.  If it really becomes an issue we
> should rework the nvme code to also skip the multipath code for any
> private namespace, even if that could mean some trouble when rescanning.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This was a bit flawed anyway since the head's current path could change,
and you end up polling the wrong request_queue. Not really harmful other
than some wasted CPU cycles, but might be worth thinking about if we
want to bring mpath polling back.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 10/13] nvme-mpath: remove I/O polling support
@ 2018-12-03 18:22     ` Keith Busch
  0 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-12-03 18:22 UTC (permalink / raw)


On Sun, Dec 02, 2018@08:46:25AM -0800, Christoph Hellwig wrote:
> The ->poll_fn has been stale for a while, as a lot of places check for mq
> ops.  But there is no real point in it anyway, as we don't even use
> the multipath code for subsystems without multiple ports, which is usually
> what we do high performance I/O to.  If it really becomes an issue we
> should rework the nvme code to also skip the multipath code for any
> private namespace, even if that could mean some trouble when rescanning.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

This was a bit flawed anyway since the head's current path could change,
and you end up polling the wrong request_queue. Not really harmful other
than some wasted CPU cycles, but might be worth thinking about if we
want to bring mpath polling back.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-03 18:23     ` Keith Busch
  -1 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-12-03 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Max Gurtovoy, linux-nvme, linux-block

On Sun, Dec 02, 2018 at 08:46:19AM -0800, Christoph Hellwig wrote:
> This will allow us to simplify both the regular NVMe interrupt handler
> and the upcoming aio poll code.  In addition to that the separate
> queues are generally a good idea for performance reasons.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
@ 2018-12-03 18:23     ` Keith Busch
  0 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-12-03 18:23 UTC (permalink / raw)


On Sun, Dec 02, 2018@08:46:19AM -0800, Christoph Hellwig wrote:
> This will allow us to simplify both the regular NVMe interrupt handler
> and the upcoming aio poll code.  In addition to that the separate
> queues are generally a good idea for performance reasons.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 01/13] block: move queues types to the block layer
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  0:49     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:49 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block



On 12/2/18 8:46 AM, Christoph Hellwig wrote:
> Having another indirect all in the fast path doesn't really help
> in our post-spectre world.  Also having too many queue type is just
> going to create confusion, so I'd rather manage them centrally.
> 
> Note that the queue type naming and ordering changes a bit - the
> first index now is the default queue for everything not explicitly
> marked, the optional ones are read and poll queues.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-sysfs.c    |  9 +++++-
>   block/blk-mq.h          | 21 +++++++------
>   drivers/nvme/host/pci.c | 68 +++++++++++++++--------------------------
>   include/linux/blk-mq.h  | 15 ++++-----
>   4 files changed, 51 insertions(+), 62 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 6efef1f679f0..9c2df137256a 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>   	return ret;
>   }
>   
> +static const char *const hctx_types[] = {
> +	[HCTX_TYPE_DEFAULT]	= "default",
> +	[HCTX_TYPE_READ]	= "read",
> +	[HCTX_TYPE_POLL]	= "poll",
> +};
> +
>   static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char *page)
>   {
> -	return sprintf(page, "%u\n", hctx->type);
> +	BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES);
> +	return sprintf(page, "%s\n", hctx_types[hctx->type]);
>   }
>   
>   static struct attribute *default_ctx_attrs[] = {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 7291e5379358..a664ea44ffd4 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
>   /*
>    * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue
>    * @q: request queue
> - * @hctx_type: the hctx type index
> + * @type: the hctx type index
>    * @cpu: CPU
>    */
>   static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
> -							  unsigned int hctx_type,
> +							  enum hctx_type type,
>   							  unsigned int cpu)
>   {
> -	struct blk_mq_tag_set *set = q->tag_set;
> -
> -	return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]];
> +	return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]];
>   }
>   
>   /*
> @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>   						     unsigned int flags,
>   						     unsigned int cpu)
>   {
> -	int hctx_type = 0;
> +	enum hctx_type type = HCTX_TYPE_DEFAULT;
> +
> +	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
> +	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
> +		type = HCTX_TYPE_POLL;
>   
> -	if (q->mq_ops->rq_flags_to_type)
> -		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
> +	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
> +		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
> +		type = HCTX_TYPE_READ;

Nit, there seems to be an extra newline that can be omitted here before
the else if statement (if I'm reading this correctly)...

Otherwise looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 01/13] block: move queues types to the block layer
@ 2018-12-04  0:49     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:49 UTC (permalink / raw)




On 12/2/18 8:46 AM, Christoph Hellwig wrote:
> Having another indirect all in the fast path doesn't really help
> in our post-spectre world.  Also having too many queue type is just
> going to create confusion, so I'd rather manage them centrally.
> 
> Note that the queue type naming and ordering changes a bit - the
> first index now is the default queue for everything not explicitly
> marked, the optional ones are read and poll queues.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>   block/blk-mq-sysfs.c    |  9 +++++-
>   block/blk-mq.h          | 21 +++++++------
>   drivers/nvme/host/pci.c | 68 +++++++++++++++--------------------------
>   include/linux/blk-mq.h  | 15 ++++-----
>   4 files changed, 51 insertions(+), 62 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 6efef1f679f0..9c2df137256a 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>   	return ret;
>   }
>   
> +static const char *const hctx_types[] = {
> +	[HCTX_TYPE_DEFAULT]	= "default",
> +	[HCTX_TYPE_READ]	= "read",
> +	[HCTX_TYPE_POLL]	= "poll",
> +};
> +
>   static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char *page)
>   {
> -	return sprintf(page, "%u\n", hctx->type);
> +	BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES);
> +	return sprintf(page, "%s\n", hctx_types[hctx->type]);
>   }
>   
>   static struct attribute *default_ctx_attrs[] = {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 7291e5379358..a664ea44ffd4 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
>   /*
>    * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue
>    * @q: request queue
> - * @hctx_type: the hctx type index
> + * @type: the hctx type index
>    * @cpu: CPU
>    */
>   static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
> -							  unsigned int hctx_type,
> +							  enum hctx_type type,
>   							  unsigned int cpu)
>   {
> -	struct blk_mq_tag_set *set = q->tag_set;
> -
> -	return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]];
> +	return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]];
>   }
>   
>   /*
> @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>   						     unsigned int flags,
>   						     unsigned int cpu)
>   {
> -	int hctx_type = 0;
> +	enum hctx_type type = HCTX_TYPE_DEFAULT;
> +
> +	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
> +	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
> +		type = HCTX_TYPE_POLL;
>   
> -	if (q->mq_ops->rq_flags_to_type)
> -		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
> +	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
> +		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
> +		type = HCTX_TYPE_READ;

Nit, there seems to be an extra newline that can be omitted here before
the else if statement (if I'm reading this correctly)...

Otherwise looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  0:54     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:54 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block


> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>   
>   	if (nr_io_queues == 0)
>   		return 0;
> +	
> +	clear_bit(NVMEQ_ENABLED, &adminq->flags);
>   

This is a change of behavior, looks correct though as we can fail
nvme_setup_irqs after we freed the admin vector. Needs documentation 
though..

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
@ 2018-12-04  0:54     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:54 UTC (permalink / raw)



> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>   
>   	if (nr_io_queues == 0)
>   		return 0;
> +	
> +	clear_bit(NVMEQ_ENABLED, &adminq->flags);
>   

This is a change of behavior, looks correct though as we can fail
nvme_setup_irqs after we freed the admin vector. Needs documentation 
though..

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  0:55     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:55 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit
@ 2018-12-04  0:55     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:55 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  0:56     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:56 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
@ 2018-12-04  0:56     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:56 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  0:58     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:58 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block


> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)

Do we still need to carry the tag around?

Other than that,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
@ 2018-12-04  0:58     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  0:58 UTC (permalink / raw)



> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)

Do we still need to carry the tag around?

Other than that,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:00     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:00 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block


> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>   	nvme_stop_queues(&dev->ctrl);
>   
>   	if (!dead && dev->ctrl.queue_count > 0) {
> -		nvme_disable_io_queues(dev);
> +		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
> +			

Would be nice if the opcode change would be kept inside but still
split like:

static void nvme_disable_io_queues(struct nvme_dev *dev)
{
	if (__nvme_disable_io_queues(dev, nvme_admin_delete_sq))
		__nvme_disable_io_queues(dev, nvme_admin_delete_cq);
}

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
@ 2018-12-04  1:00     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:00 UTC (permalink / raw)



> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>   	nvme_stop_queues(&dev->ctrl);
>   
>   	if (!dead && dev->ctrl.queue_count > 0) {
> -		nvme_disable_io_queues(dev);
> +		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
> +			

Would be nice if the opcode change would be kept inside but still
split like:

static void nvme_disable_io_queues(struct nvme_dev *dev)
{
	if (__nvme_disable_io_queues(dev, nvme_admin_delete_sq))
		__nvme_disable_io_queues(dev, nvme_admin_delete_cq);
}

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:05     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:05 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-12-04  1:05     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:05 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:08     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:08 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block


> Now that we can't poll regular, interrupt driven I/O queues there
> is almost nothing that can race with an interrupt.  The only
> possible other contexts polling a CQ are the error handler and
> queue shutdown, and both are so far off in the slow path that
> we can simply use the big hammer of disabling interrupts.
> 
> With that we can stop taking the cq_lock for normal queues.

Nice,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues
@ 2018-12-04  1:08     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:08 UTC (permalink / raw)



> Now that we can't poll regular, interrupt driven I/O queues there
> is almost nothing that can race with an interrupt.  The only
> possible other contexts polling a CQ are the error handler and
> queue shutdown, and both are so far off in the slow path that
> we can simply use the big hammer of disabling interrupts.
> 
> With that we can stop taking the cq_lock for normal queues.

Nice,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:11     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

> If it really becomes an issue we
> should rework the nvme code to also skip the multipath code for any
> private namespace, even if that could mean some trouble when rescanning.
>

This requires some explanation? skip the multipath code how?

Other than that,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 10/13] nvme-mpath: remove I/O polling support
@ 2018-12-04  1:11     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:11 UTC (permalink / raw)


> If it really becomes an issue we
> should rework the nvme code to also skip the multipath code for any
> private namespace, even if that could mean some trouble when rescanning.
>

This requires some explanation? skip the multipath code how?

Other than that,
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 11/13] block: remove ->poll_fn
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:11     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 11/13] block: remove ->poll_fn
@ 2018-12-04  1:11     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:11 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 12/13] block: only allow polling if a poll queue_map exists
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:14     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:14 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 12/13] block: only allow polling if a poll queue_map exists
@ 2018-12-04  1:14     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:14 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 13/13] block: enable polling by default if a poll map is initalized
  2018-12-02 16:46   ` Christoph Hellwig
@ 2018-12-04  1:14     ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:14 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Max Gurtovoy, linux-nvme, linux-block

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 13/13] block: enable polling by default if a poll map is initalized
@ 2018-12-04  1:14     ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04  1:14 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 01/13] block: move queues types to the block layer
  2018-12-04  0:49     ` Sagi Grimberg
@ 2018-12-04 15:00       ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:00 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Max Gurtovoy,
	linux-nvme, linux-block

On Mon, Dec 03, 2018 at 04:49:56PM -0800, Sagi Grimberg wrote:
>> @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>>   						     unsigned int flags,
>>   						     unsigned int cpu)
>>   {
>> -	int hctx_type = 0;
>> +	enum hctx_type type = HCTX_TYPE_DEFAULT;
>> +
>> +	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
>> +	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
>> +		type = HCTX_TYPE_POLL;
>>   -	if (q->mq_ops->rq_flags_to_type)
>> -		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
>> +	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
>> +		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
>> +		type = HCTX_TYPE_READ;
>
> Nit, there seems to be an extra newline that can be omitted here before
> the else if statement (if I'm reading this correctly)...

Empty lines can always be ommited, but in this case I actually like it
as it seems to help readability..

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 01/13] block: move queues types to the block layer
@ 2018-12-04 15:00       ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:00 UTC (permalink / raw)


On Mon, Dec 03, 2018@04:49:56PM -0800, Sagi Grimberg wrote:
>> @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>>   						     unsigned int flags,
>>   						     unsigned int cpu)
>>   {
>> -	int hctx_type = 0;
>> +	enum hctx_type type = HCTX_TYPE_DEFAULT;
>> +
>> +	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
>> +	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
>> +		type = HCTX_TYPE_POLL;
>>   -	if (q->mq_ops->rq_flags_to_type)
>> -		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
>> +	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
>> +		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
>> +		type = HCTX_TYPE_READ;
>
> Nit, there seems to be an extra newline that can be omitted here before
> the else if statement (if I'm reading this correctly)...

Empty lines can always be ommited, but in this case I actually like it
as it seems to help readability..

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
  2018-12-04  0:54     ` Sagi Grimberg
@ 2018-12-04 15:04       ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:04 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Max Gurtovoy,
	linux-nvme, linux-block

On Mon, Dec 03, 2018 at 04:54:15PM -0800, Sagi Grimberg wrote:
>
>> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>>     	if (nr_io_queues == 0)
>>   		return 0;
>> +	
>> +	clear_bit(NVMEQ_ENABLED, &adminq->flags);
>>   
>
> This is a change of behavior, looks correct though as we can fail
> nvme_setup_irqs after we freed the admin vector. Needs documentation 
> though..

I have a hard time parsing the above, can you point to where the problem
is?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
@ 2018-12-04 15:04       ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:04 UTC (permalink / raw)


On Mon, Dec 03, 2018@04:54:15PM -0800, Sagi Grimberg wrote:
>
>> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>>     	if (nr_io_queues == 0)
>>   		return 0;
>> +	
>> +	clear_bit(NVMEQ_ENABLED, &adminq->flags);
>>   
>
> This is a change of behavior, looks correct though as we can fail
> nvme_setup_irqs after we freed the admin vector. Needs documentation 
> though..

I have a hard time parsing the above, can you point to where the problem
is?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
  2018-12-04  0:58     ` Sagi Grimberg
@ 2018-12-04 15:04       ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:04 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Max Gurtovoy,
	linux-nvme, linux-block

On Mon, Dec 03, 2018 at 04:58:25PM -0800, Sagi Grimberg wrote:
>
>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>
> Do we still need to carry the tag around?

Yes, the timeout handler polls for a specific tag.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
@ 2018-12-04 15:04       ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:04 UTC (permalink / raw)


On Mon, Dec 03, 2018@04:58:25PM -0800, Sagi Grimberg wrote:
>
>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>
> Do we still need to carry the tag around?

Yes, the timeout handler polls for a specific tag.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
  2018-12-04  1:00     ` Sagi Grimberg
@ 2018-12-04 15:05       ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:05 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Max Gurtovoy,
	linux-nvme, linux-block

On Mon, Dec 03, 2018 at 05:00:59PM -0800, Sagi Grimberg wrote:
>
>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>   	nvme_stop_queues(&dev->ctrl);
>>     	if (!dead && dev->ctrl.queue_count > 0) {
>> -		nvme_disable_io_queues(dev);
>> +		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
>> +			
>
> Would be nice if the opcode change would be kept inside but still
> split like:

I actually like not having another wrapper to stop through..

Keith, Jens, any preference?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
@ 2018-12-04 15:05       ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:05 UTC (permalink / raw)


On Mon, Dec 03, 2018@05:00:59PM -0800, Sagi Grimberg wrote:
>
>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>   	nvme_stop_queues(&dev->ctrl);
>>     	if (!dead && dev->ctrl.queue_count > 0) {
>> -		nvme_disable_io_queues(dev);
>> +		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
>> +			
>
> Would be nice if the opcode change would be kept inside but still
> split like:

I actually like not having another wrapper to stop through..

Keith, Jens, any preference?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
  2018-12-04  1:11     ` Sagi Grimberg
@ 2018-12-04 15:07       ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:07 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Max Gurtovoy,
	linux-nvme, linux-block

On Mon, Dec 03, 2018 at 05:11:43PM -0800, Sagi Grimberg wrote:
>> If it really becomes an issue we
>> should rework the nvme code to also skip the multipath code for any
>> private namespace, even if that could mean some trouble when rescanning.
>>
>
> This requires some explanation? skip the multipath code how?

We currently always go through the multipath node as long the the
controller is multipath capable.  If we care about e.g. polling
on a private namespace on a dual ported U.2 drive we'd have to make
sure we don't go through the multipath device node for private namespaces
that can only have one path, but only for shared namespaces.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 10/13] nvme-mpath: remove I/O polling support
@ 2018-12-04 15:07       ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-12-04 15:07 UTC (permalink / raw)


On Mon, Dec 03, 2018@05:11:43PM -0800, Sagi Grimberg wrote:
>> If it really becomes an issue we
>> should rework the nvme code to also skip the multipath code for any
>> private namespace, even if that could mean some trouble when rescanning.
>>
>
> This requires some explanation? skip the multipath code how?

We currently always go through the multipath node as long the the
controller is multipath capable.  If we care about e.g. polling
on a private namespace on a dual ported U.2 drive we'd have to make
sure we don't go through the multipath device node for private namespaces
that can only have one path, but only for shared namespaces.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 01/13] block: move queues types to the block layer
  2018-12-04 15:00       ` Christoph Hellwig
@ 2018-12-04 17:08         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Max Gurtovoy, linux-nvme, linux-block


>> Nit, there seems to be an extra newline that can be omitted here before
>> the else if statement (if I'm reading this correctly)...
> 
> Empty lines can always be ommited, but in this case I actually like it
> as it seems to help readability..

If you think its useful I'm fine with it as is...

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 01/13] block: move queues types to the block layer
@ 2018-12-04 17:08         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:08 UTC (permalink / raw)



>> Nit, there seems to be an extra newline that can be omitted here before
>> the else if statement (if I'm reading this correctly)...
> 
> Empty lines can always be ommited, but in this case I actually like it
> as it seems to help readability..

If you think its useful I'm fine with it as is...

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
  2018-12-04 15:04       ` Christoph Hellwig
@ 2018-12-04 17:11         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Max Gurtovoy, linux-nvme, linux-block


>>> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>>>      	if (nr_io_queues == 0)
>>>    		return 0;
>>> +	
>>> +	clear_bit(NVMEQ_ENABLED, &adminq->flags);
>>>    
>>
>> This is a change of behavior, looks correct though as we can fail
>> nvme_setup_irqs after we freed the admin vector. Needs documentation
>> though..
> 
> I have a hard time parsing the above, can you point to where the problem
> is?

This flag replaces cq_vector = -1 for indicating the queue state so I'd
expect that it would not appear where the former didn't. In this case
we clear the flag in a place that before we did not set the cq_vector
before, which seems correct to me though.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
@ 2018-12-04 17:11         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:11 UTC (permalink / raw)



>>> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>>>      	if (nr_io_queues == 0)
>>>    		return 0;
>>> +	
>>> +	clear_bit(NVMEQ_ENABLED, &adminq->flags);
>>>    
>>
>> This is a change of behavior, looks correct though as we can fail
>> nvme_setup_irqs after we freed the admin vector. Needs documentation
>> though..
> 
> I have a hard time parsing the above, can you point to where the problem
> is?

This flag replaces cq_vector = -1 for indicating the queue state so I'd
expect that it would not appear where the former didn't. In this case
we clear the flag in a place that before we did not set the cq_vector
before, which seems correct to me though.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
  2018-12-04 15:04       ` Christoph Hellwig
@ 2018-12-04 17:13         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Max Gurtovoy, linux-nvme, linux-block


>>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>>
>> Do we still need to carry the tag around?
> 
> Yes, the timeout handler polls for a specific tag.

Does it have to? the documentation suggests that we missed
an interrupt, so it is probably waiting on the completion queue.

I'd say that it'd be better if the tag search would be implemented
on the timeout handler alone so we don't have to pass the tag around
for everyone... Thoughts?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
@ 2018-12-04 17:13         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:13 UTC (permalink / raw)



>>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>>
>> Do we still need to carry the tag around?
> 
> Yes, the timeout handler polls for a specific tag.

Does it have to? the documentation suggests that we missed
an interrupt, so it is probably waiting on the completion queue.

I'd say that it'd be better if the tag search would be implemented
on the timeout handler alone so we don't have to pass the tag around
for everyone... Thoughts?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
  2018-12-04 15:07       ` Christoph Hellwig
@ 2018-12-04 17:18         ` Sagi Grimberg
  -1 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Max Gurtovoy, linux-nvme, linux-block


>>> If it really becomes an issue we
>>> should rework the nvme code to also skip the multipath code for any
>>> private namespace, even if that could mean some trouble when rescanning.
>>>
>>
>> This requires some explanation? skip the multipath code how?
> 
> We currently always go through the multipath node as long the the
> controller is multipath capable.  If we care about e.g. polling
> on a private namespace on a dual ported U.2 drive we'd have to make
> sure we don't go through the multipath device node for private namespaces
> that can only have one path, but only for shared namespaces.

But we'd still use the multipath node for shared namespaces (and also
polling if needed). I agree that private namespaces can skip the
multipath node.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 10/13] nvme-mpath: remove I/O polling support
@ 2018-12-04 17:18         ` Sagi Grimberg
  0 siblings, 0 replies; 92+ messages in thread
From: Sagi Grimberg @ 2018-12-04 17:18 UTC (permalink / raw)



>>> If it really becomes an issue we
>>> should rework the nvme code to also skip the multipath code for any
>>> private namespace, even if that could mean some trouble when rescanning.
>>>
>>
>> This requires some explanation? skip the multipath code how?
> 
> We currently always go through the multipath node as long the the
> controller is multipath capable.  If we care about e.g. polling
> on a private namespace on a dual ported U.2 drive we'd have to make
> sure we don't go through the multipath device node for private namespaces
> that can only have one path, but only for shared namespaces.

But we'd still use the multipath node for shared namespaces (and also
polling if needed). I agree that private namespaces can skip the
multipath node.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
  2018-12-04 17:13         ` Sagi Grimberg
@ 2018-12-04 18:19           ` Jens Axboe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2018-12-04 18:19 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Keith Busch, Max Gurtovoy, linux-nvme, linux-block

On 12/4/18 10:13 AM, Sagi Grimberg wrote:
> 
>>>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>>>
>>> Do we still need to carry the tag around?
>>
>> Yes, the timeout handler polls for a specific tag.
> 
> Does it have to? the documentation suggests that we missed
> an interrupt, so it is probably waiting on the completion queue.
> 
> I'd say that it'd be better if the tag search would be implemented
> on the timeout handler alone so we don't have to pass the tag around
> for everyone... Thoughts?

Without that you don't know if that's the request that completed. You
have to be able to look that up from the timeout handler.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
@ 2018-12-04 18:19           ` Jens Axboe
  0 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2018-12-04 18:19 UTC (permalink / raw)


On 12/4/18 10:13 AM, Sagi Grimberg wrote:
> 
>>>> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)
>>>
>>> Do we still need to carry the tag around?
>>
>> Yes, the timeout handler polls for a specific tag.
> 
> Does it have to? the documentation suggests that we missed
> an interrupt, so it is probably waiting on the completion queue.
> 
> I'd say that it'd be better if the tag search would be implemented
> on the timeout handler alone so we don't have to pass the tag around
> for everyone... Thoughts?

Without that you don't know if that's the request that completed. You
have to be able to look that up from the timeout handler.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
  2018-12-04 15:05       ` Christoph Hellwig
@ 2018-12-04 18:19         ` Jens Axboe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2018-12-04 18:19 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Max Gurtovoy, linux-nvme, linux-block

On 12/4/18 8:05 AM, Christoph Hellwig wrote:
> On Mon, Dec 03, 2018 at 05:00:59PM -0800, Sagi Grimberg wrote:
>>
>>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>   	nvme_stop_queues(&dev->ctrl);
>>>     	if (!dead && dev->ctrl.queue_count > 0) {
>>> -		nvme_disable_io_queues(dev);
>>> +		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
>>> +			
>>
>> Would be nice if the opcode change would be kept inside but still
>> split like:
> 
> I actually like not having another wrapper to stop through..
> 
> Keith, Jens, any preference?

Fine either way, prefer not having a wrapper.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
@ 2018-12-04 18:19         ` Jens Axboe
  0 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2018-12-04 18:19 UTC (permalink / raw)


On 12/4/18 8:05 AM, Christoph Hellwig wrote:
> On Mon, Dec 03, 2018@05:00:59PM -0800, Sagi Grimberg wrote:
>>
>>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>>   	nvme_stop_queues(&dev->ctrl);
>>>     	if (!dead && dev->ctrl.queue_count > 0) {
>>> -		nvme_disable_io_queues(dev);
>>> +		if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
>>> +			
>>
>> Would be nice if the opcode change would be kept inside but still
>> split like:
> 
> I actually like not having another wrapper to stop through..
> 
> Keith, Jens, any preference?

Fine either way, prefer not having a wrapper.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: block and nvme polling improvements V3
  2018-12-02 16:46 ` Christoph Hellwig
@ 2018-12-04 18:40   ` Jens Axboe
  -1 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2018-12-04 18:40 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

On 12/2/18 9:46 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series optimizes a few bits in the block layer and nvme code
> related to polling.
> 
> It starts by moving the queue types recently introduce entirely into
> the block layer instead of requiring an indirect call for them.
> 
> It then switches nvme and the block layer to only allow polling
> with separate poll queues, which allows us to realize the following
> benefits:
> 
>  - poll queues can safely avoid disabling irqs on any locks
>    (we already do that in NVMe, but it isn't 100% kosher as-is)
>  - regular interrupt driven queues can drop the CQ lock entirely,
>    as we won't race for completing CQs
> 
> Then we drop the NVMe RDMA code, as it doesn't follow the new mode,
> and remove the nvme multipath polling code including the block hooks
> for it, which didn't make much sense to start with given that we
> started bypassing the multipath code for single controller subsystems
> early on.  Last but not least we enable polling in the block layer
> by default if the underlying driver has poll queues, as that already
> requires explicit user action.
> 
> Note that it would be really nice to have polling back for RDMA with
> dedicated poll queues, but that might take a while.  Also based on
> Jens' polling aio patches we could now implement a model in nvmet
> where we have a thread polling both the backend nvme device and
> the RDMA CQs, which might give us some pretty nice performace
> (I know Sagi looked into something similar a while ago).

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 92+ messages in thread

* block and nvme polling improvements V3
@ 2018-12-04 18:40   ` Jens Axboe
  0 siblings, 0 replies; 92+ messages in thread
From: Jens Axboe @ 2018-12-04 18:40 UTC (permalink / raw)


On 12/2/18 9:46 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series optimizes a few bits in the block layer and nvme code
> related to polling.
> 
> It starts by moving the queue types recently introduce entirely into
> the block layer instead of requiring an indirect call for them.
> 
> It then switches nvme and the block layer to only allow polling
> with separate poll queues, which allows us to realize the following
> benefits:
> 
>  - poll queues can safely avoid disabling irqs on any locks
>    (we already do that in NVMe, but it isn't 100% kosher as-is)
>  - regular interrupt driven queues can drop the CQ lock entirely,
>    as we won't race for completing CQs
> 
> Then we drop the NVMe RDMA code, as it doesn't follow the new mode,
> and remove the nvme multipath polling code including the block hooks
> for it, which didn't make much sense to start with given that we
> started bypassing the multipath code for single controller subsystems
> early on.  Last but not least we enable polling in the block layer
> by default if the underlying driver has poll queues, as that already
> requires explicit user action.
> 
> Note that it would be really nice to have polling back for RDMA with
> dedicated poll queues, but that might take a while.  Also based on
> Jens' polling aio patches we could now implement a model in nvmet
> where we have a thread polling both the backend nvme device and
> the RDMA CQs, which might give us some pretty nice performace
> (I know Sagi looked into something similar a while ago).

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-11-30  8:08       ` Christoph Hellwig
@ 2018-11-30 14:45         ` Keith Busch
  -1 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-11-30 14:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Max Gurtovoy, linux-nvme, linux-block

On Fri, Nov 30, 2018 at 12:08:09AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 01:36:32PM -0700, Keith Busch wrote:
> > On Thu, Nov 29, 2018 at 08:13:04PM +0100, Christoph Hellwig wrote:
> > > +
> > > +		/* handle any remaining CQEs */
> > > +		if (opcode == nvme_admin_delete_cq &&
> > > +		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
> > > +			nvme_poll_irqdisable(nvmeq, -1);
> > 
> > We're dispatchig lots of queue deletions in parallel, and they may
> > complete in any order. I don't see how you can guarantee that the
> > wait_for_completion() will return for the nvmeq that you're polling.
> 
> True.  I thought about moving the completion to the queue so that
> we have one completion per queue, and I should have done that after
> all.  Note sure how I got the idea that not doing it is fine.

You may also move the completion polling in its own loop outside the
deletion loop.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-11-30 14:45         ` Keith Busch
  0 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-11-30 14:45 UTC (permalink / raw)


On Fri, Nov 30, 2018@12:08:09AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018@01:36:32PM -0700, Keith Busch wrote:
> > On Thu, Nov 29, 2018@08:13:04PM +0100, Christoph Hellwig wrote:
> > > +
> > > +		/* handle any remaining CQEs */
> > > +		if (opcode == nvme_admin_delete_cq &&
> > > +		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
> > > +			nvme_poll_irqdisable(nvmeq, -1);
> > 
> > We're dispatchig lots of queue deletions in parallel, and they may
> > complete in any order. I don't see how you can guarantee that the
> > wait_for_completion() will return for the nvmeq that you're polling.
> 
> True.  I thought about moving the completion to the queue so that
> we have one completion per queue, and I should have done that after
> all.  Note sure how I got the idea that not doing it is fine.

You may also move the completion polling in its own loop outside the
deletion loop.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-11-29 20:36     ` Keith Busch
@ 2018-11-30  8:08       ` Christoph Hellwig
  -1 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-11-30  8:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Max Gurtovoy,
	linux-nvme, linux-block

On Thu, Nov 29, 2018 at 01:36:32PM -0700, Keith Busch wrote:
> On Thu, Nov 29, 2018 at 08:13:04PM +0100, Christoph Hellwig wrote:
> > This is the last place outside of nvme_irq that handles CQEs from
> > interrupt context, and thus is in the way of removing the cq_lock for
> > normal queues, and avoiding lockdep warnings on the poll queues, for
> > which we already take it without IRQ disabling.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  drivers/nvme/host/pci.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 9ceba9900ca3..fb8db7d8170a 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -203,6 +203,7 @@ struct nvme_queue {
> >  	unsigned long flags;
> >  #define NVMEQ_ENABLED		0
> >  #define NVMEQ_SQ_CMB		1
> > +#define NVMEQ_DELETE_ERROR	2
> >  	u32 *dbbuf_sq_db;
> >  	u32 *dbbuf_cq_db;
> >  	u32 *dbbuf_sq_ei;
> > @@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
> >  	struct nvme_queue *nvmeq = req->end_io_data;
> >  
> >  	if (!error)
> > -		nvme_poll_irqdisable(nvmeq, -1);
> > +		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
> >  
> >  	nvme_del_queue_end(req, error);
> >  }
> > @@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
> >  		nr_queues--;
> >  		sent++;
> >  	}
> > -	while (sent--) {
> > +	while (sent) {
> > +		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
> > +
> >  		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
> >  				timeout);
> >  		if (timeout == 0)
> >  			return false;
> > +
> > +		/* handle any remaining CQEs */
> > +		if (opcode == nvme_admin_delete_cq &&
> > +		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
> > +			nvme_poll_irqdisable(nvmeq, -1);
> 
> We're dispatchig lots of queue deletions in parallel, and they may
> complete in any order. I don't see how you can guarantee that the
> wait_for_completion() will return for the nvmeq that you're polling.

True.  I thought about moving the completion to the queue so that
we have one completion per queue, and I should have done that after
all.  Note sure how I got the idea that not doing it is fine.

> You also need to clear NVMEQ_DELETE_ERROR somewhere later, maybe in
> nvme_init_queue().

Indeed.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-11-30  8:08       ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-11-30  8:08 UTC (permalink / raw)


On Thu, Nov 29, 2018@01:36:32PM -0700, Keith Busch wrote:
> On Thu, Nov 29, 2018@08:13:04PM +0100, Christoph Hellwig wrote:
> > This is the last place outside of nvme_irq that handles CQEs from
> > interrupt context, and thus is in the way of removing the cq_lock for
> > normal queues, and avoiding lockdep warnings on the poll queues, for
> > which we already take it without IRQ disabling.
> > 
> > Signed-off-by: Christoph Hellwig <hch at lst.de>
> > ---
> >  drivers/nvme/host/pci.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 9ceba9900ca3..fb8db7d8170a 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -203,6 +203,7 @@ struct nvme_queue {
> >  	unsigned long flags;
> >  #define NVMEQ_ENABLED		0
> >  #define NVMEQ_SQ_CMB		1
> > +#define NVMEQ_DELETE_ERROR	2
> >  	u32 *dbbuf_sq_db;
> >  	u32 *dbbuf_cq_db;
> >  	u32 *dbbuf_sq_ei;
> > @@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
> >  	struct nvme_queue *nvmeq = req->end_io_data;
> >  
> >  	if (!error)
> > -		nvme_poll_irqdisable(nvmeq, -1);
> > +		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
> >  
> >  	nvme_del_queue_end(req, error);
> >  }
> > @@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
> >  		nr_queues--;
> >  		sent++;
> >  	}
> > -	while (sent--) {
> > +	while (sent) {
> > +		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
> > +
> >  		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
> >  				timeout);
> >  		if (timeout == 0)
> >  			return false;
> > +
> > +		/* handle any remaining CQEs */
> > +		if (opcode == nvme_admin_delete_cq &&
> > +		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
> > +			nvme_poll_irqdisable(nvmeq, -1);
> 
> We're dispatchig lots of queue deletions in parallel, and they may
> complete in any order. I don't see how you can guarantee that the
> wait_for_completion() will return for the nvmeq that you're polling.

True.  I thought about moving the completion to the queue so that
we have one completion per queue, and I should have done that after
all.  Note sure how I got the idea that not doing it is fine.

> You also need to clear NVMEQ_DELETE_ERROR somewhere later, maybe in
> nvme_init_queue().

Indeed.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-11-29 19:13   ` Christoph Hellwig
@ 2018-11-29 20:36     ` Keith Busch
  -1 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-11-29 20:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Max Gurtovoy, linux-nvme, linux-block

On Thu, Nov 29, 2018 at 08:13:04PM +0100, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/pci.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9ceba9900ca3..fb8db7d8170a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -203,6 +203,7 @@ struct nvme_queue {
>  	unsigned long flags;
>  #define NVMEQ_ENABLED		0
>  #define NVMEQ_SQ_CMB		1
> +#define NVMEQ_DELETE_ERROR	2
>  	u32 *dbbuf_sq_db;
>  	u32 *dbbuf_cq_db;
>  	u32 *dbbuf_sq_ei;
> @@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
>  	struct nvme_queue *nvmeq = req->end_io_data;
>  
>  	if (!error)
> -		nvme_poll_irqdisable(nvmeq, -1);
> +		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
>  
>  	nvme_del_queue_end(req, error);
>  }
> @@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
>  		nr_queues--;
>  		sent++;
>  	}
> -	while (sent--) {
> +	while (sent) {
> +		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
> +
>  		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
>  				timeout);
>  		if (timeout == 0)
>  			return false;
> +
> +		/* handle any remaining CQEs */
> +		if (opcode == nvme_admin_delete_cq &&
> +		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
> +			nvme_poll_irqdisable(nvmeq, -1);

We're dispatchig lots of queue deletions in parallel, and they may
complete in any order. I don't see how you can guarantee that the
wait_for_completion() will return for the nvmeq that you're polling.

You also need to clear NVMEQ_DELETE_ERROR somewhere later, maybe in
nvme_init_queue().

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-11-29 20:36     ` Keith Busch
  0 siblings, 0 replies; 92+ messages in thread
From: Keith Busch @ 2018-11-29 20:36 UTC (permalink / raw)


On Thu, Nov 29, 2018@08:13:04PM +0100, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>  drivers/nvme/host/pci.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9ceba9900ca3..fb8db7d8170a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -203,6 +203,7 @@ struct nvme_queue {
>  	unsigned long flags;
>  #define NVMEQ_ENABLED		0
>  #define NVMEQ_SQ_CMB		1
> +#define NVMEQ_DELETE_ERROR	2
>  	u32 *dbbuf_sq_db;
>  	u32 *dbbuf_cq_db;
>  	u32 *dbbuf_sq_ei;
> @@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
>  	struct nvme_queue *nvmeq = req->end_io_data;
>  
>  	if (!error)
> -		nvme_poll_irqdisable(nvmeq, -1);
> +		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
>  
>  	nvme_del_queue_end(req, error);
>  }
> @@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
>  		nr_queues--;
>  		sent++;
>  	}
> -	while (sent--) {
> +	while (sent) {
> +		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
> +
>  		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
>  				timeout);
>  		if (timeout == 0)
>  			return false;
> +
> +		/* handle any remaining CQEs */
> +		if (opcode == nvme_admin_delete_cq &&
> +		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
> +			nvme_poll_irqdisable(nvmeq, -1);

We're dispatchig lots of queue deletions in parallel, and they may
complete in any order. I don't see how you can guarantee that the
wait_for_completion() will return for the nvmeq that you're polling.

You also need to clear NVMEQ_DELETE_ERROR somewhere later, maybe in
nvme_init_queue().

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-11-29 19:12 block and nvme polling improvements V2 Christoph Hellwig
@ 2018-11-29 19:13   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-11-29 19:13 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This is the last place outside of nvme_irq that handles CQEs from
interrupt context, and thus is in the way of removing the cq_lock for
normal queues, and avoiding lockdep warnings on the poll queues, for
which we already take it without IRQ disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9ceba9900ca3..fb8db7d8170a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -203,6 +203,7 @@ struct nvme_queue {
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
 #define NVMEQ_SQ_CMB		1
+#define NVMEQ_DELETE_ERROR	2
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
 	struct nvme_queue *nvmeq = req->end_io_data;
 
 	if (!error)
-		nvme_poll_irqdisable(nvmeq, -1);
+		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
 	nvme_del_queue_end(req, error);
 }
@@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 		nr_queues--;
 		sent++;
 	}
-	while (sent--) {
+	while (sent) {
+		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
+
 		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
 				timeout);
 		if (timeout == 0)
 			return false;
+
+		/* handle any remaining CQEs */
+		if (opcode == nvme_admin_delete_cq &&
+		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
+			nvme_poll_irqdisable(nvmeq, -1);
+
+		sent--;
 		if (nr_queues)
 			goto retry;
 	}
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-11-29 19:13   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-11-29 19:13 UTC (permalink / raw)


This is the last place outside of nvme_irq that handles CQEs from
interrupt context, and thus is in the way of removing the cq_lock for
normal queues, and avoiding lockdep warnings on the poll queues, for
which we already take it without IRQ disabling.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 9ceba9900ca3..fb8db7d8170a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -203,6 +203,7 @@ struct nvme_queue {
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
 #define NVMEQ_SQ_CMB		1
+#define NVMEQ_DELETE_ERROR	2
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
 	struct nvme_queue *nvmeq = req->end_io_data;
 
 	if (!error)
-		nvme_poll_irqdisable(nvmeq, -1);
+		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
 	nvme_del_queue_end(req, error);
 }
@@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 		nr_queues--;
 		sent++;
 	}
-	while (sent--) {
+	while (sent) {
+		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
+
 		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
 				timeout);
 		if (timeout == 0)
 			return false;
+
+		/* handle any remaining CQEs */
+		if (opcode == nvme_admin_delete_cq &&
+		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
+			nvme_poll_irqdisable(nvmeq, -1);
+
+		sent--;
 		if (nr_queues)
 			goto retry;
 	}
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
  2018-11-21 16:23 block and nvme polling improvements Christoph Hellwig
@ 2018-11-21 16:23   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-11-21 16:23 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Max Gurtovoy, linux-nvme, linux-block

This is the last place outside of nvme_irq that handles CQEs from
interrupt context, and thus is in the way of removing the cq_lock for
normal queues, and avoiding lockdep warnings on the poll queues, for
which we already take it without IRQ disabling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 29072ad0a268..ef10279f4e58 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -202,6 +202,7 @@ struct nvme_queue {
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
 #define NVMEQ_SQ_CMB		1
+#define NVMEQ_DELETE_ERROR	2
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -2185,7 +2186,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
 	struct nvme_queue *nvmeq = req->end_io_data;
 
 	if (!error)
-		nvme_poll_irqdisable(nvmeq, -1);
+		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
 	nvme_del_queue_end(req, error);
 }
@@ -2227,11 +2228,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 		nr_queues--;
 		sent++;
 	}
-	while (sent--) {
+	while (sent) {
+		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
+
 		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
 				timeout);
 		if (timeout == 0)
 			return false;
+
+		/* handle any remaining CQEs */
+		if (opcode == nvme_admin_delete_cq &&
+		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
+			nvme_poll_irqdisable(nvmeq, -1);
+
+		sent--;
 		if (nr_queues)
 			goto retry;
 	}
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
@ 2018-11-21 16:23   ` Christoph Hellwig
  0 siblings, 0 replies; 92+ messages in thread
From: Christoph Hellwig @ 2018-11-21 16:23 UTC (permalink / raw)


This is the last place outside of nvme_irq that handles CQEs from
interrupt context, and thus is in the way of removing the cq_lock for
normal queues, and avoiding lockdep warnings on the poll queues, for
which we already take it without IRQ disabling.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 29072ad0a268..ef10279f4e58 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -202,6 +202,7 @@ struct nvme_queue {
 	unsigned long flags;
 #define NVMEQ_ENABLED		0
 #define NVMEQ_SQ_CMB		1
+#define NVMEQ_DELETE_ERROR	2
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -2185,7 +2186,7 @@ static void nvme_del_cq_end(struct request *req, blk_status_t error)
 	struct nvme_queue *nvmeq = req->end_io_data;
 
 	if (!error)
-		nvme_poll_irqdisable(nvmeq, -1);
+		set_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
 	nvme_del_queue_end(req, error);
 }
@@ -2227,11 +2228,20 @@ static bool nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode)
 		nr_queues--;
 		sent++;
 	}
-	while (sent--) {
+	while (sent) {
+		struct nvme_queue *nvmeq = &dev->queues[nr_queues + sent];
+
 		timeout = wait_for_completion_io_timeout(&dev->ioq_wait,
 				timeout);
 		if (timeout == 0)
 			return false;
+
+		/* handle any remaining CQEs */
+		if (opcode == nvme_admin_delete_cq &&
+		    !test_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags))
+			nvme_poll_irqdisable(nvmeq, -1);
+
+		sent--;
 		if (nr_queues)
 			goto retry;
 	}
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2018-12-04 18:40 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-02 16:46 block and nvme polling improvements V3 Christoph Hellwig
2018-12-02 16:46 ` Christoph Hellwig
2018-12-02 16:46 ` [PATCH 01/13] block: move queues types to the block layer Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  0:49   ` Sagi Grimberg
2018-12-04  0:49     ` Sagi Grimberg
2018-12-04 15:00     ` Christoph Hellwig
2018-12-04 15:00       ` Christoph Hellwig
2018-12-04 17:08       ` Sagi Grimberg
2018-12-04 17:08         ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  0:54   ` Sagi Grimberg
2018-12-04  0:54     ` Sagi Grimberg
2018-12-04 15:04     ` Christoph Hellwig
2018-12-04 15:04       ` Christoph Hellwig
2018-12-04 17:11       ` Sagi Grimberg
2018-12-04 17:11         ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  0:55   ` Sagi Grimberg
2018-12-04  0:55     ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 04/13] nvme-pci: only allow polling with separate poll queues Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-03 18:23   ` Keith Busch
2018-12-03 18:23     ` Keith Busch
2018-12-04  0:56   ` Sagi Grimberg
2018-12-04  0:56     ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  0:58   ` Sagi Grimberg
2018-12-04  0:58     ` Sagi Grimberg
2018-12-04 15:04     ` Christoph Hellwig
2018-12-04 15:04       ` Christoph Hellwig
2018-12-04 17:13       ` Sagi Grimberg
2018-12-04 17:13         ` Sagi Grimberg
2018-12-04 18:19         ` Jens Axboe
2018-12-04 18:19           ` Jens Axboe
2018-12-02 16:46 ` [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  1:00   ` Sagi Grimberg
2018-12-04  1:00     ` Sagi Grimberg
2018-12-04 15:05     ` Christoph Hellwig
2018-12-04 15:05       ` Christoph Hellwig
2018-12-04 18:19       ` Jens Axboe
2018-12-04 18:19         ` Jens Axboe
2018-12-02 16:46 ` [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-03 18:15   ` Keith Busch
2018-12-03 18:15     ` Keith Busch
2018-12-04  1:05   ` Sagi Grimberg
2018-12-04  1:05     ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  1:08   ` Sagi Grimberg
2018-12-04  1:08     ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 09/13] nvme-rdma: remove I/O polling support Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-02 16:46 ` [PATCH 10/13] nvme-mpath: " Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-03 18:22   ` Keith Busch
2018-12-03 18:22     ` Keith Busch
2018-12-04  1:11   ` Sagi Grimberg
2018-12-04  1:11     ` Sagi Grimberg
2018-12-04 15:07     ` Christoph Hellwig
2018-12-04 15:07       ` Christoph Hellwig
2018-12-04 17:18       ` Sagi Grimberg
2018-12-04 17:18         ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 11/13] block: remove ->poll_fn Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  1:11   ` Sagi Grimberg
2018-12-04  1:11     ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 12/13] block: only allow polling if a poll queue_map exists Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  1:14   ` Sagi Grimberg
2018-12-04  1:14     ` Sagi Grimberg
2018-12-02 16:46 ` [PATCH 13/13] block: enable polling by default if a poll map is initalized Christoph Hellwig
2018-12-02 16:46   ` Christoph Hellwig
2018-12-04  1:14   ` Sagi Grimberg
2018-12-04  1:14     ` Sagi Grimberg
2018-12-04 18:40 ` block and nvme polling improvements V3 Jens Axboe
2018-12-04 18:40   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2018-11-29 19:12 block and nvme polling improvements V2 Christoph Hellwig
2018-11-29 19:13 ` [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues Christoph Hellwig
2018-11-29 19:13   ` Christoph Hellwig
2018-11-29 20:36   ` Keith Busch
2018-11-29 20:36     ` Keith Busch
2018-11-30  8:08     ` Christoph Hellwig
2018-11-30  8:08       ` Christoph Hellwig
2018-11-30 14:45       ` Keith Busch
2018-11-30 14:45         ` Keith Busch
2018-11-21 16:23 block and nvme polling improvements Christoph Hellwig
2018-11-21 16:23 ` [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues Christoph Hellwig
2018-11-21 16:23   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.