linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps
@ 2018-10-29 16:37 Jens Axboe
  2018-10-29 16:37 ` [PATCH 01/14] blk-mq: kill q->mq_map Jens Axboe
                   ` (13 more replies)
  0 siblings, 14 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel

This series adds support for multiple queue maps for blk-mq.
Since blk-mq was introduced, it's only support a single queue
map. This means you can have 1 set of queues, and the mapping
purely depends on what CPU an IO originated from. With this
patch set, drivers can implement mappings that depend on both
CPU and request type - and they can have multiple sets of mappings.

NVMe is used as a proof of concept. It adds support for a separate
write queue set. One way to use this would be to limit the number
of write queues to favor reads, since NVMe does round-robin service
of queues. An easy extension of this would be to add multiple
sets of queues, for prioritized IO.

NVMe also uses this feature to finally make the polling work
efficiently, without triggering interrupts. This both increases
performance (and decreases latency), at a lower system load. At
the same time it's more flexible, as you don't have to worry about
IRQ coalescing and redirection to avoid interrupts disturbing the
workload. This is how polling should have worked from day 1.

This is on top of the mq-conversions branch and series just
posted. It can also be bound in my mq-maps branch.

Changes since v1:

- Ensure irq_calc_affinity_vectors() doesn't return more than 'maxvec'
- Rebase on top of current mq-conversions series

 block/blk-flush.c                     |   7 +-
 block/blk-mq-cpumap.c                 |  19 +--
 block/blk-mq-debugfs.c                |   4 +-
 block/blk-mq-pci.c                    |  10 +-
 block/blk-mq-rdma.c                   |   4 +-
 block/blk-mq-sched.c                  |  18 ++-
 block/blk-mq-sysfs.c                  |  10 ++
 block/blk-mq-tag.c                    |   5 +-
 block/blk-mq-virtio.c                 |   8 +-
 block/blk-mq.c                        | 213 ++++++++++++++++++++----------
 block/blk-mq.h                        |  29 ++++-
 block/blk.h                           |   6 +-
 block/kyber-iosched.c                 |   6 +-
 drivers/block/virtio_blk.c            |   2 +-
 drivers/nvme/host/pci.c               | 238 ++++++++++++++++++++++++++++++----
 drivers/scsi/qla2xxx/qla_os.c         |   5 +-
 drivers/scsi/scsi_lib.c               |   2 +-
 drivers/scsi/smartpqi/smartpqi_init.c |   3 +-
 drivers/scsi/virtio_scsi.c            |   3 +-
 fs/block_dev.c                        |   2 +
 fs/direct-io.c                        |   2 +
 fs/iomap.c                            |   9 +-
 include/linux/blk-mq-pci.h            |   4 +-
 include/linux/blk-mq-virtio.h         |   4 +-
 include/linux/blk-mq.h                |  24 +++-
 include/linux/blk_types.h             |   4 +-
 include/linux/blkdev.h                |   2 -
 include/linux/interrupt.h             |   4 +
 kernel/irq/affinity.c                 |  40 ++++--
 29 files changed, 526 insertions(+), 161 deletions(-)



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 01/14] blk-mq: kill q->mq_map
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 16:46   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 02/14] blk-mq: abstract out queue map Jens Axboe
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

It's just a pointer to set->mq_map, use that instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c         | 13 ++++---------
 block/blk-mq.h         |  4 +++-
 include/linux/blkdev.h |  2 --
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21e4147c4810..22d5beaab5a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2321,7 +2321,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	 * If the cpu isn't present, the cpu is mapped to first hctx.
 	 */
 	for_each_possible_cpu(i) {
-		hctx_idx = q->mq_map[i];
+		hctx_idx = set->mq_map[i];
 		/* unmapped hw queue can be remapped after CPU topo changed */
 		if (!set->tags[hctx_idx] &&
 		    !__blk_mq_alloc_rq_map(set, hctx_idx)) {
@@ -2331,7 +2331,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 			 * case, remap the current ctx to hctx[0] which
 			 * is guaranteed to always have tags allocated
 			 */
-			q->mq_map[i] = 0;
+			set->mq_map[i] = 0;
 		}
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
@@ -2429,8 +2429,6 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 				     struct request_queue *q)
 {
-	q->tag_set = set;
-
 	mutex_lock(&set->tag_list_lock);
 
 	/*
@@ -2467,8 +2465,6 @@ void blk_mq_release(struct request_queue *q)
 		kobject_put(&hctx->kobj);
 	}
 
-	q->mq_map = NULL;
-
 	kfree(q->queue_hw_ctx);
 
 	/*
@@ -2588,7 +2584,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		int node;
 		struct blk_mq_hw_ctx *hctx;
 
-		node = blk_mq_hw_queue_to_node(q->mq_map, i);
+		node = blk_mq_hw_queue_to_node(set->mq_map, i);
 		/*
 		 * If the hw queue has been mapped to another numa node,
 		 * we need to realloc the hctx. If allocation fails, fallback
@@ -2665,8 +2661,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = set->mq_map;
-
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
@@ -2675,6 +2669,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
 	q->nr_queues = nr_cpu_ids;
+	q->tag_set = set;
 
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47e2526..9536be06d022 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -75,7 +75,9 @@ extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 		int cpu)
 {
-	return q->queue_hw_ctx[q->mq_map[cpu]];
+	struct blk_mq_tag_set *set = q->tag_set;
+
+	return q->queue_hw_ctx[set->mq_map[cpu]];
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c675e2b5af62..4223ae2d2198 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -412,8 +412,6 @@ struct request_queue {
 
 	const struct blk_mq_ops	*mq_ops;
 
-	unsigned int		*mq_map;
-
 	/* sw queues */
 	struct blk_mq_ctx __percpu	*queue_ctx;
 	unsigned int		nr_queues;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 02/14] blk-mq: abstract out queue map
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
  2018-10-29 16:37 ` [PATCH 01/14] blk-mq: kill q->mq_map Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 18:33   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper Jens Axboe
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

This is in preparation for allowing multiple sets of maps per
queue, if so desired.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-cpumap.c                 | 10 ++++----
 block/blk-mq-pci.c                    | 10 ++++----
 block/blk-mq-rdma.c                   |  4 ++--
 block/blk-mq-virtio.c                 |  8 +++----
 block/blk-mq.c                        | 34 ++++++++++++++-------------
 block/blk-mq.h                        |  8 +++----
 drivers/block/virtio_blk.c            |  2 +-
 drivers/nvme/host/pci.c               |  2 +-
 drivers/scsi/qla2xxx/qla_os.c         |  5 ++--
 drivers/scsi/scsi_lib.c               |  2 +-
 drivers/scsi/smartpqi/smartpqi_init.c |  3 ++-
 drivers/scsi/virtio_scsi.c            |  3 ++-
 include/linux/blk-mq-pci.h            |  4 ++--
 include/linux/blk-mq-virtio.h         |  4 ++--
 include/linux/blk-mq.h                | 13 ++++++++--
 15 files changed, 63 insertions(+), 49 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..6e6686c55984 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,10 +30,10 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+int blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
-	unsigned int *map = set->mq_map;
-	unsigned int nr_queues = set->nr_hw_queues;
+	unsigned int *map = qmap->mq_map;
+	unsigned int nr_queues = qmap->nr_queues;
 	unsigned int cpu, first_sibling;
 
 	for_each_possible_cpu(cpu) {
@@ -62,12 +62,12 @@ EXPORT_SYMBOL_GPL(blk_mq_map_queues);
  * We have no quick way of doing reverse lookups. This is only used at
  * queue init time, so runtime isn't important.
  */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int index)
 {
 	int i;
 
 	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
+		if (index == qmap->mq_map[i])
 			return local_memory_node(cpu_to_node(i));
 	}
 
diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index db644ec624f5..40333d60a850 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -31,26 +31,26 @@
  * that maps a queue to the CPUs that have irq affinity for the corresponding
  * vector.
  */
-int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev,
+int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 			    int offset)
 {
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
-	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = pci_irq_get_affinity(pdev, queue + offset);
 		if (!mask)
 			goto fallback;
 
 		for_each_cpu(cpu, mask)
-			set->mq_map[cpu] = queue;
+			qmap->mq_map[cpu] = queue;
 	}
 
 	return 0;
 
 fallback:
-	WARN_ON_ONCE(set->nr_hw_queues > 1);
-	blk_mq_clear_mq_map(set);
+	WARN_ON_ONCE(qmap->nr_queues > 1);
+	blk_mq_clear_mq_map(qmap);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(blk_mq_pci_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..a71576aff3a5 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -41,12 +41,12 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 			goto fallback;
 
 		for_each_cpu(cpu, mask)
-			set->mq_map[cpu] = queue;
+			set->map[0].mq_map[cpu] = queue;
 	}
 
 	return 0;
 
 fallback:
-	return blk_mq_map_queues(set);
+	return blk_mq_map_queues(&set->map[0]);
 }
 EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index c3afbca11299..661fbfef480f 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -29,7 +29,7 @@
  * that maps a queue to the CPUs that have irq affinity for the corresponding
  * vector.
  */
-int blk_mq_virtio_map_queues(struct blk_mq_tag_set *set,
+int blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 		struct virtio_device *vdev, int first_vec)
 {
 	const struct cpumask *mask;
@@ -38,17 +38,17 @@ int blk_mq_virtio_map_queues(struct blk_mq_tag_set *set,
 	if (!vdev->config->get_vq_affinity)
 		goto fallback;
 
-	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = vdev->config->get_vq_affinity(vdev, first_vec + queue);
 		if (!mask)
 			goto fallback;
 
 		for_each_cpu(cpu, mask)
-			set->mq_map[cpu] = queue;
+			qmap->mq_map[cpu] = queue;
 	}
 
 	return 0;
 fallback:
-	return blk_mq_map_queues(set);
+	return blk_mq_map_queues(qmap);
 }
 EXPORT_SYMBOL_GPL(blk_mq_virtio_map_queues);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 22d5beaab5a0..fa2e5176966e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1974,7 +1974,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	struct blk_mq_tags *tags;
 	int node;
 
-	node = blk_mq_hw_queue_to_node(set->mq_map, hctx_idx);
+	node = blk_mq_hw_queue_to_node(&set->map[0], hctx_idx);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
@@ -2030,7 +2030,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	size_t rq_size, left;
 	int node;
 
-	node = blk_mq_hw_queue_to_node(set->mq_map, hctx_idx);
+	node = blk_mq_hw_queue_to_node(&set->map[0], hctx_idx);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
@@ -2321,7 +2321,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	 * If the cpu isn't present, the cpu is mapped to first hctx.
 	 */
 	for_each_possible_cpu(i) {
-		hctx_idx = set->mq_map[i];
+		hctx_idx = set->map[0].mq_map[i];
 		/* unmapped hw queue can be remapped after CPU topo changed */
 		if (!set->tags[hctx_idx] &&
 		    !__blk_mq_alloc_rq_map(set, hctx_idx)) {
@@ -2331,7 +2331,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 			 * case, remap the current ctx to hctx[0] which
 			 * is guaranteed to always have tags allocated
 			 */
-			set->mq_map[i] = 0;
+			set->map[0].mq_map[i] = 0;
 		}
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
@@ -2584,7 +2584,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		int node;
 		struct blk_mq_hw_ctx *hctx;
 
-		node = blk_mq_hw_queue_to_node(set->mq_map, i);
+		node = blk_mq_hw_queue_to_node(&set->map[0], i);
 		/*
 		 * If the hw queue has been mapped to another numa node,
 		 * we need to realloc the hctx. If allocation fails, fallback
@@ -2793,18 +2793,18 @@ static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 		 * for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		 * 	mask = get_cpu_mask(queue)
 		 * 	for_each_cpu(cpu, mask)
-		 * 		set->mq_map[cpu] = queue;
+		 * 		set->map.mq_map[cpu] = queue;
 		 * }
 		 *
 		 * When we need to remap, the table has to be cleared for
 		 * killing stale mapping since one CPU may not be mapped
 		 * to any hw queue.
 		 */
-		blk_mq_clear_mq_map(set);
+		blk_mq_clear_mq_map(&set->map[0]);
 
 		return set->ops->map_queues(set);
 	} else
-		return blk_mq_map_queues(set);
+		return blk_mq_map_queues(&set->map[0]);
 }
 
 /*
@@ -2859,10 +2859,12 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		return -ENOMEM;
 
 	ret = -ENOMEM;
-	set->mq_map = kcalloc_node(nr_cpu_ids, sizeof(*set->mq_map),
-				   GFP_KERNEL, set->numa_node);
-	if (!set->mq_map)
+	set->map[0].mq_map = kcalloc_node(nr_cpu_ids,
+						sizeof(*set->map[0].mq_map),
+				   		GFP_KERNEL, set->numa_node);
+	if (!set->map[0].mq_map)
 		goto out_free_tags;
+	set->map[0].nr_queues = set->nr_hw_queues;
 
 	ret = blk_mq_update_queue_map(set);
 	if (ret)
@@ -2878,8 +2880,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	return 0;
 
 out_free_mq_map:
-	kfree(set->mq_map);
-	set->mq_map = NULL;
+	kfree(set->map[0].mq_map);
+	set->map[0].mq_map = NULL;
 out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
@@ -2894,8 +2896,8 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 	for (i = 0; i < nr_cpu_ids; i++)
 		blk_mq_free_map_and_requests(set, i);
 
-	kfree(set->mq_map);
-	set->mq_map = NULL;
+	kfree(set->map[0].mq_map);
+	set->map[0].mq_map = NULL;
 
 	kfree(set->tags);
 	set->tags = NULL;
@@ -3056,7 +3058,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 			pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n",
 					nr_hw_queues, prev_nr_hw_queues);
 			set->nr_hw_queues = prev_nr_hw_queues;
-			blk_mq_map_queues(set);
+			blk_mq_map_queues(&set->map[0]);
 			goto fallback;
 		}
 		blk_mq_map_swqueue(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9536be06d022..889f0069dd80 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -70,14 +70,14 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 /*
  * CPU -> queue mappings
  */
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
+extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
 
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 		int cpu)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	return q->queue_hw_ctx[set->mq_map[cpu]];
+	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
 }
 
 /*
@@ -206,12 +206,12 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
 	__blk_mq_put_driver_tag(hctx, rq);
 }
 
-static inline void blk_mq_clear_mq_map(struct blk_mq_tag_set *set)
+static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		set->mq_map[cpu] = 0;
+		qmap->mq_map[cpu] = 0;
 }
 
 #endif
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 086c6bb12baa..6e869d05f91e 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -624,7 +624,7 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set)
 {
 	struct virtio_blk *vblk = set->driver_data;
 
-	return blk_mq_virtio_map_queues(set, vblk->vdev, 0);
+	return blk_mq_virtio_map_queues(&set->map[0], vblk->vdev, 0);
 }
 
 #ifdef CONFIG_VIRTIO_BLK_SCSI
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f30031945ee4..e5d783cb6937 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -435,7 +435,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
 
-	return blk_mq_pci_map_queues(set, to_pci_dev(dev->dev),
+	return blk_mq_pci_map_queues(&set->map[0], to_pci_dev(dev->dev),
 			dev->num_vecs > 1 ? 1 /* admin queue */ : 0);
 }
 
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 3e2665c66bc4..ca9ac124f218 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -6934,11 +6934,12 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost)
 {
 	int rc;
 	scsi_qla_host_t *vha = (scsi_qla_host_t *)shost->hostdata;
+	struct blk_mq_queue_map *qmap = &shost->tag_set.map[0];
 
 	if (USER_CTRL_IRQ(vha->hw))
-		rc = blk_mq_map_queues(&shost->tag_set);
+		rc = blk_mq_map_queues(qmap);
 	else
-		rc = blk_mq_pci_map_queues(&shost->tag_set, vha->hw->pdev, 0);
+		rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, 0);
 	return rc;
 }
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 651be30ba96a..ed81b8e74cfe 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1812,7 +1812,7 @@ static int scsi_map_queues(struct blk_mq_tag_set *set)
 
 	if (shost->hostt->map_queues)
 		return shost->hostt->map_queues(shost);
-	return blk_mq_map_queues(set);
+	return blk_mq_map_queues(&set->map[0]);
 }
 
 void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index a25a07a0b7f0..bac084260d80 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -5319,7 +5319,8 @@ static int pqi_map_queues(struct Scsi_Host *shost)
 {
 	struct pqi_ctrl_info *ctrl_info = shost_to_hba(shost);
 
-	return blk_mq_pci_map_queues(&shost->tag_set, ctrl_info->pci_dev, 0);
+	return blk_mq_pci_map_queues(&shost->tag_set.map[0],
+					ctrl_info->pci_dev, 0);
 }
 
 static int pqi_getpciinfo_ioctl(struct pqi_ctrl_info *ctrl_info,
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 1c72db94270e..c3c95b314286 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -719,8 +719,9 @@ static void virtscsi_target_destroy(struct scsi_target *starget)
 static int virtscsi_map_queues(struct Scsi_Host *shost)
 {
 	struct virtio_scsi *vscsi = shost_priv(shost);
+	struct blk_mq_queue_map *qmap = &shost->tag_set.map[0];
 
-	return blk_mq_virtio_map_queues(&shost->tag_set, vscsi->vdev, 2);
+	return blk_mq_virtio_map_queues(qmap, vscsi->vdev, 2);
 }
 
 /*
diff --git a/include/linux/blk-mq-pci.h b/include/linux/blk-mq-pci.h
index 9f4c17f0d2d8..0b1f45c62623 100644
--- a/include/linux/blk-mq-pci.h
+++ b/include/linux/blk-mq-pci.h
@@ -2,10 +2,10 @@
 #ifndef _LINUX_BLK_MQ_PCI_H
 #define _LINUX_BLK_MQ_PCI_H
 
-struct blk_mq_tag_set;
+struct blk_mq_queue_map;
 struct pci_dev;
 
-int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev,
+int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 			  int offset);
 
 #endif /* _LINUX_BLK_MQ_PCI_H */
diff --git a/include/linux/blk-mq-virtio.h b/include/linux/blk-mq-virtio.h
index 69b4da262c45..687ae287e1dc 100644
--- a/include/linux/blk-mq-virtio.h
+++ b/include/linux/blk-mq-virtio.h
@@ -2,10 +2,10 @@
 #ifndef _LINUX_BLK_MQ_VIRTIO_H
 #define _LINUX_BLK_MQ_VIRTIO_H
 
-struct blk_mq_tag_set;
+struct blk_mq_queue_map;
 struct virtio_device;
 
-int blk_mq_virtio_map_queues(struct blk_mq_tag_set *set,
+int blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 		struct virtio_device *vdev, int first_vec);
 
 #endif /* _LINUX_BLK_MQ_VIRTIO_H */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5c8418ebbfd6..71fd205b4213 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -74,8 +74,17 @@ struct blk_mq_hw_ctx {
 	struct srcu_struct	srcu[0];
 };
 
+struct blk_mq_queue_map {
+	unsigned int *mq_map;
+	unsigned int nr_queues;
+};
+
+enum {
+	HCTX_MAX_TYPES = 1,
+};
+
 struct blk_mq_tag_set {
-	unsigned int		*mq_map;
+	struct blk_mq_queue_map	map[HCTX_MAX_TYPES];
 	const struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
@@ -294,7 +303,7 @@ void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set);
+int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
  2018-10-29 16:37 ` [PATCH 01/14] blk-mq: kill q->mq_map Jens Axboe
  2018-10-29 16:37 ` [PATCH 02/14] blk-mq: abstract out queue map Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 17:22   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping Jens Axboe
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

Doesn't do anything right now, but it's needed as a prep patch
to get the interfaces right.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 889f0069dd80..79c300faa7ce 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -80,6 +80,12 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
 }
 
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
+							  int type, int cpu)
+{
+	return blk_mq_map_queue(q, cpu);
+}
+
 /*
  * sysfs helpers
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (2 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 17:30   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues Jens Axboe
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

Prep patch for being able to place request based not just on
CPU location, but also on the type of request.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-flush.c      |  7 +++---
 block/blk-mq-debugfs.c |  4 +++-
 block/blk-mq-sched.c   | 16 ++++++++++----
 block/blk-mq-tag.c     |  5 +++--
 block/blk-mq.c         | 50 +++++++++++++++++++++++-------------------
 block/blk-mq.h         |  8 ++++---
 block/blk.h            |  6 ++---
 7 files changed, 58 insertions(+), 38 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 9baa9a119447..7922dba81497 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -219,7 +219,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 
 	/* release the tag's ownership to the req cloned from */
 	spin_lock_irqsave(&fq->mq_flush_lock, flags);
-	hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
+	hctx = blk_mq_map_queue(q, flush_rq->cmd_flags, flush_rq->mq_ctx->cpu);
 	if (!q->elevator) {
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
@@ -307,7 +307,8 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	if (!q->elevator) {
 		fq->orig_rq = first_rq;
 		flush_rq->tag = first_rq->tag;
-		hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, first_rq->cmd_flags,
+					first_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	} else {
 		flush_rq->internal_tag = first_rq->internal_tag;
@@ -330,7 +331,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = blk_mq_map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
 
 	if (q->elevator) {
 		WARN_ON(rq->tag < 0);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9ed43a7c70b5..fac70c81b7de 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -427,8 +427,10 @@ struct show_busy_params {
 static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 {
 	const struct show_busy_params *params = data;
+	struct blk_mq_hw_ctx *hctx;
 
-	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx)
+	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
+	if (hctx == params->hctx)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 29bfe8017a2d..8125e9393ec2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -311,7 +311,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 {
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, bio->bi_opf, ctx->cpu);
 	bool ret = false;
 
 	if (e && e->type->ops.mq.bio_merge) {
@@ -367,7 +367,9 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
 
 	/* flush rq in flush machinery need to be dispatched directly */
 	if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
@@ -400,9 +402,15 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 				  struct blk_mq_ctx *ctx,
 				  struct list_head *list, bool run_queue_async)
 {
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *e;
+	struct request *rq;
+
+	/* For list inserts, requests better be on the same hw queue */
+	rq = list_first_entry(list, struct request, queuelist);
+	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
 
+	e = hctx->queue->elevator;
 	if (e && e->type->ops.mq.insert_requests)
 		e->type->ops.mq.insert_requests(hctx, list, false);
 	else {
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 4254e74c1446..478a959357f5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -168,7 +168,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		io_schedule();
 
 		data->ctx = blk_mq_get_ctx(data->q);
-		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+		data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
+						data->ctx->cpu);
 		tags = blk_mq_tags_from_data(data);
 		if (data->flags & BLK_MQ_REQ_RESERVED)
 			bt = &tags->breserved_tags;
@@ -530,7 +531,7 @@ u32 blk_mq_unique_tag(struct request *rq)
 	struct blk_mq_hw_ctx *hctx;
 	int hwq = 0;
 
-	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
+	hctx = blk_mq_map_queue(q, rq->cmd_flags, rq->mq_ctx->cpu);
 	hwq = hctx->queue_num;
 
 	return (hwq << BLK_MQ_UNIQUE_TAG_BITS) |
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fa2e5176966e..e6ea7da99125 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,8 +332,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 }
 
 static struct request *blk_mq_get_request(struct request_queue *q,
-		struct bio *bio, unsigned int op,
-		struct blk_mq_alloc_data *data)
+					  struct bio *bio,
+					  struct blk_mq_alloc_data *data)
 {
 	struct elevator_queue *e = q->elevator;
 	struct request *rq;
@@ -347,8 +347,9 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 		put_ctx_on_error = true;
 	}
 	if (likely(!data->hctx))
-		data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
-	if (op & REQ_NOWAIT)
+		data->hctx = blk_mq_map_queue(q, data->cmd_flags,
+						data->ctx->cpu);
+	if (data->cmd_flags & REQ_NOWAIT)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
 	if (e) {
@@ -359,9 +360,10 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 		 * dispatch list. Don't include reserved tags in the
 		 * limiting, as it isn't useful.
 		 */
-		if (!op_is_flush(op) && e->type->ops.mq.limit_depth &&
+		if (!op_is_flush(data->cmd_flags) &&
+		    e->type->ops.mq.limit_depth &&
 		    !(data->flags & BLK_MQ_REQ_RESERVED))
-			e->type->ops.mq.limit_depth(op, data);
+			e->type->ops.mq.limit_depth(data->cmd_flags, data);
 	} else {
 		blk_mq_tag_busy(data->hctx);
 	}
@@ -376,8 +378,8 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 		return NULL;
 	}
 
-	rq = blk_mq_rq_ctx_init(data, tag, op);
-	if (!op_is_flush(op)) {
+	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags);
+	if (!op_is_flush(data->cmd_flags)) {
 		rq->elv.icq = NULL;
 		if (e && e->type->ops.mq.prepare_request) {
 			if (e->type->icq_cache && rq_ioc(bio))
@@ -394,7 +396,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags };
+	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
 	struct request *rq;
 	int ret;
 
@@ -402,7 +404,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	if (ret)
 		return ERR_PTR(ret);
 
-	rq = blk_mq_get_request(q, NULL, op, &alloc_data);
+	rq = blk_mq_get_request(q, NULL, &alloc_data);
 	blk_queue_exit(q);
 
 	if (!rq)
@@ -420,7 +422,7 @@ EXPORT_SYMBOL(blk_mq_alloc_request);
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags };
+	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
 	struct request *rq;
 	unsigned int cpu;
 	int ret;
@@ -453,7 +455,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
 	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
 
-	rq = blk_mq_get_request(q, NULL, op, &alloc_data);
+	rq = blk_mq_get_request(q, NULL, &alloc_data);
 	blk_queue_exit(q);
 
 	if (!rq)
@@ -467,7 +469,7 @@ static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
 	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
@@ -484,7 +486,7 @@ void blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -976,8 +978,9 @@ bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
-		.hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu),
+		.hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu),
 		.flags = BLK_MQ_REQ_NOWAIT,
+		.cmd_flags = rq->cmd_flags,
 	};
 	bool shared;
 
@@ -1141,7 +1144,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 		rq = list_first_entry(list, struct request, queuelist);
 
-		hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
 		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -1572,7 +1575,8 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
+							ctx->cpu);
 
 	spin_lock(&hctx->lock);
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
@@ -1782,7 +1786,8 @@ blk_status_t blk_mq_request_issue_directly(struct request *rq)
 	int srcu_idx;
 	blk_qc_t unused_cookie;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
+							ctx->cpu);
 
 	hctx_lock(hctx, &srcu_idx);
 	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
@@ -1816,7 +1821,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_mq_alloc_data data = { .flags = 0 };
+	struct blk_mq_alloc_data data = { .flags = 0, .cmd_flags = bio->bi_opf };
 	struct request *rq;
 	unsigned int request_count = 0;
 	struct blk_plug *plug;
@@ -1839,7 +1844,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	rq_qos_throttle(q, bio, NULL);
 
-	rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);
+	rq = blk_mq_get_request(q, bio, &data);
 	if (unlikely(!rq)) {
 		rq_qos_cleanup(q, bio);
 		if (bio->bi_opf & REQ_NOWAIT)
@@ -1908,6 +1913,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 		if (same_queue_rq) {
 			data.hctx = blk_mq_map_queue(q,
+					same_queue_rq->cmd_flags,
 					same_queue_rq->mq_ctx->cpu);
 			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
 					&cookie);
@@ -2262,7 +2268,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		 * Set local node, IFF we have more than one hw queue. If
 		 * not, we remain on the home node of the device
 		 */
-		hctx = blk_mq_map_queue(q, i);
+		hctx = blk_mq_map_queue_type(q, 0, i);
 		if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
 			hctx->numa_node = local_memory_node(cpu_to_node(i));
 	}
@@ -2335,7 +2341,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		}
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
-		hctx = blk_mq_map_queue(q, i);
+		hctx = blk_mq_map_queue_type(q, 0, i);
 
 		cpumask_set_cpu(i, hctx->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 79c300faa7ce..55428b92c019 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -73,7 +73,8 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
 
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
-		int cpu)
+						     unsigned int flags,
+						     int cpu)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 
@@ -83,7 +84,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
 							  int type, int cpu)
 {
-	return blk_mq_map_queue(q, cpu);
+	return blk_mq_map_queue(q, type, cpu);
 }
 
 /*
@@ -134,6 +135,7 @@ struct blk_mq_alloc_data {
 	struct request_queue *q;
 	blk_mq_req_flags_t flags;
 	unsigned int shallow_depth;
+	unsigned int cmd_flags;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
@@ -208,7 +210,7 @@ static inline void blk_mq_put_driver_tag(struct request *rq)
 	if (rq->tag == -1 || rq->internal_tag == -1)
 		return;
 
-	hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu);
+	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
 	__blk_mq_put_driver_tag(hctx, rq);
 }
 
diff --git a/block/blk.h b/block/blk.h
index 2bf1cfeeb9c0..78ae94886acf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -104,10 +104,10 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 	__clear_bit(flag, &q->queue_flags);
 }
 
-static inline struct blk_flush_queue *blk_get_flush_queue(
-		struct request_queue *q, struct blk_mq_ctx *ctx)
+static inline struct blk_flush_queue *
+blk_get_flush_queue(struct request_queue *q, struct blk_mq_ctx *ctx)
 {
-	return blk_mq_map_queue(q, ctx->cpu)->fq;
+	return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx->cpu)->fq;
 }
 
 static inline void __blk_get_queue(struct request_queue *q)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (3 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 17:34   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory Jens Axboe
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

The mapping used to be dependent on just the CPU location, but
now it's a tuple of { type, cpu} instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq.c         | 18 ++++++++++++------
 block/blk-mq.h         |  2 +-
 block/kyber-iosched.c  |  6 +++---
 include/linux/blk-mq.h |  3 ++-
 5 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 8125e9393ec2..d232ecf3290c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -110,7 +110,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
 					  struct blk_mq_ctx *ctx)
 {
-	unsigned idx = ctx->index_hw;
+	unsigned short idx = ctx->index_hw[hctx->type];
 
 	if (++idx == hctx->nr_ctx)
 		idx = 0;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e6ea7da99125..fab84c6bda18 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -75,14 +75,18 @@ static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
 				     struct blk_mq_ctx *ctx)
 {
-	if (!sbitmap_test_bit(&hctx->ctx_map, ctx->index_hw))
-		sbitmap_set_bit(&hctx->ctx_map, ctx->index_hw);
+	const int bit = ctx->index_hw[hctx->type];
+
+	if (!sbitmap_test_bit(&hctx->ctx_map, bit))
+		sbitmap_set_bit(&hctx->ctx_map, bit);
 }
 
 static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 				      struct blk_mq_ctx *ctx)
 {
-	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
+	const int bit = ctx->index_hw[hctx->type];
+
+	sbitmap_clear_bit(&hctx->ctx_map, bit);
 }
 
 struct mq_inflight {
@@ -954,7 +958,7 @@ static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start)
 {
-	unsigned off = start ? start->index_hw : 0;
+	unsigned off = start ? start->index_hw[hctx->type] : 0;
 	struct dispatch_rq_data data = {
 		.hctx = hctx,
 		.rq   = NULL,
@@ -2342,10 +2346,12 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
 		hctx = blk_mq_map_queue_type(q, 0, i);
-
+		hctx->type = 0;
 		cpumask_set_cpu(i, hctx->cpumask);
-		ctx->index_hw = hctx->nr_ctx;
+		ctx->index_hw[hctx->type] = hctx->nr_ctx;
 		hctx->ctxs[hctx->nr_ctx++] = ctx;
+		/* wrap */
+		BUG_ON(!hctx->nr_ctx);
 	}
 
 	mutex_unlock(&q->sysfs_lock);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 55428b92c019..7b5a790acdbf 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -17,7 +17,7 @@ struct blk_mq_ctx {
 	}  ____cacheline_aligned_in_smp;
 
 	unsigned int		cpu;
-	unsigned int		index_hw;
+	unsigned short		index_hw[HCTX_MAX_TYPES];
 
 	/* incremented at dispatch time */
 	unsigned long		rq_dispatched[2];
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 728757a34fa0..b824a639d5d4 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -576,7 +576,7 @@ static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
 {
 	struct kyber_hctx_data *khd = hctx->sched_data;
 	struct blk_mq_ctx *ctx = blk_mq_get_ctx(hctx->queue);
-	struct kyber_ctx_queue *kcq = &khd->kcqs[ctx->index_hw];
+	struct kyber_ctx_queue *kcq = &khd->kcqs[ctx->index_hw[hctx->type]];
 	unsigned int sched_domain = kyber_sched_domain(bio->bi_opf);
 	struct list_head *rq_list = &kcq->rq_list[sched_domain];
 	bool merged;
@@ -602,7 +602,7 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
 
 	list_for_each_entry_safe(rq, next, rq_list, queuelist) {
 		unsigned int sched_domain = kyber_sched_domain(rq->cmd_flags);
-		struct kyber_ctx_queue *kcq = &khd->kcqs[rq->mq_ctx->index_hw];
+		struct kyber_ctx_queue *kcq = &khd->kcqs[rq->mq_ctx->index_hw[hctx->type]];
 		struct list_head *head = &kcq->rq_list[sched_domain];
 
 		spin_lock(&kcq->lock);
@@ -611,7 +611,7 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
 		else
 			list_move_tail(&rq->queuelist, head);
 		sbitmap_set_bit(&khd->kcq_map[sched_domain],
-				rq->mq_ctx->index_hw);
+				rq->mq_ctx->index_hw[hctx->type]);
 		blk_mq_sched_request_inserted(rq);
 		spin_unlock(&kcq->lock);
 	}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 71fd205b4213..f9e19962a22f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -37,7 +37,8 @@ struct blk_mq_hw_ctx {
 	struct blk_mq_ctx	*dispatch_from;
 	unsigned int		dispatch_busy;
 
-	unsigned int		nr_ctx;
+	unsigned short		type;
+	unsigned short		nr_ctx;
 	struct blk_mq_ctx	**ctxs;
 
 	spinlock_t		dispatch_wait_lock;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (4 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 17:40   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 07/14] blk-mq: support multiple hctx maps Jens Axboe
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-sysfs.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index aafb44224c89..2d737f9e7ba7 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -161,6 +161,11 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char *page)
+{
+	return sprintf(page, "%u\n", hctx->type);
+}
+
 static struct attribute *default_ctx_attrs[] = {
 	NULL,
 };
@@ -177,11 +182,16 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_cpus = {
 	.attr = {.name = "cpu_list", .mode = 0444 },
 	.show = blk_mq_hw_sysfs_cpus_show,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_type = {
+	.attr = {.name = "type", .mode = 0444 },
+	.show = blk_mq_hw_sysfs_type_show,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_nr_tags.attr,
 	&blk_mq_hw_sysfs_nr_reserved_tags.attr,
 	&blk_mq_hw_sysfs_cpus.attr,
+	&blk_mq_hw_sysfs_type.attr,
 	NULL,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 07/14] blk-mq: support multiple hctx maps
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (5 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 18:15   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids Jens Axboe
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->flags_to_type(). The function takes request/bio flags
and CPU, and returns a queue map index for that. We then use the
type information in blk_mq_map_queue() to index the map set.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c         | 85 ++++++++++++++++++++++++++++--------------
 block/blk-mq.h         | 19 ++++++----
 include/linux/blk-mq.h |  7 ++++
 3 files changed, 76 insertions(+), 35 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index fab84c6bda18..0fab36372ace 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2257,7 +2257,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 static void blk_mq_init_cpu_queues(struct request_queue *q,
 				   unsigned int nr_hw_queues)
 {
-	unsigned int i;
+	struct blk_mq_tag_set *set = q->tag_set;
+	unsigned int i, j;
 
 	for_each_possible_cpu(i) {
 		struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
@@ -2272,9 +2273,11 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		 * Set local node, IFF we have more than one hw queue. If
 		 * not, we remain on the home node of the device
 		 */
-		hctx = blk_mq_map_queue_type(q, 0, i);
-		if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
-			hctx->numa_node = local_memory_node(cpu_to_node(i));
+		for (j = 0; j < set->nr_maps; j++) {
+			hctx = blk_mq_map_queue_type(q, j, i);
+			if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
+				hctx->numa_node = local_memory_node(cpu_to_node(i));
+		}
 	}
 }
 
@@ -2309,7 +2312,7 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
 
 static void blk_mq_map_swqueue(struct request_queue *q)
 {
-	unsigned int i, hctx_idx;
+	unsigned int i, j, hctx_idx;
 	struct blk_mq_hw_ctx *hctx;
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -2345,13 +2348,23 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		}
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
-		hctx = blk_mq_map_queue_type(q, 0, i);
-		hctx->type = 0;
-		cpumask_set_cpu(i, hctx->cpumask);
-		ctx->index_hw[hctx->type] = hctx->nr_ctx;
-		hctx->ctxs[hctx->nr_ctx++] = ctx;
-		/* wrap */
-		BUG_ON(!hctx->nr_ctx);
+		for (j = 0; j < set->nr_maps; j++) {
+			hctx = blk_mq_map_queue_type(q, j, i);
+			hctx->type = j;
+
+			/*
+			 * If the CPU is already set in the mask, then we've
+			 * mapped this one already. This can happen if
+			 * devices share queues across queue maps.
+			 */
+			if (cpumask_test_cpu(i, hctx->cpumask))
+				continue;
+			cpumask_set_cpu(i, hctx->cpumask);
+			ctx->index_hw[hctx->type] = hctx->nr_ctx;
+			hctx->ctxs[hctx->nr_ctx++] = ctx;
+			/* wrap */
+			BUG_ON(!hctx->nr_ctx);
+		}
 	}
 
 	mutex_unlock(&q->sysfs_lock);
@@ -2519,6 +2532,7 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
 	memset(set, 0, sizeof(*set));
 	set->ops = ops;
 	set->nr_hw_queues = 1;
+	set->nr_maps = 1;
 	set->queue_depth = queue_depth;
 	set->numa_node = NUMA_NO_NODE;
 	set->flags = set_flags;
@@ -2798,6 +2812,8 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
 	if (set->ops->map_queues) {
+		int i;
+
 		/*
 		 * transport .map_queues is usually done in the following
 		 * way:
@@ -2805,18 +2821,21 @@ static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 		 * for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		 * 	mask = get_cpu_mask(queue)
 		 * 	for_each_cpu(cpu, mask)
-		 * 		set->map.mq_map[cpu] = queue;
+		 * 		set->map[x].mq_map[cpu] = queue;
 		 * }
 		 *
 		 * When we need to remap, the table has to be cleared for
 		 * killing stale mapping since one CPU may not be mapped
 		 * to any hw queue.
 		 */
-		blk_mq_clear_mq_map(&set->map[0]);
+		for (i = 0; i < set->nr_maps; i++)
+			blk_mq_clear_mq_map(&set->map[i]);
 
 		return set->ops->map_queues(set);
-	} else
+	} else {
+		BUG_ON(set->nr_maps > 1);
 		return blk_mq_map_queues(&set->map[0]);
+	}
 }
 
 /*
@@ -2827,7 +2846,7 @@ static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
-	int ret;
+	int i, ret;
 
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
@@ -2850,6 +2869,11 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		set->queue_depth = BLK_MQ_MAX_DEPTH;
 	}
 
+	if (!set->nr_maps)
+		set->nr_maps = 1;
+	else if (set->nr_maps > HCTX_MAX_TYPES)
+		return -EINVAL;
+
 	/*
 	 * If a crashdump is active, then we are potentially in a very
 	 * memory constrained environment. Limit us to 1 queue and
@@ -2871,12 +2895,14 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		return -ENOMEM;
 
 	ret = -ENOMEM;
-	set->map[0].mq_map = kcalloc_node(nr_cpu_ids,
-						sizeof(*set->map[0].mq_map),
-				   		GFP_KERNEL, set->numa_node);
-	if (!set->map[0].mq_map)
-		goto out_free_tags;
-	set->map[0].nr_queues = set->nr_hw_queues;
+	for (i = 0; i < set->nr_maps; i++) {
+		set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
+						  sizeof(struct blk_mq_queue_map),
+						  GFP_KERNEL, set->numa_node);
+		if (!set->map[i].mq_map)
+			goto out_free_mq_map;
+		set->map[i].nr_queues = set->nr_hw_queues;
+	}
 
 	ret = blk_mq_update_queue_map(set);
 	if (ret)
@@ -2892,9 +2918,10 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	return 0;
 
 out_free_mq_map:
-	kfree(set->map[0].mq_map);
-	set->map[0].mq_map = NULL;
-out_free_tags:
+	for (i = 0; i < set->nr_maps; i++) {
+		kfree(set->map[i].mq_map);
+		set->map[i].mq_map = NULL;
+	}
 	kfree(set->tags);
 	set->tags = NULL;
 	return ret;
@@ -2903,13 +2930,15 @@ EXPORT_SYMBOL(blk_mq_alloc_tag_set);
 
 void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
-	int i;
+	int i, j;
 
 	for (i = 0; i < nr_cpu_ids; i++)
 		blk_mq_free_map_and_requests(set, i);
 
-	kfree(set->map[0].mq_map);
-	set->map[0].mq_map = NULL;
+	for (j = 0; j < set->nr_maps; j++) {
+		kfree(set->map[j].mq_map);
+		set->map[j].mq_map = NULL;
+	}
 
 	kfree(set->tags);
 	set->tags = NULL;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7b5a790acdbf..e27c6f8dc86c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -72,19 +72,24 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
  */
 extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int);
 
-static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
-						     unsigned int flags,
-						     int cpu)
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
+							  int type, int cpu)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
+	return q->queue_hw_ctx[set->map[type].mq_map[cpu]];
 }
 
-static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
-							  int type, int cpu)
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
+						     unsigned int flags,
+						     int cpu)
 {
-	return blk_mq_map_queue(q, type, cpu);
+	int type = 0;
+
+	if (q->mq_ops->flags_to_type)
+		type = q->mq_ops->flags_to_type(q, flags);
+
+	return blk_mq_map_queue_type(q, type, cpu);
 }
 
 /*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f9e19962a22f..837087cf07cc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,6 +86,7 @@ enum {
 
 struct blk_mq_tag_set {
 	struct blk_mq_queue_map	map[HCTX_MAX_TYPES];
+	unsigned int		nr_maps;
 	const struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
@@ -109,6 +110,7 @@ struct blk_mq_queue_data {
 
 typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
 		const struct blk_mq_queue_data *);
+typedef int (flags_to_type_fn)(struct request_queue *, unsigned int);
 typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
 typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
@@ -133,6 +135,11 @@ struct blk_mq_ops {
 	 */
 	queue_rq_fn		*queue_rq;
 
+	/*
+	 * Return a queue map type for the given request/bio flags
+	 */
+	flags_to_type_fn	*flags_to_type;
+
 	/*
 	 * Reserve budget before queue request, once .queue_rq is
 	 * run, it is driver's responsibility to release the
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (6 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 07/14] blk-mq: support multiple hctx maps Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 18:31   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues Jens Axboe
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

With multiple maps, nr_cpu_ids is no longer the maximum number of
hardware queues we support on a given devices. The initializer of
the tag_set can have set ->nr_hw_queues larger than the available
number of CPUs, since we can exceed that with multiple queue maps.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0fab36372ace..60a951c4934c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2663,6 +2663,19 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	mutex_unlock(&q->sysfs_lock);
 }
 
+/*
+ * Maximum number of queues we support. For single sets, we'll never have
+ * more than the CPUs (software queues). For multiple sets, the tag_set
+ * user may have set ->nr_hw_queues larger.
+ */
+static unsigned int nr_hw_queues(struct blk_mq_tag_set *set)
+{
+	if (set->nr_maps == 1)
+		return nr_cpu_ids;
+
+	return max(set->nr_hw_queues, nr_cpu_ids);
+}
+
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
@@ -2682,7 +2695,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	/* init q->mq_kobj and sw queues' kobjects */
 	blk_mq_sysfs_init(q);
 
-	q->queue_hw_ctx = kcalloc_node(nr_cpu_ids, sizeof(*(q->queue_hw_ctx)),
+	q->nr_queues = nr_hw_queues(set);
+	q->queue_hw_ctx = kcalloc_node(q->nr_queues, sizeof(*(q->queue_hw_ctx)),
 						GFP_KERNEL, set->numa_node);
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
@@ -2694,7 +2708,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
-	q->nr_queues = nr_cpu_ids;
 	q->tag_set = set;
 
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
@@ -2884,12 +2897,13 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		set->queue_depth = min(64U, set->queue_depth);
 	}
 	/*
-	 * There is no use for more h/w queues than cpus.
+	 * There is no use for more h/w queues than cpus if we just have
+	 * a single map
 	 */
-	if (set->nr_hw_queues > nr_cpu_ids)
+	if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
 		set->nr_hw_queues = nr_cpu_ids;
 
-	set->tags = kcalloc_node(nr_cpu_ids, sizeof(struct blk_mq_tags *),
+	set->tags = kcalloc_node(nr_hw_queues(set), sizeof(struct blk_mq_tags *),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
 		return -ENOMEM;
@@ -2932,7 +2946,7 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
 	int i, j;
 
-	for (i = 0; i < nr_cpu_ids; i++)
+	for (i = 0; i < nr_hw_queues(set); i++)
 		blk_mq_free_map_and_requests(set, i);
 
 	for (j = 0; j < set->nr_maps; j++) {
@@ -3064,7 +3078,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 	lockdep_assert_held(&set->tag_list_lock);
 
-	if (nr_hw_queues > nr_cpu_ids)
+	if (set->nr_maps == 1 && nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
 	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
 		return;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (7 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 19:27   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 10/14] blk-mq: initial support for multiple queue maps Jens Axboe
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

Since we insert per hardware queue, we have to ensure that every
request on the plug list being inserted belongs to the same
hardware queue.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 60a951c4934c..52b07188b39a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1621,6 +1621,27 @@ static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
 		  blk_rq_pos(rqa) < blk_rq_pos(rqb)));
 }
 
+/*
+ * Need to ensure that the hardware queue matches, so we don't submit
+ * a list of requests that end up on different hardware queues.
+ */
+static bool ctx_match(struct request *req, struct blk_mq_ctx *ctx,
+		      unsigned int flags)
+{
+	if (req->mq_ctx != ctx)
+		return false;
+
+	/*
+	 * If we just have one map, then we know the hctx will match
+	 * if the ctx matches
+	 */
+	if (req->q->tag_set->nr_maps == 1)
+		return true;
+
+	return blk_mq_map_queue(req->q, req->cmd_flags, ctx->cpu) ==
+		blk_mq_map_queue(req->q, flags, ctx->cpu);
+}
+
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
 	struct blk_mq_ctx *this_ctx;
@@ -1628,7 +1649,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	struct request *rq;
 	LIST_HEAD(list);
 	LIST_HEAD(ctx_list);
-	unsigned int depth;
+	unsigned int depth, this_flags;
 
 	list_splice_init(&plug->mq_list, &list);
 
@@ -1636,13 +1657,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	this_q = NULL;
 	this_ctx = NULL;
+	this_flags = 0;
 	depth = 0;
 
 	while (!list_empty(&list)) {
 		rq = list_entry_rq(list.next);
 		list_del_init(&rq->queuelist);
 		BUG_ON(!rq->q);
-		if (rq->mq_ctx != this_ctx) {
+		if (!ctx_match(rq, this_ctx, this_flags)) {
 			if (this_ctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
 				blk_mq_sched_insert_requests(this_q, this_ctx,
@@ -1650,6 +1672,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 								from_schedule);
 			}
 
+			this_flags = rq->cmd_flags;
 			this_ctx = rq->mq_ctx;
 			this_q = rq->q;
 			depth = 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (8 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 19:40   ` Bart Van Assche
  2018-10-29 16:37 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-cpumap.c  | 9 +++++----
 block/blk-mq-pci.c     | 2 +-
 block/blk-mq-virtio.c  | 2 +-
 include/linux/blk-mq.h | 3 ++-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 6e6686c55984..03a534820271 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -14,9 +14,10 @@
 #include "blk.h"
 #include "blk-mq.h"
 
-static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
+static int cpu_to_queue_index(struct blk_mq_queue_map *qmap,
+			      unsigned int nr_queues, const int cpu)
 {
-	return cpu % nr_queues;
+	return qmap->queue_offset + (cpu % nr_queues);
 }
 
 static int get_first_sibling(unsigned int cpu)
@@ -44,11 +45,11 @@ int blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 		 * performace optimizations.
 		 */
 		if (cpu < nr_queues) {
-			map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+			map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu);
 		} else {
 			first_sibling = get_first_sibling(cpu);
 			if (first_sibling == cpu)
-				map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+				map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu);
 			else
 				map[cpu] = map[first_sibling];
 		}
diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 40333d60a850..1dce18553984 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -43,7 +43,7 @@ int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 			goto fallback;
 
 		for_each_cpu(cpu, mask)
-			qmap->mq_map[cpu] = queue;
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
 
 	return 0;
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index 661fbfef480f..370827163835 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -44,7 +44,7 @@ int blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 			goto fallback;
 
 		for_each_cpu(cpu, mask)
-			qmap->mq_map[cpu] = queue;
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
 
 	return 0;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 837087cf07cc..b5ae2b5677c1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -78,10 +78,11 @@ struct blk_mq_hw_ctx {
 struct blk_mq_queue_map {
 	unsigned int *mq_map;
 	unsigned int nr_queues;
+	unsigned int queue_offset;
 };
 
 enum {
-	HCTX_MAX_TYPES = 1,
+	HCTX_MAX_TYPES = 2,
 };
 
 struct blk_mq_tag_set {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (9 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 10/14] blk-mq: initial support for multiple queue maps Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 17:08   ` Thomas Gleixner
                     ` (2 more replies)
  2018-10-29 16:37 ` [PATCH 12/14] nvme: utilize two queue maps, one for reads and one for writes Jens Axboe
                   ` (2 subsequent siblings)
  13 siblings, 3 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe, Thomas Gleixner

A driver may have a need to allocate multiple sets of MSI/MSI-X
interrupts, and have them appropriately affinitized. Add support for
defining a number of sets in the irq_affinity structure, of varying
sizes, and get each set affinitized correctly across the machine.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/interrupt.h |  4 ++++
 kernel/irq/affinity.c     | 40 ++++++++++++++++++++++++++++++---------
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 1d6711c28271..ca397ff40836 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,10 +247,14 @@ struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @post_vectors:	Don't apply affinity to @post_vectors at end of
  *			the MSI(-X) vector space
+ * @nr_sets:		Length of passed in *sets array
+ * @sets:		Number of affinitized sets
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
+	int	nr_sets;
+	int	*sets;
 };
 
 #if defined(CONFIG_SMP)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index f4f29b9d90ee..2046a0f0f0f1 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	int curvec, usedvecs;
 	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
 	struct cpumask *masks = NULL;
+	int i, nr_sets;
 
 	/*
 	 * If there aren't any vectors left after applying the pre/post
@@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	get_online_cpus();
 	build_node_to_cpumask(node_to_cpumask);
 
-	/* Spread on present CPUs starting from affd->pre_vectors */
-	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
-					    node_to_cpumask, cpu_present_mask,
-					    nmsk, masks);
+	/*
+	 * Spread on present CPUs starting from affd->pre_vectors. If we
+	 * have multiple sets, build each sets affinity mask separately.
+	 */
+	nr_sets = affd->nr_sets;
+	if (!nr_sets)
+		nr_sets = 1;
+
+	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
+		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
+		int nr;
+
+		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
+					      node_to_cpumask, cpu_present_mask,
+					      nmsk, masks + usedvecs);
+		usedvecs += nr;
+	}
 
 	/*
 	 * Spread on non present CPUs starting from the next vector to be
@@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
 {
 	int resv = affd->pre_vectors + affd->post_vectors;
 	int vecs = maxvec - resv;
-	int ret;
+	int set_vecs;
 
 	if (resv > minvec)
 		return 0;
 
-	get_online_cpus();
-	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
-	put_online_cpus();
-	return ret;
+	if (affd->nr_sets) {
+		int i;
+
+		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
+			set_vecs += affd->sets[i];
+	} else {
+		get_online_cpus();
+		set_vecs = cpumask_weight(cpu_possible_mask);
+		put_online_cpus();
+	}
+
+	return resv + min(set_vecs, vecs);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 12/14] nvme: utilize two queue maps, one for reads and one for writes
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (10 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 16:37 ` [PATCH 13/14] block: add REQ_HIPRI and inherit it from IOCB_HIPRI Jens Axboe
  2018-10-29 16:37 ` [PATCH 14/14] nvme: add separate poll queue map Jens Axboe
  13 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

NVMe does round-robin between queues by default, which means that
sharing a queue map for both reads and writes can be problematic
in terms of read servicing. It's much easier to flood the queue
with writes and reduce the read servicing.

Implement two queue maps, one for reads and one for writes. The
write queue count is configurable through the 'write_queues'
parameter.

By default, we retain the previous behavior of having a single
queue set, shared between reads and writes. Setting 'write_queues'
to a non-zero value will create two queue sets, one for reads and
one for writes, the latter using the configurable number of
queues (hardware queue counts permitting).

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 139 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 131 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5d783cb6937..658c9a2f4114 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -74,11 +74,29 @@ static int io_queue_depth = 1024;
 module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644);
 MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2");
 
+static int queue_count_set(const char *val, const struct kernel_param *kp);
+static const struct kernel_param_ops queue_count_ops = {
+	.set = queue_count_set,
+	.get = param_get_int,
+};
+
+static int write_queues;
+module_param_cb(write_queues, &queue_count_ops, &write_queues, 0644);
+MODULE_PARM_DESC(write_queues,
+	"Number of queues to use for writes. If not set, reads and writes "
+	"will share a queue set.");
+
 struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
+enum {
+	NVMEQ_TYPE_READ,
+	NVMEQ_TYPE_WRITE,
+	NVMEQ_TYPE_NR,
+};
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -92,6 +110,7 @@ struct nvme_dev {
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
+	unsigned io_queues[NVMEQ_TYPE_NR];
 	unsigned int num_vecs;
 	int q_depth;
 	u32 db_stride;
@@ -134,6 +153,17 @@ static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
 	return param_set_int(val, kp);
 }
 
+static int queue_count_set(const char *val, const struct kernel_param *kp)
+{
+	int n = 0, ret;
+
+	ret = kstrtoint(val, 10, &n);
+	if (n > num_possible_cpus())
+		n = num_possible_cpus();
+
+	return param_set_int(val, kp);
+}
+
 static inline unsigned int sq_idx(unsigned int qid, u32 stride)
 {
 	return qid * 2 * stride;
@@ -218,9 +248,20 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
 }
 
+static unsigned int max_io_queues(void)
+{
+	return num_possible_cpus() + write_queues;
+}
+
+static unsigned int max_queue_count(void)
+{
+	/* IO queues + admin queue */
+	return 1 + max_io_queues();
+}
+
 static inline unsigned int nvme_dbbuf_size(u32 stride)
 {
-	return ((num_possible_cpus() + 1) * 8 * stride);
+	return (max_queue_count() * 8 * stride);
 }
 
 static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
@@ -431,12 +472,41 @@ static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req,
 	return 0;
 }
 
+static int queue_irq_offset(struct nvme_dev *dev)
+{
+	/* if we have more than 1 vec, admin queue offsets us 1 */
+	if (dev->num_vecs > 1)
+		return 1;
+
+	return 0;
+}
+
 static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
+	int i, qoff, offset;
+
+	offset = queue_irq_offset(dev);
+	for (i = 0, qoff = 0; i < set->nr_maps; i++) {
+		struct blk_mq_queue_map *map = &set->map[i];
+
+		map->nr_queues = dev->io_queues[i];
+		if (!map->nr_queues) {
+			BUG_ON(i == NVMEQ_TYPE_READ);
 
-	return blk_mq_pci_map_queues(&set->map[0], to_pci_dev(dev->dev),
-			dev->num_vecs > 1 ? 1 /* admin queue */ : 0);
+			/* shared set, resuse read set parameters */
+			map->nr_queues = dev->io_queues[NVMEQ_TYPE_READ];
+			qoff = 0;
+			offset = queue_irq_offset(dev);
+		}
+
+		map->queue_offset = qoff;
+		blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+		qoff += map->nr_queues;
+		offset += map->nr_queues;
+	}
+
+	return 0;
 }
 
 /**
@@ -849,6 +919,14 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static int nvme_flags_to_type(struct request_queue *q, unsigned int flags)
+{
+	if ((flags & REQ_OP_MASK) == REQ_OP_READ)
+		return NVMEQ_TYPE_READ;
+
+	return NVMEQ_TYPE_WRITE;
+}
+
 static void nvme_pci_complete_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1476,6 +1554,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.flags_to_type	= nvme_flags_to_type,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
@@ -1888,18 +1967,53 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
 	return ret;
 }
 
+static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
+{
+	unsigned int this_w_queues = write_queues;
+
+	/*
+	 * Setup read/write queue split
+	 */
+	if (nr_io_queues == 1) {
+		dev->io_queues[NVMEQ_TYPE_READ] = 1;
+		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
+		return;
+	}
+
+	/*
+	 * If 'write_queues' is set, ensure it leaves room for at least
+	 * one read queue
+	 */
+	if (this_w_queues >= nr_io_queues)
+		this_w_queues = nr_io_queues - 1;
+
+	/*
+	 * If 'write_queues' is set to zero, reads and writes will share
+	 * a queue set.
+	 */
+	if (!this_w_queues) {
+		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
+		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues;
+	} else {
+		dev->io_queues[NVMEQ_TYPE_WRITE] = this_w_queues;
+		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues - this_w_queues;
+	}
+}
+
 static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = &dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int result, nr_io_queues;
 	unsigned long size;
-
+	int irq_sets[2];
 	struct irq_affinity affd = {
-		.pre_vectors = 1
+		.pre_vectors = 1,
+		.nr_sets = ARRAY_SIZE(irq_sets),
+		.sets = irq_sets,
 	};
 
-	nr_io_queues = num_possible_cpus();
+	nr_io_queues = max_io_queues();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
 	if (result < 0)
 		return result;
@@ -1929,6 +2043,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	/* Deregister the admin queue's interrupt */
 	pci_free_irq(pdev, 0, adminq);
 
+	nvme_calc_io_queues(dev, nr_io_queues);
+	irq_sets[0] = dev->io_queues[NVMEQ_TYPE_READ];
+	irq_sets[1] = dev->io_queues[NVMEQ_TYPE_WRITE];
+	if (!irq_sets[1])
+		affd.nr_sets = 1;
+
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
@@ -1941,6 +2061,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	dev->num_vecs = result;
 	dev->max_qid = max(result - 1, 1);
 
+	nvme_calc_io_queues(dev, dev->max_qid);
+
 	/*
 	 * Should investigate if there's a performance win from allocating
 	 * more queues than interrupt vectors; it might allow the submission
@@ -2042,6 +2164,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	if (!dev->ctrl.tagset) {
 		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
+		dev->tagset.nr_maps = NVMEQ_TYPE_NR;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =
@@ -2489,8 +2612,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (!dev)
 		return -ENOMEM;
 
-	dev->queues = kcalloc_node(num_possible_cpus() + 1,
-			sizeof(struct nvme_queue), GFP_KERNEL, node);
+	dev->queues = kcalloc_node(max_queue_count(), sizeof(struct nvme_queue),
+					GFP_KERNEL, node);
 	if (!dev->queues)
 		goto free;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 13/14] block: add REQ_HIPRI and inherit it from IOCB_HIPRI
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (11 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 12/14] nvme: utilize two queue maps, one for reads and one for writes Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  2018-10-29 16:37 ` [PATCH 14/14] nvme: add separate poll queue map Jens Axboe
  13 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c            | 2 ++
 fs/direct-io.c            | 2 ++
 fs/iomap.c                | 9 ++++++++-
 include/linux/blk_types.h | 4 +++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 38b8ce05cbc7..8bb8090c57a7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -232,6 +232,8 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		bio.bi_opf = dio_bio_write_op(iocb);
 		task_io_account_write(ret);
 	}
+	if (iocb->ki_flags & IOCB_HIPRI)
+		bio.bi_opf |= REQ_HIPRI;
 
 	qc = submit_bio(&bio);
 	for (;;) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 093fb54cd316..ffb46b7aa5f7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1265,6 +1265,8 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	} else {
 		dio->op = REQ_OP_READ;
 	}
+	if (iocb->ki_flags & IOCB_HIPRI)
+		dio->op_flags |= REQ_HIPRI;
 
 	/*
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
diff --git a/fs/iomap.c b/fs/iomap.c
index ec15cf2ec696..50ad8c8d1dcb 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1554,6 +1554,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
 	struct page *page = ZERO_PAGE(0);
+	int flags = REQ_SYNC | REQ_IDLE;
 	struct bio *bio;
 
 	bio = bio_alloc(GFP_KERNEL, 1);
@@ -1562,9 +1563,12 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
+	if (dio->iocb->ki_flags & IOCB_HIPRI)
+		flags |= REQ_HIPRI;
+
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
-	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
 
 	atomic_inc(&dio->ref);
 	return submit_bio(bio);
@@ -1663,6 +1667,9 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 				bio_set_pages_dirty(bio);
 		}
 
+		if (dio->iocb->ki_flags & IOCB_HIPRI)
+			bio->bi_opf |= REQ_HIPRI;
+
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 093a818c5b68..d6c2558d6b73 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -322,6 +322,8 @@ enum req_flag_bits {
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
+	__REQ_HIPRI,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -342,8 +344,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
-
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 14/14] nvme: add separate poll queue map
  2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
                   ` (12 preceding siblings ...)
  2018-10-29 16:37 ` [PATCH 13/14] block: add REQ_HIPRI and inherit it from IOCB_HIPRI Jens Axboe
@ 2018-10-29 16:37 ` Jens Axboe
  13 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:37 UTC (permalink / raw)
  To: linux-block, linux-scsi, linux-kernel; +Cc: Jens Axboe

Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.

And now we finally have poll support without triggering interrupts!

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 103 +++++++++++++++++++++++++++++++++-------
 include/linux/blk-mq.h  |   2 +-
 2 files changed, 88 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 658c9a2f4114..cce5d06f11c5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -86,6 +86,10 @@ MODULE_PARM_DESC(write_queues,
 	"Number of queues to use for writes. If not set, reads and writes "
 	"will share a queue set.");
 
+static int poll_queues = 1;
+module_param_cb(poll_queues, &queue_count_ops, &poll_queues, 0644);
+MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
+
 struct nvme_dev;
 struct nvme_queue;
 
@@ -94,6 +98,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 enum {
 	NVMEQ_TYPE_READ,
 	NVMEQ_TYPE_WRITE,
+	NVMEQ_TYPE_POLL,
 	NVMEQ_TYPE_NR,
 };
 
@@ -202,6 +207,7 @@ struct nvme_queue {
 	u16 last_cq_head;
 	u16 qid;
 	u8 cq_phase;
+	u8 polled;
 	u32 *dbbuf_sq_db;
 	u32 *dbbuf_cq_db;
 	u32 *dbbuf_sq_ei;
@@ -250,7 +256,7 @@ static inline void _nvme_check_size(void)
 
 static unsigned int max_io_queues(void)
 {
-	return num_possible_cpus() + write_queues;
+	return num_possible_cpus() + write_queues + poll_queues;
 }
 
 static unsigned int max_queue_count(void)
@@ -500,8 +506,15 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 			offset = queue_irq_offset(dev);
 		}
 
+		/*
+		 * The poll queue(s) doesn't have an IRQ (and hence IRQ
+		 * affinity), so use the regular blk-mq cpu mapping
+		 */
 		map->queue_offset = qoff;
-		blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+		if (i != NVMEQ_TYPE_POLL)
+			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+		else
+			blk_mq_map_queues(map);
 		qoff += map->nr_queues;
 		offset += map->nr_queues;
 	}
@@ -892,7 +905,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 * We should not need to do this, but we're still using this to
 	 * ensure we can drain requests on a dying queue.
 	 */
-	if (unlikely(nvmeq->cq_vector < 0))
+	if (unlikely(nvmeq->cq_vector < 0 && !nvmeq->polled))
 		return BLK_STS_IOERR;
 
 	ret = nvme_setup_cmd(ns, req, &cmnd);
@@ -921,6 +934,8 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 static int nvme_flags_to_type(struct request_queue *q, unsigned int flags)
 {
+	if (flags & REQ_HIPRI)
+		return NVMEQ_TYPE_POLL;
 	if ((flags & REQ_OP_MASK) == REQ_OP_READ)
 		return NVMEQ_TYPE_READ;
 
@@ -1094,7 +1109,10 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 		struct nvme_queue *nvmeq, s16 vector)
 {
 	struct nvme_command c;
-	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
+	int flags = NVME_QUEUE_PHYS_CONTIG;
+
+	if (vector != -1)
+		flags |= NVME_CQ_IRQ_ENABLED;
 
 	/*
 	 * Note: we (ab)use the fact that the prp fields survive if no data
@@ -1106,7 +1124,10 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cqid = cpu_to_le16(qid);
 	c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1);
 	c.create_cq.cq_flags = cpu_to_le16(flags);
-	c.create_cq.irq_vector = cpu_to_le16(vector);
+	if (vector != -1)
+		c.create_cq.irq_vector = cpu_to_le16(vector);
+	else
+		c.create_cq.irq_vector = 0;
 
 	return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0);
 }
@@ -1348,13 +1369,14 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	int vector;
 
 	spin_lock_irq(&nvmeq->cq_lock);
-	if (nvmeq->cq_vector == -1) {
+	if (nvmeq->cq_vector == -1 && !nvmeq->polled) {
 		spin_unlock_irq(&nvmeq->cq_lock);
 		return 1;
 	}
 	vector = nvmeq->cq_vector;
 	nvmeq->dev->online_queues--;
 	nvmeq->cq_vector = -1;
+	nvmeq->polled = false;
 	spin_unlock_irq(&nvmeq->cq_lock);
 
 	/*
@@ -1366,7 +1388,8 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
 
-	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
+	if (vector != -1)
+		pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
 }
@@ -1500,7 +1523,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	spin_unlock_irq(&nvmeq->cq_lock);
 }
 
-static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
+static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
@@ -1510,7 +1533,11 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	vector = dev->num_vecs == 1 ? 0 : qid;
+	if (!polled)
+		vector = dev->num_vecs == 1 ? 0 : qid;
+	else
+		vector = -1;
+
 	result = adapter_alloc_cq(dev, qid, nvmeq, vector);
 	if (result)
 		return result;
@@ -1527,15 +1554,20 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * xxx' warning if the create CQ/SQ command times out.
 	 */
 	nvmeq->cq_vector = vector;
+	nvmeq->polled = polled;
 	nvme_init_queue(nvmeq, qid);
-	result = queue_request_irq(nvmeq);
-	if (result < 0)
-		goto release_sq;
+
+	if (vector != -1) {
+		result = queue_request_irq(nvmeq);
+		if (result < 0)
+			goto release_sq;
+	}
 
 	return result;
 
 release_sq:
 	nvmeq->cq_vector = -1;
+	nvmeq->polled = false;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
 release_cq:
@@ -1686,7 +1718,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 
 static int nvme_create_io_queues(struct nvme_dev *dev)
 {
-	unsigned i, max;
+	unsigned i, max, rw_queues;
 	int ret = 0;
 
 	for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
@@ -1697,8 +1729,17 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	}
 
 	max = min(dev->max_qid, dev->ctrl.queue_count - 1);
+	if (max != 1 && dev->io_queues[NVMEQ_TYPE_POLL]) {
+		rw_queues = dev->io_queues[NVMEQ_TYPE_READ] +
+				dev->io_queues[NVMEQ_TYPE_WRITE];
+	} else {
+		rw_queues = max;
+	}
+
 	for (i = dev->online_queues; i <= max; i++) {
-		ret = nvme_create_queue(&dev->queues[i], i);
+		bool polled = i > rw_queues;
+
+		ret = nvme_create_queue(&dev->queues[i], i, polled);
 		if (ret)
 			break;
 	}
@@ -1970,6 +2011,7 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
 static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 {
 	unsigned int this_w_queues = write_queues;
+	unsigned int this_p_queues = poll_queues;
 
 	/*
 	 * Setup read/write queue split
@@ -1977,9 +2019,28 @@ static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
 	if (nr_io_queues == 1) {
 		dev->io_queues[NVMEQ_TYPE_READ] = 1;
 		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
+		dev->io_queues[NVMEQ_TYPE_POLL] = 0;
 		return;
 	}
 
+	/*
+	 * Configure number of poll queues, if set
+	 */
+	if (this_p_queues) {
+		/*
+		 * We need at least one queue left. With just one queue, we'll
+		 * have a single shared read/write set.
+		 */
+		if (this_p_queues >= nr_io_queues) {
+			this_w_queues = 0;
+			this_p_queues = nr_io_queues - 1;
+		}
+
+		dev->io_queues[NVMEQ_TYPE_POLL] = this_p_queues;
+		nr_io_queues -= this_p_queues;
+	} else
+		dev->io_queues[NVMEQ_TYPE_POLL] = 0;
+
 	/*
 	 * If 'write_queues' is set, ensure it leaves room for at least
 	 * one read queue
@@ -2049,19 +2110,29 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	if (!irq_sets[1])
 		affd.nr_sets = 1;
 
+	/*
+	 * Need IRQs for read+write queues, and one for the admin queue
+	 */
+	nr_io_queues = irq_sets[0] + irq_sets[1] + 1;
+
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
 	 */
 	pci_free_irq_vectors(pdev);
-	result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues + 1,
+	result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues,
 			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 	if (result <= 0)
 		return -EIO;
 	dev->num_vecs = result;
-	dev->max_qid = max(result - 1, 1);
+	result = max(result - 1, 1);
+	dev->max_qid = result + dev->io_queues[NVMEQ_TYPE_POLL];
 
 	nvme_calc_io_queues(dev, dev->max_qid);
+	dev_info(dev->ctrl.device, "%d/%d/%d r/w/p queues\n",
+					dev->io_queues[NVMEQ_TYPE_READ],
+					dev->io_queues[NVMEQ_TYPE_WRITE],
+					dev->io_queues[NVMEQ_TYPE_POLL]);
 
 	/*
 	 * Should investigate if there's a performance win from allocating
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b5ae2b5677c1..6ee1d19c6dec 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -82,7 +82,7 @@ struct blk_mq_queue_map {
 };
 
 enum {
-	HCTX_MAX_TYPES = 2,
+	HCTX_MAX_TYPES = 3,
 };
 
 struct blk_mq_tag_set {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 01/14] blk-mq: kill q->mq_map
  2018-10-29 16:37 ` [PATCH 01/14] blk-mq: kill q->mq_map Jens Axboe
@ 2018-10-29 16:46   ` Bart Van Assche
  2018-10-29 16:51     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 16:46 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> It's just a pointer to set->mq_map, use that instead.

Please clarify in the patch description that the q->tag_set assignment
has been moved because this patch makes it necessary to have that pointer
available earlier such that it is clear that that move is intentional.
Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 01/14] blk-mq: kill q->mq_map
  2018-10-29 16:46   ` Bart Van Assche
@ 2018-10-29 16:51     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 16:51 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 10:46 AM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>> It's just a pointer to set->mq_map, use that instead.
> 
> Please clarify in the patch description that the q->tag_set assignment
> has been moved because this patch makes it necessary to have that pointer
> available earlier such that it is clear that that move is intentional.
> Anyway:
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Good point, I'll improve the verbiage in the commit message. Thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-29 16:37 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
@ 2018-10-29 17:08   ` Thomas Gleixner
  2018-10-29 17:09     ` Jens Axboe
  2018-10-30  9:25   ` Ming Lei
  2018-10-30 14:26   ` Keith Busch
  2 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2018-10-29 17:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel

Jens,

On Mon, 29 Oct 2018, Jens Axboe wrote:

> A driver may have a need to allocate multiple sets of MSI/MSI-X
> interrupts, and have them appropriately affinitized. Add support for
> defining a number of sets in the irq_affinity structure, of varying
> sizes, and get each set affinitized correctly across the machine.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

This looks good.

Vs. merge logistics: I'm expecting some other changes in that area as per
discussion with megasas (IIRC) folks. So I'd like to apply that myself
right after -rc1 and provide it to you as a single commit to pull from so
we can avoid collisions in next and the merge window.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-29 17:08   ` Thomas Gleixner
@ 2018-10-29 17:09     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 17:09 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-block, linux-scsi, linux-kernel

On 10/29/18 11:08 AM, Thomas Gleixner wrote:
> Jens,
> 
> On Mon, 29 Oct 2018, Jens Axboe wrote:
> 
>> A driver may have a need to allocate multiple sets of MSI/MSI-X
>> interrupts, and have them appropriately affinitized. Add support for
>> defining a number of sets in the irq_affinity structure, of varying
>> sizes, and get each set affinitized correctly across the machine.
>>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: linux-kernel@vger.kernel.org
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> This looks good.
> 
> Vs. merge logistics: I'm expecting some other changes in that area as per
> discussion with megasas (IIRC) folks. So I'd like to apply that myself
> right after -rc1 and provide it to you as a single commit to pull from so
> we can avoid collisions in next and the merge window.

That sounds fine, thanks Thomas!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper
  2018-10-29 16:37 ` [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper Jens Axboe
@ 2018-10-29 17:22   ` Bart Van Assche
  2018-10-29 17:27     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 17:22 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 889f0069dd80..79c300faa7ce 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -80,6 +80,12 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>  	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
>  }
>  
> +static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
> +							  int type, int cpu)
> +{
> +	return blk_mq_map_queue(q, cpu);
> +}
> +
>  /*
>   * sysfs helpers
>   */

How about renaming 'type' into something like hw_ctx_type to make it more
clear what its meaning is? How about declaring both the 'type' and 'cpu'
arguments as unsigned ints?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper
  2018-10-29 17:22   ` Bart Van Assche
@ 2018-10-29 17:27     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 17:27 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 11:22 AM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>> index 889f0069dd80..79c300faa7ce 100644
>> --- a/block/blk-mq.h
>> +++ b/block/blk-mq.h
>> @@ -80,6 +80,12 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>>  	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
>>  }
>>  
>> +static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
>> +							  int type, int cpu)
>> +{
>> +	return blk_mq_map_queue(q, cpu);
>> +}
>> +
>>  /*
>>   * sysfs helpers
>>   */
> 
> How about renaming 'type' into something like hw_ctx_type to make it more
> clear what its meaning is? How about declaring both the 'type' and 'cpu'
> arguments as unsigned ints?

I can do that. For the CPU type, the existing prototype is already
int for CPU. But may as well just update them both in the same patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping
  2018-10-29 16:37 ` [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping Jens Axboe
@ 2018-10-29 17:30   ` Bart Van Assche
  2018-10-29 17:33     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 17:30 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> @@ -400,9 +402,15 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
>  				  struct blk_mq_ctx *ctx,
>  				  struct list_head *list, bool run_queue_async)
>  {
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -	struct elevator_queue *e = hctx->queue->elevator;
> +	struct blk_mq_hw_ctx *hctx;
> +	struct elevator_queue *e;
> +	struct request *rq;
> +
> +	/* For list inserts, requests better be on the same hw queue */
> +	rq = list_first_entry(list, struct request, queuelist);
> +	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);

Passing all request cmd_flags bits to blk_mq_map_queue() makes it possible
for that function to depend on every single cmd_flags bit even if different
requests have different cmd_flags. Have you considered to pass the hw_ctx
type only to blk_mq_map_queue() to avoid that that function would start
depending on other cmd_flags?

Additionally, what guarantees that all requests in queuelist have the same
hw_ctx type? If a later patch will guarantee that, please mention that in
the comment about list_first_entry().

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping
  2018-10-29 17:30   ` Bart Van Assche
@ 2018-10-29 17:33     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 17:33 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 11:30 AM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>> @@ -400,9 +402,15 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
>>  				  struct blk_mq_ctx *ctx,
>>  				  struct list_head *list, bool run_queue_async)
>>  {
>> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
>> -	struct elevator_queue *e = hctx->queue->elevator;
>> +	struct blk_mq_hw_ctx *hctx;
>> +	struct elevator_queue *e;
>> +	struct request *rq;
>> +
>> +	/* For list inserts, requests better be on the same hw queue */
>> +	rq = list_first_entry(list, struct request, queuelist);
>> +	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
> 
> Passing all request cmd_flags bits to blk_mq_map_queue() makes it possible
> for that function to depend on every single cmd_flags bit even if different
> requests have different cmd_flags. Have you considered to pass the hw_ctx
> type only to blk_mq_map_queue() to avoid that that function would start
> depending on other cmd_flags?

The core only knows about the number of types, not what each type means
nor how to map it outside of using the mapping functions. So I don't
want to expose this is an explicit type, as that would then mean that
blk-mq had to know about them.

> Additionally, what guarantees that all requests in queuelist have the same
> hw_ctx type? If a later patch will guarantee that, please mention that in
> the comment about list_first_entry().

When the code is introduced, it's always the same hctx. Later on when we
do support multiple sets, the user of the list insert (plugging) explicitly
makes sure that a list only contains requests for the same hardware queue.

I'll improve the comment.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues
  2018-10-29 16:37 ` [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues Jens Axboe
@ 2018-10-29 17:34   ` Bart Van Assche
  2018-10-29 17:35     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 17:34 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> The mapping used to be dependent on just the CPU location, but
> now it's a tuple of { type, cpu} instead. This is a prep patch
> for allowing a single software queue to map to multiple hardware
> queues. No functional changes in this patch.

A nitpick: mathematicians usually use the { } notation for sets and () for
tuples. See also https://en.wikipedia.org/wiki/Tuple.

> +		/* wrap */
> +		BUG_ON(!hctx->nr_ctx);

Another nitpick: how about changing the "wrap" comment into "detect integer
overflow" or so to make the purpose of the BUG_ON() statement more clear?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues
  2018-10-29 17:34   ` Bart Van Assche
@ 2018-10-29 17:35     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 17:35 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 11:34 AM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>> The mapping used to be dependent on just the CPU location, but
>> now it's a tuple of { type, cpu} instead. This is a prep patch
>> for allowing a single software queue to map to multiple hardware
>> queues. No functional changes in this patch.
> 
> A nitpick: mathematicians usually use the { } notation for sets and () for
> tuples. See also https://en.wikipedia.org/wiki/Tuple.

Alright :-)

>> +		/* wrap */
>> +		BUG_ON(!hctx->nr_ctx);
> 
> Another nitpick: how about changing the "wrap" comment into "detect integer
> overflow" or so to make the purpose of the BUG_ON() statement more clear?

Sure, I'll improve that one too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory
  2018-10-29 16:37 ` [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory Jens Axboe
@ 2018-10-29 17:40   ` Bart Van Assche
  0 siblings, 0 replies; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 17:40 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> It can be useful for a user to verify what type a given hardware
> queue is, expose this information in sysfs.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 07/14] blk-mq: support multiple hctx maps
  2018-10-29 16:37 ` [PATCH 07/14] blk-mq: support multiple hctx maps Jens Axboe
@ 2018-10-29 18:15   ` Bart Van Assche
  2018-10-29 19:24     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 18:15 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> -static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
> -						     unsigned int flags,
> -						     int cpu)
> +static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
> +							  int type, int cpu)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
>  
> -	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
> +	return q->queue_hw_ctx[set->map[type].mq_map[cpu]];
>  }
>  
> -static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
> -							  int type, int cpu)
> +static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
> +						     unsigned int flags,
> +						     int cpu)
>  {
> -	return blk_mq_map_queue(q, type, cpu);
> +	int type = 0;
> +
> +	if (q->mq_ops->flags_to_type)
> +		type = q->mq_ops->flags_to_type(q, flags);
> +
> +	return blk_mq_map_queue_type(q, type, cpu);
>  }

How about adding a comment above both these functions that explains their
purpose? Since blk_mq_map_queue() maps rq->cmd_flags and the cpu index to
a hardware context, how about renaming blk_mq_map_queue() into
blk_mq_map_cmd_flags()?

>  /*
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index f9e19962a22f..837087cf07cc 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -86,6 +86,7 @@ enum {
>  
>  struct blk_mq_tag_set {
>  	struct blk_mq_queue_map	map[HCTX_MAX_TYPES];
> +	unsigned int		nr_maps;
>  	const struct blk_mq_ops	*ops;
>  	unsigned int		nr_hw_queues;
>  	unsigned int		queue_depth;	/* max hw supported */
> @@ -109,6 +110,7 @@ struct blk_mq_queue_data {
>  
>  typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
>  		const struct blk_mq_queue_data *);
> +typedef int (flags_to_type_fn)(struct request_queue *, unsigned int);

How about adding a comment that the format of the second argument of
flags_to_type_fn is the same as that of rq->cmd_flags?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids
  2018-10-29 16:37 ` [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids Jens Axboe
@ 2018-10-29 18:31   ` Bart Van Assche
  0 siblings, 0 replies; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 18:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> With multiple maps, nr_cpu_ids is no longer the maximum number of
> hardware queues we support on a given devices. The initializer of
> the tag_set can have set ->nr_hw_queues larger than the available
> number of CPUs, since we can exceed that with multiple queue maps.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0fab36372ace..60a951c4934c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2663,6 +2663,19 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
>  	mutex_unlock(&q->sysfs_lock);
>  }
>  
> +/*
> + * Maximum number of queues we support. For single sets, we'll never have
                       ^
                   hardware?
> + * more than the CPUs (software queues). For multiple sets, the tag_set
> + * user may have set ->nr_hw_queues larger.
> + */

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 02/14] blk-mq: abstract out queue map
  2018-10-29 16:37 ` [PATCH 02/14] blk-mq: abstract out queue map Jens Axboe
@ 2018-10-29 18:33   ` Bart Van Assche
  0 siblings, 0 replies; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 18:33 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> struct blk_mq_tag_set {
> -	unsigned int		*mq_map;
> +	struct blk_mq_queue_map	map[HCTX_MAX_TYPES];
>  	const struct blk_mq_ops	*ops;
>  	unsigned int		nr_hw_queues;
>  	unsigned int		queue_depth;	/* max hw supported */

How about documenting that nr_hw_queues is the number of hardware queues across
all hardware queue types instead of e.g. the number of hardware queues for a
single hardware queue type? This is something that only became clear to me after
having reviewed a later patch in this series. Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 07/14] blk-mq: support multiple hctx maps
  2018-10-29 18:15   ` Bart Van Assche
@ 2018-10-29 19:24     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 19:24 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 12:15 PM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>> -static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>> -						     unsigned int flags,
>> -						     int cpu)
>> +static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
>> +							  int type, int cpu)
>>  {
>>  	struct blk_mq_tag_set *set = q->tag_set;
>>  
>> -	return q->queue_hw_ctx[set->map[0].mq_map[cpu]];
>> +	return q->queue_hw_ctx[set->map[type].mq_map[cpu]];
>>  }
>>  
>> -static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q,
>> -							  int type, int cpu)
>> +static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>> +						     unsigned int flags,
>> +						     int cpu)
>>  {
>> -	return blk_mq_map_queue(q, type, cpu);
>> +	int type = 0;
>> +
>> +	if (q->mq_ops->flags_to_type)
>> +		type = q->mq_ops->flags_to_type(q, flags);
>> +
>> +	return blk_mq_map_queue_type(q, type, cpu);
>>  }
> 
> How about adding a comment above both these functions that explains their
> purpose? Since blk_mq_map_queue() maps rq->cmd_flags and the cpu index to
> a hardware context, how about renaming blk_mq_map_queue() into
> blk_mq_map_cmd_flags()?

I don't want to rename it, but I've added a kerneldoc function header to
both of them.

>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index f9e19962a22f..837087cf07cc 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -86,6 +86,7 @@ enum {
>>  
>>  struct blk_mq_tag_set {
>>  	struct blk_mq_queue_map	map[HCTX_MAX_TYPES];
>> +	unsigned int		nr_maps;
>>  	const struct blk_mq_ops	*ops;
>>  	unsigned int		nr_hw_queues;
>>  	unsigned int		queue_depth;	/* max hw supported */
>> @@ -109,6 +110,7 @@ struct blk_mq_queue_data {
>>  
>>  typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
>>  		const struct blk_mq_queue_data *);
>> +typedef int (flags_to_type_fn)(struct request_queue *, unsigned int);
> 
> How about adding a comment that the format of the second argument of
> flags_to_type_fn is the same as that of rq->cmd_flags?

Done

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues
  2018-10-29 16:37 ` [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues Jens Axboe
@ 2018-10-29 19:27   ` Bart Van Assche
  2018-10-29 19:30     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 19:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  {
>  	struct blk_mq_ctx *this_ctx;
> @@ -1628,7 +1649,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	struct request *rq;
>  	LIST_HEAD(list);
>  	LIST_HEAD(ctx_list);
> -	unsigned int depth;
> +	unsigned int depth, this_flags;
>  
>  	list_splice_init(&plug->mq_list, &list);
>  
> @@ -1636,13 +1657,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  
>  	this_q = NULL;
>  	this_ctx = NULL;
> +	this_flags = 0;
>  	depth = 0;
>  
>  	while (!list_empty(&list)) {
>  		rq = list_entry_rq(list.next);
>  		list_del_init(&rq->queuelist);
>  		BUG_ON(!rq->q);
> -		if (rq->mq_ctx != this_ctx) {
> +		if (!ctx_match(rq, this_ctx, this_flags)) {
>  			if (this_ctx) {
>  				trace_block_unplug(this_q, depth, !from_schedule);
>  				blk_mq_sched_insert_requests(this_q, this_ctx,
> @@ -1650,6 +1672,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  								from_schedule);
>  			}
>  
> +			this_flags = rq->cmd_flags;
>  			this_ctx = rq->mq_ctx;
>  			this_q = rq->q;
>  			depth = 0;

This patch will cause the function stored in the flags_to_type pointer to be
called 2 * (n - 1) times where n is the number of elements in 'list' when
blk_mq_sched_insert_requests() is called. Have you considered to rearrange
the code such that that number of calls is reduced to n?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues
  2018-10-29 19:27   ` Bart Van Assche
@ 2018-10-29 19:30     ` Jens Axboe
  2018-10-29 19:49       ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 19:30 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 1:27 PM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>>  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>  {
>>  	struct blk_mq_ctx *this_ctx;
>> @@ -1628,7 +1649,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>  	struct request *rq;
>>  	LIST_HEAD(list);
>>  	LIST_HEAD(ctx_list);
>> -	unsigned int depth;
>> +	unsigned int depth, this_flags;
>>  
>>  	list_splice_init(&plug->mq_list, &list);
>>  
>> @@ -1636,13 +1657,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>  
>>  	this_q = NULL;
>>  	this_ctx = NULL;
>> +	this_flags = 0;
>>  	depth = 0;
>>  
>>  	while (!list_empty(&list)) {
>>  		rq = list_entry_rq(list.next);
>>  		list_del_init(&rq->queuelist);
>>  		BUG_ON(!rq->q);
>> -		if (rq->mq_ctx != this_ctx) {
>> +		if (!ctx_match(rq, this_ctx, this_flags)) {
>>  			if (this_ctx) {
>>  				trace_block_unplug(this_q, depth, !from_schedule);
>>  				blk_mq_sched_insert_requests(this_q, this_ctx,
>> @@ -1650,6 +1672,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>  								from_schedule);
>>  			}
>>  
>> +			this_flags = rq->cmd_flags;
>>  			this_ctx = rq->mq_ctx;
>>  			this_q = rq->q;
>>  			depth = 0;
> 
> This patch will cause the function stored in the flags_to_type pointer to be
> called 2 * (n - 1) times where n is the number of elements in 'list' when
> blk_mq_sched_insert_requests() is called. Have you considered to rearrange
> the code such that that number of calls is reduced to n?

One alternative is to improve the sorting, but then we'd need to call
it in there instead. My longer term plan is something like the below,
which will reduce the number of calls in general, not just for
this path. But that is a separate change, should not be mixed
with this one.


diff --git a/block/blk-flush.c b/block/blk-flush.c
index 7922dba81497..397985808b75 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -219,7 +219,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 
 	/* release the tag's ownership to the req cloned from */
 	spin_lock_irqsave(&fq->mq_flush_lock, flags);
-	hctx = blk_mq_map_queue(q, flush_rq->cmd_flags, flush_rq->mq_ctx->cpu);
+	hctx = flush_rq->mq_hctx;
 	if (!q->elevator) {
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
@@ -307,13 +307,13 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	if (!q->elevator) {
 		fq->orig_rq = first_rq;
 		flush_rq->tag = first_rq->tag;
-		hctx = blk_mq_map_queue(q, first_rq->cmd_flags,
-					first_rq->mq_ctx->cpu);
+		hctx = flush_rq->mq_hctx;
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	} else {
 		flush_rq->internal_tag = first_rq->internal_tag;
 	}
 
+	flush_rq->mq_hctx = first_rq->mq_hctx;
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
 	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
 	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
@@ -326,13 +326,11 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
-
 	if (q->elevator) {
 		WARN_ON(rq->tag < 0);
 		blk_mq_put_driver_tag_hctx(hctx, rq);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index fac70c81b7de..cde19be36135 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -427,10 +427,8 @@ struct show_busy_params {
 static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 {
 	const struct show_busy_params *params = data;
-	struct blk_mq_hw_ctx *hctx;
 
-	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
-	if (hctx == params->hctx)
+	if (rq->mq_hctx == params->hctx)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d232ecf3290c..8bc1f37acca2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -367,9 +367,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	/* flush rq in flush machinery need to be dispatched directly */
 	if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
@@ -408,7 +406,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 
 	/* For list inserts, requests better be on the same hw queue */
 	rq = list_first_entry(list, struct request, queuelist);
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	hctx = rq->mq_hctx;
 
 	e = hctx->queue->elevator;
 	if (e && e->type->ops.mq.insert_requests)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 478a959357f5..fb836d818b80 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -527,14 +527,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
  */
 u32 blk_mq_unique_tag(struct request *rq)
 {
-	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx;
-	int hwq = 0;
-
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, rq->mq_ctx->cpu);
-	hwq = hctx->queue_num;
-
-	return (hwq << BLK_MQ_UNIQUE_TAG_BITS) |
+	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
 		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
 }
 EXPORT_SYMBOL(blk_mq_unique_tag);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 52b07188b39a..6b74e186be8a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -300,6 +300,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	/* csd/requeue_work/fifo_time is initialized before use */
 	rq->q = data->q;
 	rq->mq_ctx = data->ctx;
+	rq->mq_hctx = data->hctx;
 	rq->rq_flags = rq_flags;
 	rq->cpu = -1;
 	rq->cmd_flags = op;
@@ -473,10 +474,11 @@ static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
+	rq->mq_hctx = NULL;
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -490,7 +492,7 @@ void blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -982,7 +984,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
-		.hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu),
+		.hctx = rq->mq_hctx,
 		.flags = BLK_MQ_REQ_NOWAIT,
 		.cmd_flags = rq->cmd_flags,
 	};
@@ -1148,7 +1150,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 		rq = list_first_entry(list, struct request, queuelist);
 
-		hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
+		hctx = rq->mq_hctx;
 		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -1578,9 +1580,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  */
 void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 {
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
-							ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	spin_lock(&hctx->lock);
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
@@ -1638,8 +1638,7 @@ static bool ctx_match(struct request *req, struct blk_mq_ctx *ctx,
 	if (req->q->tag_set->nr_maps == 1)
 		return true;
 
-	return blk_mq_map_queue(req->q, req->cmd_flags, ctx->cpu) ==
-		blk_mq_map_queue(req->q, flags, ctx->cpu);
+	return req->mq_hctx == blk_mq_map_queue(req->q, flags, ctx->cpu);
 }
 
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
@@ -1812,9 +1811,7 @@ blk_status_t blk_mq_request_issue_directly(struct request *rq)
 	blk_status_t ret;
 	int srcu_idx;
 	blk_qc_t unused_cookie;
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
-							ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	hctx_lock(hctx, &srcu_idx);
 	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
@@ -1939,9 +1936,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_put_ctx(data.ctx);
 
 		if (same_queue_rq) {
-			data.hctx = blk_mq_map_queue(q,
-					same_queue_rq->cmd_flags,
-					same_queue_rq->mq_ctx->cpu);
+			data.hctx = same_queue_rq->mq_hctx;
 			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
 					&cookie);
 		}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e27c6f8dc86c..f3d58f9a4552 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -210,13 +210,10 @@ static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
 
 static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	struct blk_mq_hw_ctx *hctx;
-
 	if (rq->tag == -1 || rq->internal_tag == -1)
 		return;
 
-	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
-	__blk_mq_put_driver_tag(hctx, rq);
+	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
 }
 
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6e506044a309..e5e8e05ada4e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -129,6 +129,7 @@ enum mq_rq_state {
 struct request {
 	struct request_queue *q;
 	struct blk_mq_ctx *mq_ctx;
+	struct blk_mq_hw_ctx *mq_hctx;
 
 	int cpu;
 	unsigned int cmd_flags;		/* op and common flags */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 16:37 ` [PATCH 10/14] blk-mq: initial support for multiple queue maps Jens Axboe
@ 2018-10-29 19:40   ` Bart Van Assche
  2018-10-29 19:53     ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 19:40 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> -static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
> +static int cpu_to_queue_index(struct blk_mq_queue_map *qmap,
> +			      unsigned int nr_queues, const int cpu)
>  {
> -	return cpu % nr_queues;
> +	return qmap->queue_offset + (cpu % nr_queues);
>  }
> 
> [ ... ]
>  
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -78,10 +78,11 @@ struct blk_mq_hw_ctx {
>  struct blk_mq_queue_map {
>  	unsigned int *mq_map;
>  	unsigned int nr_queues;
> +	unsigned int queue_offset;
>  };

I think it's unfortunate that the blk-mq core uses the .queue_offset member but
that mapping functions in block drivers are responsible for setting that member.
Since the block driver mapping functions have to set blk_mq_queue_map.nr_queues,
how about adding a loop in blk_mq_update_queue_map() that derives .queue_offset
from .nr_queues from previous array entries?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues
  2018-10-29 19:30     ` Jens Axboe
@ 2018-10-29 19:49       ` Jens Axboe
  2018-10-30  8:08         ` Ming Lei
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 19:49 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 1:30 PM, Jens Axboe wrote:
> On 10/29/18 1:27 PM, Bart Van Assche wrote:
>> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>>>  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>>  {
>>>  	struct blk_mq_ctx *this_ctx;
>>> @@ -1628,7 +1649,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>>  	struct request *rq;
>>>  	LIST_HEAD(list);
>>>  	LIST_HEAD(ctx_list);
>>> -	unsigned int depth;
>>> +	unsigned int depth, this_flags;
>>>  
>>>  	list_splice_init(&plug->mq_list, &list);
>>>  
>>> @@ -1636,13 +1657,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>>  
>>>  	this_q = NULL;
>>>  	this_ctx = NULL;
>>> +	this_flags = 0;
>>>  	depth = 0;
>>>  
>>>  	while (!list_empty(&list)) {
>>>  		rq = list_entry_rq(list.next);
>>>  		list_del_init(&rq->queuelist);
>>>  		BUG_ON(!rq->q);
>>> -		if (rq->mq_ctx != this_ctx) {
>>> +		if (!ctx_match(rq, this_ctx, this_flags)) {
>>>  			if (this_ctx) {
>>>  				trace_block_unplug(this_q, depth, !from_schedule);
>>>  				blk_mq_sched_insert_requests(this_q, this_ctx,
>>> @@ -1650,6 +1672,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>>>  								from_schedule);
>>>  			}
>>>  
>>> +			this_flags = rq->cmd_flags;
>>>  			this_ctx = rq->mq_ctx;
>>>  			this_q = rq->q;
>>>  			depth = 0;
>>
>> This patch will cause the function stored in the flags_to_type pointer to be
>> called 2 * (n - 1) times where n is the number of elements in 'list' when
>> blk_mq_sched_insert_requests() is called. Have you considered to rearrange
>> the code such that that number of calls is reduced to n?
> 
> One alternative is to improve the sorting, but then we'd need to call
> it in there instead. My longer term plan is something like the below,
> which will reduce the number of calls in general, not just for
> this path. But that is a separate change, should not be mixed
> with this one.

Here's an updated one that applies on top of the current tree,
and also uses this information to sort efficiently. This eliminates
all this, and also makes the whole thing more clear.

I'll split this into two patches, just didn't want to include in
the series just yet.


diff --git a/block/blk-flush.c b/block/blk-flush.c
index 7922dba81497..397985808b75 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -219,7 +219,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 
 	/* release the tag's ownership to the req cloned from */
 	spin_lock_irqsave(&fq->mq_flush_lock, flags);
-	hctx = blk_mq_map_queue(q, flush_rq->cmd_flags, flush_rq->mq_ctx->cpu);
+	hctx = flush_rq->mq_hctx;
 	if (!q->elevator) {
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
@@ -307,13 +307,13 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	if (!q->elevator) {
 		fq->orig_rq = first_rq;
 		flush_rq->tag = first_rq->tag;
-		hctx = blk_mq_map_queue(q, first_rq->cmd_flags,
-					first_rq->mq_ctx->cpu);
+		hctx = flush_rq->mq_hctx;
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	} else {
 		flush_rq->internal_tag = first_rq->internal_tag;
 	}
 
+	flush_rq->mq_hctx = first_rq->mq_hctx;
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
 	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
 	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
@@ -326,13 +326,11 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
-
 	if (q->elevator) {
 		WARN_ON(rq->tag < 0);
 		blk_mq_put_driver_tag_hctx(hctx, rq);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index fac70c81b7de..cde19be36135 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -427,10 +427,8 @@ struct show_busy_params {
 static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 {
 	const struct show_busy_params *params = data;
-	struct blk_mq_hw_ctx *hctx;
 
-	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
-	if (hctx == params->hctx)
+	if (rq->mq_hctx == params->hctx)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d232ecf3290c..25c558358255 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -367,9 +367,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	/* flush rq in flush machinery need to be dispatched directly */
 	if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
@@ -399,16 +397,10 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 }
 
 void blk_mq_sched_insert_requests(struct request_queue *q,
-				  struct blk_mq_ctx *ctx,
+				  struct blk_mq_hw_ctx *hctx,
 				  struct list_head *list, bool run_queue_async)
 {
-	struct blk_mq_hw_ctx *hctx;
 	struct elevator_queue *e;
-	struct request *rq;
-
-	/* For list inserts, requests better be on the same hw queue */
-	rq = list_first_entry(list, struct request, queuelist);
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
 
 	e = hctx->queue->elevator;
 	if (e && e->type->ops.mq.insert_requests)
@@ -424,7 +416,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 			if (list_empty(list))
 				return;
 		}
-		blk_mq_insert_requests(hctx, ctx, list);
+		blk_mq_insert_requests(hctx, list);
 	}
 
 	blk_mq_run_hw_queue(hctx, run_queue_async);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 8a9544203173..a42547213f58 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -20,7 +20,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 				 bool run_queue, bool async);
 void blk_mq_sched_insert_requests(struct request_queue *q,
-				  struct blk_mq_ctx *ctx,
+				  struct blk_mq_hw_ctx *hctx,
 				  struct list_head *list, bool run_queue_async);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 478a959357f5..fb836d818b80 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -527,14 +527,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
  */
 u32 blk_mq_unique_tag(struct request *rq)
 {
-	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx;
-	int hwq = 0;
-
-	hctx = blk_mq_map_queue(q, rq->cmd_flags, rq->mq_ctx->cpu);
-	hwq = hctx->queue_num;
-
-	return (hwq << BLK_MQ_UNIQUE_TAG_BITS) |
+	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
 		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
 }
 EXPORT_SYMBOL(blk_mq_unique_tag);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 37310cc55733..17ea522bd7c1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -300,6 +300,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	/* csd/requeue_work/fifo_time is initialized before use */
 	rq->q = data->q;
 	rq->mq_ctx = data->ctx;
+	rq->mq_hctx = data->hctx;
 	rq->rq_flags = rq_flags;
 	rq->cpu = -1;
 	rq->cmd_flags = op;
@@ -473,10 +474,11 @@ static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
+	rq->mq_hctx = NULL;
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -490,7 +492,7 @@ void blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -982,7 +984,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
-		.hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu),
+		.hctx = rq->mq_hctx,
 		.flags = BLK_MQ_REQ_NOWAIT,
 		.cmd_flags = rq->cmd_flags,
 	};
@@ -1148,7 +1150,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 
 		rq = list_first_entry(list, struct request, queuelist);
 
-		hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
+		hctx = rq->mq_hctx;
 		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
 			break;
 
@@ -1578,9 +1580,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  */
 void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 {
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
-							ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	spin_lock(&hctx->lock);
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
@@ -1590,10 +1590,10 @@ void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 		blk_mq_run_hw_queue(hctx, false);
 }
 
-void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
-			    struct list_head *list)
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 
 {
+	struct blk_mq_ctx *ctx = NULL;
 	struct request *rq;
 
 	/*
@@ -1601,7 +1601,8 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	 * offline now
 	 */
 	list_for_each_entry(rq, list, queuelist) {
-		BUG_ON(rq->mq_ctx != ctx);
+		BUG_ON(ctx && rq->mq_ctx != ctx);
+		ctx = rq->mq_ctx;
 		trace_block_rq_insert(hctx->queue, rq);
 	}
 
@@ -1611,84 +1612,61 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	spin_unlock(&ctx->lock);
 }
 
-static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
+static int plug_hctx_cmp(void *priv, struct list_head *a, struct list_head *b)
 {
 	struct request *rqa = container_of(a, struct request, queuelist);
 	struct request *rqb = container_of(b, struct request, queuelist);
 
-	return !(rqa->mq_ctx < rqb->mq_ctx ||
-		 (rqa->mq_ctx == rqb->mq_ctx &&
+	return !(rqa->mq_hctx < rqb->mq_hctx ||
+		 (rqa->mq_hctx == rqb->mq_hctx &&
 		  blk_rq_pos(rqa) < blk_rq_pos(rqb)));
 }
 
-/*
- * Need to ensure that the hardware queue matches, so we don't submit
- * a list of requests that end up on different hardware queues.
- */
-static bool ctx_match(struct request *req, struct blk_mq_ctx *ctx,
-		      unsigned int flags)
-{
-	if (req->mq_ctx != ctx)
-		return false;
-
-	/*
-	 * If we just have one map, then we know the hctx will match
-	 * if the ctx matches
-	 */
-	if (req->q->tag_set->nr_maps == 1)
-		return true;
-
-	return blk_mq_map_queue(req->q, req->cmd_flags, ctx->cpu) ==
-		blk_mq_map_queue(req->q, flags, ctx->cpu);
-}
-
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
-	struct blk_mq_ctx *this_ctx;
+	struct blk_mq_hw_ctx *this_hctx;
 	struct request_queue *this_q;
 	struct request *rq;
 	LIST_HEAD(list);
-	LIST_HEAD(ctx_list);
-	unsigned int depth, this_flags;
+	LIST_HEAD(hctx_list);
+	unsigned int depth;
 
 	list_splice_init(&plug->mq_list, &list);
 
-	list_sort(NULL, &list, plug_ctx_cmp);
+	list_sort(NULL, &list, plug_hctx_cmp);
 
 	this_q = NULL;
-	this_ctx = NULL;
-	this_flags = 0;
+	this_hctx = NULL;
 	depth = 0;
 
 	while (!list_empty(&list)) {
 		rq = list_entry_rq(list.next);
 		list_del_init(&rq->queuelist);
 		BUG_ON(!rq->q);
-		if (!ctx_match(rq, this_ctx, this_flags)) {
-			if (this_ctx) {
+		if (rq->mq_hctx != this_hctx) {
+			if (this_hctx) {
 				trace_block_unplug(this_q, depth, !from_schedule);
-				blk_mq_sched_insert_requests(this_q, this_ctx,
-								&ctx_list,
+				blk_mq_sched_insert_requests(this_q, this_hctx,
+								&hctx_list,
 								from_schedule);
 			}
 
-			this_flags = rq->cmd_flags;
-			this_ctx = rq->mq_ctx;
+			this_hctx = rq->mq_hctx;
 			this_q = rq->q;
 			depth = 0;
 		}
 
 		depth++;
-		list_add_tail(&rq->queuelist, &ctx_list);
+		list_add_tail(&rq->queuelist, &hctx_list);
 	}
 
 	/*
-	 * If 'this_ctx' is set, we know we have entries to complete
-	 * on 'ctx_list'. Do those.
+	 * If 'this_chtx' is set, we know we have entries to complete
+	 * on 'hctx_list'. Do those.
 	 */
-	if (this_ctx) {
+	if (this_hctx) {
 		trace_block_unplug(this_q, depth, !from_schedule);
-		blk_mq_sched_insert_requests(this_q, this_ctx, &ctx_list,
+		blk_mq_sched_insert_requests(this_q, this_hctx, &hctx_list,
 						from_schedule);
 	}
 }
@@ -1812,9 +1790,7 @@ blk_status_t blk_mq_request_issue_directly(struct request *rq)
 	blk_status_t ret;
 	int srcu_idx;
 	blk_qc_t unused_cookie;
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
-							ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
 	hctx_lock(hctx, &srcu_idx);
 	ret = __blk_mq_try_issue_directly(hctx, rq, &unused_cookie, true);
@@ -1939,9 +1915,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_put_ctx(data.ctx);
 
 		if (same_queue_rq) {
-			data.hctx = blk_mq_map_queue(q,
-					same_queue_rq->cmd_flags,
-					same_queue_rq->mq_ctx->cpu);
+			data.hctx = same_queue_rq->mq_hctx;
 			blk_mq_try_issue_directly(data.hctx, same_queue_rq,
 					&cookie);
 		}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8329017badc8..c74804b99a33 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -59,8 +59,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 				bool at_head);
 void blk_mq_request_bypass_insert(struct request *rq, bool run_queue);
-void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
-				struct list_head *list);
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 
 /* Used by blk_insert_cloned_request() to issue request directly */
 blk_status_t blk_mq_request_issue_directly(struct request *rq);
@@ -223,13 +222,10 @@ static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
 
 static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	struct blk_mq_hw_ctx *hctx;
-
 	if (rq->tag == -1 || rq->internal_tag == -1)
 		return;
 
-	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
-	__blk_mq_put_driver_tag(hctx, rq);
+	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
 }
 
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4223ae2d2198..7b351210ebcd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -129,6 +129,7 @@ enum mq_rq_state {
 struct request {
 	struct request_queue *q;
 	struct blk_mq_ctx *mq_ctx;
+	struct blk_mq_hw_ctx *mq_hctx;
 
 	int cpu;
 	unsigned int cmd_flags;		/* op and common flags */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 19:40   ` Bart Van Assche
@ 2018-10-29 19:53     ` Jens Axboe
  2018-10-29 20:00       ` Bart Van Assche
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 19:53 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 1:40 PM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>> -static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
>> +static int cpu_to_queue_index(struct blk_mq_queue_map *qmap,
>> +			      unsigned int nr_queues, const int cpu)
>>  {
>> -	return cpu % nr_queues;
>> +	return qmap->queue_offset + (cpu % nr_queues);
>>  }
>>
>> [ ... ]
>>  
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -78,10 +78,11 @@ struct blk_mq_hw_ctx {
>>  struct blk_mq_queue_map {
>>  	unsigned int *mq_map;
>>  	unsigned int nr_queues;
>> +	unsigned int queue_offset;
>>  };
> 
> I think it's unfortunate that the blk-mq core uses the .queue_offset member but
> that mapping functions in block drivers are responsible for setting that member.
> Since the block driver mapping functions have to set blk_mq_queue_map.nr_queues,
> how about adding a loop in blk_mq_update_queue_map() that derives .queue_offset
> from .nr_queues from previous array entries?

It's not a simple increment, so the driver has to be the one setting it. If
we end up sharing queues, for instance, then the driver will need to set
it to the start offset of that set. If you go two patches forward you
can see that exact construct.

IOW, it's the driver that controls the offset, not the core.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 19:53     ` Jens Axboe
@ 2018-10-29 20:00       ` Bart Van Assche
  2018-10-29 20:09         ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 20:00 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 13:53 -0600, Jens Axboe wrote:
> On 10/29/18 1:40 PM, Bart Van Assche wrote:
> > On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> > > -static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
> > > +static int cpu_to_queue_index(struct blk_mq_queue_map *qmap,
> > > +			      unsigned int nr_queues, const int cpu)
> > >  {
> > > -	return cpu % nr_queues;
> > > +	return qmap->queue_offset + (cpu % nr_queues);
> > >  }
> > > 
> > > [ ... ]
> > >  
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -78,10 +78,11 @@ struct blk_mq_hw_ctx {
> > >  struct blk_mq_queue_map {
> > >  	unsigned int *mq_map;
> > >  	unsigned int nr_queues;
> > > +	unsigned int queue_offset;
> > >  };
> > 
> > I think it's unfortunate that the blk-mq core uses the .queue_offset member but
> > that mapping functions in block drivers are responsible for setting that member.
> > Since the block driver mapping functions have to set blk_mq_queue_map.nr_queues,
> > how about adding a loop in blk_mq_update_queue_map() that derives .queue_offset
> > from .nr_queues from previous array entries?
> 
> It's not a simple increment, so the driver has to be the one setting it. If
> we end up sharing queues, for instance, then the driver will need to set
> it to the start offset of that set. If you go two patches forward you
> can see that exact construct.
> 
> IOW, it's the driver that controls the offset, not the core.

If sharing of hardware queues between hardware queue types is supported,
what should hctx->type be set to? Additionally, patch 5 adds code that uses
hctx->type as an array index. How can that code work if a single hardware
queue can be shared by multiple hardware queue types?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 20:00       ` Bart Van Assche
@ 2018-10-29 20:09         ` Jens Axboe
  2018-10-29 20:25           ` Bart Van Assche
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 20:09 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 2:00 PM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 13:53 -0600, Jens Axboe wrote:
>> On 10/29/18 1:40 PM, Bart Van Assche wrote:
>>> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
>>>> -static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
>>>> +static int cpu_to_queue_index(struct blk_mq_queue_map *qmap,
>>>> +			      unsigned int nr_queues, const int cpu)
>>>>  {
>>>> -	return cpu % nr_queues;
>>>> +	return qmap->queue_offset + (cpu % nr_queues);
>>>>  }
>>>>
>>>> [ ... ]
>>>>  
>>>> --- a/include/linux/blk-mq.h
>>>> +++ b/include/linux/blk-mq.h
>>>> @@ -78,10 +78,11 @@ struct blk_mq_hw_ctx {
>>>>  struct blk_mq_queue_map {
>>>>  	unsigned int *mq_map;
>>>>  	unsigned int nr_queues;
>>>> +	unsigned int queue_offset;
>>>>  };
>>>
>>> I think it's unfortunate that the blk-mq core uses the .queue_offset member but
>>> that mapping functions in block drivers are responsible for setting that member.
>>> Since the block driver mapping functions have to set blk_mq_queue_map.nr_queues,
>>> how about adding a loop in blk_mq_update_queue_map() that derives .queue_offset
>>> from .nr_queues from previous array entries?
>>
>> It's not a simple increment, so the driver has to be the one setting it. If
>> we end up sharing queues, for instance, then the driver will need to set
>> it to the start offset of that set. If you go two patches forward you
>> can see that exact construct.
>>
>> IOW, it's the driver that controls the offset, not the core.
> 
> If sharing of hardware queues between hardware queue types is supported,
> what should hctx->type be set to? Additionally, patch 5 adds code that uses
> hctx->type as an array index. How can that code work if a single hardware
> queue can be shared by multiple hardware queue types?

hctx->type will be set to the value of the first type. This is all driver
private, blk-mq could not care less what the value of the type means.

As to the other question, it works just fine since that is the queue
that is being accessed. There's no confusion there. I think you're
misunderstanding how it's seutp. To use nvme as the example, type 0
would be reads, 1 writes, and 2 pollable queues. If reads and writes
share the same set of hardware queues, then type 1 simply doesn't
exist in terms of ->flags_to_type() return value. This is purely
driven by the driver. That hook is the only decider of where something
will go. If we share hctx sets, we share the same hardware queue as
well. There is just the one set for that case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 20:09         ` Jens Axboe
@ 2018-10-29 20:25           ` Bart Van Assche
  2018-10-29 20:29             ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2018-10-29 20:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, linux-kernel

On Mon, 2018-10-29 at 14:09 -0600, Jens Axboe wrote:
> hctx->type will be set to the value of the first type. This is all driver
> private, blk-mq could not care less what the value of the type means.
> 
> As to the other question, it works just fine since that is the queue
> that is being accessed. There's no confusion there. I think you're
> misunderstanding how it's seutp. To use nvme as the example, type 0
> would be reads, 1 writes, and 2 pollable queues. If reads and writes
> share the same set of hardware queues, then type 1 simply doesn't
> exist in terms of ->flags_to_type() return value. This is purely
> driven by the driver. That hook is the only decider of where something
> will go. If we share hctx sets, we share the same hardware queue as
> well. There is just the one set for that case.

How about adding a comment in blk-mq.h that explains that hardware queues can
be shared among different hardware queue types? I think this is nontrivial and
deserves a comment.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/14] blk-mq: initial support for multiple queue maps
  2018-10-29 20:25           ` Bart Van Assche
@ 2018-10-29 20:29             ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-29 20:29 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/29/18 2:25 PM, Bart Van Assche wrote:
> On Mon, 2018-10-29 at 14:09 -0600, Jens Axboe wrote:
>> hctx->type will be set to the value of the first type. This is all driver
>> private, blk-mq could not care less what the value of the type means.
>>
>> As to the other question, it works just fine since that is the queue
>> that is being accessed. There's no confusion there. I think you're
>> misunderstanding how it's seutp. To use nvme as the example, type 0
>> would be reads, 1 writes, and 2 pollable queues. If reads and writes
>> share the same set of hardware queues, then type 1 simply doesn't
>> exist in terms of ->flags_to_type() return value. This is purely
>> driven by the driver. That hook is the only decider of where something
>> will go. If we share hctx sets, we share the same hardware queue as
>> well. There is just the one set for that case.
> 
> How about adding a comment in blk-mq.h that explains that hardware queues can
> be shared among different hardware queue types? I think this is nontrivial and
> deserves a comment.

Sure, I can do that. I guess a key concept that is confusing based on
your above question is that the sets don't have to be consecutive.
It's perfectly valid to have 0 and 2 be the available queues, and
nothing for 1. For example.

BTW, split up the incremental patch, find them here:

http://git.kernel.dk/cgit/linux-block/commit/?h=mq-maps&id=6890d88deecfd3723ce620d82f5fc80485f9caec

and

http://git.kernel.dk/cgit/linux-block/commit/?h=mq-maps&id=907725dff2f8cc6d1502a9123f930b8d3708bd02

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues
  2018-10-29 19:49       ` Jens Axboe
@ 2018-10-30  8:08         ` Ming Lei
  2018-10-30 17:22           ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Ming Lei @ 2018-10-30  8:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On Mon, Oct 29, 2018 at 01:49:18PM -0600, Jens Axboe wrote:
> On 10/29/18 1:30 PM, Jens Axboe wrote:
> > On 10/29/18 1:27 PM, Bart Van Assche wrote:
> >> On Mon, 2018-10-29 at 10:37 -0600, Jens Axboe wrote:
> >>>  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >>>  {
> >>>  	struct blk_mq_ctx *this_ctx;
> >>> @@ -1628,7 +1649,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >>>  	struct request *rq;
> >>>  	LIST_HEAD(list);
> >>>  	LIST_HEAD(ctx_list);
> >>> -	unsigned int depth;
> >>> +	unsigned int depth, this_flags;
> >>>  
> >>>  	list_splice_init(&plug->mq_list, &list);
> >>>  
> >>> @@ -1636,13 +1657,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >>>  
> >>>  	this_q = NULL;
> >>>  	this_ctx = NULL;
> >>> +	this_flags = 0;
> >>>  	depth = 0;
> >>>  
> >>>  	while (!list_empty(&list)) {
> >>>  		rq = list_entry_rq(list.next);
> >>>  		list_del_init(&rq->queuelist);
> >>>  		BUG_ON(!rq->q);
> >>> -		if (rq->mq_ctx != this_ctx) {
> >>> +		if (!ctx_match(rq, this_ctx, this_flags)) {
> >>>  			if (this_ctx) {
> >>>  				trace_block_unplug(this_q, depth, !from_schedule);
> >>>  				blk_mq_sched_insert_requests(this_q, this_ctx,
> >>> @@ -1650,6 +1672,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >>>  								from_schedule);
> >>>  			}
> >>>  
> >>> +			this_flags = rq->cmd_flags;
> >>>  			this_ctx = rq->mq_ctx;
> >>>  			this_q = rq->q;
> >>>  			depth = 0;
> >>
> >> This patch will cause the function stored in the flags_to_type pointer to be
> >> called 2 * (n - 1) times where n is the number of elements in 'list' when
> >> blk_mq_sched_insert_requests() is called. Have you considered to rearrange
> >> the code such that that number of calls is reduced to n?
> > 
> > One alternative is to improve the sorting, but then we'd need to call
> > it in there instead. My longer term plan is something like the below,
> > which will reduce the number of calls in general, not just for
> > this path. But that is a separate change, should not be mixed
> > with this one.
> 
> Here's an updated one that applies on top of the current tree,
> and also uses this information to sort efficiently. This eliminates
> all this, and also makes the whole thing more clear.
> 
> I'll split this into two patches, just didn't want to include in
> the series just yet.
> 
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 7922dba81497..397985808b75 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -219,7 +219,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
>  
>  	/* release the tag's ownership to the req cloned from */
>  	spin_lock_irqsave(&fq->mq_flush_lock, flags);
> -	hctx = blk_mq_map_queue(q, flush_rq->cmd_flags, flush_rq->mq_ctx->cpu);
> +	hctx = flush_rq->mq_hctx;
>  	if (!q->elevator) {
>  		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
>  		flush_rq->tag = -1;
> @@ -307,13 +307,13 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>  	if (!q->elevator) {
>  		fq->orig_rq = first_rq;
>  		flush_rq->tag = first_rq->tag;
> -		hctx = blk_mq_map_queue(q, first_rq->cmd_flags,
> -					first_rq->mq_ctx->cpu);
> +		hctx = flush_rq->mq_hctx;
>  		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
>  	} else {
>  		flush_rq->internal_tag = first_rq->internal_tag;
>  	}
>  
> +	flush_rq->mq_hctx = first_rq->mq_hctx;
>  	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
>  	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
>  	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
> @@ -326,13 +326,11 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>  static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
>  {
>  	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  	unsigned long flags;
>  	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
>  
> -	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
> -
>  	if (q->elevator) {
>  		WARN_ON(rq->tag < 0);
>  		blk_mq_put_driver_tag_hctx(hctx, rq);
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index fac70c81b7de..cde19be36135 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -427,10 +427,8 @@ struct show_busy_params {
>  static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
>  {
>  	const struct show_busy_params *params = data;
> -	struct blk_mq_hw_ctx *hctx;
>  
> -	hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
> -	if (hctx == params->hctx)
> +	if (rq->mq_hctx == params->hctx)
>  		__blk_mq_debugfs_rq_show(params->m,
>  					 list_entry_rq(&rq->queuelist));
>  }
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index d232ecf3290c..25c558358255 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -367,9 +367,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>  	struct request_queue *q = rq->q;
>  	struct elevator_queue *e = q->elevator;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
> -	struct blk_mq_hw_ctx *hctx;
> -
> -	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
>  	/* flush rq in flush machinery need to be dispatched directly */
>  	if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
> @@ -399,16 +397,10 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>  }
>  
>  void blk_mq_sched_insert_requests(struct request_queue *q,
> -				  struct blk_mq_ctx *ctx,
> +				  struct blk_mq_hw_ctx *hctx,
>  				  struct list_head *list, bool run_queue_async)
>  {
> -	struct blk_mq_hw_ctx *hctx;
>  	struct elevator_queue *e;
> -	struct request *rq;
> -
> -	/* For list inserts, requests better be on the same hw queue */
> -	rq = list_first_entry(list, struct request, queuelist);
> -	hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
>  
>  	e = hctx->queue->elevator;
>  	if (e && e->type->ops.mq.insert_requests)
> @@ -424,7 +416,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
>  			if (list_empty(list))
>  				return;
>  		}
> -		blk_mq_insert_requests(hctx, ctx, list);
> +		blk_mq_insert_requests(hctx, list);
>  	}
>  
>  	blk_mq_run_hw_queue(hctx, run_queue_async);
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 8a9544203173..a42547213f58 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -20,7 +20,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
>  void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>  				 bool run_queue, bool async);
>  void blk_mq_sched_insert_requests(struct request_queue *q,
> -				  struct blk_mq_ctx *ctx,
> +				  struct blk_mq_hw_ctx *hctx,
>  				  struct list_head *list, bool run_queue_async);
>  
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 478a959357f5..fb836d818b80 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -527,14 +527,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   */
>  u32 blk_mq_unique_tag(struct request *rq)
>  {
> -	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx;
> -	int hwq = 0;
> -
> -	hctx = blk_mq_map_queue(q, rq->cmd_flags, rq->mq_ctx->cpu);
> -	hwq = hctx->queue_num;
> -
> -	return (hwq << BLK_MQ_UNIQUE_TAG_BITS) |
> +	return (rq->mq_hctx->queue_num << BLK_MQ_UNIQUE_TAG_BITS) |
>  		(rq->tag & BLK_MQ_UNIQUE_TAG_MASK);
>  }
>  EXPORT_SYMBOL(blk_mq_unique_tag);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 37310cc55733..17ea522bd7c1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -300,6 +300,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  	/* csd/requeue_work/fifo_time is initialized before use */
>  	rq->q = data->q;
>  	rq->mq_ctx = data->ctx;
> +	rq->mq_hctx = data->hctx;
>  	rq->rq_flags = rq_flags;
>  	rq->cpu = -1;
>  	rq->cmd_flags = op;
> @@ -473,10 +474,11 @@ static void __blk_mq_free_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  	const int sched_tag = rq->internal_tag;
>  
>  	blk_pm_mark_last_busy(rq);
> +	rq->mq_hctx = NULL;
>  	if (rq->tag != -1)
>  		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>  	if (sched_tag != -1)
> @@ -490,7 +492,7 @@ void blk_mq_free_request(struct request *rq)
>  	struct request_queue *q = rq->q;
>  	struct elevator_queue *e = q->elevator;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->cmd_flags, ctx->cpu);
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
>  	if (rq->rq_flags & RQF_ELVPRIV) {
>  		if (e && e->type->ops.mq.finish_request)
> @@ -982,7 +984,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>  {
>  	struct blk_mq_alloc_data data = {
>  		.q = rq->q,
> -		.hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu),
> +		.hctx = rq->mq_hctx,
>  		.flags = BLK_MQ_REQ_NOWAIT,
>  		.cmd_flags = rq->cmd_flags,
>  	};
> @@ -1148,7 +1150,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  
>  		rq = list_first_entry(list, struct request, queuelist);
>  
> -		hctx = blk_mq_map_queue(rq->q, rq->cmd_flags, rq->mq_ctx->cpu);
> +		hctx = rq->mq_hctx;
>  		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
>  			break;
>  
> @@ -1578,9 +1580,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   */
>  void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
>  {
> -	struct blk_mq_ctx *ctx = rq->mq_ctx;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, rq->cmd_flags,
> -							ctx->cpu);
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
>  	spin_lock(&hctx->lock);
>  	list_add_tail(&rq->queuelist, &hctx->dispatch);
> @@ -1590,10 +1590,10 @@ void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
>  		blk_mq_run_hw_queue(hctx, false);
>  }
>  
> -void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> -			    struct list_head *list)
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  
>  {
> +	struct blk_mq_ctx *ctx = NULL;
>  	struct request *rq;
>  
>  	/*
> @@ -1601,7 +1601,8 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>  	 * offline now
>  	 */
>  	list_for_each_entry(rq, list, queuelist) {
> -		BUG_ON(rq->mq_ctx != ctx);
> +		BUG_ON(ctx && rq->mq_ctx != ctx);
> +		ctx = rq->mq_ctx;
>  		trace_block_rq_insert(hctx->queue, rq);
>  	}
>  
> @@ -1611,84 +1612,61 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>  	spin_unlock(&ctx->lock);
>  }
>  
> -static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
> +static int plug_hctx_cmp(void *priv, struct list_head *a, struct list_head *b)
>  {
>  	struct request *rqa = container_of(a, struct request, queuelist);
>  	struct request *rqb = container_of(b, struct request, queuelist);
>  
> -	return !(rqa->mq_ctx < rqb->mq_ctx ||
> -		 (rqa->mq_ctx == rqb->mq_ctx &&
> +	return !(rqa->mq_hctx < rqb->mq_hctx ||
> +		 (rqa->mq_hctx == rqb->mq_hctx &&
>  		  blk_rq_pos(rqa) < blk_rq_pos(rqb)));
>  }
>  
> -/*
> - * Need to ensure that the hardware queue matches, so we don't submit
> - * a list of requests that end up on different hardware queues.
> - */
> -static bool ctx_match(struct request *req, struct blk_mq_ctx *ctx,
> -		      unsigned int flags)
> -{
> -	if (req->mq_ctx != ctx)
> -		return false;
> -
> -	/*
> -	 * If we just have one map, then we know the hctx will match
> -	 * if the ctx matches
> -	 */
> -	if (req->q->tag_set->nr_maps == 1)
> -		return true;
> -
> -	return blk_mq_map_queue(req->q, req->cmd_flags, ctx->cpu) ==
> -		blk_mq_map_queue(req->q, flags, ctx->cpu);
> -}
> -
>  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  {
> -	struct blk_mq_ctx *this_ctx;
> +	struct blk_mq_hw_ctx *this_hctx;
>  	struct request_queue *this_q;
>  	struct request *rq;
>  	LIST_HEAD(list);
> -	LIST_HEAD(ctx_list);
> -	unsigned int depth, this_flags;
> +	LIST_HEAD(hctx_list);
> +	unsigned int depth;
>  
>  	list_splice_init(&plug->mq_list, &list);
>  
> -	list_sort(NULL, &list, plug_ctx_cmp);
> +	list_sort(NULL, &list, plug_hctx_cmp);
>  
>  	this_q = NULL;
> -	this_ctx = NULL;
> -	this_flags = 0;
> +	this_hctx = NULL;
>  	depth = 0;
>  
>  	while (!list_empty(&list)) {
>  		rq = list_entry_rq(list.next);
>  		list_del_init(&rq->queuelist);
>  		BUG_ON(!rq->q);
> -		if (!ctx_match(rq, this_ctx, this_flags)) {
> -			if (this_ctx) {
> +		if (rq->mq_hctx != this_hctx) {
> +			if (this_hctx) {
>  				trace_block_unplug(this_q, depth, !from_schedule);
> -				blk_mq_sched_insert_requests(this_q, this_ctx,
> -								&ctx_list,
> +				blk_mq_sched_insert_requests(this_q, this_hctx,
> +								&hctx_list,
>  								from_schedule);
>  			}

Requests can be added to plug list from different ctx because of
preemption. However, blk_mq_sched_insert_requests() requires that
all requests in 'hctx_list' belong to same ctx. 

Thanks,
Ming

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-29 16:37 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
  2018-10-29 17:08   ` Thomas Gleixner
@ 2018-10-30  9:25   ` Ming Lei
  2018-10-30 14:26   ` Keith Busch
  2 siblings, 0 replies; 63+ messages in thread
From: Ming Lei @ 2018-10-30  9:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
> A driver may have a need to allocate multiple sets of MSI/MSI-X
> interrupts, and have them appropriately affinitized. Add support for
> defining a number of sets in the irq_affinity structure, of varying
> sizes, and get each set affinitized correctly across the machine.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/interrupt.h |  4 ++++
>  kernel/irq/affinity.c     | 40 ++++++++++++++++++++++++++++++---------
>  2 files changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 1d6711c28271..ca397ff40836 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -247,10 +247,14 @@ struct irq_affinity_notify {
>   *			the MSI(-X) vector space
>   * @post_vectors:	Don't apply affinity to @post_vectors at end of
>   *			the MSI(-X) vector space
> + * @nr_sets:		Length of passed in *sets array
> + * @sets:		Number of affinitized sets
>   */
>  struct irq_affinity {
>  	int	pre_vectors;
>  	int	post_vectors;
> +	int	nr_sets;
> +	int	*sets;
>  };
>  
>  #if defined(CONFIG_SMP)
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index f4f29b9d90ee..2046a0f0f0f1 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	int curvec, usedvecs;
>  	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
>  	struct cpumask *masks = NULL;
> +	int i, nr_sets;
>  
>  	/*
>  	 * If there aren't any vectors left after applying the pre/post
> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	get_online_cpus();
>  	build_node_to_cpumask(node_to_cpumask);
>  
> -	/* Spread on present CPUs starting from affd->pre_vectors */
> -	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> -					    node_to_cpumask, cpu_present_mask,
> -					    nmsk, masks);
> +	/*
> +	 * Spread on present CPUs starting from affd->pre_vectors. If we
> +	 * have multiple sets, build each sets affinity mask separately.
> +	 */
> +	nr_sets = affd->nr_sets;
> +	if (!nr_sets)
> +		nr_sets = 1;
> +
> +	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> +		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> +		int nr;
> +
> +		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> +					      node_to_cpumask, cpu_present_mask,
> +					      nmsk, masks + usedvecs);
> +		usedvecs += nr;
> +	}
>  
>  	/*
>  	 * Spread on non present CPUs starting from the next vector to be
> @@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>  {
>  	int resv = affd->pre_vectors + affd->post_vectors;
>  	int vecs = maxvec - resv;
> -	int ret;
> +	int set_vecs;
>  
>  	if (resv > minvec)
>  		return 0;
>  
> -	get_online_cpus();
> -	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
> -	put_online_cpus();
> -	return ret;
> +	if (affd->nr_sets) {
> +		int i;
> +
> +		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
> +			set_vecs += affd->sets[i];
> +	} else {
> +		get_online_cpus();
> +		set_vecs = cpumask_weight(cpu_possible_mask);
> +		put_online_cpus();
> +	}
> +
> +	return resv + min(set_vecs, vecs);
>  }
> -- 
> 2.17.1
> 

Looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

-- 
Ming

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-29 16:37 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
  2018-10-29 17:08   ` Thomas Gleixner
  2018-10-30  9:25   ` Ming Lei
@ 2018-10-30 14:26   ` Keith Busch
  2018-10-30 14:36     ` Jens Axboe
  2 siblings, 1 reply; 63+ messages in thread
From: Keith Busch @ 2018-10-30 14:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index f4f29b9d90ee..2046a0f0f0f1 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	int curvec, usedvecs;
>  	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
>  	struct cpumask *masks = NULL;
> +	int i, nr_sets;
>  
>  	/*
>  	 * If there aren't any vectors left after applying the pre/post
> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	get_online_cpus();
>  	build_node_to_cpumask(node_to_cpumask);
>  
> -	/* Spread on present CPUs starting from affd->pre_vectors */
> -	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> -					    node_to_cpumask, cpu_present_mask,
> -					    nmsk, masks);
> +	/*
> +	 * Spread on present CPUs starting from affd->pre_vectors. If we
> +	 * have multiple sets, build each sets affinity mask separately.
> +	 */
> +	nr_sets = affd->nr_sets;
> +	if (!nr_sets)
> +		nr_sets = 1;
> +
> +	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> +		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> +		int nr;
> +
> +		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> +					      node_to_cpumask, cpu_present_mask,
> +					      nmsk, masks + usedvecs);
> +		usedvecs += nr;
> +	}


While the code below returns the appropriate number of possible vectors
when a set requested too many, the above code is still using the value
from the set, which may exceed 'nvecs' used to kcalloc 'masks', so
'masks + usedvecs' may go out of bounds.

  
>  	/*
>  	 * Spread on non present CPUs starting from the next vector to be
> @@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>  {
>  	int resv = affd->pre_vectors + affd->post_vectors;
>  	int vecs = maxvec - resv;
> -	int ret;
> +	int set_vecs;
>  
>  	if (resv > minvec)
>  		return 0;
>  
> -	get_online_cpus();
> -	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
> -	put_online_cpus();
> -	return ret;
> +	if (affd->nr_sets) {
> +		int i;
> +
> +		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
> +			set_vecs += affd->sets[i];
> +	} else {
> +		get_online_cpus();
> +		set_vecs = cpumask_weight(cpu_possible_mask);
> +		put_online_cpus();
> +	}
> +
> +	return resv + min(set_vecs, vecs);
>  }



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 14:26   ` Keith Busch
@ 2018-10-30 14:36     ` Jens Axboe
  2018-10-30 14:45       ` Keith Busch
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 14:36 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On 10/30/18 8:26 AM, Keith Busch wrote:
> On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
>> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
>> index f4f29b9d90ee..2046a0f0f0f1 100644
>> --- a/kernel/irq/affinity.c
>> +++ b/kernel/irq/affinity.c
>> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>  	int curvec, usedvecs;
>>  	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
>>  	struct cpumask *masks = NULL;
>> +	int i, nr_sets;
>>  
>>  	/*
>>  	 * If there aren't any vectors left after applying the pre/post
>> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>  	get_online_cpus();
>>  	build_node_to_cpumask(node_to_cpumask);
>>  
>> -	/* Spread on present CPUs starting from affd->pre_vectors */
>> -	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
>> -					    node_to_cpumask, cpu_present_mask,
>> -					    nmsk, masks);
>> +	/*
>> +	 * Spread on present CPUs starting from affd->pre_vectors. If we
>> +	 * have multiple sets, build each sets affinity mask separately.
>> +	 */
>> +	nr_sets = affd->nr_sets;
>> +	if (!nr_sets)
>> +		nr_sets = 1;
>> +
>> +	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
>> +		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
>> +		int nr;
>> +
>> +		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
>> +					      node_to_cpumask, cpu_present_mask,
>> +					      nmsk, masks + usedvecs);
>> +		usedvecs += nr;
>> +	}
> 
> 
> While the code below returns the appropriate number of possible vectors
> when a set requested too many, the above code is still using the value
> from the set, which may exceed 'nvecs' used to kcalloc 'masks', so
> 'masks + usedvecs' may go out of bounds.

How so? nvecs must the max number of vecs, the sum of the sets can't
exceed that value.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 14:36     ` Jens Axboe
@ 2018-10-30 14:45       ` Keith Busch
  2018-10-30 14:53         ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Keith Busch @ 2018-10-30 14:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Tue, Oct 30, 2018 at 08:36:35AM -0600, Jens Axboe wrote:
> On 10/30/18 8:26 AM, Keith Busch wrote:
> > On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
> >> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> >> index f4f29b9d90ee..2046a0f0f0f1 100644
> >> --- a/kernel/irq/affinity.c
> >> +++ b/kernel/irq/affinity.c
> >> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
> >>  	int curvec, usedvecs;
> >>  	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
> >>  	struct cpumask *masks = NULL;
> >> +	int i, nr_sets;
> >>  
> >>  	/*
> >>  	 * If there aren't any vectors left after applying the pre/post
> >> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
> >>  	get_online_cpus();
> >>  	build_node_to_cpumask(node_to_cpumask);
> >>  
> >> -	/* Spread on present CPUs starting from affd->pre_vectors */
> >> -	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> >> -					    node_to_cpumask, cpu_present_mask,
> >> -					    nmsk, masks);
> >> +	/*
> >> +	 * Spread on present CPUs starting from affd->pre_vectors. If we
> >> +	 * have multiple sets, build each sets affinity mask separately.
> >> +	 */
> >> +	nr_sets = affd->nr_sets;
> >> +	if (!nr_sets)
> >> +		nr_sets = 1;
> >> +
> >> +	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> >> +		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> >> +		int nr;
> >> +
> >> +		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> >> +					      node_to_cpumask, cpu_present_mask,
> >> +					      nmsk, masks + usedvecs);
> >> +		usedvecs += nr;
> >> +	}
> > 
> > 
> > While the code below returns the appropriate number of possible vectors
> > when a set requested too many, the above code is still using the value
> > from the set, which may exceed 'nvecs' used to kcalloc 'masks', so
> > 'masks + usedvecs' may go out of bounds.
> 
> How so? nvecs must the max number of vecs, the sum of the sets can't
> exceed that value.

'nvecs' is what irq_calc_affinity_vectors() returns, which is the min
of either the requested max or the sum of the set, and the sum of the set
isn't guaranteed to be the smaller value.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 14:45       ` Keith Busch
@ 2018-10-30 14:53         ` Jens Axboe
  2018-10-30 15:08           ` Keith Busch
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 14:53 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On 10/30/18 8:45 AM, Keith Busch wrote:
> On Tue, Oct 30, 2018 at 08:36:35AM -0600, Jens Axboe wrote:
>> On 10/30/18 8:26 AM, Keith Busch wrote:
>>> On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
>>>> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
>>>> index f4f29b9d90ee..2046a0f0f0f1 100644
>>>> --- a/kernel/irq/affinity.c
>>>> +++ b/kernel/irq/affinity.c
>>>> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>>>  	int curvec, usedvecs;
>>>>  	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
>>>>  	struct cpumask *masks = NULL;
>>>> +	int i, nr_sets;
>>>>  
>>>>  	/*
>>>>  	 * If there aren't any vectors left after applying the pre/post
>>>> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>>>  	get_online_cpus();
>>>>  	build_node_to_cpumask(node_to_cpumask);
>>>>  
>>>> -	/* Spread on present CPUs starting from affd->pre_vectors */
>>>> -	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
>>>> -					    node_to_cpumask, cpu_present_mask,
>>>> -					    nmsk, masks);
>>>> +	/*
>>>> +	 * Spread on present CPUs starting from affd->pre_vectors. If we
>>>> +	 * have multiple sets, build each sets affinity mask separately.
>>>> +	 */
>>>> +	nr_sets = affd->nr_sets;
>>>> +	if (!nr_sets)
>>>> +		nr_sets = 1;
>>>> +
>>>> +	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
>>>> +		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
>>>> +		int nr;
>>>> +
>>>> +		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
>>>> +					      node_to_cpumask, cpu_present_mask,
>>>> +					      nmsk, masks + usedvecs);
>>>> +		usedvecs += nr;
>>>> +	}
>>>
>>>
>>> While the code below returns the appropriate number of possible vectors
>>> when a set requested too many, the above code is still using the value
>>> from the set, which may exceed 'nvecs' used to kcalloc 'masks', so
>>> 'masks + usedvecs' may go out of bounds.
>>
>> How so? nvecs must the max number of vecs, the sum of the sets can't
>> exceed that value.
> 
> 'nvecs' is what irq_calc_affinity_vectors() returns, which is the min
> of either the requested max or the sum of the set, and the sum of the set
> isn't guaranteed to be the smaller value.

The sum of the set can't exceed the nvecs passed in, the nvecs passed in
should be the less than or equal to nvecs. Granted this isn't enforced,
and perhaps that should be the case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 14:53         ` Jens Axboe
@ 2018-10-30 15:08           ` Keith Busch
  2018-10-30 15:18             ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Keith Busch @ 2018-10-30 15:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
> should be the less than or equal to nvecs. Granted this isn't enforced,
> and perhaps that should be the case.

That should at least initially be true for a proper functioning
driver. It's not enforced as you mentioned, but that's only related to
the issue I'm referring to.

The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
and max_vecs, but a range of allowable vector allocations doesn't make
sense when using sets.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 15:08           ` Keith Busch
@ 2018-10-30 15:18             ` Jens Axboe
  2018-10-30 16:02               ` Keith Busch
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 15:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On 10/30/18 9:08 AM, Keith Busch wrote:
> On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
>> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
>> should be the less than or equal to nvecs. Granted this isn't enforced,
>> and perhaps that should be the case.
> 
> That should at least initially be true for a proper functioning
> driver. It's not enforced as you mentioned, but that's only related to
> the issue I'm referring to.
> 
> The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
> and max_vecs, but a range of allowable vector allocations doesn't make
> sense when using sets.

I feel like we're going in circles here, not sure what you feel the
issue is now? The range is fine, whoever uses sets will need to adjust
their sets based on what pci_alloc_irq_vectors_affinity() returns,
if it didn't return the passed in desired max.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 15:18             ` Jens Axboe
@ 2018-10-30 16:02               ` Keith Busch
  2018-10-30 16:42                 ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Keith Busch @ 2018-10-30 16:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote:
> On 10/30/18 9:08 AM, Keith Busch wrote:
> > On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
> >> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
> >> should be the less than or equal to nvecs. Granted this isn't enforced,
> >> and perhaps that should be the case.
> > 
> > That should at least initially be true for a proper functioning
> > driver. It's not enforced as you mentioned, but that's only related to
> > the issue I'm referring to.
> > 
> > The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
> > and max_vecs, but a range of allowable vector allocations doesn't make
> > sense when using sets.
> 
> I feel like we're going in circles here, not sure what you feel the
> issue is now? The range is fine, whoever uses sets will need to adjust
> their sets based on what pci_alloc_irq_vectors_affinity() returns,
> if it didn't return the passed in desired max.

Sorry, let me to try again.

pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
that doesn't work, it will iterate down to min_vecs without returning to
the caller. The caller doesn't have a chance to adjust its sets between
iterations when you provide a range.

The 'masks' overrun problem happens if the caller provides min_vecs
as a smaller value than the sum of the set (plus any reserved).

If it's up to the caller to ensure that doesn't happen, then min and
max must both be the same value, and that value must also be the same as
the set sum + reserved vectors. The range just becomes redundant since
it is already bounded by the set.

Using the nvme example, it would need something like this to prevent the
'masks' overrun:

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a8747b956e43..625eff570eaa 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2120,7 +2120,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * setting up the full range we need.
 	 */
 	pci_free_irq_vectors(pdev);
-	result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues,
+	result = pci_alloc_irq_vectors_affinity(pdev, nr_io_queues, nr_io_queues,
 			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 	if (result <= 0)
 		return -EIO;
--

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 16:02               ` Keith Busch
@ 2018-10-30 16:42                 ` Jens Axboe
  2018-10-30 17:09                   ` Jens Axboe
  2018-10-30 17:25                   ` Thomas Gleixner
  0 siblings, 2 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 16:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On 10/30/18 10:02 AM, Keith Busch wrote:
> On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote:
>> On 10/30/18 9:08 AM, Keith Busch wrote:
>>> On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
>>>> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
>>>> should be the less than or equal to nvecs. Granted this isn't enforced,
>>>> and perhaps that should be the case.
>>>
>>> That should at least initially be true for a proper functioning
>>> driver. It's not enforced as you mentioned, but that's only related to
>>> the issue I'm referring to.
>>>
>>> The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
>>> and max_vecs, but a range of allowable vector allocations doesn't make
>>> sense when using sets.
>>
>> I feel like we're going in circles here, not sure what you feel the
>> issue is now? The range is fine, whoever uses sets will need to adjust
>> their sets based on what pci_alloc_irq_vectors_affinity() returns,
>> if it didn't return the passed in desired max.
> 
> Sorry, let me to try again.
> 
> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
> that doesn't work, it will iterate down to min_vecs without returning to
> the caller. The caller doesn't have a chance to adjust its sets between
> iterations when you provide a range.
> 
> The 'masks' overrun problem happens if the caller provides min_vecs
> as a smaller value than the sum of the set (plus any reserved).
> 
> If it's up to the caller to ensure that doesn't happen, then min and
> max must both be the same value, and that value must also be the same as
> the set sum + reserved vectors. The range just becomes redundant since
> it is already bounded by the set.
> 
> Using the nvme example, it would need something like this to prevent the
> 'masks' overrun:

OK, now I hear what you are saying. And you are right, the callers needs
to provide minvec == maxvec for sets, and then have a loop around that
to adjust as needed.

I'll make that change in nvme.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 16:42                 ` Jens Axboe
@ 2018-10-30 17:09                   ` Jens Axboe
  2018-10-30 17:22                     ` Keith Busch
  2018-10-30 17:25                   ` Thomas Gleixner
  1 sibling, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 17:09 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On 10/30/18 10:42 AM, Jens Axboe wrote:
> On 10/30/18 10:02 AM, Keith Busch wrote:
>> On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote:
>>> On 10/30/18 9:08 AM, Keith Busch wrote:
>>>> On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
>>>>> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
>>>>> should be the less than or equal to nvecs. Granted this isn't enforced,
>>>>> and perhaps that should be the case.
>>>>
>>>> That should at least initially be true for a proper functioning
>>>> driver. It's not enforced as you mentioned, but that's only related to
>>>> the issue I'm referring to.
>>>>
>>>> The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
>>>> and max_vecs, but a range of allowable vector allocations doesn't make
>>>> sense when using sets.
>>>
>>> I feel like we're going in circles here, not sure what you feel the
>>> issue is now? The range is fine, whoever uses sets will need to adjust
>>> their sets based on what pci_alloc_irq_vectors_affinity() returns,
>>> if it didn't return the passed in desired max.
>>
>> Sorry, let me to try again.
>>
>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
>> that doesn't work, it will iterate down to min_vecs without returning to
>> the caller. The caller doesn't have a chance to adjust its sets between
>> iterations when you provide a range.
>>
>> The 'masks' overrun problem happens if the caller provides min_vecs
>> as a smaller value than the sum of the set (plus any reserved).
>>
>> If it's up to the caller to ensure that doesn't happen, then min and
>> max must both be the same value, and that value must also be the same as
>> the set sum + reserved vectors. The range just becomes redundant since
>> it is already bounded by the set.
>>
>> Using the nvme example, it would need something like this to prevent the
>> 'masks' overrun:
> 
> OK, now I hear what you are saying. And you are right, the callers needs
> to provide minvec == maxvec for sets, and then have a loop around that
> to adjust as needed.
> 
> I'll make that change in nvme.

Pretty trivial, below. This also keeps the queue mapping calculations
more clean, as we don't have to do one after we're done allocating
IRQs.


commit e8a35d023a192e34540c60f779fe755970b8eeb2
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Oct 30 11:06:29 2018 -0600

    nvme: utilize two queue maps, one for reads and one for writes
    
    NVMe does round-robin between queues by default, which means that
    sharing a queue map for both reads and writes can be problematic
    in terms of read servicing. It's much easier to flood the queue
    with writes and reduce the read servicing.
    
    Implement two queue maps, one for reads and one for writes. The
    write queue count is configurable through the 'write_queues'
    parameter.
    
    By default, we retain the previous behavior of having a single
    queue set, shared between reads and writes. Setting 'write_queues'
    to a non-zero value will create two queue sets, one for reads and
    one for writes, the latter using the configurable number of
    queues (hardware queue counts permitting).
    
    Reviewed-by: Hannes Reinecke <hare@suse.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5d783cb6937..17170686105f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -74,11 +74,29 @@ static int io_queue_depth = 1024;
 module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644);
 MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2");
 
+static int queue_count_set(const char *val, const struct kernel_param *kp);
+static const struct kernel_param_ops queue_count_ops = {
+	.set = queue_count_set,
+	.get = param_get_int,
+};
+
+static int write_queues;
+module_param_cb(write_queues, &queue_count_ops, &write_queues, 0644);
+MODULE_PARM_DESC(write_queues,
+	"Number of queues to use for writes. If not set, reads and writes "
+	"will share a queue set.");
+
 struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
+enum {
+	NVMEQ_TYPE_READ,
+	NVMEQ_TYPE_WRITE,
+	NVMEQ_TYPE_NR,
+};
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -92,6 +110,7 @@ struct nvme_dev {
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
+	unsigned io_queues[NVMEQ_TYPE_NR];
 	unsigned int num_vecs;
 	int q_depth;
 	u32 db_stride;
@@ -134,6 +153,17 @@ static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
 	return param_set_int(val, kp);
 }
 
+static int queue_count_set(const char *val, const struct kernel_param *kp)
+{
+	int n = 0, ret;
+
+	ret = kstrtoint(val, 10, &n);
+	if (n > num_possible_cpus())
+		n = num_possible_cpus();
+
+	return param_set_int(val, kp);
+}
+
 static inline unsigned int sq_idx(unsigned int qid, u32 stride)
 {
 	return qid * 2 * stride;
@@ -218,9 +248,20 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
 }
 
+static unsigned int max_io_queues(void)
+{
+	return num_possible_cpus() + write_queues;
+}
+
+static unsigned int max_queue_count(void)
+{
+	/* IO queues + admin queue */
+	return 1 + max_io_queues();
+}
+
 static inline unsigned int nvme_dbbuf_size(u32 stride)
 {
-	return ((num_possible_cpus() + 1) * 8 * stride);
+	return (max_queue_count() * 8 * stride);
 }
 
 static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev)
@@ -431,12 +472,41 @@ static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req,
 	return 0;
 }
 
+static int queue_irq_offset(struct nvme_dev *dev)
+{
+	/* if we have more than 1 vec, admin queue offsets us 1 */
+	if (dev->num_vecs > 1)
+		return 1;
+
+	return 0;
+}
+
 static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_dev *dev = set->driver_data;
+	int i, qoff, offset;
+
+	offset = queue_irq_offset(dev);
+	for (i = 0, qoff = 0; i < set->nr_maps; i++) {
+		struct blk_mq_queue_map *map = &set->map[i];
 
-	return blk_mq_pci_map_queues(&set->map[0], to_pci_dev(dev->dev),
-			dev->num_vecs > 1 ? 1 /* admin queue */ : 0);
+		map->nr_queues = dev->io_queues[i];
+		if (!map->nr_queues) {
+			BUG_ON(i == NVMEQ_TYPE_READ);
+
+			/* shared set, resuse read set parameters */
+			map->nr_queues = dev->io_queues[NVMEQ_TYPE_READ];
+			qoff = 0;
+			offset = queue_irq_offset(dev);
+		}
+
+		map->queue_offset = qoff;
+		blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+		qoff += map->nr_queues;
+		offset += map->nr_queues;
+	}
+
+	return 0;
 }
 
 /**
@@ -849,6 +919,14 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static int nvme_flags_to_type(struct request_queue *q, unsigned int flags)
+{
+	if ((flags & REQ_OP_MASK) == REQ_OP_READ)
+		return NVMEQ_TYPE_READ;
+
+	return NVMEQ_TYPE_WRITE;
+}
+
 static void nvme_pci_complete_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1476,6 +1554,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.flags_to_type	= nvme_flags_to_type,
 	.complete	= nvme_pci_complete_rq,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
@@ -1888,18 +1967,53 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
 	return ret;
 }
 
+static void nvme_calc_io_queues(struct nvme_dev *dev, unsigned int nr_io_queues)
+{
+	unsigned int this_w_queues = write_queues;
+
+	/*
+	 * Setup read/write queue split
+	 */
+	if (nr_io_queues == 1) {
+		dev->io_queues[NVMEQ_TYPE_READ] = 1;
+		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
+		return;
+	}
+
+	/*
+	 * If 'write_queues' is set, ensure it leaves room for at least
+	 * one read queue
+	 */
+	if (this_w_queues >= nr_io_queues)
+		this_w_queues = nr_io_queues - 1;
+
+	/*
+	 * If 'write_queues' is set to zero, reads and writes will share
+	 * a queue set.
+	 */
+	if (!this_w_queues) {
+		dev->io_queues[NVMEQ_TYPE_WRITE] = 0;
+		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues;
+	} else {
+		dev->io_queues[NVMEQ_TYPE_WRITE] = this_w_queues;
+		dev->io_queues[NVMEQ_TYPE_READ] = nr_io_queues - this_w_queues;
+	}
+}
+
 static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = &dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int result, nr_io_queues;
 	unsigned long size;
-
+	int irq_sets[2];
 	struct irq_affinity affd = {
-		.pre_vectors = 1
+		.pre_vectors = 1,
+		.nr_sets = ARRAY_SIZE(irq_sets),
+		.sets = irq_sets,
 	};
 
-	nr_io_queues = num_possible_cpus();
+	nr_io_queues = max_io_queues();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
 	if (result < 0)
 		return result;
@@ -1934,13 +2048,48 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * setting up the full range we need.
 	 */
 	pci_free_irq_vectors(pdev);
-	result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues + 1,
-			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
-	if (result <= 0)
-		return -EIO;
+
+	/*
+	 * For irq sets, we have to ask for minvec == maxvec. This passes
+	 * any reduction back to us, so we can adjust our queue counts and
+	 * IRQ vector needs.
+	 */
+	do {
+		nvme_calc_io_queues(dev, nr_io_queues);
+		irq_sets[0] = dev->io_queues[NVMEQ_TYPE_READ];
+		irq_sets[1] = dev->io_queues[NVMEQ_TYPE_WRITE];
+		if (!irq_sets[1])
+			affd.nr_sets = 1;
+
+		/*
+		 * Need IRQs for read+write queues, and one for the admin queue
+		 */
+		nr_io_queues = irq_sets[0] + irq_sets[1] + 1;
+
+		result = pci_alloc_irq_vectors_affinity(pdev, nr_io_queues,
+				nr_io_queues,
+				PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+
+		/*
+		 * Need to reduce our vec counts
+		 */
+		if (result == -ENOSPC) {
+			nr_io_queues--;
+			if (!nr_io_queues)
+				return result;
+			continue;
+		} else if (result <= 0)
+			return -EIO;
+		break;
+	} while (1);
+
 	dev->num_vecs = result;
 	dev->max_qid = max(result - 1, 1);
 
+	dev_info(dev->ctrl.device, "%d/%d/%d read/write queues\n",
+					dev->io_queues[NVMEQ_TYPE_READ],
+					dev->io_queues[NVMEQ_TYPE_WRITE]);
+
 	/*
 	 * Should investigate if there's a performance win from allocating
 	 * more queues than interrupt vectors; it might allow the submission
@@ -2042,6 +2191,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	if (!dev->ctrl.tagset) {
 		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
+		dev->tagset.nr_maps = NVMEQ_TYPE_NR;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =
@@ -2489,8 +2639,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (!dev)
 		return -ENOMEM;
 
-	dev->queues = kcalloc_node(num_possible_cpus() + 1,
-			sizeof(struct nvme_queue), GFP_KERNEL, node);
+	dev->queues = kcalloc_node(max_queue_count(), sizeof(struct nvme_queue),
+					GFP_KERNEL, node);
 	if (!dev->queues)
 		goto free;
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:09                   ` Jens Axboe
@ 2018-10-30 17:22                     ` Keith Busch
  2018-10-30 17:33                       ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Keith Busch @ 2018-10-30 17:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote:
> Pretty trivial, below. This also keeps the queue mapping calculations
> more clean, as we don't have to do one after we're done allocating
> IRQs.

Yep, this addresses my concern. It less efficient than PCI since PCI
can usually jump straight to a valid vector count in a single iteration
where this only subtracts by 1. I really can't be bothered to care for
optimizing that, so this works for me! :) 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues
  2018-10-30  8:08         ` Ming Lei
@ 2018-10-30 17:22           ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 17:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: Bart Van Assche, linux-block, linux-scsi, linux-kernel

On 10/30/18 2:08 AM, Ming Lei wrote:
> Requests can be added to plug list from different ctx because of
> preemption. However, blk_mq_sched_insert_requests() requires that
> all requests in 'hctx_list' belong to same ctx. 

Yeah, I tried to get around it, but I think I'll just respin and keep
the 'ctx' argument to keep that perfectly clear. It'll work just fine
with different ctxs, but they will end up on a non-matching ctx which
isn't ideal.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 16:42                 ` Jens Axboe
  2018-10-30 17:09                   ` Jens Axboe
@ 2018-10-30 17:25                   ` Thomas Gleixner
  2018-10-30 17:34                     ` Jens Axboe
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2018-10-30 17:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, linux-block, linux-scsi, linux-kernel

Jens,

On Tue, 30 Oct 2018, Jens Axboe wrote:
> On 10/30/18 10:02 AM, Keith Busch wrote:
> > pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
> > that doesn't work, it will iterate down to min_vecs without returning to
> > the caller. The caller doesn't have a chance to adjust its sets between
> > iterations when you provide a range.
> > 
> > The 'masks' overrun problem happens if the caller provides min_vecs
> > as a smaller value than the sum of the set (plus any reserved).
> > 
> > If it's up to the caller to ensure that doesn't happen, then min and
> > max must both be the same value, and that value must also be the same as
> > the set sum + reserved vectors. The range just becomes redundant since
> > it is already bounded by the set.
> > 
> > Using the nvme example, it would need something like this to prevent the
> > 'masks' overrun:
> 
> OK, now I hear what you are saying. And you are right, the callers needs
> to provide minvec == maxvec for sets, and then have a loop around that
> to adjust as needed.

But then we should enforce it in the core code, right?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:22                     ` Keith Busch
@ 2018-10-30 17:33                       ` Jens Axboe
  2018-10-30 17:35                         ` Keith Busch
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 17:33 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On 10/30/18 11:22 AM, Keith Busch wrote:
> On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote:
>> Pretty trivial, below. This also keeps the queue mapping calculations
>> more clean, as we don't have to do one after we're done allocating
>> IRQs.
> 
> Yep, this addresses my concern. It less efficient than PCI since PCI
> can usually jump straight to a valid vector count in a single iteration
> where this only subtracts by 1. I really can't be bothered to care for
> optimizing that, so this works for me! :) 

It definitely is less efficient than just getting the count that we
can support, but it's at probe time so I could not really be bothered
either.

Can I add your reviewed-by?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:25                   ` Thomas Gleixner
@ 2018-10-30 17:34                     ` Jens Axboe
  2018-10-30 17:43                       ` Jens Axboe
  2018-10-30 17:46                       ` Thomas Gleixner
  0 siblings, 2 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 17:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Keith Busch, linux-block, linux-scsi, linux-kernel

On 10/30/18 11:25 AM, Thomas Gleixner wrote:
> Jens,
> 
> On Tue, 30 Oct 2018, Jens Axboe wrote:
>> On 10/30/18 10:02 AM, Keith Busch wrote:
>>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
>>> that doesn't work, it will iterate down to min_vecs without returning to
>>> the caller. The caller doesn't have a chance to adjust its sets between
>>> iterations when you provide a range.
>>>
>>> The 'masks' overrun problem happens if the caller provides min_vecs
>>> as a smaller value than the sum of the set (plus any reserved).
>>>
>>> If it's up to the caller to ensure that doesn't happen, then min and
>>> max must both be the same value, and that value must also be the same as
>>> the set sum + reserved vectors. The range just becomes redundant since
>>> it is already bounded by the set.
>>>
>>> Using the nvme example, it would need something like this to prevent the
>>> 'masks' overrun:
>>
>> OK, now I hear what you are saying. And you are right, the callers needs
>> to provide minvec == maxvec for sets, and then have a loop around that
>> to adjust as needed.
> 
> But then we should enforce it in the core code, right?

Yes, I was going to ask you if you want a followup patch for that, or
an updated version of the original?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:33                       ` Jens Axboe
@ 2018-10-30 17:35                         ` Keith Busch
  0 siblings, 0 replies; 63+ messages in thread
From: Keith Busch @ 2018-10-30 17:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, linux-kernel, Thomas Gleixner

On Tue, Oct 30, 2018 at 11:33:51AM -0600, Jens Axboe wrote:
> On 10/30/18 11:22 AM, Keith Busch wrote:
> > On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote:
> >> Pretty trivial, below. This also keeps the queue mapping calculations
> >> more clean, as we don't have to do one after we're done allocating
> >> IRQs.
> > 
> > Yep, this addresses my concern. It less efficient than PCI since PCI
> > can usually jump straight to a valid vector count in a single iteration
> > where this only subtracts by 1. I really can't be bothered to care for
> > optimizing that, so this works for me! :) 
> 
> It definitely is less efficient than just getting the count that we
> can support, but it's at probe time so I could not really be bothered
> either.
> 
> Can I add your reviewed-by?

Yes, please.

Reviewed-by: Keith Busch <keith.busch@intel.com>

> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:34                     ` Jens Axboe
@ 2018-10-30 17:43                       ` Jens Axboe
  2018-10-30 17:46                       ` Thomas Gleixner
  1 sibling, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 17:43 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Keith Busch, linux-block, linux-scsi, linux-kernel

On 10/30/18 11:34 AM, Jens Axboe wrote:
> On 10/30/18 11:25 AM, Thomas Gleixner wrote:
>> Jens,
>>
>> On Tue, 30 Oct 2018, Jens Axboe wrote:
>>> On 10/30/18 10:02 AM, Keith Busch wrote:
>>>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
>>>> that doesn't work, it will iterate down to min_vecs without returning to
>>>> the caller. The caller doesn't have a chance to adjust its sets between
>>>> iterations when you provide a range.
>>>>
>>>> The 'masks' overrun problem happens if the caller provides min_vecs
>>>> as a smaller value than the sum of the set (plus any reserved).
>>>>
>>>> If it's up to the caller to ensure that doesn't happen, then min and
>>>> max must both be the same value, and that value must also be the same as
>>>> the set sum + reserved vectors. The range just becomes redundant since
>>>> it is already bounded by the set.
>>>>
>>>> Using the nvme example, it would need something like this to prevent the
>>>> 'masks' overrun:
>>>
>>> OK, now I hear what you are saying. And you are right, the callers needs
>>> to provide minvec == maxvec for sets, and then have a loop around that
>>> to adjust as needed.
>>
>> But then we should enforce it in the core code, right?
> 
> Yes, I was going to ask you if you want a followup patch for that, or
> an updated version of the original?

Here's an incremental, I'm going to fold this into the original unless
I hear otherwise.


diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index af24ed50a245..e6c6e10b9ceb 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1036,6 +1036,13 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
 	if (maxvec < minvec)
 		return -ERANGE;
 
+	/*
+	 * If the caller is passing in sets, we can't support a range of
+	 * vectors. The caller needs to handle that.
+	 */
+	if (affd->nr_sets && minvec != maxvec)
+		return -EINVAL;
+
 	if (WARN_ON_ONCE(dev->msi_enabled))
 		return -EINVAL;
 
@@ -1087,6 +1094,13 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 	if (maxvec < minvec)
 		return -ERANGE;
 
+	/*
+	 * If the caller is passing in sets, we can't support a range of
+	 * supported vectors. The caller needs to handle that.
+	 */
+	if (affd->nr_sets && minvec != maxvec)
+		return -EINVAL;
+
 	if (WARN_ON_ONCE(dev->msix_enabled))
 		return -EINVAL;
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:34                     ` Jens Axboe
  2018-10-30 17:43                       ` Jens Axboe
@ 2018-10-30 17:46                       ` Thomas Gleixner
  2018-10-30 17:47                         ` Jens Axboe
  1 sibling, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2018-10-30 17:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, linux-block, linux-scsi, linux-kernel

On Tue, 30 Oct 2018, Jens Axboe wrote:
> On 10/30/18 11:25 AM, Thomas Gleixner wrote:
> > Jens,
> > 
> > On Tue, 30 Oct 2018, Jens Axboe wrote:
> >> On 10/30/18 10:02 AM, Keith Busch wrote:
> >>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
> >>> that doesn't work, it will iterate down to min_vecs without returning to
> >>> the caller. The caller doesn't have a chance to adjust its sets between
> >>> iterations when you provide a range.
> >>>
> >>> The 'masks' overrun problem happens if the caller provides min_vecs
> >>> as a smaller value than the sum of the set (plus any reserved).
> >>>
> >>> If it's up to the caller to ensure that doesn't happen, then min and
> >>> max must both be the same value, and that value must also be the same as
> >>> the set sum + reserved vectors. The range just becomes redundant since
> >>> it is already bounded by the set.
> >>>
> >>> Using the nvme example, it would need something like this to prevent the
> >>> 'masks' overrun:
> >>
> >> OK, now I hear what you are saying. And you are right, the callers needs
> >> to provide minvec == maxvec for sets, and then have a loop around that
> >> to adjust as needed.
> > 
> > But then we should enforce it in the core code, right?
> 
> Yes, I was going to ask you if you want a followup patch for that, or
> an updated version of the original?

Updated combo patch would be nice :)

Thanks

	lazytglx

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-30 17:46                       ` Thomas Gleixner
@ 2018-10-30 17:47                         ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-30 17:47 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Keith Busch, linux-block, linux-scsi, linux-kernel

On 10/30/18 11:46 AM, Thomas Gleixner wrote:
> On Tue, 30 Oct 2018, Jens Axboe wrote:
>> On 10/30/18 11:25 AM, Thomas Gleixner wrote:
>>> Jens,
>>>
>>> On Tue, 30 Oct 2018, Jens Axboe wrote:
>>>> On 10/30/18 10:02 AM, Keith Busch wrote:
>>>>> pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
>>>>> that doesn't work, it will iterate down to min_vecs without returning to
>>>>> the caller. The caller doesn't have a chance to adjust its sets between
>>>>> iterations when you provide a range.
>>>>>
>>>>> The 'masks' overrun problem happens if the caller provides min_vecs
>>>>> as a smaller value than the sum of the set (plus any reserved).
>>>>>
>>>>> If it's up to the caller to ensure that doesn't happen, then min and
>>>>> max must both be the same value, and that value must also be the same as
>>>>> the set sum + reserved vectors. The range just becomes redundant since
>>>>> it is already bounded by the set.
>>>>>
>>>>> Using the nvme example, it would need something like this to prevent the
>>>>> 'masks' overrun:
>>>>
>>>> OK, now I hear what you are saying. And you are right, the callers needs
>>>> to provide minvec == maxvec for sets, and then have a loop around that
>>>> to adjust as needed.
>>>
>>> But then we should enforce it in the core code, right?
>>
>> Yes, I was going to ask you if you want a followup patch for that, or
>> an updated version of the original?
> 
> Updated combo patch would be nice :)

I'll re-post the series with the updated combo some time later today.

> 	lazytglx

I understand :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-25 21:16 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
  2018-10-25 21:52   ` Keith Busch
@ 2018-10-29  7:43   ` Hannes Reinecke
  1 sibling, 0 replies; 63+ messages in thread
From: Hannes Reinecke @ 2018-10-29  7:43 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme; +Cc: Thomas Gleixner, linux-kernel

On 10/25/18 11:16 PM, Jens Axboe wrote:
> A driver may have a need to allocate multiple sets of MSI/MSI-X
> interrupts, and have them appropriately affinitized. Add support for
> defining a number of sets in the irq_affinity structure, of varying
> sizes, and get each set affinitized correctly across the machine.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   include/linux/interrupt.h |  4 ++++
>   kernel/irq/affinity.c     | 31 +++++++++++++++++++++++++------
>   2 files changed, 29 insertions(+), 6 deletions(-)
> 

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-25 21:52   ` Keith Busch
@ 2018-10-25 23:07     ` Jens Axboe
  0 siblings, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-25 23:07 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, Thomas Gleixner, linux-kernel

On 10/25/18 3:52 PM, Keith Busch wrote:
> On Thu, Oct 25, 2018 at 03:16:23PM -0600, Jens Axboe wrote:
>> A driver may have a need to allocate multiple sets of MSI/MSI-X
>> interrupts, and have them appropriately affinitized. Add support for
>> defining a number of sets in the irq_affinity structure, of varying
>> sizes, and get each set affinitized correctly across the machine.
> 
> <>
> 
>> @@ -258,13 +272,18 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>>  {
>>  	int resv = affd->pre_vectors + affd->post_vectors;
>>  	int vecs = maxvec - resv;
>> +	int i, set_vecs;
>>  	int ret;
>>  
>>  	if (resv > minvec)
>>  		return 0;
>>  
>>  	get_online_cpus();
>> -	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
>> +	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs);
>>  	put_online_cpus();
>> -	return ret;
>> +
>> +	for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
>> +		set_vecs += affd->sets[i];
>> +
>> +	return resv + max(ret, set_vecs);
>>  }
> 
> This is looking pretty good, but we may risk getting into an infinite
> loop in __pci_enable_msix_range() if we're requesting too many vectors
> in a set: the above code may continue returning set_vecs, overriding
> the reduced nvec that pci requested, and pci msix initialization will
> continue to fail because it is repeatedly requesting to activate the
> same vector count that failed before.

Good catch, we always want to be using min() with the passed in maxvec
in there. How about this incremental?


diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 0055e252e438..2046a0f0f0f1 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -272,18 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
 {
 	int resv = affd->pre_vectors + affd->post_vectors;
 	int vecs = maxvec - resv;
-	int i, set_vecs;
-	int ret;
+	int set_vecs;
 
 	if (resv > minvec)
 		return 0;
 
-	get_online_cpus();
-	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs);
-	put_online_cpus();
+	if (affd->nr_sets) {
+		int i;
 
-	for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-		set_vecs += affd->sets[i];
+		for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
+			set_vecs += affd->sets[i];
+	} else {
+		get_online_cpus();
+		set_vecs = cpumask_weight(cpu_possible_mask);
+		put_online_cpus();
+	}
 
-	return resv + max(ret, set_vecs);
+	return resv + min(set_vecs, vecs);
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
  2018-10-25 21:16 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
@ 2018-10-25 21:52   ` Keith Busch
  2018-10-25 23:07     ` Jens Axboe
  2018-10-29  7:43   ` Hannes Reinecke
  1 sibling, 1 reply; 63+ messages in thread
From: Keith Busch @ 2018-10-25 21:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-nvme, Thomas Gleixner, linux-kernel

On Thu, Oct 25, 2018 at 03:16:23PM -0600, Jens Axboe wrote:
> A driver may have a need to allocate multiple sets of MSI/MSI-X
> interrupts, and have them appropriately affinitized. Add support for
> defining a number of sets in the irq_affinity structure, of varying
> sizes, and get each set affinitized correctly across the machine.

<>

> @@ -258,13 +272,18 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>  {
>  	int resv = affd->pre_vectors + affd->post_vectors;
>  	int vecs = maxvec - resv;
> +	int i, set_vecs;
>  	int ret;
>  
>  	if (resv > minvec)
>  		return 0;
>  
>  	get_online_cpus();
> -	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
> +	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs);
>  	put_online_cpus();
> -	return ret;
> +
> +	for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
> +		set_vecs += affd->sets[i];
> +
> +	return resv + max(ret, set_vecs);
>  }

This is looking pretty good, but we may risk getting into an infinite
loop in __pci_enable_msix_range() if we're requesting too many vectors
in a set: the above code may continue returning set_vecs, overriding
the reduced nvec that pci requested, and pci msix initialization will
continue to fail because it is repeatedly requesting to activate the
same vector count that failed before.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs
       [not found] <20181025211626.12692-1-axboe@kernel.dk>
@ 2018-10-25 21:16 ` Jens Axboe
  2018-10-25 21:52   ` Keith Busch
  2018-10-29  7:43   ` Hannes Reinecke
  0 siblings, 2 replies; 63+ messages in thread
From: Jens Axboe @ 2018-10-25 21:16 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: Jens Axboe, Thomas Gleixner, linux-kernel

A driver may have a need to allocate multiple sets of MSI/MSI-X
interrupts, and have them appropriately affinitized. Add support for
defining a number of sets in the irq_affinity structure, of varying
sizes, and get each set affinitized correctly across the machine.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/interrupt.h |  4 ++++
 kernel/irq/affinity.c     | 31 +++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index eeceac3376fc..9fce2131902c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,10 +247,14 @@ struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @post_vectors:	Don't apply affinity to @post_vectors at end of
  *			the MSI(-X) vector space
+ * @nr_sets:		Length of passed in *sets array
+ * @sets:		Number of affinitized sets
  */
 struct irq_affinity {
 	int	pre_vectors;
 	int	post_vectors;
+	int	nr_sets;
+	int	*sets;
 };
 
 #if defined(CONFIG_SMP)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index f4f29b9d90ee..0055e252e438 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	int curvec, usedvecs;
 	cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
 	struct cpumask *masks = NULL;
+	int i, nr_sets;
 
 	/*
 	 * If there aren't any vectors left after applying the pre/post
@@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	get_online_cpus();
 	build_node_to_cpumask(node_to_cpumask);
 
-	/* Spread on present CPUs starting from affd->pre_vectors */
-	usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
-					    node_to_cpumask, cpu_present_mask,
-					    nmsk, masks);
+	/*
+	 * Spread on present CPUs starting from affd->pre_vectors. If we
+	 * have multiple sets, build each sets affinity mask separately.
+	 */
+	nr_sets = affd->nr_sets;
+	if (!nr_sets)
+		nr_sets = 1;
+
+	for (i = 0, usedvecs = 0; i < nr_sets; i++) {
+		int this_vecs = affd->sets ? affd->sets[i] : affvecs;
+		int nr;
+
+		nr = irq_build_affinity_masks(affd, curvec, this_vecs,
+					      node_to_cpumask, cpu_present_mask,
+					      nmsk, masks + usedvecs);
+		usedvecs += nr;
+	}
 
 	/*
 	 * Spread on non present CPUs starting from the next vector to be
@@ -258,13 +272,18 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
 {
 	int resv = affd->pre_vectors + affd->post_vectors;
 	int vecs = maxvec - resv;
+	int i, set_vecs;
 	int ret;
 
 	if (resv > minvec)
 		return 0;
 
 	get_online_cpus();
-	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
+	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs);
 	put_online_cpus();
-	return ret;
+
+	for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
+		set_vecs += affd->sets[i];
+
+	return resv + max(ret, set_vecs);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2018-10-30 17:47 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-29 16:37 [PATCHSET v2 0/14] blk-mq: Add support for multiple queue maps Jens Axboe
2018-10-29 16:37 ` [PATCH 01/14] blk-mq: kill q->mq_map Jens Axboe
2018-10-29 16:46   ` Bart Van Assche
2018-10-29 16:51     ` Jens Axboe
2018-10-29 16:37 ` [PATCH 02/14] blk-mq: abstract out queue map Jens Axboe
2018-10-29 18:33   ` Bart Van Assche
2018-10-29 16:37 ` [PATCH 03/14] blk-mq: provide dummy blk_mq_map_queue_type() helper Jens Axboe
2018-10-29 17:22   ` Bart Van Assche
2018-10-29 17:27     ` Jens Axboe
2018-10-29 16:37 ` [PATCH 04/14] blk-mq: pass in request/bio flags to queue mapping Jens Axboe
2018-10-29 17:30   ` Bart Van Assche
2018-10-29 17:33     ` Jens Axboe
2018-10-29 16:37 ` [PATCH 05/14] blk-mq: allow software queue to map to multiple hardware queues Jens Axboe
2018-10-29 17:34   ` Bart Van Assche
2018-10-29 17:35     ` Jens Axboe
2018-10-29 16:37 ` [PATCH 06/14] blk-mq: add 'type' attribute to the sysfs hctx directory Jens Axboe
2018-10-29 17:40   ` Bart Van Assche
2018-10-29 16:37 ` [PATCH 07/14] blk-mq: support multiple hctx maps Jens Axboe
2018-10-29 18:15   ` Bart Van Assche
2018-10-29 19:24     ` Jens Axboe
2018-10-29 16:37 ` [PATCH 08/14] blk-mq: separate number of hardware queues from nr_cpu_ids Jens Axboe
2018-10-29 18:31   ` Bart Van Assche
2018-10-29 16:37 ` [PATCH 09/14] blk-mq: ensure that plug lists don't straddle hardware queues Jens Axboe
2018-10-29 19:27   ` Bart Van Assche
2018-10-29 19:30     ` Jens Axboe
2018-10-29 19:49       ` Jens Axboe
2018-10-30  8:08         ` Ming Lei
2018-10-30 17:22           ` Jens Axboe
2018-10-29 16:37 ` [PATCH 10/14] blk-mq: initial support for multiple queue maps Jens Axboe
2018-10-29 19:40   ` Bart Van Assche
2018-10-29 19:53     ` Jens Axboe
2018-10-29 20:00       ` Bart Van Assche
2018-10-29 20:09         ` Jens Axboe
2018-10-29 20:25           ` Bart Van Assche
2018-10-29 20:29             ` Jens Axboe
2018-10-29 16:37 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
2018-10-29 17:08   ` Thomas Gleixner
2018-10-29 17:09     ` Jens Axboe
2018-10-30  9:25   ` Ming Lei
2018-10-30 14:26   ` Keith Busch
2018-10-30 14:36     ` Jens Axboe
2018-10-30 14:45       ` Keith Busch
2018-10-30 14:53         ` Jens Axboe
2018-10-30 15:08           ` Keith Busch
2018-10-30 15:18             ` Jens Axboe
2018-10-30 16:02               ` Keith Busch
2018-10-30 16:42                 ` Jens Axboe
2018-10-30 17:09                   ` Jens Axboe
2018-10-30 17:22                     ` Keith Busch
2018-10-30 17:33                       ` Jens Axboe
2018-10-30 17:35                         ` Keith Busch
2018-10-30 17:25                   ` Thomas Gleixner
2018-10-30 17:34                     ` Jens Axboe
2018-10-30 17:43                       ` Jens Axboe
2018-10-30 17:46                       ` Thomas Gleixner
2018-10-30 17:47                         ` Jens Axboe
2018-10-29 16:37 ` [PATCH 12/14] nvme: utilize two queue maps, one for reads and one for writes Jens Axboe
2018-10-29 16:37 ` [PATCH 13/14] block: add REQ_HIPRI and inherit it from IOCB_HIPRI Jens Axboe
2018-10-29 16:37 ` [PATCH 14/14] nvme: add separate poll queue map Jens Axboe
     [not found] <20181025211626.12692-1-axboe@kernel.dk>
2018-10-25 21:16 ` [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs Jens Axboe
2018-10-25 21:52   ` Keith Busch
2018-10-25 23:07     ` Jens Axboe
2018-10-29  7:43   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).